Rewrite HTML attributes after parsing
To rewrite anchors, inject attributes, normalise URLs, or strip sentinels in already-rendered HTML, implement IHtmlResponseRewriter. Pennington's HtmlResponseRewritingProcessor parses each response body with AngleSharp exactly once and invokes every registered rewriter against that shared IDocument, so the work composes with the built-in xref, locale, and base-URL passes. For non-HTML response types (JSON, plain text) or work that needs the final byte stream, use Inject HTML before </body> on every page instead.
Before you begin
- An existing Pennington site rendering HTML pages (see Create your first Pennington site if not).
- A clear sense of which phase fits the edit: a non-HTML token (something not valid HTML structure, like
<xref:uid>or a sentinel comment) belongs inPreParseAsync; anything queryable by selectors belongs inApplyAsync.
For a working setup, see examples/ExtensibilityLabExample — AnchorLowercaseRewriter.cs exercises both halves of the contract and Program.cs registers it against a bare AddPennington host.
Implement the rewriter
IHtmlResponseRewriter has four members: Order, ShouldApply(HttpContext), PreParseAsync(string, HttpContext), and ApplyAsync(IDocument, HttpContext). The example at examples/ExtensibilityLabExample/AnchorLowercaseRewriter.cs demonstrates all four in one sealed type.
ShouldApply runs per-response; return false to skip both phases when the content-type, path, or headers mean there is nothing to do. The example narrows to text/html responses so non-HTML endpoints (search index JSON, llms.txt) bypass the rewriter entirely.
public bool ShouldApply(HttpContext context)
{
var contentType = context.Response.ContentType;
return contentType is not null
&& contentType.StartsWith("text/html", StringComparison.OrdinalIgnoreCase);
}
PreParseAsync receives the raw HTML string before AngleSharp parses it and returns the string to parse — use it only when the target construct is not valid HTML structure (raw <xref:uid> tags are the canonical shipped example; the lab strips a sentinel comment). Return the input unchanged when there is nothing to do, to avoid paying for an allocation on every response.
/// <summary>
/// Pre-parse pass. Strip the sentinel comment so it is gone before
/// AngleSharp runs. A string replace is the right tool when the
/// target construct is not valid HTML structure (raw <c><xref></c>
/// tags are the canonical example shipped with Pennington).
/// </summary>
public Task<string> PreParseAsync(string html, HttpContext context)
{
if (!html.Contains("<!--LOWERCASE-SENTINEL-->", StringComparison.Ordinal))
return Task.FromResult(html);
return Task.FromResult(html.Replace("<!--LOWERCASE-SENTINEL-->", string.Empty, StringComparison.Ordinal));
}
ApplyAsync receives the already-parsed IDocument shared by every rewriter in this pass — query with QuerySelectorAll, mutate attributes and text, and return; do not re-serialize or reparse. The example lowercases the text content of every <a data-lowercase>; more typical uses include href canonicalisation, loading="lazy" on images, or stamping rel="noopener" on external links.
/// <summary>
/// DOM pass. Walk the parsed document, find every <c><a></c>
/// with <c>data-lowercase</c>, lowercase its text content.
/// </summary>
public Task ApplyAsync(IDocument document, HttpContext context)
{
foreach (var element in document.QuerySelectorAll("a[data-lowercase]"))
{
if (element is not IHtmlAnchorElement anchor) continue;
if (string.IsNullOrEmpty(anchor.TextContent)) continue;
anchor.TextContent = anchor.TextContent.ToLowerInvariant();
}
return Task.CompletedTask;
}
Pick an Order value
The three shipped rewriters run at 10 (XrefHtmlRewriter), 20 (LocaleLinkHtmlRewriter), and 30 (BaseUrlHtmlRewriter) — choose a number above 30 to see already-resolved xref/locale/base hrefs, below 10 to preempt xref resolution, or between the built-ins only to deliberately slot into that chain. The example uses 500 so anchors are lowercased after every transport-layer transform has landed.
public int Order => 500;
Register the implementation
HtmlResponseRewritingProcessor resolves every registered IHtmlResponseRewriter from the container and sorts by Order, so a single AddSingleton next to the host wiring is sufficient.
builder.Services.AddSingleton<IHtmlResponseRewriter, AnchorLowercaseRewriter>();
Result
Anchors marked data-lowercase have their text content lowercased, and the sentinel comment is gone from view-source.
Before:
<!--LOWERCASE-SENTINEL-->
<a data-lowercase href="/docs/">Read the DOCS</a>
<a data-lowercase href="/blog/">Latest POSTS</a>
After:
<a data-lowercase href="/docs/">read the docs</a>
<a data-lowercase href="/blog/">latest posts</a>
Anchors without data-lowercase and non-HTML responses pass through unchanged.
Verify
- Run
dotnet run --project examples/ExtensibilityLabExampleand visit/lowercase-demo/. - Expect every
<a data-lowercase>anchor text to be lowercase in the rendered HTML, and<!--LOWERCASE-SENTINEL-->to be absent from view-source. - Static build:
dotnet run --project examples/ExtensibilityLabExample -- build output— grepoutput/lowercase-demo/index.htmlto confirm the rewriter also runs during publish.
Related
- Reference: Response processing interfaces
- Background: The response-processing pipeline
- Related how-to: Write a response processor