UPDATE: I posted a new article on the execution of the last few paragraphs of this post. Check it out after this one.
I recently looked into server-side rendering options to improve search engine optimization for my website that uses AngularJS. What started as a seemingly innocent task has morphed into a week of research and development and me becoming a QA automation engineer.
My website uses dynamic metadata tags in the HTML. I do this because I want the views to have customized header attributes I can set in the controller like all other data in the app. This is all fine and good and easy to set up, but the problem comes in the form of search bots recognizing and executing the underlying JavaScript to generate the metadata.
So I did a quick search and came across PhantomJS, a library for automated testing of the UI. I messed around and created a few basic scripts to download web pages including some with JavaScript to execute. The results were fairly promising, but as I went deeper, I discovered the deprecated nature of the PhantomJS project and decided to look elsewhere.
I soon discovered headless Chrome and Puppeteer. Puppeteer is a Node library for interacting with headless Chrome (Chromium) and issuing it commands to do things like render a web page, generate a PDF snapshot, or click around and wait for things to happen. There is also a .NET port of this code called Puppeteer Sharp.
With these new tools, I set out to create code that navigates to every route in my website and downloads the corresponding page. The idea is to use the headless Chrome browser programmatically via Puppeteer to generate HTML for search bots with all the AngularJS bits already executed.
Most of the code I wrote was contained in 2 classes: Render and Route. Render had a method called GetContent that operates asynchronous to open the Chromium browser session and get pages from a locally-hosted development instance of my website. Route has all the specifics around my local site map and building the file paths for the HTML to be downloaded.
Below is the code that does most of the work.
public static async Task GetContent(List<Route> routes, bool exportHtml = false, bool exportPdf = false)
{
try
{
var result = new List<Route>();
var chromiumRevision = 526987;
var options = new LaunchOptions
{
Headless = true
};
using (var browser = await Puppeteer.LaunchAsync(options, chromiumRevision))
{
foreach (var route in routes)
{
Console.WriteLine($"Loading page: {route}");
using (var page = await browser.NewPageAsync())
{
var response = await page.GoToAsync(route.Path, new NavigationOptions
{
WaitUntil = new WaitUntilNavigation[]
{
WaitUntilNavigation.Networkidle0
}
});
route.Content = await page.GetContentAsync();
if (exportHtml && !string.IsNullOrEmpty(route.Content))
{
Console.WriteLine($"Exporting page: {route}");
await Export(route);
}
}
}
}
return;
}
catch (Exception ex)
{
Utility.Log($"Render error: {ex.Message}. {ex.StackTrace}.");
return;
}
}
I plan on using this periodically to regenerate static snapshots of my pages first locally for testing and then for search indexing from the live website. I will place the HTML that it downloads in AWS S3 and use CloudFront to route search bots to the correct place.
With any luck, I will have a site that is better indexed and more searchable by robots so that human beings may follow in their wake.
Tagged: #code
Posted on Jun 06, 2018