Simon Miller Team : Web Development Tags : Web Development Search Umbraco

Searching PDFs with Umbraco

Simon Miller Team : Web Development Tags : Web Development Search Umbraco

My current website has a simple requirement – combine regular page search results with PDF file results. Fine, I thought, this should be simple enough, just search on the filename. But I had not considered the details of the requirement: the PDFs need to be searched within the files. The actual text of the PDF!

It turns out this is quite easy. Let’s presume you have already set up your search using the default ExternalSearcher as your provider and have a working content search for Umbraco nodes.

  1. Firstly, install the NuGet package for UmbracoExamine.PDF. Word of warning – when I first did this, it did not recognise my existing Umbraco 7.x install and went about installing all the required packages for Umbraco 6! After I cleaned up that mess (thank god for Undo Checkout) I made sure to exclude “Install Dependencies” when trying to install the package again. Doing it this way, you may need to install the ITextSharp NuGet package separately.

  2. The UmbracoExamine.PDF package should have updated your ExamineIndex.config and ExamineSettings.config to include the new searcher. You will find that if you go to Umbraco backoffice admin now and try to search the contents of a PDF via the Developer > Examine Management there should be results.

    However, we don’t want to create two search engines. Ideally we want the results returned in one result set. This can be achieved through a MultiIndexSearcher, which (as it says on the tin) will search multiple indexes. To enable this, add a new ExamineSearchProvider like so:
<add name="ContentSearcher" type="Examine.LuceneEngine.Providers.MultiIndexSearcher,
Examine" analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer,
Lucene.Net" enableLeadingWildcards="true" indexSets="ExternalIndexSet,PDFIndexSet"/>
  1. Your searcher code in your controller will most likely be something like this (this code has been simplified):
var searcher = ExamineManager.Instance;
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.GroupedOr(new[] { "nodeName", "name", "title", "body" }, filter.Keyword).Compile();
var searchResults = searcher.Search(query);

Update it to include the custom searcher name and the PDF searcher’s return field type FileTextContent:

var searcher = ExamineManager.Instance["ContentSearcher"];
var searchCriteria = searcher.CreateSearchCriteria();
var query = searchCriteria.GroupedOr(new[] { "nodeName", "name", "title", "body", "FileTextContent" }, filter.Keyword).Compile();
var searchResults = searcher.Search(query);
  1. The properties on a PDF search result (of ‘media’ index type) differ to regular content (of ‘content’ index type). The content of the PDF is concatenated into the FileTextContent field. We can use this simultaneously to search against (see step 3) and also to create a summary from for displaying the search results. Your search results processor should look something like this:
Items = new List<SearchResultItem>();
foreach (var item in pages)
{
    if (item.Fields.ContainsKey("FileTextContent"))
    {
        var node = helper.TypedMedia(item.Fields["__NodeId"]);
        Items.Add(new SearchResultItem()
        {
            Title = node.Name,
            Url = node.Url,
            Summary = StringHelpers.Truncate(item.Fields["FileTextContent"] ?? string.Empty, 300)
        });
    }
    else
    {
        var node = helper.TypedContent(item.Fields["id"]);
        Items.Add(new SearchResultItem()
        {
            Title = item.Fields["title"],
            Url = node.Url,
            Summary = item.Fields.ContainsKey("body") ? StringHelpers.Truncate(item.Fields["body"] ?? string.Empty, 300) : MvcHtmlString.Empty
        });
    }
}

The above loop determines the content type based on the existence of the FileTextContent property. We then uniform the search result items – using a custom helper to turn the body content in each case into short summaries – and return the resultant model to the view.

You now have a fully functional site search that returns results by keyword for both Umbraco node content and Umbraco media stored PDFs by their textual content contained within.