#Issue with Filtering PDF Search by Path in Examine (Umbraco 13)

1 messages · Page 1 of 1 (latest)

wanton gate
#

Hey everyone,

I'm implementing a PDF search using Examine in Umbraco 13, and everything is working except for filtering by path.

What Works ✅

  • Searching for text inside PDFs works using fileTextContent.
  • Searching by nodeName also works.
  • If I don't filter by path, I get results.

What Doesn't Work ❌

  • When I try to filter by path, I get zero results.
  • The same query works in Examine Management but not in my code.

How My Examine Data Looks:
In Examine Management, when I search for a PDF, I see this in the index:

path: -1,1600,1603,1611,2227
__IndexType: pdf
nodeName: Marketing Document
fileTextContent: "This document contains marketing strategies..."
  • 1603 is my media root folder (selected by the user).
  • 2227 is the actual PDF file inside /HQM/blog1/.

Lucene Query That Works in Examine Management
If I manually search in Examine Management with:

fileTextContent:marketing~2 OR nodeName:marketing~2

I get results.

Lucene Query That My Code Generates (Fails)
Here’s what my code generates:

Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603* }

This returns zero results ❌, even though 1603* should match -1,1600,1603,1611,2227.

#

What I Tried So Far 🔄
❌ criteria.And().Field("path", $"{rootNode.Id}*")
Fails: Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603 }
→ No results. Wildcard is ignored.

❌ criteria.And().NativeQuery($"path:{rootNode.Id}")
Fails: Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603
}
→ No results.

❌ criteria.And().Field("path", "1603")
Fails: Lucene Query: { Category: pdf, LuceneQuery: +(fileTextContent:marketing~2 nodeName:marketing~2) +path:1603 }
→ No results. Throws error: '*' or '?' not allowed as first character in WildcardQuery

How Can I Filter PDFs by Path Correctly?
I need to filter PDFs so that only those inside a specific media folder (e.g., 1603) appear.

Does anyone have experience with Examine's path filtering in Umbraco?
Why is path:1603* returning nothing, even though the paths exist?
Is there a better way to search PDFs within a media folder and its children?

Thanks in advance! 🚀

vivid wing
#

1603* should match -1,1600,1603,1611,2227.

This is technically not true..

*1603* should match, or -1,1600,1603* should.

I haven't worked with paths in the PDF index, but it will depend on how the values are indexed. You can't visually see a difference in the backoffice viewer between values indexed as:

"1,2,3,4" and [1,2,3,4]

But the way Examine handles a "multivalue field" and a single value field containing a string of multiple comma separated values are very different

wanton gate
# vivid wing > 1603* should match -1,1600,1603,1611,2227. This is technically not true.. `...

Thanks for the clarification.

I debugged the results in Visual Studio and inspected the path property of a matching PDF result. It matches exactly what I see in the backoffice Examine Management UI.

Here’s what I found:

The path value of the result is: -1,1600,1603,1611,2227.
When I use +path:1603* in my Lucene query, it still does not return results even though 1603 is part of the path.
I also tried using +path:1603 as a RawQuery to account for all possible matches where 1603 appears anywhere in the path, but it seems not to work or throws errors when attempting the wildcard logic.

Do you know if Examine treats path fields differently when querying? Or is there something else I should check to ensure this works?

Thanks for your help!

vivid wing
#

Historically with Examine people will edit the path field and make it index the values separately or with spaces in between. IIRC it is because Lucene strips specific characters when it searches, among which is the comma.
So if you index "-1,1600,1603,1611,2227" lucene will treat it as "11600160316112227" which leads to issues when filtering by one of the node ids.

My guess is that is what you are running into here. So indexing the path in a TransformingIndexValues event may fix your issue

wanton gate
#
using Examine;
using Umbraco.Cms.Core.Events;
using Umbraco.Cms.Core.Notifications;

namespace HunzikerIntranet.HQM.Umbraco.Infrastructure.Examine
{
    public class PDFIndexPathTransformer : INotificationHandler<UmbracoApplicationStartedNotification>
    {
        private readonly IExamineManager _examineManager;
        private readonly ILogger<PDFIndexPathTransformer> _logger;

        public PDFIndexPathTransformer(IExamineManager examineManager, ILogger<PDFIndexPathTransformer> logger)
        {
            _examineManager = examineManager;
            _logger = logger;
        }

        public void Handle(UmbracoApplicationStartedNotification notification)
        {
            if (!_examineManager.TryGetIndex("PDFIndex", out var pdfIndex))
            {
                _logger.LogError("PDFIndex not found in Examine.");
                return;
            }

            pdfIndex.TransformingIndexValues += IndexOnTransformingIndexValues;
        }

        private void IndexOnTransformingIndexValues(object? sender, IndexingItemEventArgs e)
        {
            if (!e.ValueSet.Values.ContainsKey("path")) return;

            var rawPath = e.ValueSet.Values["path"].FirstOrDefault()?.ToString();
            if (string.IsNullOrEmpty(rawPath)) return;

            var searchablePath = rawPath.Replace(",", " ");

            var indexFields = e.ValueSet.Values.ToDictionary(x => x.Key, x => x.Value.ToList());
            indexFields["searchablePath"] = new List<object> { searchablePath };

            e.SetValues(indexFields.ToDictionary(x => x.Key, x => (IEnumerable<object>)x.Value));
        }
    }
}
vivid wing
#

Yes exactly like that, so now it should work if you filter on searchablePath instead of path

wanton gate
#

@vivid wing thanks for ur help