Posts

Limiting Search Crawling to a subsite

I had an interesting challenge.  I was asked to limit Search Crawling to a single subsite.  The underlying issue was that a great deal of security in this farm was implemented via Audiences which is not a secure method of locking down content. Audiences expose documents and items to users, but don’t prevent the user from actually accessing the documents or items.  Search Content Sources expect to have nice and simple Web Application URLs to crawl.  So how best to restrict crawling to a subsite?

The simple answer is set up the Content Source to crawl the whole Web Application, but set up Crawl Rules to exclude everything else.  Only two rules are needed:

  1. Include: List the site to include, such as “http ://sharepoint/sites/site1/site2”
    Note the * at the end to ensure all sub-content is crawled.  Being the first crawl rule, this takes precedence over the next. Don’t forget the *.*
    It seems the testing of the crawl rule with just a * will appear to capture all content, but at crawl time, only a *.* will capture content with a file extension.
  2. Exclude: List everything else: http://*.*
    This will exclude anything not captured in the first rule.
  3. If you have a content source that includes people (sps3://sharepoint) be sure to use a wildcard on the protocol as well.

Voila!

Tuning your Crawl

Want to tune your Search crawling? There’s plenty of benefit to be had refining how Search crawls in SharePoint.  Eliminate useless page hits, or documents that will fail crawl processing.  It’s another way to exclude sensitive documents as well, if you can find a suitable search crawl exclusion rule.  I found out the hard way that SharePoint URLs defined in a Content Source MUST be a Web Application.  If you only want to crawl a subsite your recourse is to pare out all other sites using Crawl Rules.  The Crawl Rules come in two basic flavors; simple wildcard which is quite intuitive, and Regular Expressions.  You can find the Crawl Rules in Central Admin, General Application Settings, Search, (your Content SSA if in FAST), Crawl Rules ( visible on left).

Surprisingly, there is scant documentation on the Regular Expression implementation in SharePoint.  Through a bit of digging and trial and error I’ve summarized the Regular Expression operators supported in SharePoint:

? Conditional matching; matches optionally “http ://sharepoint/List_%5ba-z%5d?.aspx”
the char a-z is optional
* Matches on zero or more “http ://sharepoint/List_M*”
no M or M or MM…at the end.
+ Matches on one or more “http ://SharePoint/List_M”
One or more Ms at the end
. Match one character “htt p://sharepoint/List_”
One character expected after _
[abc] Any characters; I use abc as example. Ranges a-c work too “http ://sharepoint/List_%5ba-z]”
Matches on any List_ with any letter a-z
| Exclusive OR
If both sides are true, this evaluates to false.
() Parentheses group characters for an operation
{x,y} Range of counts
{x} Exact count
{x,} X or more counts

For FAST, note the Crawl Rules are under your Content SSA, not the Query SSA.

To create an Exclusion Rule with Powershell; Type 0=include, 1=exclude:

New-SPEnterpriseSearchCrawlRule -SearchApplication FASTSearchApp  -Path “http ://SharePoint/Sites/Secret/*”  -Type 1

To output all your Crawl Rules, use this line of PowerShell:

get-SPEnterpriseSearchServiceApplication | get-SPEnterpriseSearchCrawlRule | ft

The CmdLet “get-SPEnterpriseSearchCrawlRule” requires a Service Application object, so we simply pipe one in using the “get-SPEnterpriseSearchServiceApplication” CmdLet.  You can then pipe it to whatever you want.  “ft” is an alias for Format-Table, which is the default output, but you can just as easily pipe it to a file for automatic documentation.  This is especially useful when playing with your crawl rules.

Limiting Search Crawling to a subsite

I had an interesting challenge.  I was asked to limit Search Crawling to a single subsite.  The underlying issue was that a great deal of security in this farm was implemented via Audiences which is not a secure method of locking down content. Audiences expose documents and items to users, but don’t prevent the user from actually accessing the documents or items.  Search Content Sources expect to have nice and simple Web Application URLs to crawl.  So how best to restrict crawling to a subsite?

The simple answer is set up the Content Source to crawl the whole Web Application, but set up Crawl Rules to exclude everything else.  Only two rules are needed:

  1. Include: List the site to include, such as “http ://sharepoint/sites/site1/site2”
    Note the * at the end to ensure all sub-content is crawled.  Being the first crawl rule, this takes precedence over the next. Don’t forget the *.*
    It seems the testing of the crawl rule with just a * will appear to capture all content, but at crawl time, only a *.* will capture content with a file extension.
  2. Exclude: List everything else: http://*.*
    This will exclude anything not captured in the first rule.
  3. If you have a content source that includes people (sps3://sharepoint) be sure to use a wildcard on the protocol as well.

Voila!