Limiting Search Crawling to a subsite

I had an interesting challenge.  I was asked to limit Search Crawling to a single subsite.  The underlying issue was that a great deal of security in this farm was implemented via Audiences which is not a secure method of locking down content. Audiences expose documents and items to users, but don’t prevent the user from actually accessing the documents or items.  Search Content Sources expect to have nice and simple Web Application URLs to crawl.  So how best to restrict crawling to a subsite?

The simple answer is set up the Content Source to crawl the whole Web Application, but set up Crawl Rules to exclude everything else.  Only two rules are needed:

  1. Include: List the site to include, such as “http ://sharepoint/sites/site1/site2”
    Note the * at the end to ensure all sub-content is crawled.  Being the first crawl rule, this takes precedence over the next. Don’t forget the *.*
    It seems the testing of the crawl rule with just a * will appear to capture all content, but at crawl time, only a *.* will capture content with a file extension.
  2. Exclude: List everything else: http://*.*
    This will exclude anything not captured in the first rule.
  3. If you have a content source that includes people (sps3://sharepoint) be sure to use a wildcard on the protocol as well.

Voila!

Report on all Search Site references across SharePoint Site Collections

I got an interesting request recently to find all search centers configured for all Site Collections. I thought I would share the very simple script to do this:

Get-SPSite -limit all | % {
write-host "$($_.url),$($_.rootweb.AllProperties["SRCH_ENH_FTR_URL"])"
}

You can set the search drop down property using this assignment

$web.AllProperties[“SRCH_SITE_DROPDOWN_MODE”] = HideScopeDD_Defaultcontextual

Here are the possible HideScopeDD_Defaultcontextual Values and what they mean

Site Collection Search Dropdown Mode Property Value Search Results URL
Do Not Show Scopes Dropdown, and default to contextual scope HideScopeDD_DefaultContextual Y
Do Not Show Scopes Dropdown, and default to target results page HideScopeDD N
Show scopes Dropdown ShowDD Y
Show, and default to ‘s’ URL parameter ShowDD_DefaultURL Y
Show and default to contextual scope ShowDD_DefaultContextual Y
Show, do not include contextual scopes ShowDD_NoContextual N
Show, do not include contextual scopes, and default to ‘s’ URL parameter ShowDD_NoContextual_DefaultURL N

Here’s the full PowerShell script to set these values:

$web = Get-SPWeb http://sharepoint/managedpath/site
$web.AllProperties[“SRCH_ENH_FTR_URL”] = “/search/”
$web.AllProperties[“SRCH_SITE_DROPDOWN_MODE”] = HideScopeDD_Defaultcontextual
$web.AllProperties[“SRCH_TRAGET_RESULTS_PAGE”] =”/_layouts/OSSSearchResults.aspx”
$web.update()

A link straight to a SharePoint document’s metadata

Often users want a link direct to a document’s metadata. That’s easily done using this format:
ht tp://SharePoint/sites/SiteCol/DemoMajorV2/forms/DispForm.aspx?ID=[x]

Here’s a sample link to a document’s metadata properties, just add the ID:
ht tp://SharePoint/sites/SiteCol/DemoMajorV2/forms/DispForm.aspx?ID=285

I took a random document:
h ttp://SharePoint/sites/SiteCol/DemoMajorV2/TestDoc.docx

Found its ID in the browser by adding it to a View:
ht tp://SharePoint/sites/SiteCol/DemoMajorV2/Forms/My%20Documents.aspx

Then took the format:
ht tp://SharePoint/sites/SiteCol/DemoMajorV2/forms/DispForm.aspx?ID=[x] and added the number to it:

ht tp://SharePoint/sites/SiteCol/DemoMajorV2/forms/DispForm.aspx?ID=285

That same format can be used within the search XSL to add a reference to view the document’s metadata in search results. Here’s the XSL to paste into the XSL field in Core Search Results:

<div class=”srch-Title3″>
<xsl:variable name=”itemid” select=”ItemID”/>
<xsl:choose>
<xsl:when test=”contentclass[. = 'STS_ListItem_DocumentLibrary']“>
<xsl:choose>
<xsl:when test=”contains(basic4,’http’)”>
<xsl:variable name=”library” select=”substring-after(substring-after(url,basic4),’/')” />
<xsl:variable name=”displayUrl” select=”concat(basic4, ‘/’, substring-before($library,’/'),’/Forms/DispForm.aspx?ID=’,itemid)” />
<a href=”{$displayUrl}”>
Show properties
</a>
</xsl:when>
<xsl:otherwise>
<xsl:variable name=”DocLib” select=”substring-after(substring-after(url,sitename),’/')” />
<xsl:variable name=”MetaDataPath” select=”concat(sitename, ‘/’, substring-before($DocLib,’/'),’Forms/DispForm.aspx?ID=’,itemid)” />
<a href=”{$MetaDataPath}”>
Show properties
</a>
</xsl:otherwise>
</xsl:choose>
<a href=”{sitename}”>
Show library
</a>
<br></br>
</xsl:when>
<xsl:otherwise>
</xsl:otherwise>
</xsl:choose>
</div>

Reporting on all SharePoint Search Scopes

Search scopes are often created to refine the results returned on SharePont Search. I’ve written this small snippet of PowerShell as an easy way to get a report on all Scopes. I decided not to embelish it, and keep it quick and (not too) dirty, here goes:

$a = Get-SPEnterpriseSearchServiceApplication #grabs Content and Query
$scopes = $a | Get-SPEnterpriseSearchQueryScope
foreach ($Scope in $scopes)
{
write-host $Scope.name
write-host "=======================";
$scope.Rules  #outputs all the rules
}

FAST SharePoint Property Mapping Report

Generating a FAST SharePoint Property Mapping Report

In Search, it is not an exageration to say the property mapping is the heart of the customized search intelligence. Managed properties allow search to be customized, to serve an organization’s needs. I thought it would be useful to report on the mapping of the managed properties to crawled properties. I use a CSV format for the report, where a ‘|’ is used as the delimiter, and a semi-colon is used to separate crawled properties. Using Excel, one can easily convert pipe-delimited into columns.

The first thing we want to do is get the collection of managed properties using Get-FASTSearchMetadataManagedProperty. Then for each Managed Property, we get the associated crawled properties using the getcrawledPropertyMappings() method. Here’s the full script:

$ReportFileName = &quot;C:tempMappingReport.csv&quot;
$sep = '|'
cls
$LineOut = &quot;Name$($sep)Description$($sep)Type$($sep)Mapping&quot;
add-content $ReportFileName $($LineOut)
$mps = Get-FASTSearchMetadataManagedProperty
Foreach ($MP in $mps)
{
$q = $MP.getcrawledPropertyMappings();
$CPs=$null
if ($q.gettype().Name -eq &quot;CrawledPropertyMappingImpl&quot;) 
{
foreach ($cp in $q)
{
$CPs = &quot;$($cps);$($cp.name)&quot;
}
if ($CPs -ne $null)
{
$cps = $CPs.remove(0,1);
}
}
else
{
$CPs = $q.gettype().Name;
}
$LineOUt = &quot;$($MP.Name)$($sep)$($MP.Description)$($sep)$($MP.Type)$($sep)$($CPS)&quot;
add-content $ReportFileName $($LineOut)
}

Tuning your Crawl

Want to tune your Search crawling? There’s plenty of benefit to be had refining how Search crawls in SharePoint.  Eliminate useless page hits, or documents that will fail crawl processing.  It’s another way to exclude sensitive documents as well, if you can find a suitable search crawl exclusion rule.  I found out the hard way that SharePoint URLs defined in a Content Source MUST be a Web Application.  If you only want to crawl a subsite your recourse is to pare out all other sites using Crawl Rules.  The Crawl Rules come in two basic flavors; simple wildcard which is quite intuitive, and Regular Expressions.  You can find the Crawl Rules in Central Admin, General Application Settings, Search, (your Content SSA if in FAST), Crawl Rules ( visible on left).

Surprisingly, there is scant documentation on the Regular Expression implementation in SharePoint.  Through a bit of digging and trial and error I’ve summarized the Regular Expression operators supported in SharePoint:

? Conditional matching; matches optionally “http ://sharepoint/List_%5ba-z%5d?.aspx”
the char a-z is optional
* Matches on zero or more “http ://sharepoint/List_M*”
no M or M or MM…at the end.
+ Matches on one or more “http ://SharePoint/List_M”
One or more Ms at the end
. Match one character “htt p://sharepoint/List_”
One character expected after _
[abc] Any characters; I use abc as example. Ranges a-c work too “http ://sharepoint/List_%5ba-z]”
Matches on any List_ with any letter a-z
| Exclusive OR
If both sides are true, this evaluates to false.
() Parentheses group characters for an operation
{x,y} Range of counts
{x} Exact count
{x,} X or more counts

For FAST, note the Crawl Rules are under your Content SSA, not the Query SSA.

To create an Exclusion Rule with Powershell; Type 0=include, 1=exclude:

New-SPEnterpriseSearchCrawlRule -SearchApplication FASTSearchApp  -Path “http ://SharePoint/Sites/Secret/*”  -Type 1

To output all your Crawl Rules, use this line of PowerShell:

get-SPEnterpriseSearchServiceApplication | get-SPEnterpriseSearchCrawlRule | ft

The CmdLet “get-SPEnterpriseSearchCrawlRule” requires a Service Application object, so we simply pipe one in using the “get-SPEnterpriseSearchServiceApplication” CmdLet.  You can then pipe it to whatever you want.  “ft” is an alias for Format-Table, which is the default output, but you can just as easily pipe it to a file for automatic documentation.  This is especially useful when playing with your crawl rules.

Limiting Search Crawling to a subsite

I had an interesting challenge.  I was asked to limit Search Crawling to a single subsite.  The underlying issue was that a great deal of security in this farm was implemented via Audiences which is not a secure method of locking down content. Audiences expose documents and items to users, but don’t prevent the user from actually accessing the documents or items.  Search Content Sources expect to have nice and simple Web Application URLs to crawl.  So how best to restrict crawling to a subsite?

The simple answer is set up the Content Source to crawl the whole Web Application, but set up Crawl Rules to exclude everything else.  Only two rules are needed:

  1. Include: List the site to include, such as “http ://sharepoint/sites/site1/site2”
    Note the * at the end to ensure all sub-content is crawled.  Being the first crawl rule, this takes precedence over the next. Don’t forget the *.*
    It seems the testing of the crawl rule with just a * will appear to capture all content, but at crawl time, only a *.* will capture content with a file extension.
  2. Exclude: List everything else: http://*.*
    This will exclude anything not captured in the first rule.
  3. If you have a content source that includes people (sps3://sharepoint) be sure to use a wildcard on the protocol as well.

Voila!

FAST Search and the ticking time bomb

Tick…tick..tick…when you install FAST Search for SharePoint (FS4S), you probably have a time bomb set to go off exactly one year later to the minute. Unless you configured trust certificates between servers using a Certificate Authority, FAST/SharePoint uses a self-signed certificate with a one year expiration.

Worse, the ULS logs only point out that SharePoint can’t connect to the Content Distributor (green text is environment specific):

Failed to connect to srv-fast01.ReplaceWithYourDomain.com:13391 sp. Error=Failed to initialize session with document engine: Unable to resolve Contentdistributor [documentsubmitterworkerthread.cpp:132] d:officesourcesearchnativegatherpluginscontentpidocumentsubmitterworkerthread.cpp

I also saw the following error which is either misleading or unrelated:

At memory capacity. Load is 80%, configured to block at 80%. have been waiting 00:57 to queue this document [documentmanager.cpp:969]

To make matters worse, without connectivity to FAST, crawls hang and get stuck saying “Stopping”, clearing the FAST Index hangs in Central Admin; it’s not pretty…

Let’s take a step back. FAST requires that SSL is used for communication for search crawling between the FAST Server(s) and SharePoint Server(s). To communicate via SSL, a certificate needs to be generated for the FAST Sever and installed on the SharePoint server.

If you are getting errors connecting to the Content Distributor, it makes sense to first see if it is running, by running the following PowerShell in FAST: nctrl status

A very useful PowerShell command shows the connectivity and certificate status:

Ping-SPEnterpriseSearchContentService srv-fast01.[ReplaceWithYourDomain].com:13391

The below will show a list of the certificates. The timestamp below doesn’t lie; this is what I was doing at 5:30am this morning. When the problem existed, the highlighted entry showed an ExpiryDate of the day before and ConnectionSuccess of False. Note the port is 391 above your default port, which is 13000 unless you changed it on installation.

Here’s how to create a refreshed cert that will again celebrate its own birthday via expiration, On the FAST Server, open a PowerShell window in the D:YourFASTdirectoryInstallerScripts :

.ReplaceDefaultCertificate.ps1 -generateNewCertificate $true

Microsoft helpfully provides a PowerShell script to load the certificate into the SharePoint server. Before we run that let’s configure things so we generate a certificate that’s good for longer than a single year. To do that, let’s edit the script in C:FASTSearchinstallerscriptsinclude called certificatesetup.ps1 and right after the line around line number 246 which says:

Add-Content -Path $infFile -Value &quot;SuppressDefaults=true&quot;

Add the following lines underneath it:

Add-Content -Path $infFile -Value &quot;ValidityPeriod=Years&quot; 
Add-Content -Path $infFile -Value &quot;ValidityPeriodUnits=20&quot;  #now we are adding 20 years to the cert life.

You’ll want to be sure to get the SSA Name correct as well as the service account; this is the account under which the Application Pool is running that hosts FAST Search connector Service Application that is running in IIS. Note you’ll need to copy both the SecureFASTSearchConnector.ps1 script and the certificate itself (find the file by going to Certificates(Local Computer)Trusted Root Certification Authorities in MMC). You will also need to halt the FAST Search Service and FAST Search Monitoring in Servicesc.msc before being able to generate a new certificate. When exporting the certificate, make sure to export the private key, for which you will be prompted for the password. You’ll want in in PFX format, not in DER or CER format. If you try to use MMC, you may find you won’t be able to get the cert exported with the private key. The good news is that the FAST script already automatically exports the script in the right format, in this location: D:FASTSearchdatadata_securitycert

For the FAST SSA, use the Content SSA, and not the query SSA. You can determine the service account easily, by checking the Service Application in Central Admin and clicking “Properties”

.SecureFASTSearchConnector.ps1 –certPath "path of the certificatecertificatename.pfx" –ssaName "name of your content SSA" –username “domainusername”

If you do need to change the port for your Content Distributor, here are the PowerShell commands. Remember to replace the server/domain/port with your own:

$SearchSSA= Get-SPEnterpriseSearchServiceApplication -identity 'FAST Search Connector'
$SearchSSA.extendedConnectorProperties["ContentDistributor"]="srv-fast01.YourDomain.com:13391"
$SearchSSA.update()