Friday, January 11th, 2008

Search Server 2008: Federated sites that do not return XML



The OpenSearch standard allows for the returned content to be in XML or HTML/XHTML format,  although the later makes it more difficult for Search Server 2008 federation as it is designed to use XSL Translation on the results to present the information to the user.   There is however a fairly simple (although it does require some code) process to provide the intermediary step between the Federated web parts and the Federated search source.

In this example I will provide an example of how this can be achieved against the well known search engine Google, you could do the same thing here against any data source that is accessible through code,  so you could roll your own BDC type solution to expose Line of Business information through Federation.    The basic steps for this will be the creation of an intermediary page that runs within the SharePoint layouts directory which will receive the query string,  make a HTTP request to Google and using some simple regular expressions and a bit of string manipulation to construct an RSS formatted XML string to return to the Federated Search Web Part.

Note: This is not production ready code and is provided as an example of how easy it is to federate to other services that do not provide the required query/return formats.

1. Create the Google.aspx page.

In this example I will use code beside (i.e. deploying the .cs to the server),  you will probably want to pre-compile the code in a production environment.

Create a new file called Google.aspx  and copy the following code into it.

<%@ Page Language="C#" AutoEventWireup="true"  CodeFile="Google.aspx.cs" Inherits="_Default" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml" ><head runat="server">    <title>Untitled Page</title></head><body>    <form id="form1" runat="server">    <div>        </div>    </form></body></html>

As you will see this code is pretty blank and is just provided to enable us to hookup a page load event in the code file we will create next.

Create another file in the same directory called Google.aspx.cs  into which we will add some code to:

Get the query string

    protected string query;

    protected void Page_Load(object sender, EventArgs e)    {        query = Request.QueryString["q"];    }

Call Google and parse the results

    private string getRssItemXml(string query)    {    string url = string.Format("http://www.google.com/search?q={0}", query);    WebClient client = new WebClient();    byte[] byteData = client.DownloadData(url);    string strData = Encoding.UTF8.GetString(byteData);    Regex searchPattern = new Regex("<div class=g><h2 class=r><a href=\"(?<link>.*?)\"(.*?)>(?<title>.*?)</a>(.*?)<td class=\"j\">(?<desc>.*?)<br><span class=a>(.*?)</td></tr></table></div>");    StringBuilder sb = new StringBuilder();

    foreach (Match m in searchPattern.Matches(strData))    {        sb.AppendFormat("<item><title><![CDATA[{0}]]></title><link><![CDATA[{1}]]></link><description><![CDATA[{2}]]></description></item>", m.Groups["title"].Value, m.Groups["link"].Value, m.Groups["desc"].Value);    }

The code above looks a little messy so I will explain

Get the url based on the Query string that the user entered, this was parsed during the page load and passed by the federated web part.

string url = string.Format(“http://www.google.com/search?q={0}”, query);

Using the WebClient class download the results of the query from Google into a Byte array.

 WebClient client = new WebClient();

byte[] byteData = client.DownloadData(url);

Convert the Byte array into a string to be used in the regular expression search.

string strData = Encoding.UTF8.GetString(byteData);

Construct the regular expression to extract the search results.  as can be seen Google does a reasonable job of keeping the formatting consistent so we are able to search for

<div class=g><h2 class=r>

Which is at the start of each result, then the link tag

<a href=\"(?<link>.*?)\"(.*?)>

Which provides us with the url of the result.  Here we are tagging the value so that the Regex will make this available to us.  This is followed by the title and description.

We then loop through the results using the Regex and create the RSS items that will be returned.

    foreach (Match m in searchPattern.Matches(strData))    {        sb.AppendFormat("<item><title><![CDATA[{0}]]></title><link><![CDATA[{1}]]></link><description><![CDATA[{2}]]></description></item>", m.Groups["title"].Value, m.Groups["link"].Value, m.Groups["desc"].Value);    }

The whole code looks like this, including reference

using System;using System.Data;using System.Net;using System.Configuration;using System.Web;using System.Web.Security;using System.Web.UI;using System.Web.UI.WebControls;using System.Web.UI.WebControls.WebParts;using System.Web.UI.HtmlControls;using System.IO;using System.Text;using System.Text.RegularExpressions;

public partial class _Default : System.Web.UI.Page {    protected string query;

    protected void Page_Load(object sender, EventArgs e)    {        query = Request.QueryString["q"];    }

    protected override void Render(HtmlTextWriter writer)    {        StringBuilder sb = new StringBuilder();

    Response.ContentType = "text/xml";    sb.Append("<?xml version=\"1.0\" encoding=\"utf-8\"?>");    sb.Append("<rss version=\"2.0\">");    sb.AppendFormat("<channel><title><![CDATA[Google: {0}]]></title><link/><description/><ttl>60</ttl>", query);    sb.Append(getRssItemXml(query));    sb.Append("</channel></rss>");

        writer.Write(sb.ToString());    }

    private string getRssItemXml(string query)    {    string url = string.Format("http://www.google.com/search?q={0}", query);    WebClient client = new WebClient();    byte[] byteData = client.DownloadData(url);    string strData = Encoding.UTF8.GetString(byteData);    Regex searchPattern = new Regex("<div class=g><h2 class=r><a href=\"(?<link>.*?)\"(.*?)>(?<title>.*?)</a>(.*?)<td class=\"j\">(?<desc>.*?)<br><span class=a>(.*?)</td></tr></table></div>");    StringBuilder sb = new StringBuilder();

    foreach (Match m in searchPattern.Matches(strData))    {        sb.AppendFormat("<item><title><![CDATA[{0}]]></title><link><![CDATA[{1}]]></link><description><![CDATA[{2}]]></description></item>", m.Groups["title"].Value, m.Groups["link"].Value, m.Groups["desc"].Value);    }

    return sb.ToString();    }}

2. Save the files to the LAYOUTS directory

If your testing this just complete the steps here, if not you will want to wrap this up as a WSP solution and deploy it correctly.

Copy the files Google.aspx and Google.aspx.cs to the 12\LAYOUTS\SEARCH directory.   Note you will need to create the SEARCH directory,  it is always better to store your application pages in a sub folder to avoid being overwritten by other installations.

3. Create a  Federation Location Definition File (.FLD) to point to your local search. 

Provide a name and description and the Query Template

http://mssxdemovpc/_layouts/search/google.aspx?q={searchTerms}

Where http://mssxdemovpc is the URL of your SharePoint installation.

Provide a “More Results” link to allow the user to navigate to Google if they want more.

http://www.google.co.uk/search?hl=en&q={searchTerms}

 Specify Credentials – as the web part will be calling your SharePoint page you will probably need to enable Authentication to this.  In my example I set this to NTLM – Use Application Pool Identity as all I needed was to get to the page.   You may want to look at user based or a specific account.  

Save your FLD and add edit your search results and see how you can bring Google search federation into your environment :)

Download

The files can be downloaded here from my SkyDrive, along with the Presentation from the SUGUK meeting. 

NOTE:  The FLD file does not store the Credentials in the XML so you will need to manually set this after you import it.

  • dawsonweb
    I just tried this and it doesn't return any results. I think it's the regex. Can anyone confirm if it's still working? Thanks.
  • You are correct the regex is not working any longer, someone commented that they planned to correct this. If I get time I will take a look.
  • David_Effs
    Thanks for a great article. This regex works:
    Regex searchPattern = new Regex("<li class=g>

    <a href=\"(?<link>.*?)\"(.*?)>(?<title>.*?)(.*?)

    <div class=\"s\">(?<desc>.*?)(.*?)");
  • bradgcoza
    @ David_Effs : There is something not 100% with that fix, it seems to use 100% of my processor and my sharepoint grinds to a halt ....
  • Allan Pedersen
    Great article, but unfortunately I haven't been able to get it working. If I test the regex in a standard .net console application it seems that the result page is not properly parsed. I guess that the source HTML for the result page has changed since this article was written. Does anyone have the new regex pattern?
  • Tom
    Connecting directly with a browser is no problem. Also I am using (for test purposes) the existing live Search Connector, which works too. So I guess there is someting wrong with the connector ...
    Thank you for your response!
  • Tom,

    Have you tried to access the site http://www.google.com/search?q=test from the SharePoint server? This looks more like you network is locked down to prevent external access or has incorrect routing.
  • Tom
    Same here: No connection to Google available. Starting just the aspx itself creates the following error:

    No connection could be made because the target machine actively refused it 74.125.39.99:80

    Any chance that you will fix this?
    :-O
  • My initial thoughts on this is that the response from Google has changed so the Regex is no longer working.
  • I am having the same issue as Jesper. I have set up security but still get no results displayed. I checked the Manage Federated Locations and it says that there have been queries, so I believe it is making the call. I just do not get the results displayed.

    Any other hints yet?
  • Hello Andrew,

    Thanks for the code - exactly what I was looking for !!

    I have the same problem as Nick though. The Page displays the RSS header but no results in the bottom. I have set up the security - for startes the NTLM App. Pool identity. I am running on a Win2k3 WSS 3 32.bit with the Infrastructure update.

    Any other hints please?
  • Nick, you need to do the last bit and cinfigure security manually.

    NOTE: The FLD file does not store the Credentials in the XML so you will need to manually set this after you import it.
  • Nick
    Andrew

    I am running MOSS 2007 x64 and I'm using the source files you've provided... I have uploaded the Google.aspx and Google.aspx.cs to 12\Template\layouts\search; and I have imported the FLD, replacing "mssxdemovpc" with the URL of my SharePoint installation. However i am not returning any results when pointing a federated search web part at the Google location, or when browsing to http:\\[my url]\_layouts/search/google.aspx?q=[test search terms].

    Am I missing something?
  • Andrew Woodward
    Eugene, I agree the code could be refactored to use the API but that was not really the purpose of the article, it was really just demonstrating how you could provide your own proxy federation and from here do what you needed from parsing html to as you point out using providers APIs.


    Andrew
  • Eugene Rosenfeld
    Nice article. One question: why are you parsing Google's HTML rather than calling their web services? The web services already return the search results in XML.
    http://code.google.com/apis/soapsearch/reference.html
  • Anonymous
    This article helped me greatly during my internship. Thanks!
blog comments powered by Disqus