Java

How to do profile matching with Apache Solr

Remco Moolenaar

Oct 29, 2015 • 4 min read

Apache Solr is the Java search engine based on the Lucene framework. It is generally used to search inside objects and is very fast.

However, there are many situations you want more than 'just' search. For instance, when searching for a person you want to start a relation with. In those cases you are not searching, but matching.

Matching

When searching you limit the result set by adding constraints, like 'person must live in Amsterdam'.

When matching the result set is (in theory) as big as the original set of objects. But you add a score to the result set which gives the approximation to the required optimal solution.
For instance, when matching persons and you want this person to live in Amsterdam, people actually living in Amsterdam will get a score of 100% and people in New York will have a score of 0%.

Example

We will first show a simple example of the basic set-up. We will add some document to your local Solr installation and do some querying.

Adding documents

See: http://stackoverflow.com/questions/26522651/how-to-add-index-terms-manually-in-apache-solr
Add the following three documents to your Solr collection:

{"id":"1","title":"short", "length_l": 170 }
{"id":"2","title":"normal", "length_l": 185 }
{"id":"3","title":"long", "length_l": 190 }

We have basically three different person which each a different length.

Query

What we want to query is a person of around 185 centimetres in length. To do this we will use the dist function inside Solr (see: https://wiki.apache.org/solr/FunctionQuery#dist).
The dist functions has three parameters: the type of distance, point 1 and point 2. The distance will then be calculated between those two points.
In this example we use Euclidean distances (straight line) and will assign '2' to the first parameter.

The query will then look like this:

{!boost b=dist(2,185,length_l)} length_l:[0 TO 200]

Which means: search for objects where the length_l attribute is between 0 and 200. For those documents boost the score based on the distance between 185 and the length_l attribute.
The added '_l' to the attribute name sets the type of the attribute within Solr to long integer.
This results in (using '*,score' for the fl parameter):

{
  "response": {
    "numFound": 3,
    "start": 0,
    "maxScore": 15,
    "docs": [
      {
        "id": "1",
        "length_l": 170,
        "score": 15
      },
      {
        "id": "3",
        "length_l": 190,
        "score": 5
      },
      {
        "id": "2",
        "length_l": 185,
        "score": 0
      }
    ]
  }
}

This is not what we have hoped for: the person who is the least optimal (with a length of 170 cm) pops up on top of the list.
The reason for this is quit obvious: we have boosted the score based on the distance. In others words, the bigger the distance, the bigger the score.

To fix this, we need the recip function from Solr (see https://wiki.apache.org/solr/FunctionQuery#recip). This function implements the calculation a / (m * x + b) where the outcome lies between 0 and 1. So, the higher x (in our case distance), the lower the outcome of this function.

The new query will then look like this (calculating 10 / (1 * distance + 10) for the recip function):

{!boost b=recip(dist(2,185,length_l),1,10,10)} length_l:[0 TO 200]

And the corresponding result set:

{
  "response": {
    "numFound": 3,
    "start": 0,
    "maxScore": 1,
    "docs": [
      {
        "id": "2",
        "length_l": 185,
        "score": 1
      },
      {
        "id": "3",
        "length_l": 190,
        "score": 0.6666667
      },
      {
        "id": "1",
        "length_l": 170,
        "score": 0.4
      }
    ]
  }
}

That looks much better!

Some remarks

The example above is a simple one dimensional space where the distance is basically subtracting 2 numbers. However, the example works for n-dimensional spaces as well.
So, for example, having persons with a length attribute and an age attribute the last query could look like this:

{!boost b=recip(dist(2,185,30,length_l,age_l),1,10,10)} length_l:[0 TO 200] OR age_l:[0 TO 99]

where the optimal solution is a person age 30 and who is 185 centimetres long.

Matching profiles

So you think we are the? Huh, no.
The real interesting and even more practical use case is where you want to match profiles, where some user-based attributes are being matched.

User-based attributes

In the example above we introduced two attributes: length and age.
But what happens when this attributes aren't predefined, but are based on some sort of settings table inside the database.
For instance:

Attribute: id = 1, name = 'length'
Attribute: id = 2, name = 'age'
Attribute: etc...

In this way the structure of the objects send to Solr is not predefined, but is flexible.

Object structure for Solr

The solution is dead-simple: instead of creating an attribute with the name length, the attribute name will be attribute_1.
The last query needs to be rewritten to:

{!boost b=recip(dist(2,185,30,attribute_1_l,attribute_2_l),1,10,10)} attribute_1_l:[0 TO 200] OR attribute_2_l:[0 TO 99]

This way a very flexible setup to match profiles that can be build using Apache Solr!

Some last remarks

When implementing a solution like this, take the following considerations into account:

When having a flexible setup not all attributes have to be present in a Solr object.
When matching a profile and some profiles do not have the attribute, this attribute will have a value of zero (0).
Such automatic attribute values could have a considerable influence on the distance calculation. So please keep this in mind.