Similar Entries module: Scalability issues and workarounds

A client site that is growing fast uses the similar entries module. During the course of a performance tuning exercise, we found a few scalability issues with the module, and some workarounds to them.

Block configuration does not load

This site has more than 41,000 term ids in its categories. This is because it is a user generated content web site and has free tagging enabled, so there are lots of tags.

This causes the similar entites module to be unable to load the block settings (timeout after 30 seconds).

The reason is the following code:

if (module_exist('taxonomy')) {
$names = _similar_taxonomy_names();
$form['similar_taxonomy'] = array(
'#type' => 'fieldset',
'#title' => t('Taxonomy category filter'),
'#collapsible' => true, '#collapsed' => true
);
$form['similar_taxonomy']['similar_taxonomy_filter'] = array(
'#type' => 'radios',
'#title' => t('Filter by taxonomy categories'),
'#default_value' => variable_get('similar_taxonomy_filter', 0),
'#options' => array(t('No category filtering'), t('Only show the similar nodes 
in the same category as the original node'), t('Use global category filtering')),
'#description' => t('By selecting global filtering, only nodes assigned to the 
following selected categories will display as similar nodes, regardless of the 
categories the original node is or is not assigned to.')
);
$form['similar_taxonomy']['similar_taxonomy_select'] = array(
'#type' => 'fieldset',
'#title' => t('Taxonomy categories to display'),
'#collapsible' => true, '#collapsed' => true
);
$form['similar_taxonomy']['similar_taxonomy_select']['similar_taxonomy_tids'] = array(
'#type' => 'select',
'#default_value' => variable_get('similar_taxonomy_tids', array_keys($names)),
'#description' => t('Hold the CTRL key to (de)select multiple options.'),
'#options' => $names, '#multiple' => true
);
} 

What this code is trying to do is generate an array of all term ids with their names in the $names variable, and populate a select box for it. For a few tens of terms, this is acceptable. However, with 41,000+, this is insane.

The workaround for this is to modify the code above to be like below, eliminating this term selection altogether.

if (module_exist('taxonomy')) {
$form['similar_taxonomy'] = array(
'#type' => 'fieldset',
'#title' => t('Taxonomy category filter'),
'#collapsible' => true, '#collapsed' => true
);
$form['similar_taxonomy']['similar_taxonomy_filter'] = array(
'#type' => 'radios',
'#title' => t('Filter by taxonomy categories'),
'#default_value' => variable_get('similar_taxonomy_filter', 0),
'#options' => array(t('No category filtering'), t('Only show the similar nodes in the same category as the original node')),
);
} 

Now, the block configuration page will load successfully.

Full Text and Temp tables

Another issue we found is that the queries issued by this module are on the heavy side. They do a full text matching of the node text and title.

So, if you run mtop or SHOW PROCESS LIST in mysql, you will see the following come up a lot:

Command Time State                   Info
Query   0    Copying to tmp table    SELECT DISTINCT(r.nid), r.title, r.teaser, MATCH(r.body, r.title) AGAINST ('lorem ipsum ...
Query   0    FULLTEXT initialization SELECT DISTINCT(r.nid), r.title, r.teaser, MATCH(r.body, r.title) AGAINST ('lorem ipsum ...

Note that the queries use fulltext, as well as temp table in various stages.

On this specific server (Dual Xeon Quad Cores, 8GB of memory), these queries do not consume a lot of time, and hence are not a performance bottleneck. However, this may not be the case on lesser powered servers.

Contents: 

Comments

Ah, nice. Will this change

Ah, nice. Will this change be made to the Similar Entries module? I find the module to be much more useful and straight forward than others similar modules (like Related Links).

Reduced functionality

Actually, the change I made causes some reduced functionality (inability to filter by terms).

Because this only affects sites with a large number of terms, I did not bother creating a patch for it, since most sites may not run into this problem.
--
2bits -- Drupal consulting

block cache

I didn't have the same problem as you (my blocks actually completed running...) but I did make sure to enable block caching for these since they are obviously heavy on queries. No reason to recalculate them for every single page view - just implement the caching at the "page" level and you can be sure that you are reducing lots of load.