From The Archives

Post processing search results without running SQL queries in loops

created August 29, 2009 at 9:41 pm

Recently at work we were presented with an interesting problem, one of many problems that came up from using the Google Search Appliance to index and handle search results on a large site (I will probably touch on some of the others in future blog posts). The problem is this:

We have a site containing many content items that are all indexed by the Google Search Appliance. We have to manually feed XML to the GSA since we have very specific business requirements and allowing Google to crawl our site will not work. This means that when there are updates to items we need to let the GSA know to reindex this item with the correct information. For most things this is fine, but for ratings if we have to queue an item up each time a user rates an item that would significantly slow down the site and prove to not be the best solution. The rating does not have to be 100% up to the minute, but if something is indexed today and then rated 100 times without being reindexed, the rating in the index could be very far off. This means we needed a way to keep the item's ratings accurate in the search results.

The solution I came up with (with the help of my coworker, Joseph) was to cache an array of (itemId => rating). The first idea was to cache everything at once, but considering the number of items is constantly growing there is a chance that this would exceed Memcached's 1MB limit per item in cache. So we decided to split the cache up into groups so some items would be in the first array, some in the second, etc.

One thing to keep in mind here is that we are already storing a de-normalized rating value so we do not have to calculate averages over huge amounts of data when we need to select an item's rating. That would almost certainly break this approach.

The final code looks something like this:

<?php public function getRating($itemId) { $numGroups = 5; $mod = $itemId % $numGroups; $cacheKey = __METHOD__ . '_' . $mod; if (($ratings = Core_Environment::getCache()->get($cacheKey)) === false) { $query = '/* ' . __METHOD__ . ' */' . "SELECT i.id id, i.rating rating FROM item i WHERE MOD(i.id," . $numGroups . ") = :mod"; $sth = $this->_db->prepare($query); $sth->bindValue(':mod', $mod); $sth->execute(); $records = $sth->fetchAllAssoc(); // format the array $ratings = array(); foreach ($records as $record) { $ratings[$record['id']] = $record['rating']; } Core_Environment::getCache()->put( $cacheKey, $ratings, 3600); } return $ratings[$itemId]; }

I think this is pretty self explanatory, but I will give a brief overview. Basically I am using the Modulo Operation to handle the cache grouping. With the number of groups set to 5, the remainder will always be a number between 0 and 4. This allows us to easily group the items into those 5 categories. This also means if there are 20,000 items in the system each group would contain 4,000 items (assuming the ids are all sequential). This allows us to call the getRating() method for each item returned in the search results, and chances are pretty good that they will all hit cache.

I know this might not always be necessary considering 1MB = 1,048,576 bytes and even if you figure as high as 15 bytes per item that could hold almost 70,000 item ratings in one array, but smaller arrays are easier/faster to work with, and it is better to be safe than sorry. I would also recommend storing the number of groups and cache expiration times in a configuration file somewhere so you don't need to make code changes to update them.