craig campbell

yesterday at 2:27 am

A while back I wrote about a way to keep data valid across multiple pages by invalidating cache keys. I have since thought of a better way to handle this.

The basic idea is that anytime you have child objects related to a parent object you can use a property of the parent object to invalidate the children. To make this more clear let's revisit the original problem: You are building a blog with a lot of users where each user posts a bunch of articles. The articles are paginated and cached per page. If the user adds a new article, all the items on the page will shift and we need a way to invalidate all the pages in cache.

The approach mentioned in my previous article involved having a namespace key that would store a random integer which would then be incremented when there is an update. This has a few problems:

It is never 100% safe to rely on random numbers.
If there is a cache failure on one server and the namespace key falls out of cache then there is a chance that the random number generated will lead to a key containing data that was in cache a while ago. Also on that same note incrementing the version could still contain stale data in cache.
It creates unnecessary cache lookups.
The namespace lookups have to happen before you can perform the initial lookup. Cache is fast, but like anything else it should only be used when necessary.
It can cause cache contention issues.
If many people are all trying to access the same key at the same time then the requests start to bubble up and the last one has to wait for the others to finish. This is more prevalent in this situation because if there are 10 pages of articles, each page would have its own unique cache key, but all 10 pages would share the same namespace key.

Enough with the problems, let's take a look at the solution:

class User
{
    protected $_lastArticleUpdate;

    public function getArticles($page, $pageSize)
    {
        $cacheKey = __METHOD__ . $page . $pageSize . $this->_lastArticleUpdate;

        if (($articles = Cache::get($cacheKey)) !== false) {
            return $articles;
        }

        // cache miss, retrieve articles from database
    }
}

This is overly simplified, but you can see how much cleaner and simpler this approach is. The lastArticleUpdate property contains a timestamp. This timestamp would get updated everytime the user adds/removes/updates an article. Since the user object is already required in order to get a page of articles in this example, you already have this property accessible.

December 28, 2009 at 12:57 am

At work I have been working with memcache quite a bit and coming up with creative ways to cache stuff. Perhaps I will elaborate on some of these in later posts, but for now I think I will focus on one interesting issue I ran into using multiget.

Using PECL Memcache extension the Memcache::get() method allows you to pass in a single key as a string or an array of keys to retrieve from cache. This can be very useful for retrieving lots of information with one request to memcache, however, there are a couple things to keep in mind. The first is something minor that I found annoying.

Let's say you request array('item_1', 'item_2') from cache. If only item 1 is in cache then the resulting array will be something like this: array('item_1' => 'this is the value for item 1'). This means you have to actually check if a key exists in the resulting array or not in order to know how to proceed.

The second thing had me banging my head against the wall for a short while since it was not possible to reproduce on a development environment. If you request multiple items from cache in one multiget request and the items requested are distributed over more than one server then they will not be returned in the same order that they were requested in.

While this seems to be the expected behavior, I couldn't find any information or documentation about this in the PHP documentation for memcache.

After a little while I thought of a workaround to this problem. The code is as follows:

<?php /** * gets multiple keys from cache and returns the values in the same order * they were requested in * * @param array $keys * @param bool $preserveOrder * @return array */ public function getMulti(array $keys, $preserveOrder = true) { // if we don't care what order the keys are returned in if (!$preserveOrder) { return $this->get($keys); } // gather the items as usual $items = $this->get($keys); // set all keys to null so if something is not found in cache it won't // be set to a value $results = array_fill_keys($keys, null); // merge the two arrays so the returned values from cache overwrite // the starting values return array_merge($results, $items); }

This pretty much solves both of the problems mentioned above. By using array_fill_keys we are able to guarantee that every item will be returned in the resulting array. By using array_merge any of the items actually returned will overwrite the values in the original while preserving the same order

Please note this code is part of a Memcache wrapper class.

August 29, 2009 at 9:41 pm

Recently at work we were presented with an interesting problem, one of many problems that came up from using the Google Search Appliance to index and handle search results on a large site (I will probably touch on some of the others in future blog posts). The problem is this:

We have a site containing many content items that are all indexed by the Google Search Appliance. We have to manually feed XML to the GSA since we have very specific business requirements and allowing Google to crawl our site will not work. This means that when there are updates to items we need to let the GSA know to reindex this item with the correct information. For most things this is fine, but for ratings if we have to queue an item up each time a user rates an item that would significantly slow down the site and prove to not be the best solution. The rating does not have to be 100% up to the minute, but if something is indexed today and then rated 100 times without being reindexed, the rating in the index could be very far off. This means we needed a way to keep the item's ratings accurate in the search results.

The solution I came up with (with the help of my coworker, Joseph) was to cache an array of (itemId => rating). The first idea was to cache everything at once, but considering the number of items is constantly growing there is a chance that this would exceed Memcached's 1MB limit per item in cache. So we decided to split the cache up into groups so some items would be in the first array, some in the second, etc.

One thing to keep in mind here is that we are already storing a de-normalized rating value so we do not have to calculate averages over huge amounts of data when we need to select an item's rating. That would almost certainly break this approach.

The final code looks something like this:

<?php public function getRating($itemId) { $numGroups = 5; $mod = $itemId % $numGroups; $cacheKey = __METHOD__ . '_' . $mod; if (($ratings = Core_Environment::getCache()->get($cacheKey)) === false) { $query = '/* ' . __METHOD_ . ' */' . "SELECT i.id id, i.rating rating FROM item i WHERE MOD(i.id," . $numGroups . ") = :mod"; $sth = $this->_db->prepare($query); $sth->bindValue(':mod', $mod); $sth->execute(); $records = $sth->fetchAllAssoc(); // format the array $ratings = array(); foreach ($records as $record) { $ratings[$record['id']] = $record['rating']; } Core_Environment::getCache()->put( $cacheKey, $ratings, 3600); } return $ratings[$itemId]; }

I think this is pretty self explanatory, but I will give a brief overview. Basically I am using the Modulo Operation to handle the cache grouping. With the number of groups set to 5, the remainder will always be a number between 0 and 4. This allows us to easily group the items into those 5 categories. This also means if there are 20,000 items in the system each group would contain 4,000 items (assuming the ids are all sequential). This allows us to call the getRating() method for each item returned in the search results, and chances are pretty good that they will all hit cache.

I know this might not always be necessary considering 1MB = 1,048,576 bytes and even if you figure as high as 15 bytes per item that could hold almost 70,000 item ratings in one array, but smaller arrays are easier/faster to work with, and it is better to be safe than sorry. I would also recommend storing the number of groups and cache expiration times in a configuration file somewhere so you don't need to make code changes to update them.

May 27, 2009 at 12:02 am

UPDATED 09 MAR 2010 SEE THIS POST.

I have been so incredibly busy lately that I have not had anytime to write here. Anyway I just wanted to take some time to talk about something cool I have been doing for a project I have been working on outside of work.

By now everyone knows what Memcached is so I won't bore you with an introduction. Basically it is an insanely simple caching system designed to eliminate load on the database.

While in an ideal world you would just throw cache on everything and watch your application scale to infinity and beyond, things aren't always that simple. The problem I'm going to show you today has to do with pagination and cached database queries. I will try to make this pretty straightforward so anyone can follow.

There is a user named Pablo Picasso who is using your site. He makes Cap-Sacs for a living and has received 15 messages from people on the site who are interested in purchasing one from him. This means that page 1 displays messages 6 to 15 (the 10 most recent messages) and page 2 displays 1 to 5.

Let's now imagine you cache these pages of results. A new user comes to your site and wants to buy a pink Cap-Sac. They send Picasso a message so your application says, "Oooh Picasso just got a new message. That's pretty rad. Since the query to get his messages is Cached let's just drop the cache for page 1." Now what happens you ask?

The new message shows up on the top of page 1 as expected so what's the problem?? Picasso now has 16 messages. Page 1 will display messages 7 to 16 which is fine, but since page 2 was cached from before it is going to show messages 1 to 5 still. This means that as long as the cache lives poor message number 6 (the last message on page 1 before the new message was recieved) gets dropped from the result set completely. Not to mention that page 1 thinks Picasso has 16 messages while page 2 still thinks he has 15.

This can also allow duplicates of the same item to show up if for example the user had 22 messages and page 1 and 2 were cached but page 3 wasn't. In this case a new message would cause page 3 and page 1 to be correct but since page 2 is still hitting cache, the last item on page 2 would be the same as the first item on page 3.

So what can we do about this? I see a few options

  1. Don't use memcached for paginated result sets
  2. Live with missing and/or duplicate items from time to time in your pagination
  3. Select everything at once and handle the pagination at the PHP level instead of in the database
  4. Use Pseudo cache namespaces to invalidate all cache keys for paginated queries

And here is my analysis of the options:

Option 1
This seems to be the approach that is often taken where I work, and of course it will work, but what happens if the data rarely changes? In that case it seems rather silly to continue to hit the database over and over for the same data.
Option 2
Surprisingly I don't think option 2 is a terrible option because it will definitely scale pretty well. Facebook search results have duplicate items all over most likely for this very reason, but since, in my example, I am thinking about displaying accurate data to the user I would stay away from this. For search results I think this is a much more viable option.
Option 3
This option is another interesting one. It would arguably scale better than any of the other options I have proposed although the problem here is what if Lady Gaga is using your site and she has 15,000 messages from fans. We don't want to have to select 15,000 items from the database if this were to miss cache.
Option 4
Well considering the title of this post I think you all knew this is the road I was going to go down. You get the performance benefits of caching, and you get 100 percent data accuracy.

The approach here is really quite simple and similar to the example found on the memcached FAQ page.

What I have chosen to do is use the DAO name as my main namespace identifier along with the unique userId. Please understand this is a very crude implementation just to illustrate my idea. It is more or less procedural code so I would not recommend doing it like this. I have been using a Cache_Namespace class to handle common logic such as generating the key and grabbing/setting the generation version/incrementing the version. Anyway here goes:

1 <?php 2 3 /** 4 * this would get called to retrieve a page of a user's messages 5 */ 6 public function getMessages($userId, $pageNumber) 7 { 8 $memcache = new Memcache(); 9 10 // this is the cache key to use to find the version to use 11 $versionKey = 'Message_Dao_' . $userId . '_version'; 12 13 // if there is no cache hit for the version set it to a random value 14 // and set it to never expire 15 if (($version = $memcache->get($versionKey)) === false) { 16 17 // we should use a random number here because otherwise there 18 // is a chance the version key will expire and resetting it to 1 19 // means incrementing it might not invalidate all cache keys 20 // if the query has been cached at version 2 for example 21 $version = rand(1, 9999); 22 $memcache->put($versionKey, $version, 0); 23 } 24 25 // this will be the key we use to actually retrieve the data 26 $queryCacheKey = __METHOD__ . '_' . $userId . '_' . $pageNumber . 27 '_' . $version; 28 29 // if there is no cache hit for this data then we need to run the query 30 if (($records = $memcache->get($queryCacheKey)) === false) { 31 32 // run query here SELECT * FROM blah blah blah blah 33 // grab the result set 34 $records = $sth->fetchAssoc(); 35 36 // store the results in cache 37 $memcache->put($queryCacheKey, $records); 38 } 39 40 return $records; 41 } 42 43 /** 44 * this would get called when a user sends another user a message 45 */ 46 public function sendMessage($message, $toId, $fromId) 47 { 48 // run insert INSERT INTO message blah blah 49 // after adding the message to the database we need to invalidate 50 // all cache keys for the person you are sending it to 51 $versionKey = 'Message_Dao_' . $toId . '_version'; 52 53 $memcache = new Memcache(); 54 55 // we call add here because increment won't increment a value that 56 // does not exist. add() will only set a value if it doesn't exist 57 // otherwise it will return false 58 $memcache->add($versionKey, rand(1, 9999)); 59 $memcache->increment($versionKey); 60 } 61

I tried to comment that code pretty well. As you can see the version is used as part of the query cache key so incrementing its value by 1 will make all calls to that method not hit cache. If you have more than one method you want to invalidate you can give them the same namespace key as seen in line 11.

Okay that is all for now.