Wednesday, April 22, 2009

Cake PHP Set Class - overlay function

The CakePHP Set Class is a wonderful collection of functions that allow you to handle and process CakePHP arrays quickly and efficiently. It has features that deal with stripping out data, changing arrays to objects, and custom merge functionality.

Recently however I had a problem that was bothering me, and I thought some sort of set functionality would do the job. (Please tell me if this functionality has already been written).

Here's the problem:

I have a table that I download periodically and I want to update it into my database. The problem is that I have custom ID's for each row of my current table that are indexed throughout the rest of the database. I've found that I need to update existing records, delete any records that are not in my downloaded table, and I need to add new records that don't exist in my current table.

Here's an example:

On a regular basis, you get a list of 5000 products that your online store promotes from your affiliate. Once a week you need to download all of the products and update your products table. A standard drop and re-insert will not work, because you use custom id's in your products table that are referenced in your customers_products table and others. How do you update these 5000 records?

The answer is to think of it in terms of sets...

In the diagram, Set A is my current table that I want to update. Set B is the new table. Everything that is in Set A and not in Set B, I want to delete (Yellow). Everything in Set B not in Set A, I want to insert (Dark Green). And Everything that exists in both I want to update (Light Green).

Breaking the problem into a sets will allow you to realize what you need to do. One way of looking at a possible solution is to find what you need. From the diagram, all you need is everything in green (light or dark). In our product example all you need is a set of new products while maintaining the light green Primary Keys. (the dark green will generate new id's).

Implementation:

function updateProducts() {
$products= $this->Product->query("select distinct(prodid) as `match_id` from import_products as `Product`");
// Make the products array look like cake arrays...
foreach($products as $i => $product) {
foreach($product as $model=>$this_data) {
unset($products[$i][$model]);
$products[$i]['Product'] = $this_data;
}
}
$current_products = $this->Product->find('all',array('recursive'=>-1));
$res = $this->overlay($current_products, $products, "{n}.product.match_id");
$this->Product->query("truncate table products");
$this->Product->saveAll($res);
}

// function overlay:
// This function returns the $to array with merged $from elements.
// $from - array[0..x] of cake based array structure
// $to - array[0..x] of cake based array structure
// Path field for the overlay using set::extract's strucure
// eg. $path = "{n}.TableName.fieldname"
// * the Path must exist in each element in the $to and $from array
function overlay($from, $to, $path) {
$from_ary = Set::extract($from, $path);
$to_ary = Set::extract($to, $path);
if (count($to_ary) != count($to)) return $to;
if (count($from_ary) != count($from)) return $to;
$flipped_from = array_flip($from_ary);
$cnt = 0;
$results = array();
foreach($to as $key => $val) {
if (!empty($flipped_from[$to_ary[$cnt]])) {
$fkey = $flipped_from[$to_ary[$cnt]];
$results[$cnt] = Set::merge($from[$fkey], $to[$cnt]);
} else {
$results[$cnt] = $to[$cnt];
}
$cnt++;
}
return $results;
}

What does all this do?

Put simply, I gather my SET B (the new updated products that I downloaded), and then my SET B (my current set of products). I then call my function overlay(). Overlay simply returns SET B with elements from SET A merged with it based on the Path field. What is nice about this concept is that it allows you to simply pass the results overlay() directly to a cake saveAll() function.

I would not suggest using this code for giant tables. I did it on a zip code table with 42000 records, and the saveAll took a few minutes to rebuild the table. This may not be the most efficient solution, but it's clean and useful for "once-in-a-while" updates on tables.

No comments: