Tag Archives: Data Processing

Why we needs another data processing framework

Background

I have many data processing work to do recently.

Yes VERY MUCH data processing work.

I have wrote a processing framework based on Rhino and Spring, called Jersey, which means JavaScript with easy.

It is fun to play data processing with Jersey, but there are 2 shortcomings:

  1. The startup time for jersey is too long, it’ll need about 2 seconds to startup the context (sure, you needs to start the java virtual machine, initialising the Rhino run time and then startup the Spring container, 2 seconds is not so bad). But it is nearly unbearable for me to just play something around(yes, java is stable, but, in the run and off scheme, IT IS REALLY SLOW, why? There is always lots of bootstrap there, yes I know that’s for flexiablility, but it is really slow, man!).
  2. The memory footprint for Jersey is to large. For jvm, it always wants more memory, I can wrote a python crawler, and run it using a thread group of about 10 threads, and still consumes less memory than the memory that jvm used in HelloWorld. This is very very bad, since the crawler that I wrote need to run as many as possible

So, I went to Python(2.7) for small tasks(even more bigger tasks).

Python is a little better faster, but compare to Jersey, it lacks:

  1. Better Unicode Support: This is fundamental!!!! I don’t get why Python community ignore this at the very beginning. I can’t open a CSV file properly without using a thirdparty library
  2. Fast MySQL Driver: I tried pymysql(didn’t get time to try others), and found out it is a little slow, I’ll explain it in another blog
  3. Libraries: Sure, Python is a good language, and many people using it to do serious things. But compare to Java, the library is still not enough, at least for me on the data processing work
  4. Not consistent for me: I’m working on a PHP framework for building website(and a CMS based on it) now, so why I need to code the data processing tool using Python than PHP, since I can use the library that I wrote for PHP

So, I gave up python for processing data.

And try to give PHP a try.

A little thoughts on data processing framework

After reading the section above, you’ll get to know why I’m using PHP as the language of my data processing framework(I’ll keep jersey working though. 🙂 )

And here is some thoughs of what a data processing framework can do (at least for me):

It should connect to most of the popular datasource

This is the foundamental part for the framework.

No matter how good your framework is, it is still useless if it can’t even connect to MySQL, Postgres.

And for nowadays, it should have mature libraries or drivers to connect to the nosql data storage(like Solr, MongoDB etc.), make the data transfer fast and safe.

It should based on a scripting language

This is same as Jersey. For data processing framework, testing and adjusting might happens on the live server(or the crawler master), this is the reason that I hate Hadoop… Why I needs to recompile and package and redeploy the code just to change a tiny bit on the crawler (only to run a small test)? Hadoop’s HDFS is good though.

It should have the ability to run across the platform

This is same as Jersey. That’s why Jersey is based on Java…. Luckily, most scripting language can run on all the major platforms we used today.

It should be very easy to extend and configure

It should be a framework contains lots of goodies, and from the foundation and the libraries is very flexiable to change or override.

So, no matter how complex the requirement is, there is always a better way to base the program on the framework(Eclipse is a good example).

It should run very fast, and have very little memory footprints

This is the same as the background section, you need to run it and get the result instantly if the processing is easy.

It should have the progress bar support by default

I don’t think I should explain this.

It should embed a fast rule engine

It is very important to embed a fast rule engine into the data processing framework.
Let’s view the basic work flow for data processing:

  1. Load the data from the datasource
  2. Transform the data into a common structural format(most data processing tool using XML)
  3. Processing the data
  4. Transform the data into the destination format
  5. Store the data into data destination

For step 1, you need the ability to connect(it is nothing with rule engine)
For step 2, the best transform method is rule based, it is more readable and extendable, I’ll show you an real world example here

Let’s suppose you have a small task to collect the user information collected using OAuth on 2 different platform(Twitter and Facebook for example.)

Platform 1(as p1)’s data format is(using json):

{
    "nick": "Jack",
    "profile_image": "a.jpg",
    "birthday": "someday"
}

And Platform 2(as p2)’s data format is:

{
    "screen_name": "Jack",
    "img": "b.jpg",
    "birthday": "someday"
}

There is lots of the records(about 100,000 each). You needs to transform them into a standard form

<user>
    <nick>Jack</nick>
    <profile_img>a.jpg</profile_img>
    <birtyday>someday</birthday>
</user>

Let’s using PHP and some fake code to do this, the first is using PHP code:

function processP1($arg) {
    $ret = array();
    if(isset($arg->nick)) {
        $ret['nick'] = $arg->nick;
    }
    if(isset($arg->profile_image)) {
        $ret['profile_img'] = $arg->profile_image;
    }
    if(isset($arg->birthday)) {
        $ret['birthday'] = $arg->birthday;
    }
    return (object) $ret;
}

function processP2($arg) {
    $ret = array();
    if(isset($arg->screen_name)) {
        $ret['nick'] = $arg->screen_name;
    }
    if(isset($arg->img)) {
        $ret['profile_img'] = $arg->img;
    }
    if(isset($arg->birthday)) {
        $ret['birthday'] = $arg->birthday;
    }
    return (object) $ret;
}

The second is CLIPS code:

(defrule set-result-nick-from-nick
    ?a <- (arg nick ?nick&~nil)
    ?r <- (result (nick nil))
    =>
    (retract ?a)
    (modify ?r (nick ?nick))
)

(defrule set-result-nick-from-screen-name
    ?a <- (arg screen_name ?nick&~nil)
    ?r <- (result (nick nil))
    =>
    (retract ?a)
    (modify ?r (nick ?nick))
)

(defrule set-result-profile-img-from-profile-image
    ?a <- (arg profile_image ?img&~nil)
    ?r <- (result (profile_img nil))
    =>
    (retract ?a)
    (modify ?r (profile_img ?img))
)

(defrule set-result-profile-img-from-img
    ?a <- (arg img ?img&~nil)
    ?r <- (result (profile_img nil))
    =>
    (retract ?a)
    (modify ?r (profile_img ?img))
)

(defrule set-result-birthday-from-birthday
    ?a <- (arg birthday ?birthday&~nil)
    ?r <- (result (birthday nil))
    =>
    (retract ?a)
    (modify ?r (birthday ?birthday))
)

Some one may argue, the first one can be write as one method like the second one too.

But, the world is changing, if p1 has change its protocol(say, change profile_image to img), and you’ll find you will regret to jam them together.

As you can see the code above, the second one is more consice, and better, it won’t have any assume of p1 or p2.

So, if time changes you’ll need to process some platofrm called p3’s information, you won’t need to change you code very much(just adding the missing rules, and if you are lucky, you may need not to add the rule, since the field of the user profile is mostly the same).

For Steps 3 and 4 is the same as the step 2. Rule engine runs faster and better when you needs to write lots of if..then..else.

And it is very easy to read and maintain.

CLIPS and PHP

For my PHP website framework, I choose CLIPS to do the rule processing, not only on the business logic.

I used it as the foundation of the framework, maybe you are curious about the desigin, why I should use a rule engine as the foundation of the framework?

Here is the example.

  1. The core rules to load configuration: Where to load the configuration, it seems to be very tricky, if a framework is flexiable, it can load at lots of places, and where to find is configurable too
  2. The core rules to load PHP scripts: This is the most foundation part of every PHP framework, if you think this should be very easy, take CI‘s CI_Loader as an example, and try to read it to understand the routine, and if you dare, try to add one more rule. 😀

So, I wrote a plugin for PHP first, it called php-clips. It is nearly stable for now (It can be compiled and installed using PHP’s building tools).

And I’m trying to an PHP framework to implement my thoughts above, this framework has the features as:

  1. Embed clips as its core
  2. Can be run at commandline as an application
  3. Just like jersey, will load the classes and extension on the working directory, or any configured directory(can be configured by the system wide configuration /etc/… or find the path from the environment variable, sure this is configurable too 😉 )
  4. You can use clips engine anytime, and even open a console(using PHP readlines) to run the clips commands your self manually
  5. It can run the clips scripts directly, if you want, you didn’t need to write 1 line of PHP
  6. You can replace any foundamental part of the framework just by overriding it(no need to replace the script, just like CI, you can have MY_XXX to replace the original classes, any class, and yes, this is configurable too. 😉 )
  7. It is written follow the CI’s guidelines, so, you’ll find the API and even the folder structure is like CI, but using the rule engine CLIPS as its core
  8. It is using mustache as the template engine, simple and fast
  9. It has the resource scheme and handler desigin just like spring, and you can write your own handler using PHP’s resource scheme and handler design too
  10. It’ll using Console-ProgressBar based on Curses to show the progressbar(the same progressbar like PEAR)
  11. It’ll distribute using PEAR

This little toy can be found at clips-tool, it is functional now.

Still in development, so it really lacks the documentation. I’ll make the documentation better when the current data processing work is done.