For the project I recently working on (fetching data and analyse them), I setup a crawler and analyser cluster.
First many(as many crawler as to make the max of the server) crawlers must be spawned and configured.
Since the crawler can use zookeeper to configure it self, the configuration part is not need to considered.
The hard part of this architecture is that, you’ll need to launch many crawler instance by hand, and if you want to reconfigure them(for example, the data processing rules, they are loaded at the very beginning, and won’t get reload in the whole processing time, part for rules won’t change that much, and part for the rules need to be compiled).
Sure, I have wrote a script to launch and kill the crawlers, here is the thoughts on how to implement the script and what function should it have:
- 1 It must be executable and nearly dependless on most of the Linux distributions
- 2 It must be easy to deploy on many machines, or has the self publish function.
This is a very fundamental function of this launching script. Since I just tired of deploy the crawler over many machines, and deploy the script bring even more burden for me.
And the crucial part of this function is that, how to publish the scripts it self to many machines so that you can spawn many processes on that machine and keep management them.
For publish the script to other machine, we can use scp, a powerful, easy and safe way to copy resources from machine to machine. Just need a little configuration, you can copy the files from master to many slaves without interaction.
I can wrote a script to handle that, so that after I deploy the script on the master machine, I can deploy it automatically to many slave machines
For this function, I find out taktuk, a very good tool to play and use in order to manage many slave machines(say, install java or jersey for them). It use perl as its own language, and has more power on the publish, but I won’t talk about it very deeply in this article, since it don’t have the ability of spawning many processes and manage them(but you can really use it as the publish layer)
- 3 It should be thin, light and costless for spawning processes
I don’t think run this script as a server is a good idea (at least for my opinion). The script just launching the processes with proper stdin, stderr, stdout, working directory and die after that. I don’t think keep this script running and wasting the resources is a wise idea. It is just a launcher after all .
- 4 It should be configurable, say to launch how many process at 1 time
As I wrote on the above, I want the script to launch many crawler on 1 machine, so that the crawler will take as many resource on that machine as we can get. So, I want the script can spawn many crawler processes at 1 time, and each has its own stdout and std error.
And, yes, the slave machines’s ip should be configurable for this script.
- 5 * It should launch the process as daemon*
Since the crawler runs day and night on the server. I don’t want it lives only in my ssh session. So I must make them an OS daemon to run in the background and do not harm by any signals or sessions. The script can use daemonize or nohup to achieve this
- 6 It should provide the function to check the process running status
Since the crawlers are programs(and the number of them is not few), so no wonder some of the crawler gets this or that problem and stop working(say bugs, lack of memory and disk, java vm crushing or kernel crushing).
So, I need to check for their health, so, if some slave’s crawler has died, or 1 slave is restarted, I can know it when I check, or I can write a script to check that, if there is something wrong, it can send me an email about the problem, so I can get a plan to restart them or doing something else.
- 7 It should provide the function to kill the processes running
This is very useful for redeploying the scripts, or redeploying the data analyse rules for all the crawlers. Say, if I have add another rule to the crawler, I need to publish the rule to all the slave machine(this can be done easily using taktuk), and I need to restart all the crawler processes.
So, for argument 1, 2 and 3, I think bash and perl is the best choice. And the publishing and remote executing can use taktuk to handle, I choose bash as the script language.
Thanks for taktuk’s ability, so I can use the logic for master and manage all the servers, so I just need to redirect the stdout to the master and I can get every detail of the slave’s status.
May be you will ask: Why bother? If you just need a job manager, Why not use Hadoop? Hadoop is very good at executing and manage jobs.
The answer is that hadoop or map reduce is not fit for crawling. Crawling is something like a recursive tool. Crawling start at a beginning point, and found more and more task from that, you don’t know how many times should that recur, but map reduce is not good at recursive operations.
I surely use hadoop, but just to handle the data that crawler has fetched, but as crawler, it is not useful.
At the end of this article, I would say, to run a cluster of crawler is very difficult, the logic of crawler is very complex if you want crawler won’t cost very much of your precious time. I’ll write how I wrote the crawler in another article.
Thanks for taktuk, without it, the work for my scripting tool can be more harder than writing the crawler. Life is hard for analysing massive data, but with a better tool, at least, your life will be easier.