Category Archives: Essay

This is the category for all my essays, the essay can talk about anything, mostly in philosophy and programming.

TakTuk

Thinking on daemon process launching and management

For the project I recently working on (fetching data and analyse them), I setup a crawler and analyser cluster.

First many(as many crawler as to make the max of the server) crawlers must be spawned and configured.

Since the crawler can use zookeeper to configure it self, the configuration part is not need to considered.

The hard part of this architecture is that, you’ll need to launch many crawler instance by hand, and if you want to reconfigure them(for example, the data processing rules, they are loaded at the very beginning, and won’t get reload in the whole processing time, part for rules won’t change that much, and part for the rules need to be compiled).

Sure, I have wrote a script to launch and kill the crawlers, here is the thoughts on how to implement the script and what function should it have:

  • 1 It must be executable and nearly dependless on most of the Linux distributions

It is the same thought as the crawler since I want the crawler can be run on as many machine as possible, so the crawler is based on jersey(The javascript runtime I have written on Rhino, based on Java). As for this though, we have 3 options: bash, perl, Java

  • 2 It must be easy to deploy on many machines, or has the self publish function.

This is a very fundamental function of this launching script. Since I just tired of deploy the crawler over many machines, and deploy the script bring even more burden for me.

And the crucial part of this function is that, how to publish the scripts it self to many machines so that you can spawn many processes on that machine and keep management them.

For publish the script to other machine, we can use scp, a powerful, easy and safe way to copy resources from machine to machine. Just need a little configuration, you can copy the files from master to many slaves without interaction.

I can wrote a script to handle that, so that after I deploy the script on the master machine, I can deploy it automatically to many slave machines

For this function, I find out taktuk, a very good tool to play and use in order to manage many slave machines(say, install java or jersey for them). It use perl as its own language, and has more power on the publish, but I won’t talk about it very deeply in this article, since it don’t have the ability of spawning many processes and manage them(but you can really use it as the publish layer)

  • 3 It should be thin, light and costless for spawning processes

I don’t think run this script as a server is a good idea (at least for my opinion). The script just launching the processes with proper stdin, stderr, stdout, working directory and die after that. I don’t think keep this script running and wasting the resources is a wise idea. It is just a launcher after all .

  • 4 It should be configurable, say to launch how many process at 1 time

As I wrote on the above, I want the script to launch many crawler on 1 machine, so that the crawler will take as many resource on that machine as we can get. So, I want the script can spawn many crawler processes at 1 time, and each has its own stdout and std error.

And, yes, the slave machines’s ip should be configurable for this script.

  • 5 * It should launch the process as daemon*

Since the crawler runs day and night on the server. I don’t want it lives only in my ssh session. So I must make them an OS daemon to run in the background and do not harm by any signals or sessions. The script can use daemonize or nohup to achieve this

  • 6 It should provide the function to check the process running status

Since the crawlers are programs(and the number of them is not few), so no wonder some of the crawler gets this or that problem and stop working(say bugs, lack of memory and disk, java vm crushing or kernel crushing).

So, I need to check for their health, so, if some slave’s crawler has died, or 1 slave is restarted, I can know it when I check, or I can write a script to check that, if there is something wrong, it can send me an email about the problem, so I can get a plan to restart them or doing something else.

  • 7 It should provide the function to kill the processes running

This is very useful for redeploying the scripts, or redeploying the data analyse rules for all the crawlers. Say, if I have add another rule to the crawler, I need to publish the rule to all the slave machine(this can be done easily using taktuk), and I need to restart all the crawler processes.

So, for argument 1, 2 and 3, I think bash and perl is the best choice. And the publishing and remote executing can use taktuk to handle, I choose bash as the script language.

Thanks for taktuk’s ability, so I can use the logic for master and manage all the servers, so I just need to redirect the stdout to the master and I can get every detail of the slave’s status.

May be you will ask: Why bother? If you just need a job manager, Why not use Hadoop? Hadoop is very good at executing and manage jobs.

The answer is that hadoop or map reduce is not fit for crawling. Crawling is something like a recursive tool. Crawling start at a beginning point, and found more and more task from that, you don’t know how many times should that recur, but map reduce is not good at recursive operations.

I surely use hadoop, but just to handle the data that crawler has fetched, but as crawler, it is not useful.

At the end of this article, I would say, to run a cluster of crawler is very difficult, the logic of crawler is very complex if you want crawler won’t cost very much of your precious time. I’ll write how I wrote the crawler in another article.

Thanks for taktuk, without it, the work for my scripting tool can be more harder than writing the crawler. Life is hard for analysing massive data, but with a better tool, at least, your life will be easier.

Let’s Talk About JavaScript (Part 1)

Mammon slept. And the beast reborn spread over the earth and its numbers grew legion. And they proclaimed the times and sacrificed crops unto the fire, with the cunning of foxes. And they built a new world in their own image as promised by the sacred words, and spoke of the beast with their children. Mammon awoke, and lo! it was naught but a follower.

from The Book of Mozilla, 11:9 (10th Edition)

To me, the programming languages can mean many things, they are tools to do specific jobs, they are the grammar and words to explain the problem and how to handle the problems, they are the methodologies to understand and explain the world’s reality to logic and entities.

So, to me, the programming language has 3 parts:

1. Methodology: This part is the fundamental side of the language, the philosophy of the language, the core concept of the language. This part explains why this language exists, what’s the original purpose of this language and how it evolved. This is the metaphysics side of the language, but the most important part to understand the language deeply(at least for my opinion).
2. Words and Grammar: This part is the body of the whole language, how to claim a flow, how to claim a data structure.
3. Fundamental Libraries: Although the first 2 part define what the language is, this part(the fundamental libraries) for the language runtime, is very crucial for the language, it even can affect the success of the language

As this article is the first part of the discussion for JavaScript, so, I’ll talk about the methodology of JavaScript first.

As I said, the methodology of the language is more important part for the language, and how the language is designed, why this language should been created should be discuss at the very begining.

First, the name, JavaScript is a trademark of Oracle Corporation(cited as Wikipedia). It is the common name for ECMAScript v3(for today’s talking), so, when I talking about JavaScript in my blog, I was really talking about ECMAScript (and v3 only), just the standard features and standard build in function and objects.

So, what’s an ECMAScript then?

ECMAScript is an object-oriented programming language for performing computations and manipulating computational objects within a host environment.

And what’s the purpose of the ECMAScript, why we should need it?

ECMAScript was originally designed to be a Web scripting language, providing a mechanism to enliven Web pages in browsers and to perform server computation as part of a Web-based client-server architecture.

As we can see in the ECMAScript’s introduction[http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262,%203rd%20edition,%20December%201999.pdf] and about [http://inventors.about.com/od/jstartinventions/a/JavaScript.htm]. JavaScript was originally inspirited by the idea of Java, that you can run computation and show animations in the web browser(web pages are really dull back to that year).

So, in the design of JavaScript, the author even don’t think the most fundamental part of the programming language, the Standard Input and Standard Output should be necessary.

As stated in the ECMAScript’s standard:

ECMAScript as defined here is not intended to be computationally self-sufficient; indeed, there are no provisions in this specification for input of external data or output of computed results. Instead, it is expected that the computational environment of an ECMAScript program will provide not only the objects and other facilities described in this specification but also certain environment-specific host objects, whose description and behavior are beyond the scope of this specification except to indicate that they may provide certain properties that can be accessed and certain functions that can be called from an ECMAScript program.

So, we can have a conclusion here:

JavaScript (the language standard ECMAScript v3) is a web scripting language, it is interpreted by the host environment (run time), and besides the language grammar, other things that needed is provided by the host environment, and these things are depended by the host environment, not the language itself.

Then we can get the real fact of JavaScript, it is a interpreting embed language for running the computation tasks by the host environment.

And why I think JavaScript is a good(in fact very good) interpreting embed language, let’s see JavaScript’s philosophy(I try to make it clear, but I’m not teaching JavaScript here, if you want to know the details of JavaScript programming, you can find it on W3CSchool).

1. JavaScript is a Object oriented language

Everything in JavaScript is an object, the object itself is something like a hash table(so javascript don’t need bother to have the data structure for hash tables). Object can have a function as its constructor, and if you are clever enough, you can make objects has parent object (even parent constructor). Because object are somewhat hash table, you can’t have really private in the object, everything you have in the object can be see by anything that can get hold of the object, and JavaScript even let you to iterate the property names in the object.

Since everything is an object, the primitives and functions are object too, but the primitives(numbers for example) can’t really perform as a hash table, though you can setting properties to them, they won’t save any of the property at all.

This purity offers JavaScript the simplicity, if you don’t agree, see what a mess Java has become.

2. JavaScript is a prototype based language

JavaScript is not the first one prototype based language, but I think it is the most famous one. The prototype based nature of javascript can be more easily implemented, and still have the good ways of object oriented. I’ll talk about prototype based language in other articles, so I’ll skip this part in this one.

3. JavaScript is a weak typed language

JavaScript do have types, it has Undefined, Null, Boolean, String, Number and Object type, they can convert to each other (except object, because we don’t have to, everything is an object, remember?)

But the variable don’t have types, so that, you can set anything to any variable and JavaScript won’t complain anything.

But JavaScript won’t let you to create your type, and there is nothing can help you to create a thing like type(since JavaScript is prototype based), it seems weird though(especially from java’s world), but it is really easy for JavaScript’s implementations (If we have Classes, then we must need packages and other many more things, if you don’t want have a mess environment).

4. JavaScript is an embed language

JavaScript can do anything the host environment want it do. So, if the host environment don’t provide the fundamental part JavaScript wants, suppose a calculator can run JavaScript, maybe javascript is just an object oriented calculation language on that calculator, it just can calculate the numbers and output the texts.

All the nature for JavaScripts only points to one conclusion:

JavaScript is build to be an embed language, and it focus on the simplicity of the language(and the implementation of the language), so that, the language(or the implementation) can be run at most machines, and can be easily implemented and mantained.

And as the simplicity of JavaScript (language design), for many circumstances, JavaScript can be the fastest scripting language on any platform, and the philosophy of JavaScript makes it a very nice embed language to use.

Yes, JavaScript have its flaws, but not on its philosophy, we’ll talk that later.

So, a final word for the conclusion of this article:

JavaScript an Objected Oriented Prototype Based Weak Type Scripting Language, its aim is to be a scripting language of simplicity to run in the host environment, and it has achieve its design goal.

We’ll talk about JavaScript’s grammar in next article.