Monthly Archives: January 2013

TakTuk

Thinking on daemon process launching and management

For the project I recently working on (fetching data and analyse them), I setup a crawler and analyser cluster.

First many(as many crawler as to make the max of the server) crawlers must be spawned and configured.

Since the crawler can use zookeeper to configure it self, the configuration part is not need to considered.

The hard part of this architecture is that, you’ll need to launch many crawler instance by hand, and if you want to reconfigure them(for example, the data processing rules, they are loaded at the very beginning, and won’t get reload in the whole processing time, part for rules won’t change that much, and part for the rules need to be compiled).

Sure, I have wrote a script to launch and kill the crawlers, here is the thoughts on how to implement the script and what function should it have:

  • 1 It must be executable and nearly dependless on most of the Linux distributions

It is the same thought as the crawler since I want the crawler can be run on as many machine as possible, so the crawler is based on jersey(The javascript runtime I have written on Rhino, based on Java). As for this though, we have 3 options: bash, perl, Java

  • 2 It must be easy to deploy on many machines, or has the self publish function.

This is a very fundamental function of this launching script. Since I just tired of deploy the crawler over many machines, and deploy the script bring even more burden for me.

And the crucial part of this function is that, how to publish the scripts it self to many machines so that you can spawn many processes on that machine and keep management them.

For publish the script to other machine, we can use scp, a powerful, easy and safe way to copy resources from machine to machine. Just need a little configuration, you can copy the files from master to many slaves without interaction.

I can wrote a script to handle that, so that after I deploy the script on the master machine, I can deploy it automatically to many slave machines

For this function, I find out taktuk, a very good tool to play and use in order to manage many slave machines(say, install java or jersey for them). It use perl as its own language, and has more power on the publish, but I won’t talk about it very deeply in this article, since it don’t have the ability of spawning many processes and manage them(but you can really use it as the publish layer)

  • 3 It should be thin, light and costless for spawning processes

I don’t think run this script as a server is a good idea (at least for my opinion). The script just launching the processes with proper stdin, stderr, stdout, working directory and die after that. I don’t think keep this script running and wasting the resources is a wise idea. It is just a launcher after all .

  • 4 It should be configurable, say to launch how many process at 1 time

As I wrote on the above, I want the script to launch many crawler on 1 machine, so that the crawler will take as many resource on that machine as we can get. So, I want the script can spawn many crawler processes at 1 time, and each has its own stdout and std error.

And, yes, the slave machines’s ip should be configurable for this script.

  • 5 * It should launch the process as daemon*

Since the crawler runs day and night on the server. I don’t want it lives only in my ssh session. So I must make them an OS daemon to run in the background and do not harm by any signals or sessions. The script can use daemonize or nohup to achieve this

  • 6 It should provide the function to check the process running status

Since the crawlers are programs(and the number of them is not few), so no wonder some of the crawler gets this or that problem and stop working(say bugs, lack of memory and disk, java vm crushing or kernel crushing).

So, I need to check for their health, so, if some slave’s crawler has died, or 1 slave is restarted, I can know it when I check, or I can write a script to check that, if there is something wrong, it can send me an email about the problem, so I can get a plan to restart them or doing something else.

  • 7 It should provide the function to kill the processes running

This is very useful for redeploying the scripts, or redeploying the data analyse rules for all the crawlers. Say, if I have add another rule to the crawler, I need to publish the rule to all the slave machine(this can be done easily using taktuk), and I need to restart all the crawler processes.

So, for argument 1, 2 and 3, I think bash and perl is the best choice. And the publishing and remote executing can use taktuk to handle, I choose bash as the script language.

Thanks for taktuk’s ability, so I can use the logic for master and manage all the servers, so I just need to redirect the stdout to the master and I can get every detail of the slave’s status.

May be you will ask: Why bother? If you just need a job manager, Why not use Hadoop? Hadoop is very good at executing and manage jobs.

The answer is that hadoop or map reduce is not fit for crawling. Crawling is something like a recursive tool. Crawling start at a beginning point, and found more and more task from that, you don’t know how many times should that recur, but map reduce is not good at recursive operations.

I surely use hadoop, but just to handle the data that crawler has fetched, but as crawler, it is not useful.

At the end of this article, I would say, to run a cluster of crawler is very difficult, the logic of crawler is very complex if you want crawler won’t cost very much of your precious time. I’ll write how I wrote the crawler in another article.

Thanks for taktuk, without it, the work for my scripting tool can be more harder than writing the crawler. Life is hard for analysing massive data, but with a better tool, at least, your life will be easier.

The launch of Jersey

I was glad to announced that the launch of my JavaScript execution environment(called Jersey, in the name of JavaScript of Easy).

This begin with the thought on how to get a handy too to handle the tiny programming works I have to do very often, and has many foundation between(such as text manipulating, http site data fetch and processing, xml processing, html processing, sold data updating or data migration).

I’ll write an article to explain the this later.

The link for Jersey is https://github.com/guitarpoet/jersey and the README of it is below.

Jersey: A pluggable JavaScript execution environment

This is a pluggable JavaScript execution environment based on Rhino. It can run on any Java runtime above Java 5.

Jersey also contains a package manager called jpm, so you just need to provide an ivy module file, then you can get all the plugin and plugin dependencies using maven distribution.

Concept

Jersey is a running environment for JavaScript, it can run Mozilla Rhino flavoured JavaScript using Java. The aim for Jersey, is to provide a nice and pluggable environment to playwith or use JavaScript and based on the richi set of Java libraries.

The concept of Jersey is as list bellow:

  • Pluggable: Jersey is pluggable, you can install any plugin to it, and the plugin is just a java jar. All the plugins are deactivate at the start of the enviromnent, you can activate the plugins using Spring(Jersey provides you the functions to do that).
  • Simple: Jersey try to provide the api as simple as possible, and provide a console to play JavaScript with and will add the support of other script that compiles to JavaScript(for example, CoffeeScript) using the console, and the console support the tab completion, so you can fool around very easily
  • Configurable: Jersey provides the configuration basics based on Java’s properties, and all the configuration for application and plugins contains within the same configuration file
  • Discoverable: You can list all the provided function and provided modules using native functions.
  • Docable: Jersey provides the doc method for all the function and plugin modules, you can read the documentation of the function and modules using man function, and even have a pager function to view the pages.
  • Easy to be server: Since Java has many good embedded services, Jersey using Apache Derby to provide Database service and using Jetty to provide http server service, I also have a small CMS system written using the mvc and http plugin of Jersey
  • Easy to use: Jersey providing the pacakge management system based on the Famous Maven and Ivy, so you can download every plugin and plugin dependencies using Ivy’s module file.
  • Can support other languages: Since JavaScript is not a very simple language to write, Jersey also supports Coffee script(detected by the file extention) and using coffee to compile it and run, will support others in the future
  • Powerful: Jersey is based on Java, so it can take advantage of current Java and opensource Java libraries to provide powerful functions, it can use JDBC to access Database, using Commons DBCP to make the database connection pool, it can use Solr client to access the Apache Solr, it can use Log4J to handle the logging things and even using JBoss Drools to do the Rule matching.
  • Stable: Thanks to Java’s stablility and JavaScript’s simplicity, Jersey can be very stable for running, it can run very long time without memory leaks or make system unstable. Jersey’s multithreading is using Java’s multithreading support, and Jersey provides a JobManager function using Java’s concurrent library.

Architecture

Jersey is based on Java, Rhino and Spring, the component structure of Jersey is based on Spring, and the interpreter of JavaScript is using Rhino.

All the native functions and native modules is initiliazed at the intializing of the environment, all the plugins you want to use, need to be installed into the lib folder first, configure it correctly in the configuration file, then using require function to require the intialize script for this plugin.

Usage

Jersey is used using commandline, you can run the console using just jersey command. Or jersey a.js b.js c.js to run the script file one by one.

Jersey accept -c option to get the option file location (default is config.properties), and -s to get the script file(This only used for standard in, standard in file for Jersey is – )

Native Functions

Here is docs for some native functions that Jersey provided, you can list all the native functions using function functions().

  • require: Require the library using the resouce location, support file:file and classpath: protocol.
  • appContext: The function to load application context to javascript console
  • config: Read the configuration from system.properties or list all the properties
  • functions: List all the functions this shell provided.
  • modules:Prints all the modules that have loaded.

Native Modules

Here is the docs for some native modules that Jersey provided, you can lis all the native modules using function modules().

  • console: The console object.
  • file: The file utils.
  • csv: The csv utils service.
  • sutils: The string utils service.

Standard Libraries

Jersey provides some standard JavaScript libraries in the distribution. Here is the list of the libraries:

  • std/common: This provides the common extention of JavaScript, for example capitalize, isBlank, endWith for JavaScript and so on.
  • std/date: This provides date
  • std/evn.rhino: This provide env, to use library like jQuery
  • std/man: This provide the manual function for all the native function and modules
  • std/jquery: This provides the famous jQuery
  • std/underscore: This provides underscore

You can load your scripts using classpath protocol too.

Funny Quotes About Programming

I was looking for quotes for the second part of thinking about JavaScript, and found many interesting quotes about programming at here.

Here is some that I loved, they shows how intelligent and humour the author is.

 “A C program is like a fast dance on a newly waxed dance floor by people carrying razors.” – Waldi Ravens.

C is not really friendly to the beginners of the programming(at least to me), just like guns to the kids, they will hurt themselves eventually.

“I think Microsoft named .Net so it wouldn’t show up in a Unix directory listing.” – Oktal

To Unix guys(even for Java guys), .Net seems to be a hell. C# is a good language(no wonder), it has a mix nature of many languages(including Java), it has many good natures but it provides too many ways, so there are so many questions on StackOverFlow, just to ask how to use the some of the syntax of C# Correctly!!!. And for libraries of .Net, I have to admit, that’s a nightmare. The OpenFileDialog relative path bug, it exists until this day (yes, this is a tiny bug in all the misdesign and bugs of .Net framework).

“Fine, Java MIGHT be a good example of what a programming language should be like. But Java applications are good examples of what applications SHOULDN’T be like.” – pixadel

This is true before SWT, no one can say Eclipse is a bad example for applications. Even Swing do much better today

Let’s Talk About JavaScript (Part 1)

Mammon slept. And the beast reborn spread over the earth and its numbers grew legion. And they proclaimed the times and sacrificed crops unto the fire, with the cunning of foxes. And they built a new world in their own image as promised by the sacred words, and spoke of the beast with their children. Mammon awoke, and lo! it was naught but a follower.

from The Book of Mozilla, 11:9 (10th Edition)

To me, the programming languages can mean many things, they are tools to do specific jobs, they are the grammar and words to explain the problem and how to handle the problems, they are the methodologies to understand and explain the world’s reality to logic and entities.

So, to me, the programming language has 3 parts:

1. Methodology: This part is the fundamental side of the language, the philosophy of the language, the core concept of the language. This part explains why this language exists, what’s the original purpose of this language and how it evolved. This is the metaphysics side of the language, but the most important part to understand the language deeply(at least for my opinion).
2. Words and Grammar: This part is the body of the whole language, how to claim a flow, how to claim a data structure.
3. Fundamental Libraries: Although the first 2 part define what the language is, this part(the fundamental libraries) for the language runtime, is very crucial for the language, it even can affect the success of the language

As this article is the first part of the discussion for JavaScript, so, I’ll talk about the methodology of JavaScript first.

As I said, the methodology of the language is more important part for the language, and how the language is designed, why this language should been created should be discuss at the very begining.

First, the name, JavaScript is a trademark of Oracle Corporation(cited as Wikipedia). It is the common name for ECMAScript v3(for today’s talking), so, when I talking about JavaScript in my blog, I was really talking about ECMAScript (and v3 only), just the standard features and standard build in function and objects.

So, what’s an ECMAScript then?

ECMAScript is an object-oriented programming language for performing computations and manipulating computational objects within a host environment.

And what’s the purpose of the ECMAScript, why we should need it?

ECMAScript was originally designed to be a Web scripting language, providing a mechanism to enliven Web pages in browsers and to perform server computation as part of a Web-based client-server architecture.

As we can see in the ECMAScript’s introduction[http://www.ecma-international.org/publications/files/ECMA-ST-ARCH/ECMA-262,%203rd%20edition,%20December%201999.pdf] and about [http://inventors.about.com/od/jstartinventions/a/JavaScript.htm]. JavaScript was originally inspirited by the idea of Java, that you can run computation and show animations in the web browser(web pages are really dull back to that year).

So, in the design of JavaScript, the author even don’t think the most fundamental part of the programming language, the Standard Input and Standard Output should be necessary.

As stated in the ECMAScript’s standard:

ECMAScript as defined here is not intended to be computationally self-sufficient; indeed, there are no provisions in this specification for input of external data or output of computed results. Instead, it is expected that the computational environment of an ECMAScript program will provide not only the objects and other facilities described in this specification but also certain environment-specific host objects, whose description and behavior are beyond the scope of this specification except to indicate that they may provide certain properties that can be accessed and certain functions that can be called from an ECMAScript program.

So, we can have a conclusion here:

JavaScript (the language standard ECMAScript v3) is a web scripting language, it is interpreted by the host environment (run time), and besides the language grammar, other things that needed is provided by the host environment, and these things are depended by the host environment, not the language itself.

Then we can get the real fact of JavaScript, it is a interpreting embed language for running the computation tasks by the host environment.

And why I think JavaScript is a good(in fact very good) interpreting embed language, let’s see JavaScript’s philosophy(I try to make it clear, but I’m not teaching JavaScript here, if you want to know the details of JavaScript programming, you can find it on W3CSchool).

1. JavaScript is a Object oriented language

Everything in JavaScript is an object, the object itself is something like a hash table(so javascript don’t need bother to have the data structure for hash tables). Object can have a function as its constructor, and if you are clever enough, you can make objects has parent object (even parent constructor). Because object are somewhat hash table, you can’t have really private in the object, everything you have in the object can be see by anything that can get hold of the object, and JavaScript even let you to iterate the property names in the object.

Since everything is an object, the primitives and functions are object too, but the primitives(numbers for example) can’t really perform as a hash table, though you can setting properties to them, they won’t save any of the property at all.

This purity offers JavaScript the simplicity, if you don’t agree, see what a mess Java has become.

2. JavaScript is a prototype based language

JavaScript is not the first one prototype based language, but I think it is the most famous one. The prototype based nature of javascript can be more easily implemented, and still have the good ways of object oriented. I’ll talk about prototype based language in other articles, so I’ll skip this part in this one.

3. JavaScript is a weak typed language

JavaScript do have types, it has Undefined, Null, Boolean, String, Number and Object type, they can convert to each other (except object, because we don’t have to, everything is an object, remember?)

But the variable don’t have types, so that, you can set anything to any variable and JavaScript won’t complain anything.

But JavaScript won’t let you to create your type, and there is nothing can help you to create a thing like type(since JavaScript is prototype based), it seems weird though(especially from java’s world), but it is really easy for JavaScript’s implementations (If we have Classes, then we must need packages and other many more things, if you don’t want have a mess environment).

4. JavaScript is an embed language

JavaScript can do anything the host environment want it do. So, if the host environment don’t provide the fundamental part JavaScript wants, suppose a calculator can run JavaScript, maybe javascript is just an object oriented calculation language on that calculator, it just can calculate the numbers and output the texts.

All the nature for JavaScripts only points to one conclusion:

JavaScript is build to be an embed language, and it focus on the simplicity of the language(and the implementation of the language), so that, the language(or the implementation) can be run at most machines, and can be easily implemented and mantained.

And as the simplicity of JavaScript (language design), for many circumstances, JavaScript can be the fastest scripting language on any platform, and the philosophy of JavaScript makes it a very nice embed language to use.

Yes, JavaScript have its flaws, but not on its philosophy, we’ll talk that later.

So, a final word for the conclusion of this article:

JavaScript an Objected Oriented Prototype Based Weak Type Scripting Language, its aim is to be a scripting language of simplicity to run in the host environment, and it has achieve its design goal.

We’ll talk about JavaScript’s grammar in next article.