Tag Archives: essay

Let’s talk about Autotools

Let’s talk about Autotools

Coming from Java background. The most difficult part for writting C/C++ programs(or shared libraries) is how to make the code to run on other machines(live server, for example).

Java’s virtual machine’s architecture really saves lots of people and time, you can just compile the code on your DEV machine, and deploy(mostly copy) the jar(or war, or even ear) to the distination machine, and you’re done.

So, you can test the same code that’ll be run, and copy everything it depends along with it(war for example).

When using C/C++, you don’t have these things, you should make quite good sense to everything you code depends on.

The Problem

Why I need to write an C/C++ shared library? The story begins with a little request for writting a SMILES tokenizer for MySQL. The reason for why I need to write that tokenizer, is another story 😉

For compile that plugin’s code you’ll need:

  1. OpenBabel [headers and shared library are needed]: The foundation part of the conversion, I’ll need that to convert smiles to molecue structures so that I can tokenize it using a better context
  2. MySQL [headers are needed]: Yes, there must be a MySQL isntallation on the server, and since MySQL has a very nice plugin architecture, I didn’t need to link to any libraries of MySQL, oh yeah!

Not quite hard, for it seems.

But, your are wrong:

Headers are not so easy to find.

Different System, different version and different distribution(even the different installation method), will cause the headers you need locate at different folders.

Take OpenBabel for example:

  1. LibTool’s default location(code install methdo) will put the headers to /usr/local/include/openbabel-2.0 (yes, we’re using openbabel 2.0’s api)
  2. If you’re using systems like Fedora(CentOS for example), and install openbabel-devel using yum, and you’ll find the headers will be locate at /usr/include/openbabel-2.0
  3. If you like me, are using OS X to do the development, and install the openbabel using MacPorts, you’ll find the headers are here /opt/local/include/openbabel-2.0

Yes, for a very limit of systems(only CentOS and OS X), you’ll get at least 3 kind of locations for the headers you need, and user may change the default path too.

And, yes, for the worst, the server may not have any OpenBabel installation, you’ll inform the user that you need that.

The Libraries That You Want To Link Is Not Easy To Find Too

Like headers, libraries are quite hard to find too, because:

  1. For libtool’s default location, the static library will be locate at /usr/local/lib name like libxxx.a, and hte dynamic library will be locate at the same location with name like libxxx.so or libxxx.dylib(for BSD users, on OS X)
  2. If you install the library using yum, it’ll be here /usr/lib
  3. If you install the library using mac ports, it’ll be here /opt/local/lib

That’s not all, for Fedora, if you are using 64bit OS, the 64bit library will locate to /usr/lib64.

And yes, for the worst, the server may not have any libraries you need installed.

The Deployment Location Is Uncertain

Since I’m writting a MySQL plugin, what I want to do for target make install is to install the code to MySQL’s plugin folder.

And different installation of MySQL, different system, even the default plugin folder will be quite different.

And, even worse, there maybe no MySQL plugin folder at all.

The Function You Need May Not Exists

Yes, that’s not all of the problem. For my another application, I came to a problem that some api I used in OpenBabel 2.3.2(from MacPorts) is not exist in 2.2.3(from CentOS6’s epel yum repository). So I must disable some function when compiling my code on the system that didn’t support the api of OpenBabel 2.3.2, and let other functions to work as well.

I should have a better way to do this.

My Solution

So I came to GNU’s Autotools. The reason I choose that is that MacPorts use it by default, and PHP use it to build plugins, 😀

Then I found out, Autotools is so hard to use, especially for newbie users…..

This blog will needs you have a little background knowledge of GNU Make and the knowledge about how to write Makefiles.

Problems When I Use Autotools For The First Time

  1. What’s the working flow using Autotools?
  2. How should i start?
  3. What are the commands I should use, and how to use?
  4. What file that I need to code?
  5. If I want to write a shared library(like MySQL plugin, what should I do)?
  6. What are AC Macros? Where is the Fking documentation for the Fking AC Macros?

These problem is the motivition for this blog.

Since its TOO HARD to beginers!!!! There is very less documentation for Autotools for beginers, and the offical documentation is a piece of SHIT!

This will scare most of the beginners away from it! I’ll try to make it a little simpler to beginners so that they can begin to play with Autotools.

What’s the working flow to use Autotools? How should I start? What commands that I should use?

Autotools is a set of tools to help you write the code that adapt to migration between different systems and installations. It can be break down to these command:

The Commands

  1. autoscan: This program will scan all of your code, and generate a boilerplate for your configuration(configure.ac) for Autotools
  2. aclocal: Generating autoconf’s local macros, if you do not use this command to generate the macros, you’ autoconf execution will probably get a macro is not defined error
  3. autoheader: This will use the configuration in configure.ac to generate your config.h.in (The input file for automake to generate Makefile.in)
  4. autoconf: This is the core part of header and library resoving macro support. This command will using the configuration in configure.ac to generate the configure script
  5. automake: This will take the Makefile configuration file Makefile.am to generate the Makefile template Makefile.in

And that’s not all, if you want to write share library, you’ll need this:

  1. libtool: The command line tool to create and install libraries, Autotools will support this by default(sure, they are from the same orgnization, aren’t they?)

So, there is at least 6 commands you should know, and I’ll list the files that you should write or get(for beginners, this is quite difficult):

The Files

  • configure.scan: This is the output file of autoscan, you can rename it to configure.ac(it’ll create some boilerplate for you)
  • configure.ac: This file is very important, this is the core configuration file for your Autotool build system, nearly every magic part of Autotools is configured here(using M4 macros)
  • aclocal.m4: This file is generated by aclocal, this file will read the configuration of configure.ac and initialize the macros you’ll need(for example, the automake macros and libtool macros), this is quite quite important for the Autotool’s command execution
  • config.h.in: This file is generated by command autoheader, will be the input file for automake to generate the file Makefile.in
  • Makefile.am: This file is the Makefile template that you need to write, in this Makefile you’ll need to define the targets and the variables(but strongly suggest you define these variables in configure.ac and let Autotools write these variables automaticly for you to your Makefile, I’ll discuss about this later)
  • Makefile.in: This file can be generate using automake, this file will is the template for the final Makefile(without pathes, since the path resoving is done by configure)
  • configure: This is the final product for Autotools, since other product is generate by this script or the product of this script. This script can be created by autoconf command
  • config.status: This file is generated by configure, and this script will generate the final config.h and Makefile
  • config.h: This is the core part for migration, you can generate all the detection as the macros in this header file, so you can add macros in your code to do the tricks(for example, if some function is missing, will remove some functions, or if is in Windows, using some F**king api instead of using POSIX API)
  • Makefile: Ah~~~ At last, we come to a file that means something…..

See? That’s why I said Autotools is quite hard for beginners. It has 6 commands(7, for including libtoolize), and 9 kind of files (input or output or input and output).

The Workflow

I’ll just describe the workflow of the share library development(since it is more complex).

  1. Run autoscan to generate the configure.scan
  2. Rename configure.scan to configure.ac
  3. Run libtoolize –force to add the libtool support (you’ll need AUTHORS, COPYING, ChangeLog, INSTALL, NEWS and README files in the folder, or add –install option to let libtool copy these files for you.)
  4. You’ll need to enable libtool and automake in your configuration, so add these code into your configure.ac

    AM_INIT_AUTOMAKE # This Macro will initialize the automake
    AC_ENABLE_SHARED # This Macro will configure the libtool to use shared library other than static
    LT_INIT # This Macro will initialize the libtool
    AC_CONFIG_MACRO_DIR([m4]) # This will provide libtool’s macros to your autoconf configuration file
    AC_OUTPUT(Makefile src/Makefile) # This will let configure generate the Makefiles for you

  5. Run autoheader to generate config.h.in

  6. Create your own Makefile.am(you can see here for the exmaple of writting the program’s Makefile.am), for shared library, you should use this code(if you want to install the library to your destination other than /usr/local/lib):

    pkgplugin_LTLIBRARIES= xxx.la
    xxx_la_SOURCES = xxx.h xxx.c

  7. Run command autoconf to generate the configure script
  8. Run command automake to generate the Makefile.in (maybe you should add option –add-missing to add the missing files)
  9. Run command automake to generate the Makefile.in
  10. You’re almost done, you can run ./configure to generate the config.h and Makefile then

Yes, 10 steps. 2 kind of files to write (configure.ac and Makefile.am).

The workflow will be like the image below:

Autotool Workflow

Tricks

  1. Use AC_MSG_CHECKING Macro to send the checking information to your user like this AC_MSG_CHECKING(F**king Windows API)
  2. Use AC_MSG_RESULT Macro to send the checking result to your user like this AC_MSG_RESULT(Yes, you are using F**king Windows XP)
  3. Use AC_MSG_WARN Macro to warn the user that some of function is not working, but don’t stop the checking AC_MSG_WARN(I’m afraid some is not going to work….)
  4. Use AC_MSG_ERROR Macro to stop the flow, let user to install the dependencies like this AC_MSG_ERROR(You should at least to have brain to go on)
  5. If you are using F**king C++, you can’t use AC_CHECK_LIB since C++ have a bad naming convention…. You should use AC_LINK_IFELSE to do this, the details is here
  6. Use AC_SUBST Macro to add varibles to your Makefile like this AC_SUBST([stair_to_heaven], [not exists])

The complete documentation for Autoconf Macros is here, help yourself.

References

Thanks for watching….

Why we needs another data processing framework

Background

I have many data processing work to do recently.

Yes VERY MUCH data processing work.

I have wrote a processing framework based on Rhino and Spring, called Jersey, which means JavaScript with easy.

It is fun to play data processing with Jersey, but there are 2 shortcomings:

  1. The startup time for jersey is too long, it’ll need about 2 seconds to startup the context (sure, you needs to start the java virtual machine, initialising the Rhino run time and then startup the Spring container, 2 seconds is not so bad). But it is nearly unbearable for me to just play something around(yes, java is stable, but, in the run and off scheme, IT IS REALLY SLOW, why? There is always lots of bootstrap there, yes I know that’s for flexiablility, but it is really slow, man!).
  2. The memory footprint for Jersey is to large. For jvm, it always wants more memory, I can wrote a python crawler, and run it using a thread group of about 10 threads, and still consumes less memory than the memory that jvm used in HelloWorld. This is very very bad, since the crawler that I wrote need to run as many as possible

So, I went to Python(2.7) for small tasks(even more bigger tasks).

Python is a little better faster, but compare to Jersey, it lacks:

  1. Better Unicode Support: This is fundamental!!!! I don’t get why Python community ignore this at the very beginning. I can’t open a CSV file properly without using a thirdparty library
  2. Fast MySQL Driver: I tried pymysql(didn’t get time to try others), and found out it is a little slow, I’ll explain it in another blog
  3. Libraries: Sure, Python is a good language, and many people using it to do serious things. But compare to Java, the library is still not enough, at least for me on the data processing work
  4. Not consistent for me: I’m working on a PHP framework for building website(and a CMS based on it) now, so why I need to code the data processing tool using Python than PHP, since I can use the library that I wrote for PHP

So, I gave up python for processing data.

And try to give PHP a try.

A little thoughts on data processing framework

After reading the section above, you’ll get to know why I’m using PHP as the language of my data processing framework(I’ll keep jersey working though. 🙂 )

And here is some thoughs of what a data processing framework can do (at least for me):

It should connect to most of the popular datasource

This is the foundamental part for the framework.

No matter how good your framework is, it is still useless if it can’t even connect to MySQL, Postgres.

And for nowadays, it should have mature libraries or drivers to connect to the nosql data storage(like Solr, MongoDB etc.), make the data transfer fast and safe.

It should based on a scripting language

This is same as Jersey. For data processing framework, testing and adjusting might happens on the live server(or the crawler master), this is the reason that I hate Hadoop… Why I needs to recompile and package and redeploy the code just to change a tiny bit on the crawler (only to run a small test)? Hadoop’s HDFS is good though.

It should have the ability to run across the platform

This is same as Jersey. That’s why Jersey is based on Java…. Luckily, most scripting language can run on all the major platforms we used today.

It should be very easy to extend and configure

It should be a framework contains lots of goodies, and from the foundation and the libraries is very flexiable to change or override.

So, no matter how complex the requirement is, there is always a better way to base the program on the framework(Eclipse is a good example).

It should run very fast, and have very little memory footprints

This is the same as the background section, you need to run it and get the result instantly if the processing is easy.

It should have the progress bar support by default

I don’t think I should explain this.

It should embed a fast rule engine

It is very important to embed a fast rule engine into the data processing framework.
Let’s view the basic work flow for data processing:

  1. Load the data from the datasource
  2. Transform the data into a common structural format(most data processing tool using XML)
  3. Processing the data
  4. Transform the data into the destination format
  5. Store the data into data destination

For step 1, you need the ability to connect(it is nothing with rule engine)
For step 2, the best transform method is rule based, it is more readable and extendable, I’ll show you an real world example here

Let’s suppose you have a small task to collect the user information collected using OAuth on 2 different platform(Twitter and Facebook for example.)

Platform 1(as p1)’s data format is(using json):

{
    "nick": "Jack",
    "profile_image": "a.jpg",
    "birthday": "someday"
}

And Platform 2(as p2)’s data format is:

{
    "screen_name": "Jack",
    "img": "b.jpg",
    "birthday": "someday"
}

There is lots of the records(about 100,000 each). You needs to transform them into a standard form

<user>
    <nick>Jack</nick>
    <profile_img>a.jpg</profile_img>
    <birtyday>someday</birthday>
</user>

Let’s using PHP and some fake code to do this, the first is using PHP code:

function processP1($arg) {
    $ret = array();
    if(isset($arg->nick)) {
        $ret['nick'] = $arg->nick;
    }
    if(isset($arg->profile_image)) {
        $ret['profile_img'] = $arg->profile_image;
    }
    if(isset($arg->birthday)) {
        $ret['birthday'] = $arg->birthday;
    }
    return (object) $ret;
}

function processP2($arg) {
    $ret = array();
    if(isset($arg->screen_name)) {
        $ret['nick'] = $arg->screen_name;
    }
    if(isset($arg->img)) {
        $ret['profile_img'] = $arg->img;
    }
    if(isset($arg->birthday)) {
        $ret['birthday'] = $arg->birthday;
    }
    return (object) $ret;
}

The second is CLIPS code:

(defrule set-result-nick-from-nick
    ?a <- (arg nick ?nick&~nil)
    ?r <- (result (nick nil))
    =>
    (retract ?a)
    (modify ?r (nick ?nick))
)

(defrule set-result-nick-from-screen-name
    ?a <- (arg screen_name ?nick&~nil)
    ?r <- (result (nick nil))
    =>
    (retract ?a)
    (modify ?r (nick ?nick))
)

(defrule set-result-profile-img-from-profile-image
    ?a <- (arg profile_image ?img&~nil)
    ?r <- (result (profile_img nil))
    =>
    (retract ?a)
    (modify ?r (profile_img ?img))
)

(defrule set-result-profile-img-from-img
    ?a <- (arg img ?img&~nil)
    ?r <- (result (profile_img nil))
    =>
    (retract ?a)
    (modify ?r (profile_img ?img))
)

(defrule set-result-birthday-from-birthday
    ?a <- (arg birthday ?birthday&~nil)
    ?r <- (result (birthday nil))
    =>
    (retract ?a)
    (modify ?r (birthday ?birthday))
)

Some one may argue, the first one can be write as one method like the second one too.

But, the world is changing, if p1 has change its protocol(say, change profile_image to img), and you’ll find you will regret to jam them together.

As you can see the code above, the second one is more consice, and better, it won’t have any assume of p1 or p2.

So, if time changes you’ll need to process some platofrm called p3’s information, you won’t need to change you code very much(just adding the missing rules, and if you are lucky, you may need not to add the rule, since the field of the user profile is mostly the same).

For Steps 3 and 4 is the same as the step 2. Rule engine runs faster and better when you needs to write lots of if..then..else.

And it is very easy to read and maintain.

CLIPS and PHP

For my PHP website framework, I choose CLIPS to do the rule processing, not only on the business logic.

I used it as the foundation of the framework, maybe you are curious about the desigin, why I should use a rule engine as the foundation of the framework?

Here is the example.

  1. The core rules to load configuration: Where to load the configuration, it seems to be very tricky, if a framework is flexiable, it can load at lots of places, and where to find is configurable too
  2. The core rules to load PHP scripts: This is the most foundation part of every PHP framework, if you think this should be very easy, take CI‘s CI_Loader as an example, and try to read it to understand the routine, and if you dare, try to add one more rule. 😀

So, I wrote a plugin for PHP first, it called php-clips. It is nearly stable for now (It can be compiled and installed using PHP’s building tools).

And I’m trying to an PHP framework to implement my thoughts above, this framework has the features as:

  1. Embed clips as its core
  2. Can be run at commandline as an application
  3. Just like jersey, will load the classes and extension on the working directory, or any configured directory(can be configured by the system wide configuration /etc/… or find the path from the environment variable, sure this is configurable too 😉 )
  4. You can use clips engine anytime, and even open a console(using PHP readlines) to run the clips commands your self manually
  5. It can run the clips scripts directly, if you want, you didn’t need to write 1 line of PHP
  6. You can replace any foundamental part of the framework just by overriding it(no need to replace the script, just like CI, you can have MY_XXX to replace the original classes, any class, and yes, this is configurable too. 😉 )
  7. It is written follow the CI’s guidelines, so, you’ll find the API and even the folder structure is like CI, but using the rule engine CLIPS as its core
  8. It is using mustache as the template engine, simple and fast
  9. It has the resource scheme and handler desigin just like spring, and you can write your own handler using PHP’s resource scheme and handler design too
  10. It’ll using Console-ProgressBar based on Curses to show the progressbar(the same progressbar like PEAR)
  11. It’ll distribute using PEAR

This little toy can be found at clips-tool, it is functional now.

Still in development, so it really lacks the documentation. I’ll make the documentation better when the current data processing work is done.

Steps To Setup and FTP Server

Here is something really basic as FTP setup.

Yeah, ftp setup is easy and fun isn’t it?

What you need is just install an ftp server software, configure the users, and you’re done.

Piece of cake, right?

YOU ARE FUCKING WRONG!!!!!

I’ll write my steps for setting up an secure FTP server, in case this will help some freaking guy like me out.

You should use a good ftp software.

This could be a very easy choice if you’re using distribution like CentOS or RHEL.

They suggest you install vsftp as the ftp software. I’m not an expert at this domain, and as so far, vsftp works fine for me.

You should create the ftp user in the Linux and setup the permissions

vsftp using Linux’s user system and file system as its user system and file system, it’s a brilliant idea to have, since it can have the most sophisticated user permission system on the fly.

But, this requires you to treat your users and system more carefully, don’t make the folder opposed to FTP or FTP user to open, so anyone can update or read your file by ftp without any problem.

Fine, this is not the key point I want to make, so I make them as short as I can, let’s go to the KEY POINTS

1. You must setup SELinux to accept your FTP, or it will kill your vsftp when it tries to access the file system.

This is a very fucking thing, but it is true. If you didn’t tell SELinux that vsftp’s action is fine, SELinux will stop the action to keep folder safe.

SELinux can be your friend in many ways, so turn it down may not be a good option.

I have googled the ways to make these two things work together, and here is the way:

/usr/sbin/setsebool -P ftp_home_dir=1 

This command will update the SELinux’s policy, and let ftp application have the previldeges to access user’s home folders.

This command will take a little time to execute, but this is the easiest way to acchieve this target, believe me.

2. You must configure the iptables firewall to let FTP application to connect

This step is easy to understand, no one wants his server too open, so at the begining, iptables only let ICMP and SSH requests to access the ports of the server.

In order to let FTP application to access the server, you must open two ports, 20 for data transfer, and 21 for commands.

So the configuration for iptables should be like this:

-A INPUT -m state --state NEW -m tcp -p tcp --dport 21 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT

After this, your FTP application can connect to server then.

Are we finished yet?

NO!!!!, not yet.

You still can’t upload your files onto the server.

Why?!

Because:

VSFTP IS USING PASSIVE MODE BY DEFAULT, and the passive mode of FTP is like this:

  • FTP Client tell server: Let’s using passive mode
  • Server respond: You can connect to me using port xxxx for this transfer
  • Client open a tcp channel on local 2001 to server’s port xxxx to start

Yes, passive mode can make use more port on server than active mode, this is a better way to use, isn’t it?

But, did you remember, that we only allow port 21 and 20 for requests on iptables?

So, this is a very very very big problem for FTP applications.

They’ll confused by the server, server told them to open a connecto to port xxxx, but when they try, they’ll get a connection refused.

So, you need to:

3. Change the configuration of vsftpd to let passive mode to use only port of a range

For example, like this:

pasv_max_port=10100
pasv_min_port=10090

This only opens 10090 to 10100 port for passive mode.

Then

4. You need to chnage iptables configuration to let port 10090 to 10100 open for requests

-I INPUT -p tcp --dport 10090:10100 -j ACCEPT

Then your FTP server is done and secure, and if you want to make the transfer to be more secured, you can:

5. Adding SSL transfer support to vsftp

First you need to generate a self assigned ceritificate for SSL

cd /etc/vsftpd
/usr/bin/openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout vsftpd.pem -out vsftpd.pem

This command will generate a certificate for SSL and this ceriticate will valid through a year.

Then you need to change /etc/vsftpd.conf adding these lines

# Turn on SSL
ssl_enable=YES

# Allow anonymous users to use secured SSL connections
allow_anon_ssl=YES

# All non-anonymous logins are forced to use a secure SSL connection in order to
# send and receive data on data connections.
force_local_data_ssl=YES

# All non-anonymous logins are forced to use a secure SSL connection in order to send the password.
force_local_logins_ssl=YES

# Permit TLS v1 protocol connections
ssl_tlsv1=YES

# Permit SSL v2 protocol connections
ssl_sslv2=YES

# permit SSL v3 protocol connections
ssl_sslv3=YES

# Specifies the location of the RSA certificate to use for SSL encrypted connections
rsa_cert_file=/etc/vsftpd/vsftpd.pem

after these steps,

6. Restart all the services

service iptables restart
service vsftpd restart

And, you’re done.

So, what we learned today?

  1. It is very hard to be secure, especially for a very easy and foundamental service like FTP
  2. Linux is secure, only when you are understanding it more deeply and use it more carefully
  3. Don’t blame firewall for the problems, it protects you
  4. When something is wrong, maybe the only problem is at your understanding, so, read and ask before compian is a good way to solve the proble