An area of application of Computer Vision, one that has always fascinated people, concerns the capability of robots and computers in general to determine, recognize and interact with human counterparts. In this article we will take advantage of the availability of cheap tools for computing and image acquisition, like Raspberry Pi and his dedicated video camera, Camera Pi, and of open source software products for image acquisition and processing, such as OpenCV and SimpleCV, that allow a high level approach to this discipline, and therefore quite a simplified one.
In this post we present the possibility to locate, within the context of pictures, human beings or their parts like faces, eyes, nose, and so on. This functionality is available in the most advanced photo gallery applications, and it is currently in the implementation phase as for social network applications. Once photos are loaded, the system will scan them to search for people’s faces, will find them out and will give a chance to associate a name. If, by chance, the same person is present in different pictures, he/she is recognized and automatically “ registered ”, notwithstanding privacy concerns. This last functionality is the one we previously cited as the one for identification or recognition.
The Recognition Method
To recognize real objects like, in our case, people and their features, a method known as “ Haar feature cascade ” or Viola-Jones method is supplied in SimpleCV, and in OpenCV as well. The method, in fact, was proposed in 2001 by Paul Viola and Michael Jones in their article “ Rapid Object Detection using a Boosted Cascade of Simple Feature ”, which actually means that it is possible to rapidly identify objects by means of a cascade of consecutive combinations of simple features. The method is a combination of four key components:
- Comparison characteristics, materializing from rectangular pixel matrices, and known as Haar features;
- Integral image calculation, starting from the image to process, so to speed up the features detection;
- The application of the learning method, as for AdaBoost Computer Vision systems;
- A cascade classifier, so to speed up the detection process.
The algorithm to extract the features comes from the Haar wavelet (http://en.wikipedia.org/wiki/Haar_wavelet): simplifying up to the point of being ridiculous, the wavelet is represented by two square matrices, one representing the upper part of the wavelet, the other one representing the lower part.
In the model employed for the extraction of the features from the images, the reference matrices have different shapes, such as the ones that can be seen in figure, that are more suitable for determining the shapes belonging to the human body, like the eyes or the nose. From this comes their denomination of Haar Features, to distinguish them from their original meaning. The same picture shows the shape of the features used by OpenCV and SimpleCV. The presence or not of a Haar “ feature ” in a portion of the picture happens by subtracting the median pixel value that are present in the black “ mask ” portion, from the median value of the pixels that are present in the clear part of the “ mask ”. If the difference is above a certain threshold value, the feature is considered as present. The threshold value is determined, for each feature, during the function training, to detect particular objects or parts of the human body. The learning process materializes itself when “ presenting ” to the Vision System the highest possible number of images concerning the “ objects ” family that we want to identify, and the highest possible number of images that have nothing to share with the object itself. From the amount of data that are “ studied ”, the threshold values are calculated, for each of the features that, in the case of OpenCV and SimpleCV, are memorized as a file in .xml format.
The portions of the picture that are analyzed are usually formed by 24×24 pixel matrices, that have to be “ compared ” with all the expected features. Here a first problem arises. By processing a 24×24 pixels matrix, something like more than 160.000 features are obtained. The gentlemen mentioned before have introduced a simplification in the calculation that is based on the integral image, allowing to represent the whole matrix with just four pixels. Incidentally, this method has been developped mathematically in 1984. You may expand your knowledge on the method at this link
The process consists in assigning to a certain pixel with a certain position within the matrix (image) the sum of all the pixels that are there in the area above and on the left of its position. Starting from the pixel up and on the left and continuing to the right and downwards, the incremental calculation process of the value for each pixel is definitely efficient.
Despite the simplification, the processing volume is still too big to be efficient. The authors of the method noticed that certain features were more significant than other ones, for the purpose of recognizing certain parts, e. g.: the eyes, but were not at all significant to detect, e. g.: the cheeks. Some other features are not at all significant. How to select the most significant features and discard the other ones? It can be done by means of a mathematical algorithm for “ machine learning ”, created to optimize the performances of other learning algorithms.
In extremely plain terms, that would horrify purists, its purpose is to research the smallest possible set that would grant a given percentage (for example, 75%) of accuracy as the result for detecting or discarding the required object. The paper mentioned in the beginning shows that with 200 features it is possible to obtain a degree of localization with an accuracy percentage of 95%.
For the sake of completeness and greater accuracy, the implementations of the method employ about 6.000 features from the 160.000 initial ones. If we had stopped here, for each 24×24 pixels matrix we should compare 6.000 features. Still definitely too long.
Luckily, another intuition of the mentioned authors has supplied a solution to this problem. In most cases, the greatest part of an area within a picture doesn’t contain the sought items. For this reason it is appropriate to define a simple method, able to understand if a portion of the picture belongs to the sought item or if it is surely not part of it. In this second case, the portion is discarded immediately and will not be processed anymore. The method will instead concentrate on those parts that in some way seem to be part of the sought items, and will analyze them thoroughly.
This led to introduce the idea of “ cascade ”, that is to say of classifiers’ cascade. Instead of applying all the 6.000 features to each portion of the image, the features themselves are grouped under different classification levels and applied one level at a time. If a matrix is discarded by the first level classifiers, it will no longer be processed. If it passes the first level, it will be compared with the features belonging to the second level. If an image’s portion passes all levels, it belongs then to the sought item.
The proposed classification method subdivides the amount of about 6.000 features in 38 classification levels. The first five levels are composed respectively by 1, 10, 25, 25 and 50 features. This way, on average, it is needed to compare each image portion with about 10 features (on average!). Before getting on to some practical example, let’s remember that to work, the method has to be trained with a great quantity (600 and more) of images containing different versions of the sought items, and as many images not containing them. Luckily, at least for the main shapes associated with the human body, we find that the work is already done and coded within the OpenCV and SimpleCV packages. This allows us to start immediately with some experiments.
Stand up for it, with your face
After a long conversation introducing the object recognition method, based on the Haar Features Cascade algorithm, let’s experiment, practically, with some examples. Let’s take advantage of the occasion to update the Raspberry Pi operating system as well, and to install a new library to help us manage Camera Pi. Let’s power Raspberry Pi, and verify that the monitor works, and then connect ourselves by means of PuTTY (or however you may want) with the “ root ” user. Let’s update the operating system, since at the moment of writing this article, many improvements appeared, and let’s do it by means of the commands:
If your distribution is not really one of the latest, it will need some time to complete the updates. In cases like this one, before starting it is always right to perform a good:
Let’s install now the “ picamera ” library, purposely developed for the python environment to interface Camera Pi. The documentation of the “ picamera ” library, along with other useful information concerning the available functionalities and Camera Pi’s running, can be found at the address:
Let’s install the library, with the command:
apt-get install python-picamera
In the following examples, we will concentrate on the recognition of faces and body parts, by using the “ HaarCascades ” that are available. Which ones are they? They are the ones for which the OpenCV and SimpleCv developers have provided the execution of the training process and made both libraries available.
We find them in the folder:
We will now disclose a pair of warnings, just in case you’d run into some problems. For the moment, do notice that the SimpleCV library is installed under the python2.7 folder, therefore you have to make sure to be using this python version to follow the examples. If you launch the examples under python3 you will probably get some errors. Secondly, to reach the .xml file within the said folder, the name is usually enough, e. g.: “ face ”, as indicated by the greatest part of the examples that can be downloaded from Internet. If in this way you get some errors, you need only to use the complete path. In our case we have found that it works fine to name the “ Features ” with their extension, e. g.: “ face.xml ”. In the said folder we find the following files, that can be different in the case you installed a SimpleCV version that is different from the 1.3 version that we have installed:
eye.xml fullbody.xml mouth.xml two_eyes_big.xml
face2.xml glasses.xml nose.xml two_eyes_small.xml
face3.xml left_ear.xml profile.xml upper_body2.xml
face4.xml left_eye2.xml right_ear.xml upper_body.xml
face_cv2.xml lefteye.xml right_eye2.xml
face.xml lower_body.xml right_eye.xml
Each .xml file contains the features that are the result of the training process, aimed at identifying a particular data, that we may directly use to process our images. The names recall the detail they are able to recognize. For certain body parts they are present in more than one version. Some “ Features ” are more selective than other ones, other ones are better at detecting certain very obvious shapes within the image, some other ones are more suitable to recognize portions of smaller images. In the context of actual applications, more features are applied to the same picture, so to “ try them all ”. In the most particular cases, or to recognize objects that are different from faces and body parts, it is needed to proceed in creating your own Features files, with the due learning processes. But that is another story.