An outline of an intelligent video surveillance system able to detect and identify abnormal and alarming situations by analyzing object movement, designed to minimize video processing and transmission, allowing large numbers of cameras as an integrated safety and security solution in smart cities
by Lorena Calavia et al, Universidad de Valladolid
Telecommunications Advances in Information and Communication Technologies (ICTs) are triggering a transformation of the environments where we live into intelligent entities globally known as smart spaces (smart homes, smart buildings, smart cities, etc.). They capture information using large sensors networks distributed throughout its domain (a house, a building, a whole city, etc.) and use it to intelligently adapt their behavior to the needs of the users.
Additionally, modern society is experiencing an increasing interest in safety and security, resulting in many experiences related to wide area deployment of video surveillance systems. When integrated within Smart Spaces, these video sensor networks confer to the intelligent system the ability to watch the environment, and when combined with their intelligence, to detect and identify different abnormal situations that may arise inside the smart domain. Currently there is a wide range of video surveillance systems that are used in different fields such as intrusion detection or traffic surveillance. However, autonomous detection of alerts and abnormal situations is still at a primitive stage.
Automatic object recognition is therefore a hot topic with quite a lot of literature behind it. When the system is capable of identifying objects, artificial intelligence (AI) and video interpretation algorithms are capable of detecting abnormal behaviors of those objects, mainly using two different strategies: statistical and semantic analysis. Usage of statistical analysis to process visual information and a focus on video surveillance and the usage of Latent Semantic Analysis, present probabilistic models where statistical classification and relational learning are applied to identify recurrent routines.
Autonomous
In order to design and develop an intelligent/ autonomous video interpretation system based on semantic reasoning to be deployed on a massive video surveillance network with thousands of cameras, capable of covering an area similar to a city, the objectives of would be:
- develop an algorithm for the interpretation of video scenes and identification of alarm situations.
- the system should be capable of giving rich, human-level information about the alarm. It would not be enough to say that there has been an alarm, the system should say, for instance, that there has been a car crash.
- the system should operate on large numbers of cheap cameras to allow a wide area deployment. This means cameras will not have enough processing power to run complex object identification algorithms, and that it is impossible for every camera to send a detailed video signal to the control room for its real-time analysis. This scenario is the one found in Smart City deployments, i.e., intelligent urban-scale systems.
- the system should be able to operate in all the different knowledge domains related to surveillance in the Smart City scenario. For instance, it should be able to handle traffic control, fire alarms, crowd control and vandalism detection.
Current state-of-the-art surveillance systems are based either on statistical analysis of image features or on the hard-coded interpretation of object identification. Statistical alarm detection simply identifies abnormal behaviors, understanding abnormal as things that do not happen frequently according to a certain mathematical criteria, so it is impossible for them to fulfill requirements in 2 above. Systems based on hard-coded interpretations work on the basis of a hard-coded rule engine and do not usually make use of formal semantic technologies. This means that it is necessary to manually modify the specific implementation of the algorithms in the system to port it from one domain to the other, if that is feasible at all.
Object identification
For instance, a video surveillance system for traffic control which is based on object identification of cars will require a complete change of the object identification algorithms in order for the system to operate in the vandalism detection domain. Therefore, this kind of systems is not suitable for the multi-domain scenario specified by requirement number 4.
Formal semantic technologies based on ontologies would allow an easy, fast and flexible specification of different operation domains by switching to the appropriate ontology, but its application today is only feasible in environments where cameras have powerful processors and are capable of running complex object identification algorithms.
However, design goal number 3 is not compatible with this computing power requirement. This limitation is imposed when operating with a huge number of cameras: embedding powerful processors in all of them would be too expensive, and sending the entire video signal to one control center would require an enormous amount of bandwidth.
A system capable of fulfilling all four design goals is based on semantic reasoning for performing the image interpretation, so it is possible to easily change the application domain (requirement number 4) by specifying an appropriate ontology. Additionally, semantic reasoning is performed at a high level of abstraction using human concepts (such as –a car should move along a lane– or –there is an alarm if a car is located on a sidewalk–), thus fulfilling requirement number 2.
Computing power
However, as mentioned, performing this kind of semantic reasoning directly over the video signal would require a lot of computing power, mainly in identifying all objects in the image, and that would be in conflict with requirement number 3. Therefore, the system proposed in this work does not perform object identification directly over the video stream. Instead, the video is preprocessed in a first stage to extract information about moving objects, their size and trajectories. With that information a path model of the scene is created for each camera (training mode), and objects are identified on the basis of the parameters of their movement (operation mode). Finally, semantic reasoning is performed over this interpretation.
One advantage of this approach is that it facilitates real world deployments of dense networks with many cameras by simplifying the camera-specific calibration and configuration stage. The path model of each scene is built automatically by the system during the learning stage using an unattended route detection algorithm, and the ontology employed is not camera-specific, but domain-specific: concepts defining for instance the normal behavior of cars (like –cars should be on the road–) are always the same, regardless of the specific road a camera is watching. This means that there is a single ontology for each surveillance domain (traffic control, fire detection, perimeter surveillance, etc.) shared by all cameras watching a scene related to that domain. Thanks to this, the only two camera-specific configuration operations required by this approach are: (1) recording camera height and tilt angle at installation time (parameters that will help correct perspective distortion); and (2) selecting the surveillance domain/s to which each camera is applied (that is, selecting the appropriate ontology/ontologies for each camera).
It is worth mentioning that ontology building is a technique on its own, which already has a well-defined set of procedures, tools and best-practices. An ontology built for alarm detection should not be significantly different from ontologies designed for other purposes.
The proposed system is based on a three-stage architecture shown in Figure 1. The first stage of the algorithm is implemented on each sensor camera, and the second and third are located in the system’s control center. These three modules are as follows:
Sensing
A sensor network including smart surveillance cameras and other sensors (fire and movement detectors, for instance) is connected with the control center. Cameras run motion detection algorithms to transform the video stream into data packets (specifically XML files) that contain information about the different moving objects (speed, position, size, etc.).
Route detection
Once the XML file with the data of the moving objects is available, trajectories and movement patterns of different objects are processed using an algorithm that builds, for each camera, a route model of the scene (zones of the image where objects usually move) enriched with object sources and sinks (zones of the image where objects usually appear or disappear). Route Detection is implemented with two internal submodules. First, the Frame Preprocessor receives from the camera an XML file with the motion parameters of the objects detected by the camera, separates the integrated data in different frames (a single file can aggregate several frames to optimize communications), corrects the perspective distortion using height and tilt angle values for the source camera by applying a simple Inverse Perspective Mapping, and reformats the information in the shape of a raw data matrix. From this data matrix, the Route Detection Algorithm, using a set of routines implemented in Matlab, determines the routes of the scene. Route Detection is performed only when the system is in training mode.
Semantic reasoning
This stage is only performed in operation mode, when Route Detection stage has finished and the route model of the scene is completed. The aim of this stage is to translate the syntactic parameters of objects, routes, sinks and sources obtained by the cameras and the Route Detection stage into meaningful semantic classes (–car– instead of –object–) and identify any alert situation (a –car is on the sidewalk–) according to the ontology and semantic rules (a formal knowledge model specified by a human ontology engineer).
The Semantic Translation translates the syntactic information into formal semantic data (according to Semantic Web standard formats) and populates the ontology with them (using the Jena framework, which handles all the semantic operations done within Java). After the translation, the Alarm Detection submodule processes the ontology (recently populated with new data) with a semantic reasoner to infer new properties about the objects in the image, and specifically, identify if an alarm situation is going on. If it is the case, an appropriate XML Alarm is sent.
Deployment
Being based on the semantic reasoning over a formal knowledge model implemented by a human operator, the system is suitable for its deployment in a wide range of different environments, only by switching to the appropriate ontology and ruleset.
While some systems based on statistical analysis which are not domain-specific may operate in different domains without any kind of adaptation, they present several disadvantages with respect to the solution presented here: their behaviour is difficult to predict (since they are based on mathematical analysis of the video parameters) and is not related to the semantics of the scene watched, they are unable to discriminate among different types of alarms and cannot give rich information about them. Therefore, they are in general unsuitable for its deployment within Smart Spaces.
In short, the semantic-based approach for the detection of alarms in video surveillance systems presented in this paper offers a number of important advantages that are not available in other solutions in the current state of the art.