Event Camera: the Next Generation of Visual Perception System


In 2022, it seems computer vision is mature enough to enable intelligence in all sorts of visual perception systems. Image classification, object detection, semantic segmentation, depth estimation... Passing these keywords to search engines, we will get tons of papers/code repositories/pre-trained models that are readily available.
To get a rough idea of the future of computer vision, let's see the number of submissions of CVPR, the top-tier conference in computer vision. Submission numbers in the last ten years are plotted below, with submission numbers of NeurIPS as a reference. NeurIPS is a top-tier conference for all fields of artificial intelligence including computer vision. NeurIPS kept receiving record-breaking submissions until 2021, when the submission number went down at the first time, indicating an upper limit of about 10,000 submissions. According to the trend, the submission number in CVPR would become saturated in the following 3 5 3 5 3∼53\sim 5 years.

Then, what's next for computer vision in the following 3 5 3 5 3∼53\sim 5 years? This is the question I have been thinking about for two years. According to my observation, there are mainly three directions:
  • Develop magic applications based on mature solutions to basic computer vision problems. Mathpix is a representative example in this direction. I was very surprised that Mathpix was built as an image caption system. My stereotype of image caption is generating free-form text from images. Previously I thought it was impossible to generate compilable LaTeX code by a neural network, considering the pain I suffered from the complicated LaTeX grammar. Nevertheless, Mathpix managed it, without any exotic techniques but ordinary tools in computer vision like Transformers and image caption networks. (A public research paper [1] on Image2LaTeX is available. Hope that someday we can enjoy technical reports from Mathpix.) Therefore, I believe creative minds can continue to surprise us by smart combinations of mature techniques in computer vision.
  • Fix problems in existing deep-learning based computer vision techniques. Every researcher in deep learning is more or less aware of drawbacks of deep neural networks: they behave in a hard-to-interpret way, are vulnerable to adversarial attack, are biased towards prejudice in the training data, and are hungry for humongous data... Addressing these issues is an important question as deep-learning based systems are applied in our daily life. Governments, researchers, and companies are working together to reduce the negative impact of artificial intelligence, by clarifying the ownership of data/the boundary of privacy, and by demanding explainable decisions in critical scenarios. These problems are not only very important but also very difficult. In the future, maybe some intelligent minds can come up with nice solutions to these problems.
  • Build the next generation of visual perception system where current computer vision techniques are incompetent. When I say computer vision techniques are mature, implicitly I mean in controllable cases with high-quality data. Notable examples are face identification and indoor surveillance. In face identification, we can guide users to stay still to obtain good images; in indoor surveillance, light condition is mild and the environment is not very complicated. However, when deployed to in-the-wild scenarios, computer vision techniques will face many unexpected problems. Take the important auto-driving scenario for example, a visual perception system for auto-driving needs to deal with fast motion (of course cars move in a very high speed), dim or bright light (drive at night or towards the sun), and emergent occurrence of pedestrians or cars. Sadly, contemporary computer vision techniques are incompetent to address aforementioned problems. Besides auto-driving, there are lots of scenarios where computer vision can be improved: industrial vision, medical imaging... Considering the prevalence and importance of cars, I think auto-driving would be the next scenario with broad application of visual perception techniques. In my opinion, how to improve computer vision to a level that can satisfy the requirement of auto-driving, is definitely a goal of computer vision in the following years.
In auto-driving, two mainstream solutions are multi-camera perception (fuse information from multiple commodity cameras to understand the environment) and multi-sensor perception (fuse information from multiple types of sensors to model the environment). In terms of multi-sensor perception, researchers have explored many types of sensors over the past years, including LIDAR, ToF, dToF, RGBD camera, and so on. In this blog, I would like to share what I recently read: a new sensor called event camera with fascinating advantages for auto-driving systems. A friend of mine introduced event camera to me, and I found it interesting after consulting several tech leaders. The main reference materials of this blog are a survey paper [2] written by the team who invented event camera, and a website [3] from Prophesee (the leading supplier of event camera).

Does Human Eye Produce Frames?

The prevalent commodity camera follows a pinhole camera model. Periodically (e.g., every 1 30 1 30 (1)/(30)\frac{1}{30} second), all pixels take a global shutter to report their sensed light value, producing a frame of image. We call them "frame-based camera".

The pinhole camera model

Does human eye follow a frame-based paradigm? It seems too naive to be a question: our brain can think and imagine in pictures, so of course our eye works in a frame-based paradigm. However, after a simple envelop calculation, I find that the fact may be counter-intuitive. Human retina has a resolution of about 5 × 10 8 5 × 10 8 5xx10^(8)5\times 10^8 pixels with a refreshrate of about 60 60 6060 Hz; if the human eye is frame-based, then there are at least 3 × 10 10 3 × 10 10 3xx10^(10)3 \times 10^{10} neural spikes every second. Considering that each spike costs about 6 × 10 10 6 × 10 10 6xx10^(-10)6 \times 10^{-10} joul of energy [4], if human eye produces frames, then just transmitting the frame from retina to the brain requires the power of 18 18 1818 Watt, not to mention the processing of frames. To see how ridiculous the frame-based assumption is, please note that the total power of the human brain is about 20 Watt [5]. If we insist that human eye produce frames, then most energy ( 90 % 90 % 90%90\%) of the brain is consumed by the retina, which is insane. In fact, image frame is an illusion dreamed by the brain, not the working mechanism of retina.

From Silicon Retina to Event Camera

The first successful reproduction of human retina by silicon circuits was the Silicon Retina [6] reported by Scientific American in around 1990s. Taking advantage of a circuit that senses the change of intensity, they built a 250-pixel (50 by 50) sensor which was subject to human-like illusions of afterimages and Hermann grid.

Hermann grid illusion. Dark blobs appear at the intersections, but they disappear when zoomed in.

The silicon retina was a huge breakthrough in 1990s, but its pixels were too big and noisy to bring practical significance. The turning point came along with the European project called CAVIAR, when Prof. Tobi Delbruck proposed the circuit of dynamic vision sensor. The improved version of circuit made it possible to steadily increase the resolution from 64 x 64 [7] to 128 x 128 [8], 640 x 480 (VGA resolution) [9], and currently to 1280 x 720 (720p resolution) [10].
During the development of event camera, the metaphor "silicon retina" has been replaced by "event camera" to make a comparison with commodity frame-based camera. Just as cameras in our smartphones are made by active pixel sensors, event cameras are made by dynamic vision sensors. Therefore, event camera is sometimes called dynamic vision sensor (DVS).

Characteristics of Event Camera

According to Wikipedia [11], our daily camera originated in 1960s, and it took 40 years to evolve into its matured form of active pixel sensor (the first CMOS image sensor IMX001 was launched in 2000). By contrast, the event camera (or the silicon retina) was proposed in 1990s and manufactured in 2010s with a fast industrialization. The motivating force of pushing prototypical circuit to industrialization comes from the attractiveness of event camera's unique selling points. There are three well-recognized advantages of event cameras:
  • High temporal resolution. Pixels in event cameras respond independently to intensity change, canceling the necessity of synchronization. As a result, the response time of a pixel can be as short as 1 microsecond ( 1 μ s = 10 6 s 1 μ s = 10 6 s 1mu s=10^(-6)s1\mu s = 10^{-6}s)! By contrast, framerate of commodity cameras is about 30 60 30 60 30∼6030\sim 60 with response time at the scale of 10 m s = 10 2 s 10 m s = 10 2 s 10 ms=10^(-2)s10ms = 10^{-2}s.
  • High dynamic range. Because pixels in event camera measure the amount of (logarithmic) intensity change, it can work in extremely dim or light conditions. In the literature, the minimum dynamic range of event cameras is 120 d B 120 d B 120 dB120dB, which means the most bright light it can sense is 10 6 10 6 10^(6)10^6 times brighter than the least bright light it can sense. By contrast, dynamic range of commodity frame-based cameras is no more than 80 d B 80 d B 80 dB80dB with the ratio of the most bright and the least bright light being 10 4 10 4 10^(4)10^4.
  • Low power consumption. In event camera, when a pixel senses no change of the incident light, it just does nothing. In consequence, pixels only produce signals when necessary, reducing the power consumption of the whole camera. 10 m W 10 m W 10 mW10mW is the typical power of event cameras, while the power consumption of commodity cameras is at the scale of W W WW.

Application of Event Camera

The last several years have witnessed the surge of event camera applications. As shown in the following figure, the number of published papers on event cameras at computer vision and robotics venues grows exponentially every year.

Papers on event camera published during the last 6 years at Computer Vision and Robotics venues (PAMI, IJCV, CVPR+W, ECCV+W, ICCV+W, BMVC, WACV, 3DV, ICIP, TRO, IJRR, RAL, IROS+W, ICRA+W, RSS, ICCP, ICASSP, CoRL)

Applications of event cameras can be categorized into three aspects according to the characteristic they depend on:
  • Frame interpolation, optical flow estimation, motion deblur, and high-speed recording. These applications rely on the high temporal resolution brought by event cameras to enhance temporal information. The 1 μ s 1 μ s 1mu s1\mu s response time makes data from event camera almost continuous-in-time, breaking the limit of framerate. For example, this fascinating group uses event camera to manipulate microfluid at a high speed.
  • Auto-driving in poor light conditions can benefit from the high dynamic range of event cameras. This is not a product yet, but active research is going on, including several datasets (DDD17 [12], MVSEC [13]) designed for auto-driving with event cameras
  • Space situational awareness, wake-up word detection/gesture recognition in embedded devices. These applications have a limited energy budget and can benefit from the low-power characteristic of event cameras. For example, every watt matters for satellites orbiting the earth because we cannot replace batteries for them. In this aspect, researchers have launched a satellite with an event camera to sense space debris and to avoid collision.

Future Directions of Event Camera

After reading all the materials, I am convinced that frame-based cameras and event cameras are two ways of perceiving the world. At present, techniques for frame-based cameras are very mature after decades of development. It is fair to say, almost all the problems of event cameras are caused by their immaturity. In the following decades, it is hopeful that event cameras will be mature enough to be mass-produced, to have dedicated algorithms, and to show up in widely-used products.


[1] Deng et al., Image-to-markup generation with coarse-to-fine attention, ICML 2017
[2] Gallego et al., Event-based Vision: A Survey, TPAMI 2020
[4] Lennie et al., The cost of cortical computation, Current biology 2003
[5] Balasubramanian et al., Brain power, PNAS 2021
[6] Mahowald et al., The Silicon Retina, Scientific American 1991
[7] Lichtsteiner et al., A 64x64 AER logarithmic temporal derivative silicon retina, Research in Microelectronics and Electronics 2005
[8] Lichtsteiner et al., A 128x128 120 dB 15us latency asynchronous temporal contrast vision sensor, IEEE journal of solid-state circuits 2008
[9] Son et al., A 640×480 dynamic vision sensor with a 9µm pixel and 300Meps address-event representation, IEEE International Solid-State Circuits Conference (ISSCC) 2017
[10] Finateu et al., A 1280x720 back-illuminated stacked temporal contrast event-based vision sensor with 4.86 μm pixels, 1.066 GEPS readout, programmable event-rate controller and compressive data-formatting pipeline, IEEE International Solid-State Circuits Conference-(ISSCC), 2020
[12] Binas et al., DDD17: End-To-End DAVIS Driving Dataset, ICML workshop 2017
[13] Zhu et al., The Multivehicle Stereo Event Camera Dataset: An Event Camera Dataset for 3D Perception, IEEE Robotics and Automation Letters 2018

Recommended for you

Kaichao You
Co-Tuning: An easy but effective trick to improve transfer learning
Co-Tuning: An easy but effective trick to improve transfer learning
Transfer learning is a popular method in the deep learning community, but it is usually implemented naively (eg. copying weights as initialization). Co-Tuning is a recently proposed technique to improve transfer learning that is easy to implement, and effective to a wide variety of tasks.
5 points
0 issues