FPGAs improve vision processing
As higher-resolution cameras and faster frame rates push data rates beyond the processing capabilities of many host PCs, acceleration hardware can make up the shortfall.
By Kumara Ratnayake, Dalsa -- Test & Measurement World, 12/1/2007 2:00:00 AM
|
Sidebars:
Flexibility for computation and I/O |
Machine-vision systems have taken advantage of host PC computing capabilities to handle image processing. Today’s applications, however, are pushing the envelope of what a host processor can accomplish as data rates increase and host-processor performance reaches hard limits.
Data rates in vision systems have increased significantly as a result of image sizes that extend beyond 16 Mpixels (4 kpixels x 4 kpixels) and frame rates that reach 1000 fps for high-speed motion capture. Even at the modest 30 fps of standard video, the higher-resolution images yield typical system data rates of nearly 500 Mpixels/s. Simple 8-bit monochrome systems feature a data rate of half a gigabit per second, and color systems, which typically use 24 to 48 bits per pixel (3 to 6 bytes), have a data rate that quickly reaches the multi-gigabits-per-second range.
Such rates stretch host PC performance well past the breaking point. CPU clock speeds are topping out at 5 GHz due to power concerns, and only a portion of total processing power is available for vision tasks, because the host PC must also run the operating system and other applications software.
The mismatch between rising data rates and host PC performance limits points to a need for system acceleration. Switching to a high-performance bus such as PCI Express eliminates one bottleneck, but the processor’s limits are not so readily overcome. Even with the highest-performing CPUs, a host PC can only handle relatively simple image-processing functions in real time, and even those need to be at the lower end of the data rate range.
One way to accelerate a vision system and overcome the limitations of CPUs is to employ machine-vision hardware that makes use of field-programmable gate arrays (FPGAs). Many FPGAs also offer the added benefit of flexibility: Because you can reprogram them, you can use a single vision system for multiple applications.
Fixed designs not cost-effective
Each inspection application dictates how much computational power, I/O bandwidth, latency, and determinism a machine-vision system will need. Designing a system to handle only a single application is typically not cost-effective (see “Flexibility for computation and I/O”). A system capable of handling many different applications will have a larger market, resulting in lower production costs. In addition, the generalized system is easier and less expensive for users to adapt to changing requirements.
When building a vision system, therefore, developers need to keep the design flexible while boosting computational power and I/O bandwidth and controlling latency and determinism. The commonly available choices are to use faster clocks, to switch to a digital signal processor (DSP) architecture, or to use multicore or multiple processors in one of several architectures.
Each of these options has drawbacks. The single-processor approaches, even using a DSP architecture, still face fundamental performance limitations. Processors are also limited by their sequential nature. Only by increasing clock speed can such sequential operations be made faster, but increasing clock speed also increases the power consumption of the logic.
The single processor must also share its capability among several tasks, including running the operating system. While the image-processing tasks may follow a schedule, other system tasks may not. Thus, the need to share processing capability among multiple unscheduled tasks compromises the determinism of image-processing task execution. The need for sharing also applies to the processor’s I/O bus, which must handle all of a processor’s peripherals and memory access. Sharing of the bus limits the processor design’s I/O bandwidth.
A multicore or multiple-processor approach—dedicating one or more processors to the imaging tasks alone—can minimize some of these drawbacks. Possible multiprocessor architectures include cascaded and parallel structures (Figure 1).
|
|
Figure 1. Using multiple processors can help boost system performance, but (top) cascaded and (bottom) parallel architectures address different system requirements, and increasing processor count may not increase performance proportionally. |
In a cascaded approach, each processor in a series handles a portion of the imaging task, then passes the results to the next processor. Memory buffers between processors help accommodate the timing differences for each step. This approach can be extended to achieve the computational speeds required, but each extension adds cost and latency to a system.
A parallel-processing implementation separates image data into blocks, processing each block in the same way in its own processor. This approach is limited by cost and board space but in theory can be extended as far as one processor per pixel. While such an approach helps minimize latency, it is not suitable for functions such as feature extraction, for instance, which are extremely difficult to implement with this kind of block-level parallelism.
Acceleration outperforms multiprocessing
Even with the benefits gained by using multiple processors, a system’s performance is still limited by the sequential nature of processors. An N-fold increase in processor count yields no more than an N-fold increase in performance—in fact, the performance increase is often less because of the overhead required for coordinating processor operations. An alternative approach is to combine the host processor with a coprocessor that uses dedicated parallel logic rather than sequential code execution. Such a processing accelerator can provide substantially greater computational performance increases than conventional processors.
One way of creating dedicated coprocessing logic is to develop a custom ASIC. A common example of a coprocessing ASIC is the discrete cosine transform (DCT), which is used to speed system operation in image-compression applications.
Dedicated logic designs can offer substantial performance improvements over processors. Execution of a 5x5 convolution, for instance, can be implemented using 25 multiply-and-accumulate (MAC) structures in parallel, producing a full result after each clock cycle rather than requiring several clock cycles for each step. As a result of this parallelism, the clock speeds needed to achieve a given performance level are substantially less than for processors, with a corresponding reduction in heat generation. The dedicated nature of the logic also ensures that the results are deterministic, and by not having to manage external peripherals to achieve its functions, the ASIC’s pins can be dedicated to providing high I/O bandwidth where needed.
What ASICs gain in performance, however, they lose in flexibility. Because their logic is fixed, they are not readily adapted to new requirements. The creation of additional ASIC designs is not practical because of cost. Development of an ASIC, even a derivative one, can require more than a year of design time and hundreds of thousands to millions of dollars in production startup costs.
Despite having some drawbacks, FPGAs offer a cost-effective and flexible alternative to creating custom logic. In terms of performance, FPGAs are typically slower than comparable ASICs, and they also face restrictions in the amount of logic they can embody. While ASICs can be as large as needed to achieve their function, FPGAs typically are available only in set sizes. A design either fits in a given FPGA or it doesn’t, and FPGAs typically have much smaller logic capacity than ASICs of the same die area. In the past, such limitations severely reduced the applicability of FPGAs to machine vision.
But recent technology improvements have substantially reduced these limitations. One improvement has been in process technology. Through 65-nm process lithography, newer FPGAs offer greatly increased speed and logic capacity. The equivalent of several million logic gates and clock speeds of several hundred megahertz are now available.
Another improvement has been the introduction of hard logic cores into the FPGA device. Hard cores, such as DSP blocks and double data rate (DDR) memory interfaces, carry the full performance advantages of ASIC designs, while the surrounding programmable fabric provides design flexibility.
Flexibility a hallmark of FPGA coprocessing
One of the main advantages of the FPGA is that its function is readily changed. Many FPGAs are even reprogrammable in-circuit, giving FPGA-based designs virtually the same degree of flexibility that processors provide, with none of the limitations.
An FPGA-based machine-vision system can be adapted by developers to handle many applications, which means a single hardware design can service many markets. The field programmability of such designs allows customers to customize and adapt a system without installing new hardware. When FPGAs that can be programmed in-circuit are used, machine-vision systems can switch tasks and still offer accelerated performance simply by being loaded with a new logic program.
Thus, the dedicated logic coprocessor can provide the increased computing power, increased I/O bandwidth, and controlled determinism and latency that machine-vision systems require. Implemented in an FPGA, the coprocessor can also offer the design flexibility needed for providing high functionality at low cost. The key to realizing this potential lies in proper application of the technology.
To begin, developers need to analyze the image-processing algorithms they need to execute, looking for parallelism to exploit. The structures for implementing these parallel tasks can then be implemented in the FPGA hardware. That hardware should include significant amounts of memory for buffering image data. Both SRAM-like memory for random access and DRAM memory for streaming and burst access should be made available to the FPGA using dedicated memory interfaces.
Another step in applying the FPGA is to determine the best location within the system for the acceleration to take place. Where latency is a prime concern in an application, the acceleration element should be positioned closer to the camera where it can work directly with raw pixel data as the data is produced. By the time a full image frame has been captured, it has already been processed.
When latency is not as important, and in highly compute-intensive functions, positioning the accelerator in the frame grabber is more appropriate. Whereas resources such as electrical power and physical space are limited in the camera, the frame grabber is able to accommodate much larger designs. In addition, a frame grabber offers more memory and greater mass storage than a camera.
Using multiple FPGAs
Sometimes the optimal choice is to use an FPGA in several places, each designed to handle a range of functions. An example of such a design is Dalsa’s XRI-1200, an image-processor board that targets x-ray imaging and uses a three-stage processing design with acceleration at each stage (Figure 2).
|
|
Figure 2. Coprocessing offers a variety of benefits, some of which depend on its location in the system, as the XRI-1200 three-stage design demonstrates. |
The first stage of the XRI-1200 provides programmable shading correction and image warping. The shading correction applies offset and gain on a per-pixel basis to data from the camera in order to compensate for variations in light intensity and sensor response across the image. The image warping counters the distortions typically encountered at the edges of the image field of view due to lensing effects. Both functions must be programmable to accommodate system-specific variations, and both can operate on a pixel-by-pixel basis on the data coming from the camera.
The second stage provides configurable motion compensation to reduce noise in the image. Noise reduction can be achieved by averaging several frames together, but movement of the target during the averaging can result in blurring of the final image. The motion compensation algorithms determine the speed and direction of the target’s motion between frames, and then correct for the motion before averaging the frames. This operation requires substantial memory buffering to hold successive images as well as feature detection and motion extraction. Its position in the middle of the system gives it access to the necessary resources.
The third stage provides image rotation in increments of 0.01°, 3x3 programmable filter convolution, and output image conditioning. These tasks require extensive I/O and memory resources to handle the rotation as well as the computational acceleration, and they require a different memory structure from stage 2 because of the random addressing involved in rotation. By separating the functions into different stages, the XRI-1200 is able to address the differences in memory requirements with a simpler design.
The use of FPGA-based coprocessing hardware enables the XRI-1200 to process a 1024x1024-pixel image with 12 bits per pixel in real time at 30 fps—a performance level that is beyond what a conventional processor can provide. Simply implementing the 3x3 filter would require a processor performance of more than 400 MFLOPs.
In addition, the FPGA approach also provides the XRI-1200 design with considerable flexibility, as it gives the user full control over the settings for functions such as tuning the image compensation for the system’s specific camera and lens in the first stage, the number of frames to average and motion thresholds in the second stage, and filter parameters and rotation angle in the third stage. The FPGA also allows users to implement custom functions in the system without requiring any hardware changes.
The rise in image resolution and growing user demands for processing capability in applications such as medical imaging are being seen throughout the machine-vision industry. These demands have outstripped what can be accomplished simply with a host PC or programmable processors. Hardware acceleration is essential, and of all the performance-boosting options, FPGAs offer the best blend of performance increases and design flexibility.
No related content found.
- 0 rated items found.
Datasheets.com Electronic Parts & Inventory Search
185 million searchable parts
- Part Number
- Description
- Inventory
- Products
- Manufacturers
























