MMX Can Speed Image-Processing Software
Capitalizing on MMX technology isn't easy, but here are some tips for how you can take advantage of it.
Fernando Serra Imaging Technology, Bedford, MA -- Test & Measurement World, 6/1/1999
| The MMX technology, which Intel added to its
Pentium processors to speed graphics processing, can also speed image-analysis tasks.
Unfortunately, capitalizing on MMX technology isnt easy. First, programming
languages, such as C, C++, and Basic, lack the data elements needed to let the code they
produce use MMX hardware. Second, existing software cannot automatically benefit from MMX
technology. Your choices are few. You can purchase new or upgraded image-analysis software
that includes MMX capabilities, or you can resort to using assembly language to optimize
your code to include MMX capabilities. Lastly, you may find you get reasonable-fast
processing by simply optimizing existing code, without using MMX. Intels MMX technology adds several new
data types and new machine-language instructions to Pentium-class CPUs. The new data types
let the CPU handle 64-bit data: The CPU uses 64 bits in each of its eight 80-bit
floating-point registers to form MMX registers. Think of the 64-bit MMX registers as
containing eight bytes, four words, two double words, or one quadword. Because the MMX and floating-point operations share
registers, software cannot mix floating-point and MMX instructions without paying a price.
A Pentium takes about 50 clock cycles to toggle the floating-point register set between
floating point use and MMX use. So, simply switching the context of the registers consumes
valuable time. MMX defines several new CPU instructions that manipulate a
registers data in parallel. For example, when an operation processes 1 byte in an
MMX register, the same operation can take place simultaneously on the other 7 bytes. The
added MMX instructions perform the following types of operations: add, subtract, multiply,
multiply-and-accumulate, compare, shift, logical, move, and pack-unpack. Test a Real Algorithm We developed and tested the variance algorithm using Microsoft Visual C++ 5.0 and a Windows NT 4.0-based 266-MHz Intel PentiumPro computer with 128 Mbytes of memory. Our timing information (Table 1) shows results for processing a 1023x1023-pixel 8-bit image. We chose the nonstandard image size simply to verify that the algorithm worked properly. Listing 1 shows our C-language algorithm for variance.
During our optimization experiments, we kept
track of the execution times for each new version of the software, as shown in Table 1.
You can find all the optimization details in two sections, complete with code, at www.imaging.com/tutorials.html. We used the first version (Listing 1) as our reference. That
routine ran in 147.8 ms. Each version builds on the code of the previous version, unless
noted otherwise. First we modified version 1 to use only integer math (version 2). In
version 3, we used a more efficient C-pointer construct to replace the innermost for loop.
Version 4 added more speed when we unrolled the loops, a procedure in
which we duplicated the code for the inner for loop to increase time between CPU branch
instructions. Moving to version 5 required a rewrite in assembly language.
This version ran slightly slower than the best C-language version, but it provided the
base from which we began code optimization that would eventually include MMX operations.
Unrolling the assembly-language code in version 5 produced version 6. Version 7the first MMX coderequired a drastic
recoding of the assembly-language code from version 6. To optimize version 7 of the
algorithm so it best used MMX hardware, we turned to Intels VTune software-analysis
tools (developer.intel.com/ vtune/analyzer/). This product analyzes assembly-language code
and determines how efficiently the code executes on the CPU. VTune Optimizes MMX Code The VTune program helped us determine which operations we
could pair to further optimize the MMX code. We used the resulting MMX code in version 8.
Coding algorithms to take advantage of MMX hardware requires careful redesign of
algorithms and careful coding in assembly language. As you may have deduced from the data in Table 1, you
frequently can increase processing speeds just by carefully redesigning the code you
already have. Although the final MMX version of our test algorithm operates 10 times
faster than the original C algorithm, most of the In most cases, optimizing existing code yields the greatest returns in the first few optimizations. If you dont need to squeeze out every bit of performance afforded by MMX while you wait for new software tools, try optimizing the code you already have. T&MW Fernando Serra works as the Vision Group Manager at Imaging Technology and he is responsible for vision algorithms and software tools. He received a B.S.E.E. degree from Wentworth Institute of Technology in 1986. fernando@imaging.com. |



















