Cortex A8 ARM features

Thứ Hai, 20 tháng 6, 2011 / 09:23

ARM Cortex-A8 in OMAP3 is a high performance dual-issue applications processor which reaches a performance of 2.0 DMIPS/MHz (compared to ARM11 at 1.2 DMIPS/MHz). It is ARM v7 architecture, which is fully backwards compatible with application code for previous ARM processors.
It includes a floating point unit (ARM VFPv3 architecture) and the ARM NEON SIMD instruction set.
See Floating Point Optimization article for an intro into VFP-lite and NEON.

ARM NEON

NEON is a 64/128-bit wide SIMD vector extension for ARM, which has been architected to be an efficient C compiler target as well as being used from assembly language. It has 32x 64-bit registers (with a dual view as 16x 128-bit registers) which can hold the following datatypes:

64-bit signed/unsigned
32-bit signed/unsigned
32-bit single precision floating point
16-bit signed/unsigned
8-bit signed/unsigned

The key advantage of NEON is very high performance vector math processing, whilst being easy to program. It is the same thread of control as the ARM (but different instructions), and is supported by the same tools, debuggers and operating systems.
The NEON instruction set is documented in ARM's RealView Compilation Tools Assembler Guide.
For NEON optimized libraries see ARM Releases AAC, MP3, MPEG-4, H.264 and FFT OpenMAX DL Libraries, Highly Optimized for Cortex-A8/NEON and ARM11 Processors. Note: Read the EULA.
NEON is used by various opensource projects:

ffmpeg - libavcodec used by mplayer, omapfbplay, and many other linux applications
libpixman - used by X.org and Mozilla & Webkit browsers to render text and graphics
Bluez - official Linux Bluetooth stack
Eigen2 - C++ template library for linear algebra (matrix math etc)
Webm - Google's new opensource video codec

Compilation tools support for NEON:

ARM RVDS
gcc
LLVM

ARM Cortex-A8 Floating Point

There are two types of instructions in the ARM v7 ISA that handle floating point:
1) VFPv3 Floating point instruction set (used for single/double precision scalar operations). These is used by gcc for C floating point operations on 'float' and 'double' since ANSI C can only describe scalar floating point, where there is only one operation at a time.
2) NEON NEON vectorized single precision operations (2 values in a D-register, or 4 values in a Q-register) These can be use by gcc when -ftree-vectorize is enabled and -mfpu=neon is specified, and the code can be vectorized. In other cases the VFPv3 scalar ops will be used.
ARM Cortex-A processors have separate floating point pipelines that handle these different instructions.
On Cortex-A8, the designers' focus was on the NEON unit performance which can sustain 1 cycle/instr throughput (processing 2 single-precision values at once) for consumer multimedia. The scalar VFPv3 FPU cannot achieve this level of performance (cycle timings are in the Cortex-A8 TRM download), but it is still a lot better than doing floating point using integer instructions.
If you need the highest performance floating point on Cortex-A8, you need to use single precision and ensure the code uses the NEON vectorized instructions:

use gcc with -ftree-vectorize (possibly modify source code to make it vector friendly)
use NEON instrinsics (#include <arm_neon.h>, float32x2_t datatype and vmul_f32() etc)
use NEON asm directly

On Cortex-A9, there is a much higher performance floating point unit which can sustain 1 cycle/instr throughput, with low result latencies. OMAP4 uses dual-core Cortex-A9+NEON which gives excellent floating-point performance for both FPU and NEON instructions.

Device Driver, Driver Guide, Driver Download

Template Information

Trang