ARM Cortex-A8 in OMAP3 is a high performance dual-issue applications processor which reaches a performance of 2.0 DMIPS/MHz (compared to ARM11 at 1.2 DMIPS/MHz). It is ARM v7 architecture, which is fully backwards compatible with application code for previous ARM processors.
It includes a floating point unit (ARM VFPv3 architecture) and the ARM NEON SIMD instruction set.
See Floating Point Optimization article for an intro into VFP-lite and NEON.
The NEON instruction set is documented in ARM's RealView Compilation Tools Assembler Guide.
For NEON optimized libraries see ARM Releases AAC, MP3, MPEG-4, H.264 and FFT OpenMAX DL Libraries, Highly Optimized for Cortex-A8/NEON and ARM11 Processors. Note: Read the EULA.
NEON is used by various opensource projects:
1) VFPv3 Floating point instruction set (used for single/double precision scalar operations). These is used by gcc for C floating point operations on 'float' and 'double' since ANSI C can only describe scalar floating point, where there is only one operation at a time.
2) NEON NEON vectorized single precision operations (2 values in a D-register, or 4 values in a Q-register) These can be use by gcc when -ftree-vectorize is enabled and -mfpu=neon is specified, and the code can be vectorized. In other cases the VFPv3 scalar ops will be used.
ARM Cortex-A processors have separate floating point pipelines that handle these different instructions.
On Cortex-A8, the designers' focus was on the NEON unit performance which can sustain 1 cycle/instr throughput (processing 2 single-precision values at once) for consumer multimedia. The scalar VFPv3 FPU cannot achieve this level of performance (cycle timings are in the Cortex-A8 TRM download), but it is still a lot better than doing floating point using integer instructions.
If you need the highest performance floating point on Cortex-A8, you need to use single precision and ensure the code uses the NEON vectorized instructions:
It includes a floating point unit (ARM VFPv3 architecture) and the ARM NEON SIMD instruction set.
See Floating Point Optimization article for an intro into VFP-lite and NEON.
ARM NEON
NEON is a 64/128-bit wide SIMD vector extension for ARM, which has been architected to be an efficient C compiler target as well as being used from assembly language. It has 32x 64-bit registers (with a dual view as 16x 128-bit registers) which can hold the following datatypes:- 64-bit signed/unsigned
- 32-bit signed/unsigned
- 32-bit single precision floating point
- 16-bit signed/unsigned
- 8-bit signed/unsigned
The NEON instruction set is documented in ARM's RealView Compilation Tools Assembler Guide.
For NEON optimized libraries see ARM Releases AAC, MP3, MPEG-4, H.264 and FFT OpenMAX DL Libraries, Highly Optimized for Cortex-A8/NEON and ARM11 Processors. Note: Read the EULA.
NEON is used by various opensource projects:
- ffmpeg - libavcodec used by mplayer, omapfbplay, and many other linux applications
- libpixman - used by X.org and Mozilla & Webkit browsers to render text and graphics
- Bluez - official Linux Bluetooth stack
- Eigen2 - C++ template library for linear algebra (matrix math etc)
- Webm - Google's new opensource video codec
- ARM RVDS
- gcc
- LLVM
ARM Cortex-A8 Floating Point
There are two types of instructions in the ARM v7 ISA that handle floating point:1) VFPv3 Floating point instruction set (used for single/double precision scalar operations). These is used by gcc for C floating point operations on 'float' and 'double' since ANSI C can only describe scalar floating point, where there is only one operation at a time.
2) NEON NEON vectorized single precision operations (2 values in a D-register, or 4 values in a Q-register) These can be use by gcc when -ftree-vectorize is enabled and -mfpu=neon is specified, and the code can be vectorized. In other cases the VFPv3 scalar ops will be used.
ARM Cortex-A processors have separate floating point pipelines that handle these different instructions.
On Cortex-A8, the designers' focus was on the NEON unit performance which can sustain 1 cycle/instr throughput (processing 2 single-precision values at once) for consumer multimedia. The scalar VFPv3 FPU cannot achieve this level of performance (cycle timings are in the Cortex-A8 TRM download), but it is still a lot better than doing floating point using integer instructions.
If you need the highest performance floating point on Cortex-A8, you need to use single precision and ensure the code uses the NEON vectorized instructions:
- use gcc with -ftree-vectorize (possibly modify source code to make it vector friendly)
- use NEON instrinsics (#include <arm_neon.h>, float32x2_t datatype and vmul_f32() etc)
- use NEON asm directly
0 nhận xét:
Đăng nhận xét