![]() |
|
||||||||||||||
| | 首页 | 新闻 | 文库 | 方案 | 技术 | 独家 | 座谈 | 下载 | 图库 | 开发板 | 仿真器 | 邮购 | VIP会员 | 芯片代购 | 客户评价 | | ||
|
||
|
|||||
| Optimizing on BlackFin 代码优化之BlackFin | |||||
作者:yygoing 文章来源:http://yygoing.spaces.live.com/blog/ 点击数: 更新时间:2008-9-5 ![]() |
|||||
|
1 Function Arguments Transferring
three arguments or less: Use R2:0 to transfer. R0 as return value more than three arguments. first three: R2:0
fourth: [FP+20] note: a,0x14 = 20; b, LINK/UNLINK return value: R0 Function Prototype:
int test(int a, int b, int c) Parameters Passed as:
a in R0, b in R1, c in R2 Return Location in R0 int test(char a, char b, char c, char d, char e)
a in R0, b in R1, c in R2, d in [FP+20], e in [FP+24] Return Location in R0 Details: plz go to the Visual DSP’s Help=>function arguments, transferring
2 Optimizing Step by Step
Step 1: Design the Structure of your Function
a. the algorithm of your function
eg: de-quant of mpeg data[i] = (coeff[i] * default_intra_matrix[i] * quant2) >> 4;
dequant_mpeg_intra_c(int16_t * data,
const int16_t * coeff, const uint32_t quant, const uint32_t dcscalar, const uint16_t * mpeg_quant_matrices) b. the use of Vector Operations: Two calculations per instruction
Two (int16_t * int16_t) => R7.L = R7.L * R6.L,R7.H = R7.H * R6.H(IS); c. the use of other effective Operations like pixel instructions etc. Step 2: Implement the function
<ps: at this stage, u may disregard the use of the Parallel Instructions. However, you should consider the Parallel Instructions as much as possible> Step 3: Use the Parallel Instructions.
An multi-issue instruction is 64 bits in length <that’s why use .ALIGN 8 in code segment> An 64 bits multi-issue instruction = 32 bits ALU/MAC instruction + 2 * 16 bits instructions 16 bits instructions includes a. Ireg’s add, modify, sub b. Load c. Store details: ADSP-BF53x BF56x Blackfin Processor Programming Reference.pdf 提高Blackfin系列DSP中代码的并行性.kdh Step 4: Adjust the sequence of instructions for pipeline. How to see whether the pipeline has a conflict: a. pipeline viewer b. build message. eg: xx requires one extra cycle. LSETUP(DEQNT_INTRA_START, DEQNT_INTRA_END) LC0 = P0;
DEQNT_INTRA_START: R7.L = R7.L * R2.H,R7.H = R7.H * R2.H(IS) || [I0++] = R4 || R5 = [I1++]; ---------------------------------------------------------------------------------- R7.L = R7.L * R6.L,R7.H = R7.H * R6.H(IS); R6 = R7 >>>4(V) || R4 = [I2++] || NOP; R5.L = R5.L * R2.H,R5.H = R5.H * R2.H(IS) || [I0++] = R6 || R7 = [I1++]; ---------------------------------------------------------------------------------- R5.L = R5.L * R4.L,R5.H = R5.H * R4.H(IS) || R6 = [I2++] || NOP; DEQNT_INTRA_END:R4 = R5 >>>4(V); ===================================>>>>>>>>>>>>>
LSETUP(DEQNT_INTRA_START, DEQNT_INTRA_END) LC0 = P0;
DEQNT_INTRA_START: R4 = R5 >>>4(V); R7.L = R7.L * R6.L,R7.H = R7.H * R6.H(IS) || [I0++] = R4 || R5 = [I1++]; ---------------------------------------------------------------------------------- R6 = R7 >>>4(V)|| R4 = [I2++] || NOP; R5.L = R5.L * R2.H,R5.H = R5.H * R2.H(IS) || [I0++] = R6 || R7 = [I1++]; R7.L = R7.L * R2.H,R7.H = R7.H * R2.H(IS) || R6 = [I2++]; R5.L = R5.L * R4.L,R5.H = R5.H * R4.H(IS) || NOP; DEQNT_INTRA_END:R4 = R5 >>>4(V); 3 Effective Optimizing Tricks
Trick 1: Make use of LINK and UNLINK in Pairs.
Transferring less than three arguments <including three> can disregard link and unlink. Trick 2: Don’t Let your store/load/modify IReg instruction feel lonely.
If it exists, that ’s because you do not use parallel instructions as much as possible. eg: [I0] = R6; R0 = 0; => R0 =R0 -|- R0 || [I0] = R6; Trick 3: Pay Great attention to your instructions in loops.
Solve LOOP: plan 1: expand it if the loop count is not very large. plan 2: use the hardware loop. <how? To see the 4.0 PROGRAM SEQUENCER> Decrease your instructions in loops AS MUCH AS POSSIBLE. Trick 4: Combine your “if then else” as much as possible eg: if (coeff[i] < 0) { int32_t level = -coeff[i]; level = (( 2 * level + 1 ) * inter_matrix[i] * quant) >> 4; data[i] = (level <= 2048 ? -level : -2048); } else { uint32_t level = coeff[i]; level = (( 2 * level + 1 ) * inter_matrix[i] * quant) >> 4; data[i] = (level <= 2047 ? level : 2047); } if negative => -1 if positive => +1 ==========? // R5 = level two levels R5 = R5 << 2 (V,S); // if negative 1111 1111 1111 1111 // if positive 0000 0000 0000 0000 R1 = R5 >>>15(V,S) BITSET(R1,0); Trick 5: How to saturate a integer which is not 16 bits or 32 bits
plan A : first shift left, then saturate ,at last shift right back.
eg: saturate to <-2048,2047> 2047 = 0b1111 1000 0000 0000
so first shift left 4 bits, then saturate, at last shift right 4bits back . plan B: use MAX( ),MIN( ),MAX( )(V),MIN( )(V). eg: saturate to <-2048,2047>
MIN(2047,XX); MAX(-2048,XX); plan C: use subs and adds
eg: saturate to <-2048,2047> 32767_minus_2047 = 32767 – 2047; XX = XX + 32767_minus_2047 (S); XX = XX - 32767_minus_2047 (S); 32767_minus_2048 = 32767 – 2048; XX = XX - 32767_minus_2048 (S); XX = XX + 32767_minus_2048 (S); Which Plan do We choose??? That depends. Trick 6: The More parallel ,vector, pixel instructions, The Better.
Trick 7: About Align.
Word Align: 4 bytes 32 bits Half Word Align: 2 bytes,16 bits Byte Align: 1 byte ,8 bits. [I/Preg++] = Rx; W[I/Preg++] = Rx.L; B[Preg++] =Rx |
|||||
| 文章录入:admin 责任编辑:admin | |||||
| 【发表评论】【加入收藏】【告诉好友】【打印此文】【关闭窗口】 | |||||
| 最新热点 | 最新推荐 | 相关文章 | ||
| 前置放大器在移动医疗服务系 便携式多通道大容量生理信号 防腐监测仪的设计与应用 基于AD1674的酶标仪的设计 基于C/S模式的JRTPLIB库的测 ffmpeg与jrtplib相结合应用 blackfin模拟摄像头驱动中的 可编程逻辑在数字信号处理系 发现VDSP4.5一个BUG:单步调 VDSP5.0双核工程下sml3中的变 |
| 网友评论:(只显示最新10条。评论内容只代表网友观点,与本站立场无关!) |
| | 本站介绍 | 合作联络 | 欢迎投稿 | 广告业务 | 网站地图 | 设为首页 | 加入收藏 | 友情链接 | 网站公告 | 联系我们 | | |||
|