[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)
Hi,
I got that posting with the discussion of the block sizes and times -
thanks.
Here is a new assembler memcpy for the MB.
For aligned transfers the main inner loop drops from 12 to 8 cycles per
transfer compared to the C code.
For un-aligned transfers the main inner loop drops from 17 to 13 cycles per
transfer compared to the C code.
... not sure what else I can add at this point. Is there any other
significant transfers or checksums that use other functions that might
benefit?
Jim Law
Iris Power
----- Original Message -----
From: "Brettschneider Falk" <fbrettschneider@xxxxxxxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Tuesday, April 29, 2008 10:30 AM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff
*orig_skb,structnet_device*dev)
Hi again,
I wrote:
A mysterious time gap of 570us in the logging above is
between "entered inner while-loop" and "before
sk_stream_alloc_pskb". It's a bit after
__tcp_push_pending_frames. I'm not sure but I reckon it's a
task switch to the receive part of the TCP stack (maybe
because the TCP-packet with an ACK has been received from my PC).
meanwhile I improved my logging and demystified that time gap. Indead it's
a task switch because of the receiving of an incoming TCP packet (likely
the ACK). Look here what happens in tcp.c:tcp_sendmsg():
1. entered tcp_sendmsg():
2. entered inner while-loop: 38
3. before skb_add_data(148 bytes): 6
4. after skb_add_data: 36
5. before __tcp_push_pending_frames: 10
6. entered xenet_FifoSend(1514): 287
7. exit xenet_FifoSend: 278
8. after __tcp_push_pending_frames: 59
9. next loop in inner while-loop: 2
10. before sk_stream_alloc_pskb(): 5
11. ---NET-RX-IRQ-(receive ACK)---->
12. entered FifoRecvHandler(64 bytes): 133
13. exit FifoRecvHandler: 148
14. ---soft-IRQ-------------------->
15. entered net_rx_action: 51
16. exit net_rx_action: 248
17. ---back-to-tcp_sendmsg--------->
18. after sk_stream_alloc_pskb: 60
19. before skb_add_data(180 bytes): 20
20. after skb_add_data: 42
21. before tcp_push: 8
22. before TCP_CHECK_TIMER(sk): 41
23. before release_sock(sk): 2
24. entered xenet_FifoSend(234): 525
25. exit xenet_FifoSend: 215
26. after release_sock(sk): 97
27. exit tcp_sendmsg():
a) You see 11., 14. and 17. are context switches between recv and send
tasks.
b) 16. shows we have another memcpy() in FifoRecvHandler() because of
taking over the bytes given by EMAC and putting them into a TCP packet.
c) 23. till 27. is the sending of the rest of bytes which didn't fit into
the TCP packet sent in 6.. It's 328-148==180(+54 header)==234.
d) I wonder why there is that huge waiting in 23. to 24.? Does
release_sock(sk) really take so long?
I hope this all helps to understand the time spent for sending in the TCP
stack of microblaze-uClinux. Apart from a faster memcpy(), do you see any
place where we could improve the speed?
Cheers, Falk
___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive :
http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
###################################-*-asm*-
#
# Copyright 2008 (c) Jim Law - Iris LP All rights reserved.
#
# This file is subject to the terms and conditions of the GNU General
# Public License. See the file COPYING in the main directory of this
# archive for more details.
#
# Written by Jim Law <jlaw@xxxxxxxxxxxxx>
#
# intended to replace the memcpy in memcpy.c in arch/microblaze/lib
#
#
#
# assly_memcpy.s
#
# Attempt at quicker memcpy for MicroBlaze
# Input : Operand1 in Reg r5 - destination address
# Operand2 in Reg r6 - source address
# Operand3 in Reg r7 - number of bytes to transfer
# Output: Result in Reg r3 - starting destinaition address
#
#
# Explanation:
# Perform (possibly unaligned) copy of a block of memory
# between mem locations with size of xfer spec'd in bytes
#
#
#######################################
#include <asm/clinkage.h>
.globl C_SYMBOL_NAME(memcpy)
.ent C_SYMBOL_NAME(memcpy)
C_SYMBOL_NAME(memcpy):
addi r3,r5,0 # move d to return register as value of function
addi r4,r0,4 # temp = 4
cmpu r4,r4,r7 # temp = c - temp (unsigned)
blti r4,5f # if temp < 0, less than one word to transfer
# transfer enough individual bytes to align the destination address
andi r4,r5,3 # temp = d & 3
beqi r4,16f # if zero, destination already aligned
rsubi r4,r4,4 # temp = 4 - temp (yields 3, 2, 1 transfers for 1, 2, 3 addr offset)
15: beqi r4,16f # if no bytes left to transfer, transfer the bulk
lbui r11,r6,0 # h = *s
sbi r11,r5,0 # *d = h
addi r6,r6,1 # s++
addi r5,r5,1 # d++
addi r4,r4,-1 # temp--
brid 15b # loop
addi r7,r7,-1 # c-- (IN DELAY SLOT)
16: andi r4,r6,3 # temp = s & 3
bnei r4,10f # if temp != 0, unaligned transfers needed
######### case when aligned ########################
addi r12,r0,4 # v = 4
cmpu r4,r12,r7 # temp = c - v (unsigned)
blti r4,5f # if temp < 0 quit loop
bsrli r9,r7,2 # n = c/4
add r10,r0,r0 # offset = 0
1: lw r4,r6,r10 # temp = *(s+offset)
sw r4,r5,r10 # *(d+offset) = temp
addi r9,r9,-1 # n--
bneid r9,1b # loop
addi r10,r10,4 # offset++ (IN DELAY SLOT)
add r6,r6,r10 # s = s + offset
add r5,r5,r10 # d = d + offset
rsubi r10,r10,0 # offset = - offset
brid 5f # all done, go finish any remaining bytes
add r7,r7,r10 # c = c + offset (IN DELAY SLOT)
######### all cases when non-aligned ############
10: bsrli r9,r7,2 # n = c/4
beqi r9,5f # if n == 0 just do balance
add r10,r0,r0 # offset = 0
andi r8,r6,0xfffffffc # calc aligned source address 'as'
lw r11,r8,r10 # h = *(as + 0)
addi r8,r8,4 # as++
addi r4,r4,-1
beqi r4,101f # temp was 1 => 1 byte offset
addi r4,r4,-1
beqi r4,102f # temp was 2 => 2 byte offset
bri 103f # temp was 3 => 3 byte offset
######### case when offset = 1 byte #############
101: bslli r11,r11,8 # h = h << 8
21: lw r12,r8,r10 # v = *(as + offset)
bsrli r4,r12,24 # temp = v >> bits_to_shift_right
or r4,r11,r4 # temp = h | temp
sw r4,r5,r10 # *(d + offset) = temp
bslli r11,r12,8 # h = v << 8
addi r9,r9,-1 # n--
bneid r9,21b # loop
addi r10,r10,4 # offset++ (IN DELAY SLOT)
31: addi r4,r0,-3 # temp = addr_adj
bri 39f
######### case when offset = 2 byte #############
102: bslli r11,r11,16 # h = h << 16
22: lw r12,r8,r10 # v = *(as + offset)
bsrli r4,r12,16 # temp = v >> 16
or r4,r11,r4 # temp = h | temp
sw r4,r5,r10 # *(d + offset) = temp
bslli r11,r12,16 # h = v << 16
addi r9,r9,-1 # n--
bneid r9,22b # loop
addi r10,r10,4 # offset++ (IN DELAY SLOT)
32: addi r4,r0,-2 # temp = addr_adj
bri 39f
######### case when offset = 3 byte #############
103: bslli r11,r11,24 # h = h << 24
23: lw r12,r8,r10 # v = *(as + offset)
bsrli r4,r12,8 # temp = v >> 8
or r4,r11,r4 # temp = h | temp
sw r4,r5,r10 # *(d + offset) = temp
bslli r11,r12,24 # h = v << 24
addi r9,r9,-1 # n--
bneid r9,23b # loop
addi r10,r10,4 # offset++ (IN DELAY SLOT)
33: addi r4,r0,-1 # temp = addr_adj
#bri 39f # fall thru
##################################################
39:
add r8,r8,r10 # as = as + offset
add r5,r5,r10 # d = d + offset
rsubi r10,r10,0 # offset = - offset
add r7,r7,r10 # c = c + offset
add r6,r8,r4 # s = as - addr_adj
# get here to do balance of individual bytes
5: beqi r7,6f # if no bytes left to transfer, exit
lbui r4,r6,0 # temp = *s
sbi r4,r5,0 # *d = temp
addi r6,r6,1 # s++
addi r5,r5,1 # d++
addi r7,r7,-1 # c--
bri 5b # loop
6:
rtsd r15,8
nop
.end C_SYMBOL_NAME(memcpy)