[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)



Hi,

I got that posting with the discussion of the block sizes and times - thanks.

Here is a new assembler memcpy for the MB.

For aligned transfers the main inner loop drops from 12 to 8 cycles per transfer compared to the C code.

For un-aligned transfers the main inner loop drops from 17 to 13 cycles per transfer compared to the C code.

... not sure what else I can add at this point. Is there any other significant transfers or checksums that use other functions that might benefit?

Jim Law
Iris Power


----- Original Message ----- From: "Brettschneider Falk" <fbrettschneider@xxxxxxxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Tuesday, April 29, 2008 10:30 AM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)


Hi again,

I wrote:
A mysterious time gap of 570us in the logging above is
between "entered inner while-loop" and "before
sk_stream_alloc_pskb". It's a bit after
__tcp_push_pending_frames. I'm not sure but I reckon it's a
task switch to the receive part of the TCP stack (maybe
because the TCP-packet with an ACK has been received from my PC).
meanwhile I improved my logging and demystified that time gap. Indead it's a task switch because of the receiving of an incoming TCP packet (likely the ACK). Look here what happens in tcp.c:tcp_sendmsg():

1.  entered tcp_sendmsg():
2.    entered inner while-loop:            38
3.      before skb_add_data(148 bytes):     6
4.      after  skb_add_data:               36
5.      before __tcp_push_pending_frames:  10
6.        entered xenet_FifoSend(1514):   287
7.        exit    xenet_FifoSend:         278
8.      after  __tcp_push_pending_frames:  59
9.    next loop in inner while-loop:        2
10.     before sk_stream_alloc_pskb():      5
11. ---NET-RX-IRQ-(receive ACK)---->
12. entered FifoRecvHandler(64 bytes):    133
13. exit    FifoRecvHandler:              148
14. ---soft-IRQ-------------------->
15. entered net_rx_action:                 51
16. exit    net_rx_action:                248
17. ---back-to-tcp_sendmsg--------->
18.     after  sk_stream_alloc_pskb:       60
19.     before skb_add_data(180 bytes):    20
20.     after  skb_add_data:               42
21.   before tcp_push:                      8
22.   before TCP_CHECK_TIMER(sk):          41
23.   before release_sock(sk):              2
24.     entered xenet_FifoSend(234):      525
25.     exit    xenet_FifoSend:           215
26.   after  release_sock(sk):             97
27. exit tcp_sendmsg():

a) You see 11., 14. and 17. are context switches between recv and send tasks. b) 16. shows we have another memcpy() in FifoRecvHandler() because of taking over the bytes given by EMAC and putting them into a TCP packet. c) 23. till 27. is the sending of the rest of bytes which didn't fit into the TCP packet sent in 6.. It's 328-148==180(+54 header)==234. d) I wonder why there is that huge waiting in 23. to 24.? Does release_sock(sk) really take so long?

I hope this all helps to understand the time spent for sending in the TCP stack of microblaze-uClinux. Apart from a faster memcpy(), do you see any place where we could improve the speed?

Cheers, Falk

___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/

###################################-*-asm*- # # Copyright 2008 (c) Jim Law - Iris LP All rights reserved. # # This file is subject to the terms and conditions of the GNU General
# Public License.  See the file COPYING in the main directory of this
# archive for more details.
#
# Written by Jim Law <jlaw@xxxxxxxxxxxxx>
# # intended to replace the memcpy in memcpy.c in arch/microblaze/lib
#
# # # assly_memcpy.s # # Attempt at quicker memcpy for MicroBlaze
#	Input :	Operand1 in Reg r5 - destination address
#		Operand2 in Reg r6 - source address
#		Operand3 in Reg r7 - number of bytes to transfer
#	Output: Result in Reg r3 - starting destinaition address
#			
# # Explanation:
# 	Perform (possibly unaligned) copy of a block of memory
#	between mem locations with size of xfer spec'd in bytes
#
#
#######################################

#include <asm/clinkage.h>

	.globl	C_SYMBOL_NAME(memcpy)
	.ent	C_SYMBOL_NAME(memcpy)

C_SYMBOL_NAME(memcpy):
	addi	r3,r5,0		# move d to return register as value of function

	addi	r4,r0,4		# temp = 4
	cmpu	r4,r4,r7	# temp = c - temp  (unsigned)
	blti	r4,5f		# if temp < 0, less than one word to transfer

	# transfer enough individual bytes to align the destination address
	andi	r4,r5,3		# temp = d & 3
	beqi	r4,16f		# if zero, destination already aligned	
	rsubi	r4,r4,4		# temp = 4 - temp (yields 3, 2, 1 transfers for 1, 2, 3 addr offset)

15:	beqi	r4,16f		# if no bytes left to transfer, transfer the bulk
	lbui	r11,r6,0	# h = *s
	sbi	r11,r5,0	# *d = h
	addi	r6,r6,1		# s++
	addi	r5,r5,1		# d++
	addi	r4,r4,-1	# temp--
	brid	15b		# loop
	addi	r7,r7,-1	# c-- (IN DELAY SLOT)

16:	andi	r4,r6,3		# temp = s & 3
	bnei	r4,10f		# if temp != 0, unaligned transfers needed

######### case when aligned ########################
	addi	r12,r0,4	# v = 4
	cmpu	r4,r12,r7	# temp = c - v (unsigned)
	blti	r4,5f		# if temp < 0 quit loop
bsrli r9,r7,2 # n = c/4 add r10,r0,r0 # offset = 0
	
1:	lw	r4,r6,r10	# temp = *(s+offset)
	sw	r4,r5,r10	# *(d+offset) = temp
	addi	r9,r9,-1	# n--
	bneid	r9,1b		# loop
	addi	r10,r10,4	# offset++ (IN DELAY SLOT)
	
	add	r6,r6,r10	# s = s + offset
	add	r5,r5,r10	# d = d + offset
	rsubi	r10,r10,0	# offset = - offset
	brid	5f		# all done, go finish any remaining bytes
	add	r7,r7,r10	# c = c + offset (IN DELAY SLOT)


######### all cases when non-aligned ############
10: bsrli r9,r7,2 # n = c/4 beqi r9,5f # if n == 0 just do balance

	add	r10,r0,r0	# offset = 0

	andi	r8,r6,0xfffffffc	# calc aligned source address 'as'
	lw	r11,r8,r10	# h = *(as + 0)
	addi	r8,r8,4		# as++

	addi	r4,r4,-1	
	beqi	r4,101f		# temp was 1 => 1 byte offset
	addi	r4,r4,-1
	beqi	r4,102f		# temp was 2 => 2 byte offset
	bri	103f		# temp was 3 => 3 byte offset


######### case when offset = 1 byte #############
101:	bslli	r11,r11,8	# h = h << 8

21:	lw	r12,r8,r10	# v = *(as + offset)
	bsrli	r4,r12,24	# temp = v >> bits_to_shift_right
	or	r4,r11,r4	# temp = h | temp
	sw	r4,r5,r10	# *(d + offset) = temp
	bslli	r11,r12,8	# h = v << 8
	addi	r9,r9,-1	# n--
	bneid	r9,21b		# loop
	addi	r10,r10,4	# offset++ (IN DELAY SLOT)

31: addi r4,r0,-3 # temp = addr_adj bri 39f

######### case when offset = 2 byte #############
102:	bslli	r11,r11,16	# h = h << 16

22:	lw	r12,r8,r10	# v = *(as + offset)
	bsrli	r4,r12,16	# temp = v >> 16
	or	r4,r11,r4	# temp = h | temp
	sw	r4,r5,r10	# *(d + offset) = temp
	bslli	r11,r12,16	# h = v << 16
	addi	r9,r9,-1	# n--
	bneid	r9,22b		# loop
	addi	r10,r10,4	# offset++ (IN DELAY SLOT)

32: addi r4,r0,-2 # temp = addr_adj bri 39f

######### case when offset = 3 byte #############
103:	bslli	r11,r11,24	# h = h << 24

23:	lw	r12,r8,r10	# v = *(as + offset)
	bsrli	r4,r12,8	# temp = v >> 8
	or	r4,r11,r4	# temp = h | temp
	sw	r4,r5,r10	# *(d + offset) = temp
	bslli	r11,r12,24	# h = v << 24
	addi	r9,r9,-1	# n--
	bneid	r9,23b		# loop
	addi	r10,r10,4	# offset++ (IN DELAY SLOT)

33: addi r4,r0,-1 # temp = addr_adj #bri 39f # fall thru

##################################################

39:	
	add	r8,r8,r10	# as = as + offset
	add	r5,r5,r10	# d = d + offset
	rsubi	r10,r10,0	# offset = - offset
	add	r7,r7,r10	# c = c + offset
	add	r6,r8,r4	# s = as - addr_adj

	# get here to do balance of individual bytes
5:	beqi	r7,6f		# if no bytes left to transfer, exit
	lbui	r4,r6,0		# temp = *s
	sbi	r4,r5,0		# *d = temp
	addi	r6,r6,1		# s++
	addi	r5,r5,1		# d++
	addi	r7,r7,-1	# c--
	bri	5b		# loop

6:
	rtsd	r15,8
	nop

.end C_SYMBOL_NAME(memcpy)