[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device *dev)
Hi,
I've been following the discussion on speeding up the ethernet with
interest. I had a look at the do_csum() routine in the
arch/microblaze/lib/checksum.c file and produced an assembler version in the
hopes of providing some speedup.
I'm not familiar with how to best fold this into the normal code base - I
just commented out the do_csum in the c file and added assly_csum.o to the
makefile in that directory.
I've attached the assly_csum.S file to this message.
I'd be interested if the 100us checksum calculation in your test case below
is changed much by using this optimization.
Jim Law
Iris Power LP
----- Original Message -----
From: "Brettschneider Falk" <fbrettschneider@xxxxxxxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Thursday, April 24, 2008 12:25 PM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff
*orig_skb,structnet_device *dev)
Hi,
John Williams wrote:
> That's the primary entry point for the upper network layer
> to send a
> packet via this device. The alignment requirement is deeply
> ingrained
> into the kernel. The problem comes about because ethernet
> frame headers
> are not a multiple of 4 bytes long, but kernel expects (or at least
> assumes) that IP header fields are word aligned.
Here are 2 patches, applying both of them makes xenet_FifoSend() much
faster. Patch skbuff.c.diff makes the data aligned for xenet_FifoSend(),
patch adapter.c.diff checks the alignment and avoids memcpy() in case of
that it's aligned.
Patch skbuff.c.diff slows down tpc.c:tcp_sendmsg() because the memcpy()
there has offset=2 now, but the second memcpy() which had offset=2 anyway is
completely prevented now. So after all it's better now.
What do you think about these 2 patches?
My user program was permanently sending data. Time measuring with an
oscilloscop showed that most of the time the spent time in xenet_FifoSend()
is reduced from about 600us to about 200us. The 200us consists of 100us
checksum calculation plus 100us spent in XEmac_FifoSend().
Though I noticed that the call of xenet_FifoSend() is just every
millisecond. That is why the data throughput (Megabytes per second) to my PC
keeps the same. Is xenet_FifoSend() called by a timer with such frequency of
1/ms? If yes, can it be reduced?
CU, F@lk
###################################-*-asm*-
#
# Copyright 2008 (c) Jim Law - Iris LP All rights reserved.
#
# This file is subject to the terms and conditions of the GNU General
# Public License. See the file COPYING in the main directory of this
# archive for more details.
#
# Written by Jim Law <jlaw@xxxxxxxxxxxxx>
#
# intended to replace the do_csum in checksum.c in arch/microblaze/lib
#
#
# assly_csum.s
#
# Attempt at quicker checksum for ethernet
# Input : Operand1 in Reg r5 - starting address of buffer
# Operand2 in Reg r6 - number of bytes to perform checksum on
# Output: Result in Reg r3 - checksum 16
#
#
# Explanation:
# Perform modulo 16 bit checksum on a (possibly unaligned)
# big-endian buffer of size spec'd in bytes
#
#
#######################################
#include <asm/clinkage.h>
.globl C_SYMBOL_NAME(do_csum)
.ent C_SYMBOL_NAME(do_csum)
C_SYMBOL_NAME(do_csum):
beqid r6,1f # if num of bytes is zero, return with zero csum
addik r3,r0,0 # clear return csum - IN DELAY SLOT
andi r7,r5,0xfffffffc # calc buffer word address, implied sign extend
add r8,r5,r6 # end address = buff address + num of bytes
addi r8,r8,-1 # end address = end address - 1
andi r4,r5,3 # temp = buff address & 3
bslli r4,r4,3 # temp = temp * 8
xori r9,r0,0xffffffff # r9 = 0xffffffff, implied sign extend
bsrl r9,r9,r4 # startmask = r9 >> temp
andi r4,r8,3 # temp = end address & 3
bslli r4,r4,3 # temp = temp * 8
rsubi r4,r4,24 # temp = 24 - temp
xori r10,r0,0xffffffff # r10 = 0xffffffff, implied sign extend
bsll r10,r10,r4 # startmask = r9 << temp
bsrli r4,r5,2 # temp = buff address >> 2
bsrli r11,r8,2 # word count = end address >> 2
rsub r11,r4,r11 # word count = word count - temp
# add in the first word, appropriately masked
lwi r3,r7,0 # csum = *word address
and r3,r3,r9 # csum = csum & startmask
bneid r11,2f # if word count != 0, go do more than one word sum
addi r7,r7,4 # word address++ - IN DELAY SLOT
# when get here, then all bytes to be summed are in one word
brid 3f # goto sum the half-words
and r3,r3,r10 # csum = csum & endmask - IN DELAY SLOT
2: # when get here, then more than one word to be summed
addi r11,r11,-1 # word count = word count - 1
addi r0,r0,0 # clear carry for add with carry in loop
5: beqi r11,4f # if no more words to do, leave loop
lwi r4,r7,0 # temp = *word address
addc r3,r3,r4 # csum = csum + temp + carry
nop # delay to make sure carry is updated before mfs
mfs r4,rmsr # save carry in temp
addi r7,r7,4 # word address++
addi r11,r11,-1 # word count = word count - 1
brid 5b
mts rmsr,r4 # restore carry from temp - IN DELAY SLOT
4: # deal with last (possibly partial) word
lwi r4,r7,0 # temp = *word address
and r4,r4,r10 # temp = temp & endmask
addc r3,r3,r4 # csum = csum + temp, include carry
3: # sum the halfwords in the result
bsrli r4,r3,16 # temp = csum >> 16
andi r3,r3,0x0000ffff # csum = csum & 0x0000ffff, need .imm here to override sign ext.
addc r3,r3,r4 # csum = csum + temp + carry, this might have carried out of ls half-word
# .. so add the high half-word back in again
bsrli r4,r3,16 # temp = csum >> 16
andi r3,r3,0x0000ffff # csum = csum & 0x0000ffff, need .imm here to override sign ext.
add r3,r3,r4 # csum = csum + temp, this will never carry out of ls half-word
andi r4,r5,1 # temp = buff address & 1, check if high/low bytes need to be swapped
beqi r4,1f # started on half-word boundary, ok to not swap
# swap the high / low bytes in the 16 bit csum
bsrli r4,r3,8 # temp = csum >> 8, no high bits on, so no need to mask
andi r3,r3,0x000000ff # csum = csum & 0xff, implied sign extend ok
bslli r3,r3,8 # csum = csum << 8
or r3,r3,r4 # csum = csum | temp
# Restore Frame and return
1: rtsd r15,8
nop
.end C_SYMBOL_NAME(do_csum)