[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [microblaze-uclinux] kernel BUG at sched.c:687!
On Friday 21 April 2006 12:41, Brettschneider Falk wrote:
> Hi,
> given the assumption there's a kernel stack overflow on high IRQ load, what
> would you suggest to fix that? I suppose a bigger stack wouldn't really
> help but just delay the problem to a higher IRQ load limit.
> CU, F@lk
Kernel stack overflow is rare but possible. I put the stack size check because
I use a interrupt virtualization code that could do kernel stack overflow
under high interrupt frequency. Using usual uCLinux/Linux code this problem
can't happen. But, as you don't know where is the problem, better if the
kernel stack overflow can be discarded.
This mailing thread began with the BUG at sched ... and this BUG points to a
corruption in the process descriptor. The source of the problem could be at
switch code (which is good tested, obviously) or interrupts chained problem
where the previous kernel stack is not recovered in the right way. Under high
interrupt frequency this could be more problematic, but I dont think this is
the problem since I have check the implementation and as JW comments, a BUG
there would be seen very soon. Other point: your testbench is very related
with signals and signals implementation at the kernel is very tricky and
complex. Usually signals are not very frequent and perharps this part has not
been tested enough. Threads add more uncertainty to the mistery.
>
> > -----Original Message-----
> > From: owner-microblaze-uclinux@xxxxxxxxxxxxxx
> > [mailto:owner-microblaze-uclinux@xxxxxxxxxxxxxx]On Behalf Of Alejandro
> > Lucero
> > Sent: Friday, April 21, 2006 1:04 PM
> > To: microblaze-uclinux@xxxxxxxxxxxxxx
> > Subject: Re: [microblaze-uclinux] kernel BUG at sched.c:687!
> >
> >
> > Hi Prasad,
> >
> > This is my fault. I sent an email 1 hour ago but it seems to
> > be delayed. I had
> > a mistake in my instructions. The code must be AFTER
> > SAVE_STATE. If not the
> > stack pointer used could be from user stack and the check
> > will fail giving a
> > false positive.
> >
> > Sorry again.
> >
> > On Friday 21 April 2006 09:41, DeviPrasad Natesan wrote:
> > > Hi alexandro,
> > > I did put your stack checking code (with current ptr updated as mine
> > > ) and strange as it seems, the code goes into loop just
> >
> > when the linux
> >
> > > coming up. (This seems to work normal without this code). I checked
> > > with the our custom hardware and also the ml403 eval board and it
> > > behaves the same. I removed all the custom application and it is a
> > > simple system with ethernet, uart with root mounted in the
> >
> > jffs2. This
> >
> > > itself seems to corrupt the kernel stack (according to this test
> > > code).
> > >
> > > Does it mean that the last "current" task corrupts the
> >
> > stack?. Got to
> >
> > > continue debugging tomorrow. Seems totally illogical at this point.
> > >
> > > - Prasad
> > >
> > > On 4/20/06, Alejandro Lucero <alucero@xxxxxxxxx> wrote:
> > > > On Thursday 20 April 2006 13:29, Brettschneider Falk wrote:
> > > > > Hi Alejandro,
> > > > > thanks! Am I right, it will infinitely loop in case of
> >
> > a kernel stack
> >
> > > > > overflow? If yes, how could I write a fixed value to a
> >
> > register address
> >
> > > > > in that loop (to e.g. switch an LED on)? Going to try
> >
> > that soon...
> >
> > > > > Cheers, F@lk
> > > >
> > > > If the CPU executes this infinity loop it means the
> >
> > kernel stack size is
> >
> > > > more than 7500 bytes and this size is very close to 8192.
> >
> > Moreover, the
> >
> > > > limit is 8192 - sizeof(struct task_struct) since the
> >
> > process descriptor
> >
> > > > is at the bottom. This does not imply a stack overflow
> >
> > but the odds are
> >
> > > > high.
> > > >
> > > > If you know the leds address (look at
> > > > arch/microblaze/platform/uclinux-auto/autoconfig.in) you
> >
> > can use it with
> >
> > > > a swi instruction:
> > > >
> > > > addi r11, r0, 0x1; /* Assuming 0x00000001 put a led on */
> > > > swi r11, r0, led_address;
> > > >
> > > > You can add this before the first nop instruction.
> > > >
> > > > > > -----Original Message-----
> > > > > > From: owner-microblaze-uclinux@xxxxxxxxxxxxxx
> > > > > > [mailto:owner-microblaze-uclinux@xxxxxxxxxxxxxx]On Behalf Of
> > > > > > Alejandro Lucero
> > > > > > Sent: Thursday, April 20, 2006 1:50 PM
> > > > > > To: microblaze-uclinux@xxxxxxxxxxxxxx
> > > > > > Subject: Re: [microblaze-uclinux] kernel BUG at sched.c:687!
> > > > > >
> > > > > > On Thursday 20 April 2006 10:31, Brettschneider Falk wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Alejandro Lucero wrote:
> > > > > > > > I assumed you are using the entry.S without my
> >
> > patch reported
> >
> > > > > > > > two days ago.
> > > > > > > > aren't you?
> > > > > > >
> > > > > > > I've tried JWs version of your patch but it doesn't help as
> > > > > >
> > > > > > a bugfix. My
> > > > > >
> > > > > > > environment is one active user app with several threads
> > > > > >
> > > > > > (SCHED_RR), a high
> > > > > >
> > > > > > > IRQ frequency (about 2 per millisecond), many
> >
> > thread switches, many
> >
> > > > > > > locks/unlocks of semaphores and mutexes. From time to time
> > > > > >
> > > > > > one thread of
> > > > > >
> > > > > > > that application calls pthread_cancel() to another thread.
> > > > > > > Often (about after 20 kill actions) this leads to either a
> > > > > >
> > > > > > Linux crash
> > > > > >
> > > > > > > (with several versions of "kernel BUG at sched.c:***"), or
> > > > > >
> > > > > > just a total
> > > > > >
> > > > > > > hang or an exit of the app with return code 5. (The
> >
> > statistical
> >
> > > > > > > distribution is: displaying of scheduler bug = 0,01%, Linux
> > > > > >
> > > > > > hang = 60%,
> > > > > >
> > > > > > > process exit = rest.) I haven't the problems if either the
> > > > > >
> > > > > > IRQ frequency is
> > > > > >
> > > > > > > very low or no threads are cancelled(). That's why I asked
> > > > > >
> > > > > > you if you ever
> > > > > >
> > > > > > > tried to kill threads in your application, this increases
> > > > > >
> > > > > > the chance of a
> > > > > >
> > > > > > > Linux crash extremely here.
> > > > > >
> > > > > > Perhaps you could do some tests to discard the kernel stack
> > > > > > overflow. Try to
> > > > > > put this in your entry.S file but update the "current"
> > > > > > pointer and make sure
> > > > > > you are not using memory 0x554, 0x558, 0xc64 and 0xc68
> > > > > > (surely LMB memory).
> > > > > > This code looks at the kernel stack size and if it is greter
> > > > > > then 0x1d4c
> > > > > > (7500bytes) the system will execute an endless loop with
> > > > > > interrupts disabled.
> > > > > > In 0xc64 is stored the maximum kernel stack size used.
> > > > > >
> > > > > > Rembember to update current which is my kernel is in
> > > > > > 0x0213472c address. Use
> > > > > > objdump -t image.elf | grep current
> > > > > >
> > > > > > Try to put this in ENTRY(irq) just after swi r1, r0, ENTRY_SP
> > > > > > and before
> > > > > > SAVE_STATE
> > > > > >
> > > > > > swi r11, r0, 0x554
> > > > > > swi r12, r0, 0x558
> > > > > > lwi r11, r0, 0x0213472c;
> > > > > > addi r11, r11, 0x2000;
> > > > > > rsub r11, r1, r11;
> > > > > > lwi r12, r0, 0xc64;
> > > > > > swi r11, r0, 0xc68;
> > > > > > rsub r11, r11, r12;
> > > > > > bgei r11, 1f;
> > > > > > lwi r11, r0, 0xc68;
> > > > > > swi r11, r0, 0xc64;
> > > > > > 1:
> > > > > > lwi r11, r0, 0x0213472c;
> > > > > > addi r11, r11, 0x2000;
> > > > > > rsub r11, r1, r11;
> > > > > > addi r12, r0, 0x1d4c;
> > > > > > rsub r11, r12, r11;
> > > > > > blei r11, 2f;
> > > > > > lwi r11, r0, 0;
> > > > > > mts rmsr, r11;
> > > > > > nop;
> > > > > > nop;
> > > > > > nop;
> > > > > > bri -8;
> > > > > > 2:
> > > > > > lwi r11, r0, 0x554
> > > > > > lwi r12, r0, 0x558
> > > > > >
> > > > > > > Cheers, F@lk
> > > > > > > ___________________________
> > > > > > > microblaze-uclinux mailing list
> > > > > > > microblaze-uclinux@xxxxxxxxxxxxxx
> > > > > > > Project Home Page :
> > > > >
> > > > > http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> > > > >
> > > > > > Mailing List Archive :
> > > > > > http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
> > > >
> > > > --
> > > > Alejandro Lucero
> > > > Technical Director
> > > > +34 665 68 71 68
> > > > Valencia (SPAIN)
> > > > www.os3sl.com
> > > > ___________________________
> > > > microblaze-uclinux mailing list
> > > > microblaze-uclinux@xxxxxxxxxxxxxx
> > > > Project Home Page :
> >
> > http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> >
> > > > Mailing List Archive :
> > > > http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
> > >
> > > ___________________________
> > > microblaze-uclinux mailing list
> > > microblaze-uclinux@xxxxxxxxxxxxxx
> > > Project Home Page :
> >
> > http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> >
> > > Mailing List Archive :
> > > http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
> >
> > --
> >
> > Alejandro Lucero
> > Technical Director
> > +34 665 68 71 68
> > Valencia (SPAIN)
> > www.os3sl.com
> > ___________________________
> > microblaze-uclinux mailing list
> > microblaze-uclinux@xxxxxxxxxxxxxx
> > Project Home Page :
> > http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> > Mailing List Archive :
> > http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
>
> ___________________________
> microblaze-uclinux mailing list
> microblaze-uclinux@xxxxxxxxxxxxxx
> Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> Mailing List Archive :
> http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
--
Alejandro Lucero
Technical Director
+34 665 68 71 68
Valencia (SPAIN)
www.os3sl.com
___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/