[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [microblaze-uclinux] bad page fault kernel panics



Hi Gregori,

> Hi,
> 
> I'm using the PetaLogix subversion snapshot with MMU support enabled on
> the Spartan-3A DSP 1800 Board with MicroBlaze 7.10.d and I was
> accidentally running into bad page fault panics over and over again - 
> sometimes during boot, sometimes later. In the meantime I found a way
> to force this panic by running
> 
>   # while [ -d / ] ; do ls ; done
> 
> which will end in said bad page fault panic sooner or later (but usually
> within the first minute being executed)

I haven't test it your commands but I think that I know where the problem is.
Please go to arch/microblaze/kernel/entry.S -> lines 743-747 should be

	swi	r11, r0, TOPHYS(PER_CPU(KM));	/* Now we're in kernel-mode. */
	lwi	r31, r0, TOPHYS(PER_CPU(CURRENT_SAVE));	/* load current - get the saved
current */
2:
	swi	r0, r1, PTO+PT_SYSCALL; /* Save away the syscall number.  */



Please change it to this form (just move 2: label one line before)
	swi	r11, r0, TOPHYS(PER_CPU(KM));	/* Now we're in kernel-mode. */
2:
	lwi	r31, r0, TOPHYS(PER_CPU(CURRENT_SAVE));	/* load current - get the saved
current */
	swi	r0, r1, PTO+PT_SYSCALL; /* Save away the syscall number.  */


Could you please test it and send me your results?

Thanks,
Michal


> Registers, Stack and Call Trace look like this:
> 
> Stack:
>   c04cce68 c04d43a0 c04cced0 c7df52b4 00000000 00000000 00000000 c0227a7c
>   00000000 c0227b4c 00000001 00000007 00010000 00000001 c01ee55c 00000001
>   c0226000 00000000 00000008 00000000 00000000 c0019ba4 00000000 00000000
> Call Trace:
> [<c0019ba4>] update_process_times+0x94/0xa0
> [<c00098c0>] account_system_time+0x10/0x16c
> [<c0001c20>] timer_interrupt+0x60/0x98
> [<c002cf64>] handle_IRQ_event+0x54/0xb4
> [<c0001c2c>] timer_interrupt+0x6c/0x98
> [<c002d070>] __do_IRQ+0xac/0x140
> [<c0068334>] new_inode+0xc/0x9c
> [<c002cf64>] handle_IRQ_event+0x54/0xb4
> [<c00019fc>] do_IRQ+0x3c/0x94
> [<c00880c0>] pid_revalidate+0x24/0x10c
> [<c0001bfc>] timer_interrupt+0x3c/0x98
> [<c0059428>] do_lookup+0x84/0x1e4
> [<c0005ec4>] irq_call+0x0/0x8
> [<c005b2f0>] __link_path_walk+0x844/0xee8
> [<c005b394>] __link_path_walk+0x8e8/0xee8
> [<c0089d08>] proc_pid_lookup+0x30/0x1d4
> [<c00f6e4c>] number+0x2c0/0x3f4
> [<c00f7a98>] __umodsi3+0x8c/0xbc
> [<c005b2f0>] __link_path_walk+0x844/0xee8
> [<c00f730c>] vsnprintf+0x38c/0x6dc
> [<c0001bfc>] timer_interrupt+0x3c/0x98
> [<c005ba94>] link_path_walk+0x100/0x28c
> [<c0019b84>] update_process_times+0x74/0xa0
> [<c00f7730>] sprintf+0x30/0x44
> [<c0001c2c>] timer_interrupt+0x6c/0x98
> [<c0088a1c>] proc_self_follow_link+0x28/0x54
> [<c006855c>] touch_atime+0xc0/0x150
> [<c005ad64>] __link_path_walk+0x2b8/0xee8
> [<c005ba94>] link_path_walk+0x100/0x28c
> [<c00074d4>] do_page_fault+0x2e0/0x444
> [<c005bea0>] do_path_lookup+0xac/0x2a0
> [<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
> [<c0056cc4>] do_execve+0x3c/0x238
> [<c005cdf0>] __path_lookup_intent_open+0x64/0xd8
> [<c005cdc4>] __path_lookup_intent_open+0x38/0xd8
> [<c005cf00>] path_lookup_open+0x8/0x1c
> [<c0056694>] open_exec+0x28/0x108
> [<c0056cd8>] do_execve+0x50/0x238
> [<c0005240>] sys_execve_wrapper+0x0/0x10
> [<c0056cc4>] do_execve+0x3c/0x238
> [<c0001f9c>] sys_execve+0x54/0x94
> [<c0001f6c>] sys_execve+0x24/0x94
> [<c0004f78>] sc+0x10/0x18
> [<c0004f78>] sc+0x10/0x18
> 
> Oops: kernel access of bad area, sig: 11
>  Registers dump: mode=1
>  r1=C02279CC, r2=00000000, r3=00000861, r4=0000010F
>  r5=00000007, r6=00000800, r7=C02279E4, r8=00000018
>  r9=00000001, r10=C0226000, r11=000041AA, r12=C0003AF0
>  r13=00000000, r14=C022796C, r15=C0005AE8, r16=49CAF500
>  r17=00000001, r18=00000000, r19=00000007, r20=4817FFF4
>  r21=00000000, r22=00000000, r23=00000000, r24=00000001
>  r25=00000000, r26=FFFFFFFF, r27=00000002, r28=0000000A
>  r29=00000003, r30=0000000E, r31=C03287B0, rPC=C0003B08
>  msr=000041AA, ear=0000010F, esr=000000B2, fsr=C6AA9E80
> Kernel panic - not syncing: Aiee, killing interrupt handler!
>  <0>Rebooting in 120 seconds..
> 
> 
> The Program Counter is always pointing to the same address, which is
> inside the _unaligned_data_exception code as objdump shows:
> 
> c0003af0 <_unaligned_data_exception>:
> c0003af0:   a50303e0    andi    r8, r3, 992
> c0003af4:   65080002    bsrli   r8, r8, 2
> c0003af8:   a4c30400    andi    r6, r3, 1024
> c0003afc:   be260068    bneid   r6, 104     // c0003b64
> c0003b00:   a4c30800    andi    r6, r3, 2048
> 
> c0003b04 <ex_lw_vm>:
> c0003b04:   be060034    beqid   r6, 52      // c0003b38
> c0003b08:   e0a40000    lbui    r5, r4, 0   <--- here it is
> c0003b0c:   b000c01e    imm -16354
> c0003b10:   30c0ce40    addik   r6, r0, -12736
> c0003b14:   f0a60000    sbi r5, r6, 0
> c0003b18:   e0a40001    lbui    r5, r4, 1
> c0003b1c:   f0a60001    sbi r5, r6, 1
> c0003b20:   e0a40002    lbui    r5, r4, 2
> c0003b24:   f0a60002    sbi r5, r6, 2
> c0003b28:   e0a40003    lbui    r5, r4, 3
> c0003b2c:   f0a60003    sbi r5, r6, 3
> c0003b30:   b8100020    brid    32      // c0003b50
> c0003b34:   e8660000    lwi r3, r6, 0
> 
> 
> I'm a bit confused as the ESR value is 0xb2 which indicates a Data TLB
> Miss Exception, but it gets stuck executing the handler function for
> Unaligned Data Exception - which again might be caused by the 0x10f in
> r4 which shouldn't be valid address anyway. (other times I got 0xff,
> 0x10b, 0x102, 0xffffff79 etc. in r4)
> 
> 
> After the 120 seconds, when the kernel calls its reboot function, the
> Stack and Call Trace looks like the following - where you can see the
> exception handler functions:
> 
> Stack:
>   c001ed7c c01ad2c4 00003c96 00001998 00003330 00008000 00000000 c000cc18
>   c000cbec 4817fff4 00000000 c03287b0 c0227934 0000000b c0010d6c c01ae8d8
>   00000078 c0375ac8 00000000 00005000 00000000 c02277c4 c0226000 0000000b
> Call Trace:
> [<c001ed7c>] emergency_restart+0xc/0x20
> [<c000cc18>] panic+0x154/0x1dc
> [<c000cbec>] panic+0x128/0x1dc
> [<c0010d6c>] do_exit+0x624/0x93c
> [<c0004060>] die+0x90/0x98
> [<c000404c>] die+0x7c/0x98
> [<c00071e8>] bad_page_fault+0xcc/0xd8
> [<c0003b08>] ex_lw_vm+0x4/0x34
> [<c0007328>] do_page_fault+0x134/0x444
> [<c0003b08>] ex_lw_vm+0x4/0x34
> [<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
> [<c00081dc>] task_running_tick+0x17c/0x2a8
> [<c0003af0>] _unaligned_data_exception+0x0/0x14
> [<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
> [<c0003b08>] ex_lw_vm+0x4/0x34
> [<c0019ba4>] update_process_times+0x94/0xa0
> [<c00098c0>] account_system_time+0x10/0x16c
> [<c0001c20>] timer_interrupt+0x60/0x98
> ...
> 
> 
> So my guess is, the actual panic is caused by a missing entry in the
> __ex_table section for that address (0xc0003b08), but I doubt that this
> is the real reason - as mentioned above, the data in r4 doesn't look
> like valid addresses. However, the Call Trace always starts with an
> interrupt handling function - currently always some timer interrupt
> handler, but I also had a PS/2 keyboard driver and pressing some keys
> forced this panic as well, showing the keyboard ISR functions. For me
> it looks like some bad interaction between page fault handling and
> interrupts, but that's just another guess..
> 
> Now basically my question is, if that's a known issue or at least if
> the situation can be reproduced by someone else - or if the problem
> might be within my hardware configuration or ..well, dunno.. any hint
> is welcome. If more information is needed, let me know.
> 
> Thanks in advance,
> Sven
> 
> ___________________________
> microblaze-uclinux mailing list
> microblaze-uclinux@xxxxxxxxxxxxxx
> Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
> 
> 
___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/