[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [microblaze-uclinux] bad page fault kernel panics
Hi Gregori,
> Hi,
>
> I'm using the PetaLogix subversion snapshot with MMU support enabled on
> the Spartan-3A DSP 1800 Board with MicroBlaze 7.10.d and I was
> accidentally running into bad page fault panics over and over again -
> sometimes during boot, sometimes later. In the meantime I found a way
> to force this panic by running
>
> # while [ -d / ] ; do ls ; done
>
> which will end in said bad page fault panic sooner or later (but usually
> within the first minute being executed)
I haven't test it your commands but I think that I know where the problem is.
Please go to arch/microblaze/kernel/entry.S -> lines 743-747 should be
swi r11, r0, TOPHYS(PER_CPU(KM)); /* Now we're in kernel-mode. */
lwi r31, r0, TOPHYS(PER_CPU(CURRENT_SAVE)); /* load current - get the saved
current */
2:
swi r0, r1, PTO+PT_SYSCALL; /* Save away the syscall number. */
Please change it to this form (just move 2: label one line before)
swi r11, r0, TOPHYS(PER_CPU(KM)); /* Now we're in kernel-mode. */
2:
lwi r31, r0, TOPHYS(PER_CPU(CURRENT_SAVE)); /* load current - get the saved
current */
swi r0, r1, PTO+PT_SYSCALL; /* Save away the syscall number. */
Could you please test it and send me your results?
Thanks,
Michal
> Registers, Stack and Call Trace look like this:
>
> Stack:
> c04cce68 c04d43a0 c04cced0 c7df52b4 00000000 00000000 00000000 c0227a7c
> 00000000 c0227b4c 00000001 00000007 00010000 00000001 c01ee55c 00000001
> c0226000 00000000 00000008 00000000 00000000 c0019ba4 00000000 00000000
> Call Trace:
> [<c0019ba4>] update_process_times+0x94/0xa0
> [<c00098c0>] account_system_time+0x10/0x16c
> [<c0001c20>] timer_interrupt+0x60/0x98
> [<c002cf64>] handle_IRQ_event+0x54/0xb4
> [<c0001c2c>] timer_interrupt+0x6c/0x98
> [<c002d070>] __do_IRQ+0xac/0x140
> [<c0068334>] new_inode+0xc/0x9c
> [<c002cf64>] handle_IRQ_event+0x54/0xb4
> [<c00019fc>] do_IRQ+0x3c/0x94
> [<c00880c0>] pid_revalidate+0x24/0x10c
> [<c0001bfc>] timer_interrupt+0x3c/0x98
> [<c0059428>] do_lookup+0x84/0x1e4
> [<c0005ec4>] irq_call+0x0/0x8
> [<c005b2f0>] __link_path_walk+0x844/0xee8
> [<c005b394>] __link_path_walk+0x8e8/0xee8
> [<c0089d08>] proc_pid_lookup+0x30/0x1d4
> [<c00f6e4c>] number+0x2c0/0x3f4
> [<c00f7a98>] __umodsi3+0x8c/0xbc
> [<c005b2f0>] __link_path_walk+0x844/0xee8
> [<c00f730c>] vsnprintf+0x38c/0x6dc
> [<c0001bfc>] timer_interrupt+0x3c/0x98
> [<c005ba94>] link_path_walk+0x100/0x28c
> [<c0019b84>] update_process_times+0x74/0xa0
> [<c00f7730>] sprintf+0x30/0x44
> [<c0001c2c>] timer_interrupt+0x6c/0x98
> [<c0088a1c>] proc_self_follow_link+0x28/0x54
> [<c006855c>] touch_atime+0xc0/0x150
> [<c005ad64>] __link_path_walk+0x2b8/0xee8
> [<c005ba94>] link_path_walk+0x100/0x28c
> [<c00074d4>] do_page_fault+0x2e0/0x444
> [<c005bea0>] do_path_lookup+0xac/0x2a0
> [<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
> [<c0056cc4>] do_execve+0x3c/0x238
> [<c005cdf0>] __path_lookup_intent_open+0x64/0xd8
> [<c005cdc4>] __path_lookup_intent_open+0x38/0xd8
> [<c005cf00>] path_lookup_open+0x8/0x1c
> [<c0056694>] open_exec+0x28/0x108
> [<c0056cd8>] do_execve+0x50/0x238
> [<c0005240>] sys_execve_wrapper+0x0/0x10
> [<c0056cc4>] do_execve+0x3c/0x238
> [<c0001f9c>] sys_execve+0x54/0x94
> [<c0001f6c>] sys_execve+0x24/0x94
> [<c0004f78>] sc+0x10/0x18
> [<c0004f78>] sc+0x10/0x18
>
> Oops: kernel access of bad area, sig: 11
> Registers dump: mode=1
> r1=C02279CC, r2=00000000, r3=00000861, r4=0000010F
> r5=00000007, r6=00000800, r7=C02279E4, r8=00000018
> r9=00000001, r10=C0226000, r11=000041AA, r12=C0003AF0
> r13=00000000, r14=C022796C, r15=C0005AE8, r16=49CAF500
> r17=00000001, r18=00000000, r19=00000007, r20=4817FFF4
> r21=00000000, r22=00000000, r23=00000000, r24=00000001
> r25=00000000, r26=FFFFFFFF, r27=00000002, r28=0000000A
> r29=00000003, r30=0000000E, r31=C03287B0, rPC=C0003B08
> msr=000041AA, ear=0000010F, esr=000000B2, fsr=C6AA9E80
> Kernel panic - not syncing: Aiee, killing interrupt handler!
> <0>Rebooting in 120 seconds..
>
>
> The Program Counter is always pointing to the same address, which is
> inside the _unaligned_data_exception code as objdump shows:
>
> c0003af0 <_unaligned_data_exception>:
> c0003af0: a50303e0 andi r8, r3, 992
> c0003af4: 65080002 bsrli r8, r8, 2
> c0003af8: a4c30400 andi r6, r3, 1024
> c0003afc: be260068 bneid r6, 104 // c0003b64
> c0003b00: a4c30800 andi r6, r3, 2048
>
> c0003b04 <ex_lw_vm>:
> c0003b04: be060034 beqid r6, 52 // c0003b38
> c0003b08: e0a40000 lbui r5, r4, 0 <--- here it is
> c0003b0c: b000c01e imm -16354
> c0003b10: 30c0ce40 addik r6, r0, -12736
> c0003b14: f0a60000 sbi r5, r6, 0
> c0003b18: e0a40001 lbui r5, r4, 1
> c0003b1c: f0a60001 sbi r5, r6, 1
> c0003b20: e0a40002 lbui r5, r4, 2
> c0003b24: f0a60002 sbi r5, r6, 2
> c0003b28: e0a40003 lbui r5, r4, 3
> c0003b2c: f0a60003 sbi r5, r6, 3
> c0003b30: b8100020 brid 32 // c0003b50
> c0003b34: e8660000 lwi r3, r6, 0
>
>
> I'm a bit confused as the ESR value is 0xb2 which indicates a Data TLB
> Miss Exception, but it gets stuck executing the handler function for
> Unaligned Data Exception - which again might be caused by the 0x10f in
> r4 which shouldn't be valid address anyway. (other times I got 0xff,
> 0x10b, 0x102, 0xffffff79 etc. in r4)
>
>
> After the 120 seconds, when the kernel calls its reboot function, the
> Stack and Call Trace looks like the following - where you can see the
> exception handler functions:
>
> Stack:
> c001ed7c c01ad2c4 00003c96 00001998 00003330 00008000 00000000 c000cc18
> c000cbec 4817fff4 00000000 c03287b0 c0227934 0000000b c0010d6c c01ae8d8
> 00000078 c0375ac8 00000000 00005000 00000000 c02277c4 c0226000 0000000b
> Call Trace:
> [<c001ed7c>] emergency_restart+0xc/0x20
> [<c000cc18>] panic+0x154/0x1dc
> [<c000cbec>] panic+0x128/0x1dc
> [<c0010d6c>] do_exit+0x624/0x93c
> [<c0004060>] die+0x90/0x98
> [<c000404c>] die+0x7c/0x98
> [<c00071e8>] bad_page_fault+0xcc/0xd8
> [<c0003b08>] ex_lw_vm+0x4/0x34
> [<c0007328>] do_page_fault+0x134/0x444
> [<c0003b08>] ex_lw_vm+0x4/0x34
> [<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
> [<c00081dc>] task_running_tick+0x17c/0x2a8
> [<c0003af0>] _unaligned_data_exception+0x0/0x14
> [<c0005ae8>] page_fault_instr_trap+0x1e8/0x1f0
> [<c0003b08>] ex_lw_vm+0x4/0x34
> [<c0019ba4>] update_process_times+0x94/0xa0
> [<c00098c0>] account_system_time+0x10/0x16c
> [<c0001c20>] timer_interrupt+0x60/0x98
> ...
>
>
> So my guess is, the actual panic is caused by a missing entry in the
> __ex_table section for that address (0xc0003b08), but I doubt that this
> is the real reason - as mentioned above, the data in r4 doesn't look
> like valid addresses. However, the Call Trace always starts with an
> interrupt handling function - currently always some timer interrupt
> handler, but I also had a PS/2 keyboard driver and pressing some keys
> forced this panic as well, showing the keyboard ISR functions. For me
> it looks like some bad interaction between page fault handling and
> interrupts, but that's just another guess..
>
> Now basically my question is, if that's a known issue or at least if
> the situation can be reproduced by someone else - or if the problem
> might be within my hardware configuration or ..well, dunno.. any hint
> is welcome. If more information is needed, let me know.
>
> Thanks in advance,
> Sven
>
> ___________________________
> microblaze-uclinux mailing list
> microblaze-uclinux@xxxxxxxxxxxxxx
> Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
>
>
___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/