Wednesday, August 1, 2007

Stone Age Debugging

How do you debug a reasonably complex system without tools like a debugger, core dumps or even a console access!!! I am talking about a system with no printf's, no gdb, no persistent filesystem to dump a core on crash.

I work on an embedded system, TI C62x DSPs on an Motorola MPC board. Few weeks back I was debugging a problem that caused the DSPs to crash every 18-20 hours. We dont have JTag to peek into the DSPs, all we have is a shared memory based debugging mechanism in which DSPs keep updating a location with the code trace and in case of a problem with the DSP, it jumps to HALT and the Controller detects it with a heartbeat mechanism after which it tries to access this shared memory via the DSP's HPI, gets the data, and dumps it to a file.

However, the problem we have is, if the DSP crashes, i.e. if it DOES NOT do a halt instead badly screws up and reboots, the Controller goes for a toss eventually causing the whole system to reset; which is where the problem starts.

First 3 days I tried to fix the MPC reset. On a little digging, the MPC reset was found to be due to the following sequence of events:
- DSP has a problem
- DSP resets
- MPC misses a heartbeat from the ailing DSP
- MPC tries to read the shared memory for the logs [Thinking that DSP has halted]
- The DSP handles are no more valid as it is reset
- MPC crashes, bringing down the whole system

Day1:
Aim: DSP resets by jumping to c_int00, override the ISR to loop infinitely [or HALT]

Approach:
  • Find address of c_int00 from map file
  • When DSP is loaded, overwrite the address with JMP HLT code [RTFM for asm or write while (1), compile, check asm]
Result: Day wasted, DSP still resets, looks like the code I wrote was wrong [no way for me to figure out] or the screwup is worse than I thought

Day2:
Aim: MPC crashes because it is trying to read using an invalid handle, get a new handle before reading

Approach: On DSP failure
  • Close the DSPs
  • Get new handle
  • DONT download the code to the DSP
  • Open HPI
  • Read from the shared location
Result: I get something, but it looks like it is corrupted, but MPC does not crash at least. But still nowhere.

Day3:

Aim: Look for the rootcause, DSP is resetting probably because there is a stack overflow, arrest that

Approach: In the main task that runs every 20 msecs, check if the stack usage is going beyond 80 %, if so, HALT

Result: None, half the day wasted

Aim: There is a buffer overflow, arrest that

Approach: Put a "gaurd band" near all major buffers, every 20 msecs cycle, check if there is something being written to it. i.e. lets say there is buffer char caImportantBuffer [100];
Modifiy that to: char caImportantBuffer [100 + GAURD_BAND];
memset (caImportantBuffer + 100, 0x1234, GAURD_BAND);

Every 20 msecs, check if the last GAURD_BAND bytes have changed, if they have, HALT.

Result: ONE bug found, but the problem still remains. Big achievement, but miles to go before I sleep


To be continued......

No comments: