strange - infinite loop in msdelay function

Coding and general discussion relating to the compiler

Moderators: David Barker, Jerry Messina

Post Reply
User avatar
RadioT
Registered User
Registered User
Posts: 157
Joined: Tue Nov 27, 2007 12:50 pm
Location: Winnipeg, Canada

strange - infinite loop in msdelay function

Post by RadioT » Mon Oct 28, 2013 6:15 am

Hello,

We have been re-writing our code to move uart and timer functions to interrupts and make the whole thing more modular. We're pretty much done except we are finding that in long tests, the program will get stuck in what appears to be an infinite loop in the usdelay() routine.

Are any of the variables in the usdelay() routine local? Maybe we are overwriting one in an interrupt? Or could we be messing up a status bit if it's interrupted? When entering interrupts, we are saving with

Code: Select all

Save(FSR0, FSR1, FSR2, PRODL)

<insert respective interrupt code here>

Restore
We also ran into stack overflows but if we hit those, the code reboots. I think we have pretty much fixed all those. In this instance, though, the device just appears to quit working when testing with field devices. In our debug unit connected to MPLab we see where often it halts at location 0x4c, which is in the middle of the usdelay() sub. When we step at that point, MPLab never stops running.

I see the watchdog is set in this routine, so when it's stuck in there it can never time out can force a reboot.

Any ideas?? Maybe the stack overflows when it's in a timer loop, and get stuck there instead of rebooting???

Thanks,

-Tom

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Mon Oct 28, 2013 1:40 pm

First off, if you're saving the PRODL register (which is almost always a good idea), you need to check what it's actually saving.

The PROD register is actually a 16-bit combination of the two PRODL/PRODH registers, but it's defined in the device files as a byte.

Code: Select all

   PROD as byte absolute $0FF3,
   PRODL as byte absolute $0FF3,
   PRODH as byte absolute $0FF4,
So 'save(PRODL)' or 'save(PROD)' will only save the lower 8-bits.

If you re-define PROD as

Code: Select all

public dim PROD as PRODL.AsWord
then 'save(PROD)' will save the entire word

It seems to me that the device files should probably define 'PROD' along with the other 16-bit aliases, but they don't

Code: Select all

// alias...
public dim
   FSR0 as FSR0L.AsWord,
   FSR1 as FSR1L.AsWord,
   FSR2 as FSR2L.AsWord,
   TABLEPTR as TBLPTRL.AsWord,
   PROD as PRODL.AsWord
Note: if you use 'save(0)' to save the system variables then this is taken care of for you... it also saves FSR0, FSR1, and PRODL/PRODH


Now, on to delayms...

delayms() and delayus() both use a few bytes of the system register space as temp storage, so depending on what's in your isrs you may need to add 'save(0, ' to the list, but without knowing what the isrs are doing that's hard to tell. Saving/restoring the system vars adds a far bit of latency to the interrupt, so it's a good idea NOT to use it if at all possible (but that depends on the isr code).

Also, from your other thread it seems you're making a number of subroutines calls in the isrs (because of the stack overflow issue). You need to take a look at this since if these subs use any parameters, temps, or local variables they too would probably have to be added to the 'save' list since the frame variables are recycled

If you're using both high and low priorities, this usually applies to both isrs, so this can be a real issue if you need fast interrupt response.

See if adding something like

Code: Select all

  save(0, sub1_used_by_isr, sub2_used_by_isr, etc)
to each of the interrupt handlers helps. If it does, then you know where to start looking.

Also, you aren't using the 'delayxx' calls IN the isrs, are you?

User avatar
RadioT
Registered User
Registered User
Posts: 157
Joined: Tue Nov 27, 2007 12:50 pm
Location: Winnipeg, Canada

Post by RadioT » Mon Oct 28, 2013 11:36 pm

Hi Gerry,

Yes, we make sure there are NO delay loops in ISRs. They are EVIL there. I actually found one that added 2 minutes to a simple screen clear - the ISR ran a delay, then when the ISR returned to it's original code, it happened to be in an delayms() routine! After taking out the delayms() routine in the ISR, the screen clear was instant once the program returned from the interrupt.

So you are absolutely right, NO delays in the ISR.

As for the context save - if we save 0 that would save PRODL/PRODH as well as the others? OK I was not aware of that. We can use that instead.

In the end, we just set a "disable" high-priority interrupts command at the start of the ISR and an "enable" command at the end of the ISR. Our routines are over fast enough that there has been no adverse effect in our code. We got lucky on that one! We have not seen a single stack overflow since.

Thanks for all your help,

-Tom

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Tue Oct 29, 2013 10:54 am

>>if we save 0 that would save PRODL/PRODH as well as the others?

Yup.

Adding '0' to the save list saves/restores:
- FSR0
- FSR1
- the SF System variables (25 bytes)
- 16-bit PROD (PRODL, PRODH)

The interrupt itself saves BSR, WREG, and STATUS

User avatar
RadioT
Registered User
Registered User
Posts: 157
Joined: Tue Nov 27, 2007 12:50 pm
Location: Winnipeg, Canada

Post by RadioT » Fri Nov 08, 2013 1:45 pm

Unfortunately, we are still having problems.

We worked to reduce the depth of the stack in our program but get strange stack overflows where the stack will have 15 or more consecutive locations stuffed into it. It's like a certain condition is met, and the program flow goes into a loop that rapidly makes calls to consecutive locations in assembler with no explanation.

The program runs anywhere from an hour to 8 hours then has the overflow.

In one example case, it was in a UART "initialize" sub for our GPS.

STKPTR was 10011110, 0x9E, which is 30, with bit 7 set, the stack overflow bit. So the stack overflowed.

So I kept stepping. I was trying to see where the stack led before it died.

I added the trap code you suggested, and that works.

So here is what was in the stack. I gathered it, location by location, by setting the PC at a "return" instruction, then step-in executed in MPLab. The next lower stack location is popped off, and the PC goes to the location popped off. Then find a "return" instruction again, and re-do the process.

Popping values off the stack using "Return". The locations are in hex format, ie 0x002FF2

Each line has the stack location in decimal then in hex, followed by the loction in memory and a comment on where the program counter is.

Stack 30 0x002FF8
29 1D missed it - used pop
28 1C 0x002FF2 - used return, then step, in the Dissassembly window. This is back in the 'initialize gps' routine
27 1B 0x003260 - at end of "public function bitBashingIsEnabled() as boolean"
26 1A 0x002FF0 - in "public sub initialize(bBash as boolean = false)" in clear(data.location.latitude) line
25 19 0x002FEE - public sub initialize(bBash as boolean = false) in clear(data.location.latitude)
24 18 0x002FEC - in "public sub initialize(bBash as boolean = false)" again in clear(data.location.latitude) line
23 17 0x002FEA - in "public sub initialize(bBash as boolean = false)" again in clear(data.location.latitude) line
22 16 0x002FE8 - in "public sub initialize(bBash as boolean = false)" again in clear(data.location.latitude) line
21 15 0x002FE6 - in "public sub initialize(bBash as boolean = false)" again in clear(data.location.latitude) line
20 14 0x002FE4 - in "public sub initialize(bBash as boolean = false)" again in clear(data.location.latitude) line
19 13 0x002FE2 - in "public sub initialize(bBash as boolean = false)" again in clear(data.location.latitude) line
18 12 0x002FE0 - in "public sub initialize(bBash as boolean = false)" at "First_Setting = true"
17 11 0x002FDE - in "public sub initialize(bBash as boolean = false)" at clear(readingCount)
16 10 0x002FDC - in "public sub initialize(bBash as boolean = false)" at clear(readingCount)
15 0F 0x002FDA - private sub cyclePower(), just before "initialize", at end in delayMS(POWER_UP_TIME_NEEDED)
14 0E 0x002FD8 - again in private sub cyclePower(), just before "initialize", at end in delayMS(POWER_UP_TIME_NEEDED)
13 0D 0x002FD6 - again in private sub cyclePower(), just before "initialize", at end in delayMS(POWER_UP_TIME_NEEDED)
12 0C 0x002DF4 - again in private sub cyclePower(), just before "initialize", at end in delayMS(POWER_UP_TIME_NEEDED)
11 0B 0x002DF2 - again in private sub cyclePower(), just before "initialize", at end in delayMS(POWER_UP_TIME_NEEDED)
10 0A 0x000DA - an early function near the top of the memory
9 09 0x002FCE - again in private sub cyclePower(), just before "initialize", at delayMS(10)
8 08 0x002FCC - again in private sub cyclePower(), just before "initialize", at delayMS(10)
7 07 0x002FCA - again in private sub cyclePower(), just before "initialize", at delayMS(10)
6 06 0x002FC8 - again in private sub cyclePower(), just before "initialize", at GPS_pwr = Power_Off
5 05 0x002FC6 - in private sub start() at the Return() at the end
4 04 0x003A7E - in public sub readData() after call to sub start()
3 03 0x01224E - in sub run_GPS(), in enableHighPriorityInterrupts = interruptsWereEnabled
2 02 0x0188C4 - in private sub runApplication(), in "if Iridium.countSinceSignalStrengthCheck <= Iridium.SignalStrength_Timeout then"
1 01 0x18C12 - in Main:, at GoTo Main
0 00 0x00000 - at reset vector. Underflow gets set when at this point.


So, for whatever reason, some sort of chain reaction occurred where the locations pushed onto the stack are all in the GPS functions "cycle power" and "initialize". It's like a condition was found that just kept forcing jumps or some sort of problem with the program counter that keeps pushing locations onto the stack every instruction cycle in the same general functions.

Any help would be appreciated! I am at the point I may have to spend $4000 on an ICE but not only is it expensive but it would take 2 weeks to set up and learn to use it. Even then, though, there's no guarantee we'll find the problem.

-Tom

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Fri Nov 08, 2013 3:18 pm

I'm not sure what you'd have to do to get a stack dump like that. Are you messing around with interrupt enables inside the ISR's by chance?

FYI, if you tell SF to generate a LST file then you can use that to trace the program flow instead of all the manual intervention.

From the IDE select the 'View | Compile and Program Options...' menu. On the 'Compiler' tab, check 'Assembler *.lst file'
This will give you a .LST file that has all the locations, hex code, and SF basic code mixed with the asm instructions, something like this...

Code: Select all

LOC    OBJECT CODE    LINE  SOURCE TEXT
                      07996 ; PON_STATUS_MSG
003652                07997 PROC_L00000285
003652                07998 ?I000143_F008_001365_P000068 ; L#MK IF (PSTAT > BOUND(PON_STAT_MSGS)) THEN
003652 0E0A           07999     MOVLW 10
003654 5C1A           08000     SUBWF F0_U08,0,0
003656 E302           08001     BNC L00000287
003658                08002 ?I000144_F008_001366_P000068 ; L#MK PSTAT = UNKNOWN_EVENT
003658 0E09           08003     MOVLW 9
00365A 6E1A           08004     MOVWF F0_U08,0
00365C                08005 L00000287
00365C                08006 ?I000145_F008_001370_P000068 ; L#MK IX = PSTAT * SIZEOF(PON_STAT_MSGS(0))
00365C 501A           08007     MOVF F0_U08,0,0
00365E 0D10           08008     MULLW 16
003660 CFF3 F01B      08009     MOVFF PRODL,F1_U08
003664                08010 ?I000146_F008_001371_P000068 ; L#MK TABLEPTR = @PON_STAT_MSGS
003664 0E28           08011     MOVLW ((L00000001 >> 8) & 255)
003666 6EF7           08012     MOVWF _TBLPTRL#M0_U16H,0
003668 0E70           08013     MOVLW (L00000001 & 255)
00366A 6EF6           08014     MOVWF _TBLPTRL#M0_U16,0
00366C                08015 ?I000147_F008_001372_P000068 ; L#MK RESULT = TABLEPTR + IX
00366C 501B           08016     MOVF F1_U08,0,0
00366E 26F6           08017     ADDWF _TBLPTRL#F2_U16,1,0
003670 0E00           08018     MOVLW 0
003672 22F7           08019     ADDWFC _TBLPTRL#F2_U16H,1,0
003674                08020 ?I000148_F008_001373_P000068 ; L#MK END FUNCTION
003674 0012           08021     RETURN 0
After dumping the stack contents, just match the stack dump with the 'LOC' field, and that should tell you where you were.

User avatar
RadioT
Registered User
Registered User
Posts: 157
Joined: Tue Nov 27, 2007 12:50 pm
Location: Winnipeg, Canada

Post by RadioT » Fri Nov 08, 2013 4:15 pm

Actually - we ARE. We are disabling interrupts while we are in a high-priority interrupt so we only have one happening at a time. This way, we don't overflow the stack. At least, we thought we weren't!!

-Tom

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Fri Nov 08, 2013 4:20 pm

We are disabling interrupts while we are in a high-priority interrupt so we only have one happening at a time
You don't normally have/want to do this... once you vector to the high-priority interrupt, the chip automatically clears GIE, which disables all interrupts (both priorities). Interrupts are automatically reenabled by the RETFIE instruction that's at the end of the ISR.

For the low-priority ISR, things are a bit different since the high-priority ISR can interrupt it.

Perhaps you should post the code for your ISR's

User avatar
kanderson
Posts: 17
Joined: Tue Oct 15, 2013 4:28 pm
Location: Canada
Contact:

Still happening

Post by kanderson » Fri Nov 08, 2013 10:16 pm

Hi, I'm working with RadioT...

I removed the stuff from our ISR that manipulated GIE, and we're still having the stack overflow issue. Any other thoughts?

Code: Select all

{
*****************************************************
* Name	: highPriorityInterrupts					*
*****************************************************
}
interrupt highPriorityInterrupts(IPHigh)

	Save(0)

	if appCrippledByInterrupts then
		asm
			reset
		end asm
	endif

	inc(interruptsSinceMainLoopRan)
	if interruptsSinceMainLoopRan > interruptsSinceMainLoopLimit then
		appCrippledByInterrupts = true
	endif


	// handle B0 external interrupt from DS1374
	if	DS1374.interrupts_enabled = true	and
		DS1374.interrupt_flag = true
	then
		#if debug
			inc(RTC_interruptsSinceMainLoopRan)
		#endif
		handleInterrupt_realTimeClockAlarm()
	endif
	clear(DS1374.interrupt_flag)


	// GPS data comes in on USART2
	if	GPS_ublox.interrupts_enabled = true	and
		GPS_ublox.interrupt_flag = true
	then
		#if debug
			inc(GPS_interruptsSinceMainLoopRan)
		#endif
		GPS_ublox.handleInterrupt_incomingGPSData()
	endif
	clear(GPS_ublox.interrupt_flag)


	// Iridium data comes in on USART1
	if Iridium.interrupts_enabled and Iridium.interrupt_flag = true then
		#if debug
			inc(Iridium_interruptsSinceMainLoopRan)
		#endif
		Iridium.handleInterrupt()
	endif
	clear(Iridium.interrupt_flag)

	Restore

end interrupt



{
*****************************************************
* Name	: lowPriorityInterrupts						*
*****************************************************
}
interrupt lowPriorityInterrupts(IPLow)

	Save(0)

	if appCrippledByInterrupts then
		asm
			reset
		end asm
	endif

	inc(interruptsSinceMainLoopRan)
	if interruptsSinceMainLoopRan > interruptsSinceMainLoopLimit then
		appCrippledByInterrupts = true
	endif

	//

	if	timer1_interrupts_enabled and
		timer1_interrupt_flag = true
	then
		handleInterrupt_timer1()
	endIf
	clear(timer1_interrupt_flag)

	//

	if	Timer4.interrupts_enabled and
		Timer4.interrupt_flag = true
	then
		#if debug
			inc(Timer4_interruptsSinceMainLoopRan)
		#endif
		handleInterrupt_timer4()
	endIf
	clear(Timer4.interrupt_flag)

	Restore

end interrupt

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Sat Nov 09, 2013 12:42 am

The basic structure seems to be ok there, but without seeing the whole thing it's hard to tell. I assume that the various 'handleInterrupt_XXXXX()' subs do a fair amount of work and are complex with a lot of nested sub calls themselves?

One thing you could try is to save the context of the called subroutines, and see how that reacts. Add the code to the low, then the high and make note of the ram usage after each compile. I'm betting that it'll be fairly substantial.

Code: Select all

interrupt highPriorityInterrupts(IPHigh)

   Save(0, 
   		handleInterrupt_realTimeClockAlarm, 
   		GPS_ublox.handleInterrupt_incomingGPSData,
   		Iridium.handleInterrupt
   		)

<SNIP>


interrupt lowPriorityInterrupts(IPLow)

   Save(0, 
   		handleInterrupt_timer1, 
   		handleInterrupt_timer4
   		)
   
<SNIP>
Oh, and I think you'd be better off moving the clearing of the intr flag into the true condition, otherwise you may miss an intr if it ever gets unmasked later.

Code: Select all

   if (x_interrupts_enabled and (x_interrupt_flag = true)) then
      handleInterrupt_x()
      clear(x_interrupt_flag)
   endIf
Whether you clear it before or after the handleInterrupt_x() call would depend on the peripheral and what the handler does.

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Sat Nov 09, 2013 1:33 pm

Also I see that you're using the intr from a DS1374. If you're using the I2C bus in the isr handler then make sure you're not also using it in the main context to talk to anything.

User avatar
RadioT
Registered User
Registered User
Posts: 157
Joined: Tue Nov 27, 2007 12:50 pm
Location: Winnipeg, Canada

Post by RadioT » Sun Nov 10, 2013 10:23 pm

Good point. There is a lot going on in bus comms.

Jerry Messina
Swordfish Developer
Posts: 1473
Joined: Fri Jan 30, 2009 6:27 pm
Location: US

Post by Jerry Messina » Sun Nov 10, 2013 11:22 pm

You know, there's always the possibility that there's just too many stack levels required the way the code is structured.

With two interrupt priorities enabled, if you use a lot of nested sub/function calls in the isr's you might need a lot of levels. Also, it's pretty easy to start having stack frame issues when you're calling subs from an isr. Because of the way ram is recycled you have to pay very close attention to this, esp with sub parameters and local variables.

I usually try to do as little as required in the isr, and then let the main context do most of the work. For serial comms I usually do little more than check for errors and buffer incoming data in a circular queue. I try to avoid accessing peripherals like I2C, SPI, and others since this seems to always ends up causing more problems than I started out with. Think of what would happen if they were currently in use, and an intr came along and tried to use them. Not pretty. Plus, they're usually too slow, or require blocking loops to wait for flags to become set.

Of course, you do have to do SOMETHING in the isr. Otherwise there's no point in having it. It's just that too much code in the isr can be a big source of failure as well.

Post Reply