LwIP sample program failure

27 replies [Last post]
tamirmichael
Offline
Joined: 2012-04-13

Hello,

I have downloaded the port of lwIP for LPC17xx and encountered a program failure after some time the controller (LPC1788) handes bus events
(I ran the TCP echo sample, but I suspect that any will do to demonstrate this):

    /* all pbufs in a chain are referenced at least once */
    LWIP_ASSERT("pbuf_free: p->ref > 0", p->ref > 0);

Which can be found at pbuf.c, Line 627 of version 1.4.0 (the function "pbuf_free"). The failure occurrs both with dynamic and staticaly allocated heap after at most several minutes of running, being pinged or not.
How can this happen? Did you encounter this too? What can be done about it?

0
Your rating: None

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
wellsk's picture
wellsk
Online
Joined: 2011-05-17

I ran flood ping tests with tcpecho for a while (easily more than 2 minutes). but didn't see this issue. But I have seen this trap get hit before in specific cases (all of which shouldn't be present except for possible one...). larsjep (another developer) recently discovered an issue with the way chained pbufs are de-allocated in the 18xx/43xx port that may be contributing to this issue. (It's an 18xx/43xx patch, but the same logic should apply to the 17xx port).

The patch is here:
http://www.lpcware.com/content/bugtrackerissue/lwiplpc-lpc18xx43xx-drive...

I've noticed problems with the reference counters getting messed up when under load with an RTOS when SYS_LIGHTWEIGHT_PROT is not enabled (SYS_LIGHTWEIGHT_PROT = 0 and NO_SYS = 0). Are you currently using one of the examples without any changes? If you are using a FreeRTOS based example (NO_SYS = 0), can you verify that the SYS_LIGHTWEIGHT_PROT option is set to 1?

Do you have a stack trace available, or know if the issue occurred from the TX or RX handler side?

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

tamirmichael
Offline
Joined: 2012-04-13

Thanks a lot for your reply. I will try to patch my sample/project tomorrow morning!
Yes, indeed, I have been using the sample without my change (after not understanding why my official project fails...).
I do not have stack trace available yet, but I am quite sure the failure occurs while processing a received message.
I will report more details tomorrow.
Thanks again!

tamirmichael
Offline
Joined: 2012-04-13

Kevin,

I have merged this change into the my project and it seems to have been working fine for a while now (pinged for about 30 minutes plus).
Thanks again for sharing this patch.

tamirmichael
Offline
Joined: 2012-04-13

A small update: I did see my program fail again after about an hour of operation. I cannot provide more details as my debugger was not connected.

wellsk's picture
wellsk
Online
Joined: 2011-05-17

Do what you can to identify the area of the problem and we'll try to repeat the issue here. Some type of case where the issue occurs would be very helpful (ie, the stack trace when the error occurs to help trace the failure path). The patch fixes a potential issue with TX pbufs only, but may or may not help at all. Are you using a standalone configuration or with an RTOS?

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

tamirmichael
Offline
Joined: 2012-04-13

In order to reproduce the problem, all you need to do is use the embedded artists LPC1788 evaluation board and program it with this sample program:

"nxpcommon\examples\lpc177x_8x\ea1788\ea1788_tcpecho_freertos"

and let it run being pinged or not. After some time, even with the patch posted above, it will hang at the same place indicated above.
I am going to start another test with a debugger - will provide a stack trace once failure occurs.
Thanks for your help.

tamirmichael
Offline
Joined: 2012-04-13

Without the patch above, the failure seems to occur much faster, producing this stack trace:

pbuf_free (called by ip.c, line 433)
ip_input (called by etharp.c, line 1282)
ethernet_input (called by lpc17_emac.c, line 468)
lpc_enetif_input (called by lpc17_emac.c, line 787)
vPacketReceive_Task

this appiles to version 1.4.0 of lwIP.

I will now attempt to reproduce the failure with the patch.

tamirmichael
Offline
Joined: 2012-04-13

I managed to fail the program with the patch - here is the stack trace:

pbuf_free (called by lpc17_emac.c, line 477)
lpc_enetif_input (called by lpc17_emac.c, line 787)
vPacketReceiveTask

It took me something like 45 minutes of continuous pinging (I stopped and resumed shortly before the failure) to cause this.
I hope this helps.

wellsk's picture
wellsk
Online
Joined: 2011-05-17

It looks like you are using FreeRTOS and the failure occurred on the input path. I'd keep the TX pbuf patch in, but it looks like I might have another problem with the example code itself using the wrong input path for an RTOS. (Sorry :()

When used with FreeRTOS, the 'ethernet_input' function should be replaced with the 'tcpip_input' function. This is correct for the LPC18xx/43xx examples, but it seems I got a little to comfortable with cut-and-paste.

In the example code, can you make the change indicated below and give it a try?

  /* Add netif interface for lpc17xx_8x */
  memset(lpc_netif, 0, sizeof(lpc_netif));
//  if (!netif_add(&lpc_netif, &ipaddr, &netmask, &gw, NULL, lpc_enetif_init, ethernet_input))
  if (!netif_add(&lpc_netif, &ipaddr, &netmask, &gw, NULL, lpc_enetif_init, tcpip_input))
    LWIP_ASSERT("Net interface failed to initialize\r\n", 0);

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

wellsk's picture
wellsk
Online
Joined: 2011-05-17

I've entered a bug tracker issue for this case.
http://www.lpcware.com/content/bugtrackerissue/lwip-17xx-freertos-exampl...

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

tamirmichael
Offline
Joined: 2012-04-13

I will try this fix ASAP but I must deal with some bush fires first...
Thanks for the efforts, I will report the results once I have something.

SeleneSW
Offline
Joined: 2012-06-19

I have the same problem with the "ea1788_http_freertos" demo.
I tried with the patches described above but with no success.

Any news about this thread?

wellsk's picture
wellsk
Online
Joined: 2011-05-17

I have uploaded the LPC17xx and LCP43xx ports with larsjep's updates and several other fixes for both the 17xx and 43xx platforms.
You can pull the changes from the GIT repo at: http://sw.lpcware.com/?p=lwip_lpc.git&a=summary

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

tamirmichael
Offline
Joined: 2012-04-13

Thanks for the effort. I have downloaded the corrected program and just started what I hope will be a very long test. I will report back.

tamirmichael
Offline
Joined: 2012-04-13

Test failed (ping reply no longer generated). I did not have a debugger connected to the target.

tamirmichael
Offline
Joined: 2012-04-13

I am conducting an additional test to make sure what I have seen is indeed a legitimate failure.

tamirmichael
Offline
Joined: 2012-04-13

Test failed again. I did not have a debugger connected to the target but it is easy to reproduce - just let any of the test programs run pings (or not) just like above.

wellsk's picture
wellsk
Online
Joined: 2011-05-17

>Test failed again. I did not have a debugger connected
I wasn't able to repeat the failure here with the current FLASH variant of the ea1788_tcpecho_freertos project.
I opened multiple (3+) flood ping sessions to the EA1788 board and ran them for an hour - the board never lost ping response, although it did occasionally lose some packets or get increased latency times due to loading. (This is with only 3 TX and RX descriptors being used for the driver, so it's to be expected).

The FLASH example is very memory limited with the following configuration:
From lwipopts.h...

#define PBUF_POOL_SIZE                  6
#define MEM_SIZE (12*1024)

From lpc_emac_config.h...
#define LPC_NUM_BUFF_RXDESCS 3
#define LPC_NUM_BUFF_TXDESCS 3

It -is- very possible to get LWIP into a state where it's internally 'out of memory' and no more pbufs can be allocated for the receive descriptor queue. If this happens, the code will continue to run, but the MAC controller will accept no more packets as no more descriptors are queued due to no buffers being allocatable! Packet fragmentation will definitely do this (ping -f -s 5000 ). When LWIP allocates space for copied packet fragments, it uses the same memory for normal buffers - and the pbuf allocation will eventually fail. Only adding more memory will fix this.

The FLASH variant example uses the internal RAM only. Ethernet buffers and descriptors can only be located in peripheral RAM on the 17xx, which is only 32K. This isn't a lot for LWIP. A DRAM variant is also available that uses DRAM for buffer/pbuf/descriptor allocation from a huge memory pool.
Here are the DRAM variant settings, this also boots from FLASH...

#define PBUF_POOL_SIZE                  256
#define MEM_SIZE (4096*1024)
#define LPC_NUM_BUFF_RXDESCS 128
#define LPC_NUM_BUFF_TXDESCS 128

I've attached both versions (FLASH/IRAM and FLASH/DRAM) of the example here for the EA1788 board. I've also attached a debug version of the DRAM version here that should output messages to the UART port similar to the following. If LWIP fails, you should see an error message you can re-post here.

lpc_low_level_input: Packet received: a0027110, size 102 (index=109)
pbuf_alloc(length=1536)
pbuf_alloc(length=1536) == a0026af4
lpc_rxqueue_pbuf: pbuf packet queued: a0026af4 (free desc=0)
pbuf_header: old a0027120 new a002712e (-14)
pbuf_free(a0027110)
pbuf_free: deallocating a0027110
lpc_low_level_input: Packet received: a002772c, size 102 (index=110)
pbuf_alloc(length=1536)
pbuf_alloc(length=1536) == a0027110
lpc_rxqueue_pbuf: pbuf packet queued: a0027110 (free desc=0)
pbuf_header: old a002773c new a002774a (-14)
pbuf_free(a002772c)
pbuf_free: deallocating a002772c
lpc_low_level_input: Packet received: a0027d48, size 102 (index=111)

PreviewAttachmentSize
lwip_ea1788_prebuilt_examples.zip703 KB

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

tamirmichael
Offline
Joined: 2012-04-13

Thanks. I have downloaded "dram_uart_example.hex". I will start a test right away.

tamirmichael
Offline
Joined: 2012-04-13

Kevin,

I have adjusted the latest state of affairs to accommodate more memory - somehow my board refused to boot with the .hex file you provided (so there is, unfortunately, no extra debug information available in the image that I test now). Either way, it seems a little odd that a system would run out of memory sitting idle (but of course, the network interface does process packets). I would expect that there is still a memory/resource "leak" somewhere...

SeleneSW
Offline
Joined: 2012-06-19

I have managed my sources in order to allocate pbuf areas on external SDRAM and the new FW is running ok.
I will keep it running for at least 3 hours and then I will report.

Thank you for the help.

Mario

tamirmichael
Offline
Joined: 2012-04-13

Mine failed after about 2 hours...

SeleneSW
Offline
Joined: 2012-06-19

I have stopped the test. It was alive and kicking after about 370.000 echos...
I think the problem is solved, but I'm going to make more tests leaving it working the whole week-end.
See you on monday.

tamirmichael
Offline
Joined: 2012-04-13

Mario,

Can you tell me what software you are using? I have simply added memory to a DRAM variant make available as indicated above - which eventually failed. Did you run the .hex file provided here?

wellsk's picture
wellsk
Online
Joined: 2011-05-17

Hmm, I'm really not sure what's happening. With the FLASH build, you should be able to turn on LWIP debugging and get some output messages from the UART on failure. Maybe you'll catch something I can trace? (Set LWIP_DEBUG as a define in your compile argument path). You won't need to have a debugger attached to see the error output on the UART if this is defined.

You can trace pbuf related functions by setting the following define in lwipopts.h when LWIP_DEBUG is enabled.

#define PBUF_DEBUG                      LWIP_DBG_ON

The MAC driver also has a special define you can enable to trace some of it's events. Continuous failures to allocate a pbuf and queue a receive descriptor would indicate memory is very tight.

#define UDP_LPC_EMAC                    LWIP_DBG_ON

http://www.lpcware.com/content/project/lightweight-ip-lwip-networking-st...
Note: Enabling debug will drastically impact performance.

Knowing what type of error (or lack of error) is important in determining the failure. In the zero-copy implementation, a failure to allocate a pbuf for the receive queue may not necessarily always be a failure if other receive descriptors still have pbufs associated with them. The driver will attempt to allocate them later, but you will see a warning message like this:

lpc_rx_queue: could not allocate RX pbuf index 4...

A failure to allocate at least 1 pbuf when all receive descriptors are used would be bad. This results in a situation where no more descriptors are queued and the MAC will reject receive packets from that point on (until more descriptors/pbufs are queued). Receive pbufs are allocated in the receive task whenever a packet is received. The new pbuf is immediately associated with the new descriptor before the current packet is processed. If it can't allocate it, it goes on and will attempt to allocate on the next received packet (possibly allocating and queuing more than 1 pbuf). Receive packets are always allocated with the PBUF_DMA type with size 1536 bytes and are contiguous in memory.

If it's any easier, I can provide a copy based pbuf implementation of the driver that frees no pbufs and only allocates pbufs for the receive buffer copy. This is a full static implementation of the descriptors and buffers, but is slower than the zero-copy approach. It won't help with memory constrained issues though.

Kevin Wells

NXP Semiconductors

Kevin Wells

NXP Semiconductors

SeleneSW
Offline
Joined: 2012-06-19

Hi,

good news, the demo program is still running after the week-end!

tamirmichael,

I'm using the tcpecho_freertos demo integrated in my own project and running on a proprietary board.
I started from the new sources posted by Kevin (http://sw.lpcware.com/?p=lwip_lpc.git&a=summary)
and modified them as indicated.

lwipopts.h:
#define PBUF_POOL_SIZE 256
#define MEM_SIZE (4096*1024)

lpc_emac_config.h:
#define LPC_NUM_BUFF_RXDESCS 128
#define LPC_NUM_BUFF_TXDESCS 128

Then I created a new DRAM section called LWIP_RAM modifying my linker script file.
Last, I modified some variable declarations in LwIP to force their allocation in the
new section (see attachments).

After that the program started to function correctly, but now I am a little bit confused:
if over 4MB of memory are needed to perform a simple echo function, can we call it
"Light Weight IP" ?

PreviewAttachmentSize
lwip-1.4.0_src_core.zip10.49 KB
tamirmichael
Offline
Joined: 2012-04-13

It is very kind of you to share your changes. Thank you.
Either way, I would expect the linker to automatically map variables that do not fit in IRAM to DRAM. I will make another attempt first with your code, then I will make an attempt based on my sources.
Regarding memory usage: I have been working successfully with uIP so far (but covet the socket interface of lwIP + improved performance...) so I'm used to a tiny footprint. It is hard to argue with facts - your program keeps on functioning - but it would be interesting to know how much memory is actually consumed? What is the minimum RAM footprint?

feedback