vyatta

open thought and learning

Posts Tagged ‘opendeploy

OpenDeploy rollback across a WAN

leave a comment »

While working on Interwoven OpenDeploy I came across the following problem:

Large deployments or file-pushes spanning a WAN or a continent used to sometimes time-out or roll back. The problem was noted where there was a significant difference of size between file lists.

This is what happens:

  1. OD starts n threads based on the n lists of files to be deployed.
  2. Thread 1 finishes and the remaining n-1 threads continue file transfer.
  3. After exactly 5 minutes, thread 1 times out (shows a TCP packet with RST flag set on tcpdump) and after all threads finish, the deployment fails and rolls back the transaction.

Root cause:
Some network device on the way times out TCP idle sessions more than 300 seconds and sends an RST flag, dropping the connection essentially. When this happens, OpenDeploy considers the transaction corrupt and rolls it back.

Fix

  1. Get the firewall to extend the timeout to a more reasonable time (perhaps similar to the default tcp_keepalive_time of 7200 seconds?) – not practically possible if a number of teams are involved.
  2. Change tcp_keepalive_time to ~200 seconds
  3. If the keepalive change does not help alone, try http://libkeepalive.sourceforge.net . Works like a charm!

Generally speaking, and not being ‘opendeploy-centric’, I did learn the importance of keepalive packets and how the default value of 7200 seconds might not be practical when an application talks to servers across network borders.

Thanks to my colleague Prajal Sutaria for working on this!

Written by mohitsuley

August 6, 2008 at 9:04 pm

Posted in linux, sysadmin

Tagged with , , ,