open thought and learning

Archive for the ‘sysadmin’ Category

Kübler-Ross model – Tailored for operations teams

leave a comment »

Unless a direct alert for an application fires, the almost-always assumption is that the problem doesn’t exist with an application. Most of the time, this is seen for alerts triggered on upstream tiers in the application stack. I first came across this model while watching House and reading some articles online. It’s fairly known as the ‘5 Stages of Grief’ around the world.

From our perspective here’s how the responses change from the application team, in alignment with the Kubler Ross model:

Denial: “Nothing’s wrong with our tier. Why did you even call us?”, “This is a *false alarm*”.

This is only a temporary defense for the application team. The feeling is generally replaced by the kind of impact this incident might have.

Anger: “How can my application fail?!”, “Not a single alert fired!”, “Check the freakin’ network!!”

Once in the second stage, the team recognizes that denial cannot continue and extends to getting other teams on the line. It cannot be responsible, alone.

Bargaining: “This isn’t really a user-facing problem!”, “This is actually a dis-satisfaction report, not an incident, come on!”

The third stage involves the hope that the team can somehow postpone the impact or the creation of an ‘incident’. Usually, the negotiation to ignore the incident is made with the Tier 2  in exchange for improved alerting , network-bashing and other personal favors.

Depression: “[TIER2] XYZ APPTEAM, are you looking at it?…[TIER2] Ping? …..[TIER2] You there?….[APPTEAM] Still Looking ….. [TIER2] Any update?”

During the fourth stage the application team begins to understand the certainty of an incident. Because of this, the team representative may become silent, refuse disturbance, and spend more time on looking at application counters, Cactus etc. and determining what went wrong with their beloved application. Didn’t they love it enough? Did it catch ‘the bug’?

Acceptance: “Yes, we appear to be losing X dollars per 100 page hits”, “Can’t fix the code, might as well fail over traffic and mitigate impact.”

In this last stage, the application team begins to come to terms with the ‘mortality’ of the feature and understands that mitigation needs to be done.


Written by mohitsuley

April 16, 2011 at 12:49 am

Posted in sysadmin

TCP/IP Drinking Game

leave a comment »

After a long hiatus from my activity here, I have decided on two things:

1. Blog entries can be short

2. They need to be more frequent – for my own sanity as well.

I plan to read and take cues from the TCP/IP Drinking Game and learn the nuances of the protocol a bit more. There are some interesting things I learnt which I will write about, later.

That’s it for now. Ciao!

Written by mohitsuley

April 14, 2009 at 6:52 pm

Posted in networks, sysadmin

disown and nohup

leave a comment »

This is the first time I started a file transfer, and on hindsight, struck my head and said “Wish I’d started this with nohup or screen…”; I could have left home on time with this laptop tagging along with me.

What I didn’t know *then* (now I know) that there’s a beautiful bash built-in called disown which can attch any/all running jobs to the init process. Yay!

[root@linux-test data]# scp RHEL4-U5-i386-AS-disc4.iso suleym@
suleym@'s password:
RHEL4-U5-i386-AS-disc4.iso 0% 1016KB 1.0MB/s 06:21 ETA
[1]+ Stopped scp RHEL4-U5-i386-AS-disc4.iso suleym@
[root@linux-test data]# bg
[1]+ scp RHEL4-U5-i386-AS-disc4.iso suleym@ &
[root@linux-test data]# disown -h

More about disown here and in man bash.

Writing this post for the sake of not reverse-engineering searches on Google, and so that people see disown *related* to nohup.

Written by mohitsuley

August 21, 2008 at 10:09 pm

Posted in linux, sysadmin

Tagged with ,

Semaphore problems on Apache

leave a comment »

I came across a simple but intriguing problem – apachectl restart will work and restart apache processes, and in my case restart the CA/Netegrity Siteminder agent as well. However, the server didn’t respond, and neither were there any messages on the error log. SM logs said the agent initialized successfully.

When I remove mod_sm.so and restart apache (after removing environment variables related to SM), everything worked just fine. I naturally assumed that the problem was with this module that I just removed.

It turned out that the problem was with this particular semaphore which didn’t release since about the last 24 hours, and was somehow linked to the siteminder agent module. After I did an ipcrm -s ID, everything was working fine as before.

I always thought semaphores/shared memory segments not freeing up will result in apache not restarting successfully. This is the first time apache didn’t complain on a restart, no logs displayed any errors, ‘removing’ a module rectified the error, and putting it back actually made the issue recur!

Need to learn more about semaphore allocation in linux.

Written by mohitsuley

August 16, 2008 at 2:08 am

Posted in linux, sysadmin

Tagged with , , ,

Tuning a JVM for Berkeley DB Java Edition

leave a comment »

For those who not have heard about Berkeley DB (called BDB): it is a transactional storage engine with basic key/value pairs, very agile and highly performance-oriented, with no SQL-like engine. Compared to it’s native version, the Java Edition has quite a few differences and is useful when it is to be integrated with a basic Java application.

The aim of the database is to be available in RAM all the time as much as possible, so that all query responses are fast. Based on this, here’s my take on tuning the JVM that hosts the BDB:

  • JVM heap size should be around the same size as the data store
  • Use the Concurrent Mark/Sweep GC algorithm to have low-pause GC times
  • Since most of the objects are going to be living ‘forever’, it’ll make sense to have a huge tenured generation
  • If the DB size can vary, refrain from giving Xmx and Xms the same values. Give a huge difference so that the JVM can manage it as your data grows

This is what CATALINA_OPTS might look like (includes a lot of debug flags as well):

CATALINA_OPTS="-server -Xms1024m -Xmx4096m -XX:+UseMembar -XX:+PrintGCDetails -
XX:+PrintGCApplicationStoppedTime -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -verbos
e:gc -Xloggc:/appl/tomcat/logs/gcdata.txt"

-XX+UseMemBar is there to accomodate for the high IO waits I had been seeing – I think there’s a problem in linux with the JDK using memory barriers. I read about the bug here.

BDB Java Edition is not a replacement for a traditional database, but is a means to have almost immediate results for things like look-up data, subscriptions and most frequently-used information. There are quite a number of on-line resources available to help you set it up and use it – native or Java, whichever your flavor is.

memcached is another such tool that is useful when improving performance for an application-database connection.  More on it in another post some other time.


Written by mohitsuley

August 8, 2008 at 9:30 pm

Making SSI work on a JSP response

leave a comment »

If you need to parse SSI from a JSP response, there are two simple ways to do it:

1. Use the SSIServlet and handle it within tomcat
2. If you have a separate web server like Apache in front of tomcat, and you want that web server to do it, the plot thickens.

If you ask, ‘why, when you are already using Java? You can do all that you can do with SSI in a JSP, right?‘, you might be surprised. Let’s just say the reason is out-of-scope for this post.

So, you have a three-tier architecture with web servers spread across the world and app/DB servers local to certain data-centers. Naturally, you might want to ‘assimilate’ content on the web servers (closest to local users based on 3DNS/similar) where it’s already present instead of shuttling bytes back-and-forth between the web and app layers. That’s the reason. And did I say earlier it was out of scope? My bad.

The way you would do it is set up Apache on a specific Location to grab for, put an AddOutputFilterByType statement with the MIME type as text/x-server-parsed-html and finally, on the JSP itself, you will set the MIME type using setContentHeader for the response.

Your Location section might look like this:

<Location /application/ssiparser >
Options +Includes
AddOutputFilterByType INCLUDES;DEFLATE text/x-server-parsed-html

In an ideal scenario, everything should have been hunky-dory, but life isn’t so simple. At least it didn’t happen so easily for me.

What I had done earlier was, in order to make certain performance improvements, added a CompressionFilter on tomcat to gzip all responses from it so that the app-web performance improves as well. This meant that once the response reached Apache it would already be gzipped and SSI parsing would not be possible. Mind you, this is Apache 2.0.x and not 2.2.x where you can actually set up FilterDeclare and such.

There are two ways to get around this problem:

1. Get the CompressionFilter to exclude the Location you have on for SSI, and then pass on INCLUDES;DEFLATE to AddOutputFilterByType.
2. Or, unset the Accept-Encoding header on the request first so that it doesn’t take gzip and the CompressionFilter doesn’t compress it at all. If I try to deflate it again now, it doesn’t happen.

The problem with (2) is that you end up sending decompressed data across. Option (1) would be the right way to go.

(1) will entail a change on the web.xml for your application.

(2) will look like this:

<Location /application/ssiparser >
Options +Includes
RequestHeader unset Accept-Encoding
RequestHeader set Accept-Encoding deflate
AddOutputFilterByType INCLUDES;DEFLATE text/x-server-parsed-html

The JSP will start with:

<!--#include virtual="/static/content/news.html"-->
<!--#include virtual="/static/content/weather.html"-->
<!--#include virtual="/static/content/media.html"-->

Most folks do not upgrade Apache as they do with other kinds of software, just because it's so damn stable and fulfills your requirements very well. However, I feel if you need to work with filters and play around with them, 2.2 will be the way to go.

Written by mohitsuley

August 7, 2008 at 1:15 am

OpenDeploy rollback across a WAN

leave a comment »

While working on Interwoven OpenDeploy I came across the following problem:

Large deployments or file-pushes spanning a WAN or a continent used to sometimes time-out or roll back. The problem was noted where there was a significant difference of size between file lists.

This is what happens:

  1. OD starts n threads based on the n lists of files to be deployed.
  2. Thread 1 finishes and the remaining n-1 threads continue file transfer.
  3. After exactly 5 minutes, thread 1 times out (shows a TCP packet with RST flag set on tcpdump) and after all threads finish, the deployment fails and rolls back the transaction.

Root cause:
Some network device on the way times out TCP idle sessions more than 300 seconds and sends an RST flag, dropping the connection essentially. When this happens, OpenDeploy considers the transaction corrupt and rolls it back.


  1. Get the firewall to extend the timeout to a more reasonable time (perhaps similar to the default tcp_keepalive_time of 7200 seconds?) – not practically possible if a number of teams are involved.
  2. Change tcp_keepalive_time to ~200 seconds
  3. If the keepalive change does not help alone, try http://libkeepalive.sourceforge.net . Works like a charm!

Generally speaking, and not being ‘opendeploy-centric’, I did learn the importance of keepalive packets and how the default value of 7200 seconds might not be practical when an application talks to servers across network borders.

Thanks to my colleague Prajal Sutaria for working on this!

Written by mohitsuley

August 6, 2008 at 9:04 pm

Posted in linux, sysadmin

Tagged with , , ,