vyatta

open thought and learning

Archive for August 2008

disown and nohup

leave a comment »

This is the first time I started a file transfer, and on hindsight, struck my head and said “Wish I’d started this with nohup or screen…”; I could have left home on time with this laptop tagging along with me.

What I didn’t know *then* (now I know) that there’s a beautiful bash built-in called disown which can attch any/all running jobs to the init process. Yay!


[root@linux-test data]# scp RHEL4-U5-i386-AS-disc4.iso suleym@192.168.1.1:/appl/RHEL-AS4/
suleym@192.168.1.1's password:
RHEL4-U5-i386-AS-disc4.iso 0% 1016KB 1.0MB/s 06:21 ETA
[1]+ Stopped scp RHEL4-U5-i386-AS-disc4.iso suleym@192.168.1.1:/appl/RHEL-AS4/
[root@linux-test data]# bg
[1]+ scp RHEL4-U5-i386-AS-disc4.iso suleym@3.122.220.169:/appl/RHEL-AS4/ &
[root@linux-test data]# disown -h

More about disown here and in man bash.

Writing this post for the sake of not reverse-engineering searches on Google, and so that people see disown *related* to nohup.

Written by mohitsuley

August 21, 2008 at 10:09 pm

Posted in linux, sysadmin

Tagged with ,

Semaphore problems on Apache

leave a comment »

I came across a simple but intriguing problem – apachectl restart will work and restart apache processes, and in my case restart the CA/Netegrity Siteminder agent as well. However, the server didn’t respond, and neither were there any messages on the error log. SM logs said the agent initialized successfully.

When I remove mod_sm.so and restart apache (after removing environment variables related to SM), everything worked just fine. I naturally assumed that the problem was with this module that I just removed.

It turned out that the problem was with this particular semaphore which didn’t release since about the last 24 hours, and was somehow linked to the siteminder agent module. After I did an ipcrm -s ID, everything was working fine as before.

I always thought semaphores/shared memory segments not freeing up will result in apache not restarting successfully. This is the first time apache didn’t complain on a restart, no logs displayed any errors, ‘removing’ a module rectified the error, and putting it back actually made the issue recur!

Need to learn more about semaphore allocation in linux.

Written by mohitsuley

August 16, 2008 at 2:08 am

Posted in linux, sysadmin

Tagged with , , ,

Tuning a JVM for Berkeley DB Java Edition

leave a comment »

For those who not have heard about Berkeley DB (called BDB): it is a transactional storage engine with basic key/value pairs, very agile and highly performance-oriented, with no SQL-like engine. Compared to it’s native version, the Java Edition has quite a few differences and is useful when it is to be integrated with a basic Java application.

The aim of the database is to be available in RAM all the time as much as possible, so that all query responses are fast. Based on this, here’s my take on tuning the JVM that hosts the BDB:

  • JVM heap size should be around the same size as the data store
  • Use the Concurrent Mark/Sweep GC algorithm to have low-pause GC times
  • Since most of the objects are going to be living ‘forever’, it’ll make sense to have a huge tenured generation
  • If the DB size can vary, refrain from giving Xmx and Xms the same values. Give a huge difference so that the JVM can manage it as your data grows

This is what CATALINA_OPTS might look like (includes a lot of debug flags as well):

CATALINA_OPTS="-server -Xms1024m -Xmx4096m -XX:+UseMembar -XX:+PrintGCDetails -
XX:+PrintGCApplicationStoppedTime -XX:NewRatio=4 -XX:+UseConcMarkSweepGC -verbos
e:gc -Xloggc:/appl/tomcat/logs/gcdata.txt"

-XX+UseMemBar is there to accomodate for the high IO waits I had been seeing – I think there’s a problem in linux with the JDK using memory barriers. I read about the bug here.

BDB Java Edition is not a replacement for a traditional database, but is a means to have almost immediate results for things like look-up data, subscriptions and most frequently-used information. There are quite a number of on-line resources available to help you set it up and use it – native or Java, whichever your flavor is.

memcached is another such tool that is useful when improving performance for an application-database connection.  More on it in another post some other time.

Cheers!

Written by mohitsuley

August 8, 2008 at 9:30 pm

Making SSI work on a JSP response

leave a comment »

If you need to parse SSI from a JSP response, there are two simple ways to do it:

1. Use the SSIServlet and handle it within tomcat
2. If you have a separate web server like Apache in front of tomcat, and you want that web server to do it, the plot thickens.

If you ask, ‘why, when you are already using Java? You can do all that you can do with SSI in a JSP, right?‘, you might be surprised. Let’s just say the reason is out-of-scope for this post.

So, you have a three-tier architecture with web servers spread across the world and app/DB servers local to certain data-centers. Naturally, you might want to ‘assimilate’ content on the web servers (closest to local users based on 3DNS/similar) where it’s already present instead of shuttling bytes back-and-forth between the web and app layers. That’s the reason. And did I say earlier it was out of scope? My bad.

The way you would do it is set up Apache on a specific Location to grab for, put an AddOutputFilterByType statement with the MIME type as text/x-server-parsed-html and finally, on the JSP itself, you will set the MIME type using setContentHeader for the response.

Your Location section might look like this:

<Location /application/ssiparser >
Options +Includes
AddOutputFilterByType INCLUDES;DEFLATE text/x-server-parsed-html
</Location>

In an ideal scenario, everything should have been hunky-dory, but life isn’t so simple. At least it didn’t happen so easily for me.

What I had done earlier was, in order to make certain performance improvements, added a CompressionFilter on tomcat to gzip all responses from it so that the app-web performance improves as well. This meant that once the response reached Apache it would already be gzipped and SSI parsing would not be possible. Mind you, this is Apache 2.0.x and not 2.2.x where you can actually set up FilterDeclare and such.

There are two ways to get around this problem:

1. Get the CompressionFilter to exclude the Location you have on for SSI, and then pass on INCLUDES;DEFLATE to AddOutputFilterByType.
2. Or, unset the Accept-Encoding header on the request first so that it doesn’t take gzip and the CompressionFilter doesn’t compress it at all. If I try to deflate it again now, it doesn’t happen.

The problem with (2) is that you end up sending decompressed data across. Option (1) would be the right way to go.

(1) will entail a change on the web.xml for your application.

(2) will look like this:

<Location /application/ssiparser >
Options +Includes
RequestHeader unset Accept-Encoding
RequestHeader set Accept-Encoding deflate
AddOutputFilterByType INCLUDES;DEFLATE text/x-server-parsed-html
</Location>

The JSP will start with:

<%
response.setHeader("Content-Type","text/x-server-parsed-html");
%>
<!--#include virtual="/static/content/news.html"-->
<!--#include virtual="/static/content/weather.html"-->
<!--#include virtual="/static/content/media.html"-->

Most folks do not upgrade Apache as they do with other kinds of software, just because it's so damn stable and fulfills your requirements very well. However, I feel if you need to work with filters and play around with them, 2.2 will be the way to go.

Written by mohitsuley

August 7, 2008 at 1:15 am

OpenDeploy rollback across a WAN

leave a comment »

While working on Interwoven OpenDeploy I came across the following problem:

Large deployments or file-pushes spanning a WAN or a continent used to sometimes time-out or roll back. The problem was noted where there was a significant difference of size between file lists.

This is what happens:

  1. OD starts n threads based on the n lists of files to be deployed.
  2. Thread 1 finishes and the remaining n-1 threads continue file transfer.
  3. After exactly 5 minutes, thread 1 times out (shows a TCP packet with RST flag set on tcpdump) and after all threads finish, the deployment fails and rolls back the transaction.

Root cause:
Some network device on the way times out TCP idle sessions more than 300 seconds and sends an RST flag, dropping the connection essentially. When this happens, OpenDeploy considers the transaction corrupt and rolls it back.

Fix

  1. Get the firewall to extend the timeout to a more reasonable time (perhaps similar to the default tcp_keepalive_time of 7200 seconds?) – not practically possible if a number of teams are involved.
  2. Change tcp_keepalive_time to ~200 seconds
  3. If the keepalive change does not help alone, try http://libkeepalive.sourceforge.net . Works like a charm!

Generally speaking, and not being ‘opendeploy-centric’, I did learn the importance of keepalive packets and how the default value of 7200 seconds might not be practical when an application talks to servers across network borders.

Thanks to my colleague Prajal Sutaria for working on this!

Written by mohitsuley

August 6, 2008 at 9:04 pm

Posted in linux, sysadmin

Tagged with , , ,

Caching problems with SAML

leave a comment »

Anyone who has worked with SAML knows very well how effective and simple it is to achieve federated services with your own authentication mechanism. What needs to be remembered, though, is that end-users might very well be behind firewalls. And with that come proxies; and those proxies open up the Pandora’s box aka cache.

Proxies can cache POST response from the authentication user agent and make user1 see a page which says ‘Welcome user2’. Do a forced-refresh (Ctrl-F5, Cmd-R) on the browser, and you can see your own ID again.

Fixes:
1. Ensure proxies don’t cache any content for your authentication domain.
2. Pass a ‘random’ value like the timestamp using Javascript to the URL (to make it unqiue)
3. Force the content-provider’s web server, and the user agent web server to set Cache-Control to max-age=0 and proxy-revalidate.
4. Make sure you’re sending an invalidation string in the packet as well.

Clearing proxies in a company with about ~100 proxy servers might not be the right choice. The onus should lie on the development and the sysadmin team to make sure important pages are non-cacheable. Never trust proxy servers is the motto here.

Written by mohitsuley

August 1, 2008 at 4:18 pm

Posted in linux, sysadmin

Tagged with , , ,

Pinging hostnames from /etc/hosts

leave a comment »

Problem Statement: Ability to ping a user-defined hostname with a valid IP address
Solution: Simple, put it in the /etc/hosts file and you’re done.

You still can’t do it; did you check nsswitch.conf? This is what should be there: hosts: files dns .
So, with the right /etc/nsswitch.conf and /etc/hosts, should it work?

root@treebeard:~# cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 treebeard
192.168.2.2 mithrandir


root@treebeard:~# ping mithrandir
PING mithrandir (192.168.2.2) 56(84) bytes of data.
64 bytes from mithrandir (192.168.2.2): icmp_seq=1 ttl=64 time=0.092 ms
64 bytes from mithrandir (192.168.2.2): icmp_seq=2 ttl=64 time=0.067 ms

It works!
But…

root@treebeard:~#sudo su - mohit
mohit@treebeard:~$ ping mithrandir
ping: unknown host

It seems when I switch to a non-root user, entries in /etc/hosts fail to take effect.
Why?

The problem is with the read attributes on /etc/nsswitch.conf. I hadn’t noticed that it was world-unreadable.

root@treebeard:~# chmod o+r /etc/nsswitch.conf
root@treebeard:~#sudo su - mohit
mohit@treebeard:~$ ping mithrandir
mohit@treebeard:~$ ping mithrandir
PING mithrandir (192.168.2.2) 56(84) bytes of data.
64 bytes from mithrandir (192.168.2.2): icmp_seq=1 ttl=64 time=0.092 ms
64 bytes from mithrandir (192.168.2.2): icmp_seq=2 ttl=64 time=0.067 ms

Worked, finally. The weird thing is I would have assumed ping to complain that it wasn’t able to read a file or something, but there was nothing of that sort. This means you can actually force a user to stick to DNS resolution and all daemons and root-owned processes to leverage /etc/hosts.

Bad idea I’d say. This might be a ticking time-bomb. I faced this problem when configuring two nodes for a 10g RAC cluster. The DB runs as a user, and the DBA had a tough time getting the private interconnect working – thanks to nsswitch.conf.

Lesson learnt.

Written by mohitsuley

August 1, 2008 at 4:26 am

Posted in linux, sysadmin

Tagged with ,