Mailing List CGatePro@mail.stalker.com Message #92663
From: Jorge R. MacĂ­as L. <jmacias@mexis.net>
Subject: Filesystem layout and CGP performance
Date: Wed, 19 Sep 2007 19:46:10 -0500
To: <CGatePro@mail.stalker.com>
X-Mailer: CommuniGate Pronto 1.2b
Hi everyone

I want to share some very recent experience we got with our CGP server in case anyone is having similar issues.

We have a CGP server with about 27K users running on Linux Redhat Enterprise 4 on a Dell PowerEdge 2650 with 2 GB RAM, 2 internal 36 GB HD on RAID 1, and 2 Xeon 3.6 GHz cpus. The server is directly Fiber attached to a Clariion CX200 with 10 Fiber Channel 15K rpms 36 GB disks.  The disks are configured on 2 RAID 5 volumes, each one with 5 disks, a LUN using 100% of the space of every volume group and then both LUNs together forming a MetaLUN via stripping.  The usable disk space after all that is of 210 GB.

The whole MetaLUN is formatted as ext3 and mounted as /var/CommuniGate

After more than 3 years of a very smooth behavior we started seeing a severe degradation of the whole CGP service.  Problems started every day at around 11:00 and lasted more or less til 13:30.  Basically the problem was a very slow webmail access, even the admin pages became slow as hell, specially when saving a configuration change.  Pop3 connections also slowed down. Active POP3, HTTPU, HTTPA and SMTPI connections increase to more than 300% their usual values.  The most visible effects were the linear increase of localdeliverywaiting and numdequeuerbatches values, the worst day of all we reached about 120K value in the numdequeuerbatches element.

When all this happened we were forced to close smtp incomming channels in order to let CGPro finish processing what was already enqueued, which meant between 2 and 3 hours of very restricted SMTP access to the server.

We got very valuable help from CGP's Philip Slater and other members of their staff and their feedback pointed us to our storage / filesystem solution.  We accelerated our plans to migrate from the CX200 to a CX3-10 SAN solution which we'll be deploying in about 4 weeks.

One of the issues was that our filesystem operated at about 85% of usage most of the time, with 90% peaks, the day of the worst issue the filesystem usage went all the way to 99%.  The first thing we made was changing the log settings to keep only the last 3 days, an so we were able to lower the usage to stay between 70 and 75%.

Since we were still having problems with dequeuer batches and local delivery we thought about moving some of the folders under /var/CommuniGate to the local HD, but since this are small disks we couldn't move SystemLogs for instance, so what we did was moving the Queue directory.

Surprisingly we've been working for 5 days without the slightest performance affectation.

When the Queue directory was on the same filesystem as the rest of /var/CommuniGate, if I axecuted a "du -sh /var/CommuniGate/Queue" it would take about 1.5 minutes to answer, know it answers immediately.  We've been monitoring the Queue dir size for this 5 days and it has never gone beyond 700 MB so I wonder what was the problem, and this is where I need to hear from any of you guys, what do you think?

Can it be an issue with the storage device?  The performance of the FC disks? The usage of RAID5? or maybe it has to do with some ext3 related issue?  Maybe the number of open files in the same filesystem?  Or the size of the filesystem index or journal or whatever ext3 uses?

I'w like to get some feedback on the best way to configure our server's storage and filesystem when we migrate to the CS3-10 SAN.  The server will be connected on a high availability disposition to the SAN.  One HBA FC card to the first switch, one to the second switch, both switches connected to the CX3-10 device.  I'm thinking I have to use at least 3 independent LUNs/Filesystems, maybe even on separate physical disks groups, one for /var/CommuniGate/Queue, something small, about 5 GB.  Another one for /var/CommuniGate/SystemLogs, about 100 GB.  The last one for /var/CommuniGate, with maybe 300 GB.  The first two would be created on a volume group formed with 5 FC 10Krpms 300 GB disks on RAID 5, the last one on a similar volume group but with 15Krpms disks.

Should I keep on using ext3?  Should I keep on using Redhat Enterprise 4? Do I upgrade to Entreprise 5?  Do I use any specific parameters for the kernel or for ext3 when I format the partitions in order to get a better performance?

Any suggestions?

Thanks and best regards

Jorge
Subscribe (FEED) Subscribe (DIGEST) Subscribe (INDEX) Unsubscribe Mail to Listmaster