After a long running battle of wits with an extremely temperamental server / application, it looks like we may have finally gotten to the bottom of our memory issues.
The problem, simply stated, was that the memory, cpu and threads in use by the jrun process was growing without plateauing, until the server locked up and had to be restarted. It usually locked up once JRun was using about 1.1Gig of RAM, which happened several times a day, and nearly always in the middle of the night.
I've posted about this a few times over the last 6 months or so, and we've tried just about everything we could think of and everything anyone had suggested to diagnose and fix it - we just couldn't keep the server up for more than a few hours.
But yesterday I think we made the crucial breakthrough. Said server has now been quite happy for almost 24h, and memory usage is fairly stable at a much more respectable 500M. And the thing is, now that I've found the problem, I can't believe it wasn't one of the first things we tried....
Two days ago I whipped up a very quick-n-dirty app based on the coldfusion.runtime.SessionTracker object (more detail over at RewindLife) to list session scopes per application. I was quite surprised to see well over 1000 sessions on just one of our sites when I was fairly certain that there were nowhere near that many users "active" at the time.
The app in question is a one-codebase-serving-many-sites kind of affair, with each site having its own application scope - I'm sure you know the kind of thing - and if there were over 1000 sessions on just one of those sites, multiply that by the number of sites, and you can end up with a pretty big number.
This led me to the following chain of reasoning:
Every time a HTTP Request is made to a Coldfusion application with session management enabled, the CF server looks for the CFID&CFTOKEN cookie to determine which of its in-memory session scopes to assign to that request.
If there is no relevant cookie, then a new session scope is created.
Session scopes are also per-application - the same visitor will have a different session scope for each cf application they visit.
By itself, CF only stores a very small amount of data per session ( just enough to identify the user )
However, on the system in question, we have to store a significant amount of data per user as everything needs to be permission-controlled - objects and queries are also cached in session scope per-user to speed up performance.
As we are also using CF to permission-control the generated RSS feeds - both for access to the feed itself (e.g. blog feeds from a private group blog) and for what items appear in the feed (e.g. search results for a term which appears in a private groups name will depend on whether the current visitor is a logged-in member of the private group) - then every access to RSS feeds goes through this same mechanism.
Most RSS readers and server-to-server RSS requests will NOT be storing cookies and re-using them - so virtually every request for RSS will create a new session scope.
I've mentioned this before on the Headshift blog : RSS Will Eat Itself? and often joked in the office that the easiest way to create a singularity would be to subscribe two RSS aggregators to each other - but this is the first time I've actually seen it happen.
The session timeout on that app used to be an hour, but it was upped to a day after users were not happy about getting logged out so often. As a result, every time an RSS reader - or another site - requests RSS from the server, a new session is created , complete with cached app-specific objects and data - that hangs around for a full day.
Hence a pattern of gradually increasing resource usage is not that surprising.
So yesterday, after manually restarting CF in the middle of the afternoon for the umpteenth time, I took the opportunity to take the session timeout on that app down to 2hrs - this means people will still be logged in after taking lunch, which I suspect will probably be enough - and now the results are in:
After just short of 24hrs, the server has not neede to be bounced, the number of sessions on that app is showing at 360, the memory usage is pretty stable at around 500M, and the response time has stayed snappy throughout.
Can it really have been as simple as that? I'm just glad that we've got it under control, although I'm kicking myself that we didn't think of this sooner. But then, 99% of fixing any problem is working out what it is in the first place - my old school Physics teacher (Mr Watkinson, who managed to make A-Level Physics seem both fascinating and simple, and without whom I probably wouldn't have gone on to study Physics at university, and hence wouldn't be in my present career) used to say that a problem well-stated is a problem half-solved, and I've yet to see that statement disproved.
I'm not about to unfurl the flags and announce "Mission Accomplished", but I think we can declare major combat operations more-or-less finished :)