1. Check that there isn't a scheduled agent that is going nuts - nope.
2. Check that there isn't an agent on another machine that is hitting the server - nope
3. Check it's not the indexer task - nope
4. AMgr - nope
5. Any bizarre stats - wait, what's with all the http connections?
So, that was the first real clue. Whenever the problems started, it seemed that the HTTP Connections were building very fast from a small number - say 10 or 20, right through the roof to a max of 600 (the server limit). Then, just as suddenly they would disappear and everything would come good. Turned out that that was because they were timing out. We also couldn't figure out why the timing seemed to be random. Sometimes it would happen every 10 or 15 minutes, and sometimes it could be a day or so between events.... Very frustrating.
So, looking at it, the big clue was http connections. Alright then, let's look at some http and firewall logs. Well, it seems there are lots of connections (it's a very, very busy server which hosts 30+ websites), but there is nothing that really stands out. There is no single ip address, or even a range of addresses that is an order of magnitude larger than the others.
As it turned out, it wasn't the number of connections for a particular address, it was that combined with what it is they were doing. What was happening was that some developers (I won't name them or their company here, but can confirm that they had absolutely nothing to do with my company, they were just using us as a good source of data) were testing some code which downloaded some (reasonably large) PDFs and then matched them against an existing baseline to find changes - or something like that. That code was implemented on 5 or 6 development machines, and was supposed to run once a day. Unfortunately whoever set it up "accidently" set it to run every MINUTE. So, that's why it was so random in the timing. The developers weren't running the code the whole time, sometimes they were on leave, or logged off, or running a different VM instance or whatever.
So, what did I learn from this - nothing much. I suppose the big one was that if you know there is a problem, and you can't find it, just keep digging - you will be right and people will think you're a genius, or you will be wrong - either way at least you're something.
This led me to wonder what's the most annoying, tedious, hard to find tech problem that you out there in reader land have had. Leave a comment and let us know.
No comments:
Post a Comment