A couple weeks ago I started getting some panicked customers. They were telling me that on piece of software that fetched prices, all their prices were coming back $0. At first I wasn’t too concerned. The first calls came in the middle of a hiccup on the network where the web service I used to fetch those prices was offline. Clearly, if it’s offline, I can’t do much about pricing. I told them to refresh their pricing, and worst case, restart the program and fetch the prices again.
I didn’t hear much about it for a couple days. Then it blew up. Several of my internal customers started reporting the issue. And it was usually at about the same time, between 11:00 and 11:30 am. Unfortunately for me, my log file in this part of the program was fairly sparce. It resembled something like:
So the first step was to increase the logging. I started by logging what parameters I was sending, and what parameters I was getting back. I reasoned that if the prices coming back were $0, there probably wasn’t a whole lot I could do, and I’d have to talk to the team that wrote the webservice. I also made sure I did a better job of logging the WebException. So I put it out and waited.
A little while later one of my guys calls up with the issue. We grab his log file and this entry pops up:
2010-03-04 14:50:19,779 ERROR (794) - System.Net.WebException: The operation has timed out
after that, everything is coming back as $0. However, when I manually typed the url into my browser for the service he was calling, everything was coming back fine. Here we were no more than 2 minutes after he called and it appears as if everything is running fine again. So a timeout occurred, but it instantly came back? That seemed a little odd, and fairly lucky for 2 or 3 people to hit this all at the exact time.
That sent me digging. Somehow I needed to be able to see what WebRequest was sending out. Thankfully, after a quick search at Google, I found out how to enable tracing for System.Net.
So I turned on the logging for trace, redeployed my app.config file, and sat, waiting.
By lunch the next day we had our first trace log. We started combing through the two log files (my log4net file and the trace log) and noticed that the people having this probably were only launching the application once, when they got into the office at around 7:30-8:00 in the morning. And it was timing out at around 11-11:30 am. We start examining the cookies we were using in the web service, perhaps they were expiring and and that was causing the timeout.
I did some investigation and found out that I actually didn’t need to login to access the web services. Since they were internal and existed only inside our firewall, they validate the user before returning data. The solution appeared obvious, remove the code that logs in and keeps a cookie. Everywhere in the log file showed just 1 cookie.
So we ran a test, take 2 of the internal users, have one use it as normal, and the other should close down the program when it wasn’t in use. We did that on a Friday and didn’t hear a word from the guy who closed the program frequently. Confident we’d solved the solution, we did a new deployment that didn’t keep cookies, thinking we’d put the issue to rest.