The Director of Software Development for WebMaster, Incorporated wrote back almost immediately. I am honestly impressed at the detailed, frank description he gave regarding possible causes of the CNN system failure. I think many other companies could learn from their example by providing timely, accurate, and technically detailed information not only to customers but to the Internet community at large. Kudos to WebMaster. With support like this, I can understand why someone would choose them as a vendor.
The last paragraph is especially important. It echoes my statement that the problem was caused by CNN erronously giving me voice without any authentication. ConferenceRoom has a NickServ much like Dalnet that would have easily allowed CNN to limit who could /nick to President_Clinton.
> I would consider excessive hardware requirements to be a sign of poor > software design. I can handle far more than 1500 users on a P2-233 with > 128MB of RAM using almost any of the freely available IRC servers on a > FreeBSD platform. The chat server was likely not what was eating up all that memory. More likely it was the web server that was. You see, each person who connects to the chat server through Java makes a pile of web requests. Those web requests consume far more system resources than the chat does. The problem was not the user load but the rate of new users coming in. Unfortunately, Java security requires that the applet be served by the same machine the chat is on. You are comparing apples to oranges. > Nonetheless, I agree that, given their choice of > software > and operating system, CNN failed to host the system on a machine with > sufficient resources. It has nothing to do with software or operating system. Anyone trying to run a web server handling 800 concurrent connections on a P2-233 with 128Mb is going to encounter massive problems. > However, no piece of software should respond to load it cannot handle > by simply crashing. The fact that the software crashed rather than > simply > slowing down or gracefully limiting additional users is certainly > evidence > of "instability" and the crashes are what allowed me to use the nick of > President_Clinton. Because of that, I do not feel that my statement was > incorrect. However, I will modify the statement to indicate that these > are based on my evaluation of the incident, and provide a link to your > objection and my response. I don't know what actually happened in this case. ConferenceRoom has internal code to detect a low-memory or out-of-memory situation and respond in stages, first disallowing 'expensive' commands and then eventually refusing all new connections. The only case I ever heard where a ConferenceRoom server actually crashed due to a resource issue was when someone entered a single command that essentially required the server to send over a million messages. Unfortunately, this doesn't work as well on NT as it does on UNIX. The main problem is that with modern operating systems, it's very hard to tell if they are under memory pressure. Use of swap space could just be due to agressive swapping to increase caching effectiveness. Low amounts of available physical RAM are normal. One annoying limitation of NT is that it will only allow you to lock down a certain fraction of memory for I/O. Exceeded this limit (the 'non-paged pool') can cause instability and in extreme cases even crashes. I don't know what kind of tuning CNN did on their server, but the default limit on the non-paged pool may not have been enough. ConferenceRoom does two things to help deal with this problem. First, if the OS does give us a warning (by returning a 'no buffer space' error), we immediately release some non-paged memory we locked down specifically as an emergency reserve. We then take active steps to shrink the buffers we are using. Second, we have a special 'slab allocator' that carefully fits the pieces of memory we hand to the OS for I/O into the minimum possible number of pages (since NT only locks whole pages). That way, three 2Kb chunks that need to be locked will result in only one page instead of potentially six (if each 2Kb chunk landed over a page boundary). > If this is not acceptable or you feel further modifications to my > statement > are necessary to prevent harm to your company, please do not hesitate to > contact me again. One other point, CNN was using the 'server managed channels' feature of CR. You'll note that when the server came back up, it instantly made #Auditorium +m. Unfortunately, two more things happened in this case. CNN didn't secure the nickname that they were using. They could have, ConferenceRoom has the ability to lock out nicknames by any method you choose. Second, a case of human error -- one of CNN's moderators didn't check where you were coming from, and so opped you in error. Unfortunately, no security scheme can prevent human error.
WebMaster Incorporated gave me permission to post this providing I include the following:
Yes, you can publish it. I'd appreciate if you integrate (or separately list) the following corrections: 1) The server was actually up for 170 days, not 70. And it was a P2-266, not P2-233. 2) We aren't really certain whether it was swapping that caused the CPU to overload or whether it was CPU load that caused excessive buffering which lead to swapping. 3) You are right in the abstract that programs and machines should not crash when they run low on resources, unfortunately, this is almost never completely possible to do. For example, FreeBSD (and variants such as BSDi) crash when they run out of mbufs. The accepted solution to this is tuning. 4) There server was not just running ConferenceRoom and a heavily-loaded web server but it was also running our services agent. It crashed with over 2,000 users on the server. 5) Our primary server, irc.webmaster.com, is a P3-500 with 512Mb or RAM. We have consistent 30 day uptimes (before we put in new versions we need to test). CPU load is typically about 20% and memory load is typically about 50%.