Another letter from webmaster.com staff

The Director of Software Development for WebMaster, Incorporated wrote back almost immediately. I am honestly impressed at the detailed, frank description he gave regarding possible causes of the CNN system failure. I think many other companies could learn from their example by providing timely, accurate, and technically detailed information not only to customers but to the Internet community at large. Kudos to WebMaster. With support like this, I can understand why someone would choose them as a vendor.

The last paragraph is especially important. It echoes my statement that the problem was caused by CNN erronously giving me voice without any authentication. ConferenceRoom has a NickServ much like Dalnet that would have easily allowed CNN to limit who could /nick to President_Clinton.

> I would consider excessive hardware requirements to be a sign of poor
> software design.  I can handle far more than 1500 users on a P2-233 with
> 128MB of RAM using almost any of the freely available IRC servers on a
> FreeBSD platform.

        The chat server was likely not what was eating up all that memory. More
likely it was the web server that was. You see, each person who connects to
the chat server through Java makes a pile of web requests. Those web
requests consume far more system resources than the chat does. The problem
was not the user load but the rate of new users coming in. Unfortunately,
Java security requires that the applet be served by the same machine the
chat is on.

        You are comparing apples to oranges.

> Nonetheless, I agree that, given their choice of
> software
> and operating system, CNN failed to host the system on a machine with
> sufficient resources.

   It has nothing to do with software or operating system. Anyone trying to
run a web server handling 800 concurrent connections on a P2-233 with 128Mb
is going to encounter massive problems.

> However, no piece of software should respond to load it cannot handle
> by simply crashing.  The fact that the software crashed rather than
> simply
> slowing down or gracefully limiting additional users is certainly
> evidence
> of "instability" and the crashes are what allowed me to use the nick of
> President_Clinton.  Because of that, I do not feel that my statement was
> incorrect.  However, I will modify the statement to indicate that these
> are based on my evaluation of the incident, and provide a link to your
> objection and my response.

   I don't know what actually happened in this case. ConferenceRoom has
internal code to detect a low-memory or out-of-memory situation and respond
in stages, first disallowing 'expensive' commands and then eventually
refusing all new connections. The only case I ever heard where a
ConferenceRoom server actually crashed due to a resource issue was when
someone entered a single command that essentially required the server to
send over a million messages.

   Unfortunately, this doesn't work as well on NT as it does on UNIX. The main
problem is that with modern operating systems, it's very hard to tell if
they are under memory pressure. Use of swap space could just be due to
agressive swapping to increase caching effectiveness. Low amounts of
available physical RAM are normal.

   One annoying limitation of NT is that it will only allow you to lock down a
certain fraction of memory for I/O. Exceeded this limit (the 'non-paged
pool') can cause instability and in extreme cases even crashes. I don't know
what kind of tuning CNN did on their server, but the default limit on the
non-paged pool may not have been enough.

   ConferenceRoom does two things to help deal with this problem. First, if
the OS does give us a warning (by returning a 'no buffer space' error), we
immediately release some non-paged memory we locked down specifically as an
emergency reserve. We then take active steps to shrink the buffers we are
using. Second, we have a special 'slab allocator' that carefully fits the
pieces of memory we hand to the OS for I/O into the minimum possible number
of pages (since NT only locks whole pages). That way, three 2Kb chunks that
need to be locked will result in only one page instead of potentially six
(if each 2Kb chunk landed over a page boundary).

> If this is not acceptable or you feel further modifications to my
> statement
> are necessary to prevent harm to your company, please do not hesitate to
> contact me again.

   One other point, CNN was using the 'server managed channels' feature of CR.
You'll note that when the server came back up, it instantly made #Auditorium
+m. Unfortunately, two more things happened in this case. CNN didn't secure
the nickname that they were using. They could have, ConferenceRoom has the
ability to lock out nicknames by any method you choose. Second, a case of
human error -- one of CNN's moderators didn't check where you were coming
from, and so opped you in error. Unfortunately, no security scheme can
prevent human error.

WebMaster Incorporated gave me permission to post this providing I include the following:

        Yes, you can publish it. I'd appreciate if you integrate (or separately
list) the following corrections:

        1) The server was actually up for 170 days, not 70. And it was a P2-266,
not P2-233.

        2) We aren't really certain whether it was swapping that caused the CPU to
overload or whether it was CPU load that caused excessive buffering which
lead to swapping.

        3) You are right in the abstract that programs and machines should not
crash when they run low on resources, unfortunately, this is almost never
completely possible to do. For example, FreeBSD (and variants such as BSDi)
crash when they run out of mbufs. The accepted solution to this is tuning.

        4) There server was not just running ConferenceRoom and a heavily-loaded
web server but it was also running our services agent. It crashed with over
2,000 users on the server.

        5) Our primary server, irc.webmaster.com, is a P3-500 with 512Mb or RAM. We
have consistent 30 day uptimes (before we put in new versions we need to
test). CPU load is typically about 20% and memory load is typically about
50%.