Why does my server hang?
A hang or a freeze in a system occurs when it ceases to respond to inputs. A server might hang both due to hardware and software issues. Following are the reasons for a server freeze due to hardware and software issues:
If a Network Interface Controller (NIC) has a faulty component or if it is attached to a bad cable, false interrupts might occur. Such interrupts occur at an elevated interrupt request level (IRQL). This draws the attention of the processor. At such instances, lower priority requests (user level) request may remain unanswered. This causes the server to hang.
A server might hang if the storage requests are left unanswered. This happens when a disk drive fails, which in turn, causes outstanding I/O requests to be queued up. Eventually, such unanswered requests trigger a cascading effect of user and system threads to hang. This leads to a system-wide outage.
System Resource Depletion: A memory leak by a driver or kernel mode thread is one of the most common reasons that cause a software hang. Surpassing architectural limits of paged and non-paged memory pools can cause resource depletion.
Deadlock Conditions: When contention exists between two or more threads for commonly used resources, a deadlock occurs. To be more specific, if a process uses a common resource in a non-sharable mode, and another process requests the same resource, a deadlock occurs as a result that leads to nowhere.
Spinlock Conditions: A spinlock condition occurs when a thread, trying to acquire the locked resource simply waits in the queue, but repeatedly checks if the locked resource is available. Since this thread is active and is not performing a useful task, these causes a software hang over a certain period of time. Example: A driver holding a lock while performing other activities over a period of time without releasing the lock.
Insufficient number of threads: Such a situation occurs when number of user requests on Servers suddenly increase and the MaxThreadCount is not set to an appropriate value. The server might not be able to process these many requests and this causes a server hang.
Insufficient number of File Detectors: If the number of file detectors is very less, this leads to a slow server response and eventually a server hang.
High-priority, Compute-bound thread: A compute-bound thread is a task whose time of completion is principally determined by the speed of the central processor. Such high-priority, compute-bound thread that dominates the processors can also cause a software hang. Typically operating systems permit varying levels of thread priority. One or more thread may execute at a higher priority than a typical user thread consuming most of the CPU’s time. This causes an apparent software hang, as users and applications at normal priority are starved for the CPU time.
Run-away thread: A runaway thread is a process that continues to execute indefinitely draining the resources including monitors, CPU and memory. Run-away threads do not show on the logs and gets written only when request is completed. It does not return until it has already affected the entire application. Examples include infinite loops in the code, invalid data input that drives and extends the execution beyond normal response time.
Configuration Errors and Backups: A system hang can occur due to configuration errors. For example, an errant quotation in the command line can bring an Apache Server or other critical systems to halt. A backup server can sometime consume a significant amount of the CPU resources, which eventually slows down the server and cause it to hang. For instance, below is how a typical Linux server would look like, if it gets hanged.
Below mentioned scenarios might also cause a Server Hang
Too many apps running: Every application running on a system requires certain amount of hardware and software resources. If there are multiple applications running at the same, the system might get starved for resources as memory is used by numerous applications. This might lead to a software freeze.
Driver Issues: Outdated or damaged drivers might cause a software hang. If video drivers installed on the system are not updated, the computer might hang if an attempt is made to play a video on the system.
Excess heating up and Insufficient RAM: If the temperature of the system processor is higher than regular temperature, the system might hang. Insufficient RAM can also cause a system to freeze at regular intervals of time. A non-functional motherboard, CPU or Power supply can also cause a system freeze.
BIOS Settings: In certain instances, modifying BIOS settings may lead to serious issues and might put system to freeze mode. Over clocking system processor or RAM also causes instability issues.
Power issues: Even if there is a powerful computer with latest processor, sufficient RAM, GPU and advanced motherboard, insufficient power supply or sudden power surge can cause a system hang.
External Devices: Faulty USB or external devices such as mouse, keyboard, USB camera or gaming consoles connected to the system externally might lead to system shutdown and eventually a system freeze.
Garbage Data Cleanup: If Garbage Collection is taking sufficient amount of the time, in such a scenario, Garbage Data Clean Up process will take longer time and results in the Threads working for the cleanup rather that processing client requests. At such instance a server hang might occur.
Firmware/Middleware Issues: Obsolete or incompatible firmware/middleware may also result in hang issues.
Troubleshooting methods when a Server Hangs
The first step in troubleshooting a hang is to determine whether the issue is due to software or a hardware problem. If the hang is due a hardware issue, immediately contact the hardware vendor. Subsequent troubleshooting to isolate any faults/under-performance of hardware must be carried out by the designated vendor personnel.
Verify the Event Logs for any events in the System Log at the time of Hang. In case of Pool Depletion which occurs due to lack of space, Event IDs 2019 or 2020 can be seen with Event Source being SRV.
Launch Performance Monitor and check the starting value for Free System PTEs (Page Table Entries) under the memory object. If a system is booting up with fewer Free System PTEs than the normal count, it is an indication of a problem. This tells that all PTEs are used at startups, leaving fewer resources available for the normal server operations.
If a system hangs happen very often or repeatedly, set up a Performance Monitor Log and let it run for a specific time. Add counters for Memory, Process, Processor and System. The length of the time will depend on how long the system takes to hang. Capture a minimum of hundred samples over the life of the log. Any low memory scenario will be clear, especially if it is steady leak.
Methods for diagnosing hanging requests on IIS servers
If requests are hanging, dump immediately shows 3 critical things:
What URLs are involved?
Whether all the requests related to an app are hanging or only specific ones?
The module/stage at which they are hanging in.
Obtain a detailed request trace which provides more details regarding the hanging request. Failure for obtaining the request trace also helps to generate details regarding the hanging request. This can be determined by setting rules to capture traces when requests to specific URLs exceed a time limit or fail with specific errors. Once the trace is obtained, further diagnoses about the hanging request will be an easier process.
Dr. Watson is a tool available as part of Windows operating system. Upon proper configuration, this tool will detect applications that crash and provide a log file and a user dump file for troubleshooting server hangs. Analysing this data will normally provide an error code or condition that has troubleshooting methods.
ADPlus tool is a VB script that normally monitors an application for an unexpected condition and captures the dump file when such condition occurs. The tool can also be used to force a crash dump on a user application that has hung and analyse the dump with the windows debugger.
Debugdiag is a tool used on application that involves Microsoft Internet Information Services (IIS). This tool identifies a series of problems like Web Server Hangs, Slow Performance, Crashes and Memory leaks. It can also be used on simple Win32 applications that don’t involve IIS.
ProcDump is a tool that has the ability to dump a process when a CPU activity spikes to a predetermined level for a specific period of time. The memory contents can be used to get more details about the application including the thread environment, the process environment and the locking information.
Using Keyboard: Windows include a feature that causes the system to stop responding and generates a memory dump file by using the keyboard. After this feature is enabled, hold down the right CTRL key and press SCROLL LOCK key two times. This generates a memory dump file which can be used for troubleshooting. This feature is available for both PS/2 and universal serial bus Keyboards.
Only one limitation in this feature is that this shortcut does not work if the system stops responding at a High Interrupt Request Level (IRL) and works only at a low IRL.