Get Your Free IT Resume Guide

ITIL Problem Management: The IT Version of CSI

In ITIL (Information Technology Infrastructure Library), the process that is most challenging in terms of achieving delivery success is problem management. The output of this process is not straightforward as some of the others I have discussed thus far. The activities performed here involve investigating the issue on hand, finding the culprit and meting out a judgment to rectify the problem.

It sounds a lot like the Crime Scene Investigation (CSI) shows we see on TV, which involves plenty of investigation, following the clues and accurately pin pointing the person responsible. My personal favorite is the Miami version.

The CSI we see on TV is staged and the real version of forensic investigation takes several days and months to achieve reasonable success. The subjects are real people, evidence left behind during the act, background information and connections. In IT form of CSI – problem management, investigations are quite similar. It involves investigating and probing the techies involved, studying the log files and the history of IT infrastructure in question and carefully analyzing the architecture for possible related issues. Don’t you feel problem management could be fun? Most definitely, I started my career doing problem management by the way.

What is Considered a Problem?

“Problem” is a common lingo for people across professions and societies. In IT, we need to be careful when we use this term as it carries plenty of weight. A “problem” in ITIL refers to an incident for which we do not know the root cause yet or an issue which is repetitive in nature, with no solution in sight.

The official definition, as defined by ITIL is:

The unknown root cause of one or more existing or potential Incidents. Problems may sometimes be identified because of multiple Incidents that exhibit common symptoms. Problems can also be identified from a single significant Incident, indicative of a single error, for which the cause is unknown. Occasionally Problems will be identified well before any related Incidents occur.

Let’s take an example. You are using MS Outlook 2007 and when you try to open it, Windows throws an error message and apparently you cannot open the application. Well, this is an incident. The technician looks at your system and cannot pinpoint the culprit and decides to uninstall and reinstall the software, no go. The issue resurfaces even after reinstall. In this case, the root cause is unknown and hence the Outlook 2007 issue can be termed as a problem.

The technician escalates the issue to the next level of support. The higher skilled technician looks at your system, reads through several pages of documentation, refers to knowledge bases and finds the solution to the problem. He promptly goes into the registry and changes a parameter that was conflicting with MS Outlook 2007, and voila that fixes the issue. The incident is resolved now and the problem, for which the root cause is established, becomes a known error.

Temporary Workaround and Permanent Solution

Temporary Workarounds and Permanent Solutions are the two varieties of fixes you find in the IT world. While both cannot and do not exist mutually, they are used in juxtaposition, but not simultaneously.

A temporary workaround is a fix that is provided to an issue at hand, but the fix provided is not a permanent one. If MS Word on your PC is not working, downloading Open Office and using it instead is a workaround but is in no sense the solution that is optimum. However, if you went ahead and repaired MS Word using the required troubleshooting techniques and bringing the application back to life, that could more or less be regarded a permanent solution.

Note that while you had your temporary fix in place, you could get the job done but not exactly the manner/means you intended. It only served the purpose of getting you through for the time being. When you opted for a permanent solution, you no longer need a temporary workaround, although it can still exist in the system.

A temporary workaround is generally an output of incident management while permanent solution is always the aim of problem management, although I wouldn’t go on record stating that everything that comes out of problem management is permanent in nature.

Root Cause Analysis (RCA)

The Root Cause Analysis (RCA) is a document/report that is a result of problem management.  The objective of this document is to establish the root cause of the problem and to find a permanent solution.

A typical RCA lists out the issue details, such as the nature of the outage, time frame, affected regions amongst other facts. Furthermore, it would have an entire section dedicated for finding the cause and the action taken to resolve it. Finally, a preventive measure to ensure the issue does not happen yet again is recommended.

Here are some RCA samples.

ITIL Problem ManagementTo find the cause of an issue, there are several problem management techniques in place. The simplest and possibly the most commonly used technique is the 5 why analysis. You simply ask the question ‘why’ five consecutive times. Let me illustrate it with an example.

Let’s say your keyboard which is connected to your PC is not working. Here are the questions and answers I would pose to get the root cause.

  1. Why is the keyboard not working? Windows 7 does not recognizing it.
  2. Why does Windows 7 not recognize the keyboard? Driver is corrupt.
  3. Why is the driver corrupt? Maggie tried to install her wireless keyboard on this PC. She installed the wireless keyboard’s drivers onto this PC.

Here you go. Driver conflict is the root cause of this issue. You really don’t have to go down asking ‘why’ five times. Do it as many times as possible to get a logical and usable answer. Generally you will arrive at the root cause within the fifth why.

Problem Management Roles

There are basically two roles in the IT industry for people who work on problem management. A problem manager is the one who is in charge of the entire problem management process end to end. He takes ownership of all problems coming in and it is his accountability to deliver RCAs on time. A problem coordinator reports to a problem manager. A coordinator basically coordinates between different technical teams to obtain the information needed to develop the RCA and performs document control tasks along with tracking and monitoring of open problems.

In the IT industry, problem manager position is considered to be one of the top individual contributor roles. An expert in ITIL who knows the processes quite deeply is entrusted with this job. I worked as a problem coordinator initially and then as a manager a few years back, and the experience has catapulted me into a higher position today. The job was hard, not for the work I had to do, but for the expectations I needed to manage. In my present role, I have four problem managers reporting into me, and I look at them to provide the technical expertise I need to manage my delivery. I say this to stress the importance of problem managers in an organization. If your problem manager is good, your company can most certainly heave a sigh of relief.

Conclusion

Everything I mentioned in this post is perhaps a small fraction of problem management knowledge area. The field is so vast, that no single book that I have read so far covers it completely. It will most definitely give you the satisfaction of doing something completely new every day, and it keeps your senses at bay, watching for clues as you unthread the mysteries behind the IT infrastructure.

More Related Posts

  1. Longhorn's Server Manager will Change Server Management
  2. IPv6 Essentials: What You Need to Know About the New Version
  3. Symantec’s Enterprise Messaging Management
  4. How to Obtain Your Project Management Professional (PMP) Certification
  5. How to Create a Project Management Plan

Discussion

One comment/trackback for “ITIL Problem Management: The IT Version of CSI

Comments

Post a comment

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>