Welcome to the Problem Management - Problem Managers & Problem Specialists learning. Click each tab below for learning content.
{magictabs}
Introduction ::
Problem Management is the Service Management process responsible for managing the lifecycle of identified problems. The primary goal of the Problem Management process is to
Identify and Resolve the underlying root cause(s) of incidents, and
Prevent recurrence by proactively preventing the occurrence of incidents, problems, and errors.
This is accomplished via Problem Investigation steps that include identifying weaknesses in the infrastructure, determining the root cause of problems that impact the service, and proposing changes that eliminate, remove, or mitigate problems.
Problem Definition
A 'Problem' is the unknown cause of one or more incidents, often identified as a result of multiple similar incidents. Key terms used when managing problems include:
Key Terms
Definitions
Known Error
A known error is the result of diagnosing the root cause of a Problem and developing a workaround or permanent fix for it.
Major Problems
Major Problem is any problem where the severity or impact is such that management decides to review the entire series of activities. The scope of a Major Problem includes reviewing the process, the actions of staff, as well as tools and environment conditions affecting the problem.
Request for Change (RFC)
A Request for Change (RFC) is the form used to record details of a change to a component of the service. When a solution for a problem is identified, an RFC will be completed to resolve the problem.
Root cause
The root cause is the most basic reason for a problem, which if eliminated would prevent recurrence of the problem.
Workaround
A workaround is typically a temporary fix that restores service to a user, but does not resolve the underlying problem.
Problem Management: Benefits{showhide title="See More ..." changetitle="Hide ..."}
Problem Management facilitates multiple benefits and positive outcomes for customers and Intel. Place your cursor each of the below benefit areas for further information.
Outcomes Service for IT Customers
Problem Management processes remove defects from IT environment, eliminates recurring incidents, and stabilizes the environment.
Knowledge base allows users to search FAQs, known solutions and workarounds for quick resolution and reduction of service desk calls.
Decision Making for Management
Standardized and integrated toolset ensures data consistency for more accurate and meaningful performance measurement.
Consistent process and data supports research done by Problem Specialist to prevent Incidents in the environment.
Ability to prioritize problems according to business impact.
Productivity for Service Desk & Service Support Staff
Problem Management process eliminates recurring incidents & stabilizes the environment.
Problem Management provides ability to investigate root causes of incidents leading to the eradication of problems both in a reactive and proactive manner.
Knowledge base allows the service desk to search FAQs, known errors and workarounds to quickly resolve customer issues.
Solution database enables first level to quickly resolve issues without escalating to second level.
{/showhide}
Problem Management: Structure & Roles{showhide title="See More ..." changetitle="Hide ..."}
Within the Project Management support services, each supported Service is structured similar to the diagram shown. However, do to the unique nature of each service, data values can vary.
A "problem" is uniquely classified for support group identification and assignment through its combined values of Service, Service Component, Support Skill, and Process Role.
Click to view a demonstration how to Set Your Notification Preferences.
Each support skill may one or more roles associated. A role is a category assigned to a user, or group of users, that defines access privileges to functionality. Problem Management has two roles:
Roles
Responsibilities
Problem Manager
Liason with all problem resolution groups to ensure swift resolution of problems and goal targets.
Manages the lifecycle of Problems and Known Errors within the scope of their service.
Primarily responsible for preventing incidents that can't be prevented.
May assign problem investigations to problem specialists.
Problem Specialist
Develop or validate workarounds & preventative actions where there is no justification for a permanent solution.
Takes the initiative to regularly maintain their service-specific & general IT knowledge and skill level.
Upon identifying the underlying root cause of problem, requests a change to implement the solution if it is justifiable & approved by the Problem Manager.
Mentor Incidents Specialists to ensure incident data quality.
Success is measured by the health of the service(s) & problem management process used to support them.
Click to view a demonstration how to Display Your Service Roles in Service-Now.
{/showhide}
||||
Problem Mgt: Processes ::
In Problem Management, Configuration Items (CIs) represent important information during problem investigations. Problem investigations may be generated based on:
reaction to incidents currently manifesting themselves in the environment,
business rule requirements (such as all Major incidents require a problem investigation), and
proactive research to analyze trends and changes in the environment.
A Configuration Item (CI) is defined as any component that needs to be managed in order to deliver an IT service. It refers to the fundamental structural unit of a Configuration Management Database (CMDB) and is generally under change control procedures. CI information address IT Services, Software, Hardware, or Facilities problems. CIs are the common denominator in service management and are identified in most process records. Having this information allows Problem Managers and Problem Specialists to look for trends or information that is valuable to problem investigations. Besides CI identification in a record, Problem Management expects other process records to include enough information to make the record relevant and useful.
Problem Managers/Specialists provide feedback to Incident Specialists on how to improve recognizing the correct CI. It helps to think of CIs in an incident as "what did I touch or fix in order to restore the service?". Each CI has a mapping that allows an agent to view what relationships a CI has to other CIs in the CMDB. Knowing the CI mapping is helpful when conducting problem investigations.
Click to view a demonstration how to Display Configuration Item Information & Relationships.
Problem Management ties in closely with other service management processes such as Knowledge and Change Management. Problem Management focuses on problem investigation to identify root causes and solution identification. The below image represents the 4 high level Problem Investigation stages with summary steps included. Further details for each stage are provided in the proceeding section.
Problem Investigation: Stage Details{showhide title="See More ..." changetitle="Hide ..."}
Stages
Description
Stage 1
Evaluate Data
Incident Trending
A problem investigation may be needed due to reoccurring incidents of a similar nature. Viewing incident ticket data and trends is the most common activity.
Proactive Trending
It is important to create a problem investigation record to document work and demonstrate proactive problem management engagement. Proactive monitoring and identification of potential issues can identify concerns across many services.
Stage 1
Create Investigation
Once the data is analyzed and a fail point - or potential fail point - is identified, the Problem Manager creates the investigation record.
To prevent future rework, be sure to document ALL activity even if it's potentially a duplicate or might be rejected.
The causing service would own the Problem Investigation & the impacted services would play a role as subject matter experts.
Click to view a demonstration how to Create a Problem Investigation Record.
Stage 2
Investigate the Problem
During the investigation stage, the goal is to clarify the nature of the problem and articulate it in the problem statement. If a Problem Investigation team of SME's is required, you need to exercise your power to influence in order to get their time and talent!
The Problem Manager or Specialist needs to identify and relate any existing knowledge articles that are appropriate to the investigation.
Sometimes there is an existing knowledge article that may need to be validated or updated.
If no knowledge article exists, the Problem Manager or Specialist needs to create one.
If you are unsure of the approach, try the "5 Why's" technique.
5 Why's
Key Term
Reason(example: school exam)
1. Why did you fail?
SYMPTOM
I missed 8 out of 10 questions.
2. Why did you miss so many questions?
EXCUSE
The test was too hard.
3. Why was the test so hard?
BLAME
I didn't study enough.
4. Why didn't you study?
CAUSE
I didn't have time.
5. Why didn't you have time?
ROOT CAUSE
Suzie and I were watching the finals of American Idol.
When to stop asking 'Why'?
A guideline is when you have identified the CI(s) to be fixed to prevent incidents from happening.
Click to view a demonstration how to Associate a Problem Investigation To a Vendor.
Stage 2
Change the Problem Record to Known Error State
At this point, a decision needs to be made:
Leave the problem investigation open as a known error with an appropriate work around, OR
Pursue identification of a solution.
Click to view a demonstration how to Update a Problem Investigation Record.
Stage 3
Identify a Solution & Request for Change
Generate a Solution
This is an iterative process where various solutions are investigated.
Solution may never be found or may be impractical.
If a solution is identified & believed practical, a Request for Change (RFC) is generated.
Change Management Activity
This is the Disposition section of a problem record that tracks the status of corrections and complete deployments at affected sites. The problem state should not move to resolved until all vulnerable sites are transitioned to Resolved in the problem record.
Stage 3
Request for Change: Activities & Roles
When generating an RFC, you play the role of Change Requestor and do the following:
Document the change
Ensure the implementation fulfills the objective
Work closely with the Change Manager to ensure the following additional tasks are completed:
Provide project management/coordination
Make resource assgnments
Stage 3
Determine the Change Type for an RFC
The change type is determined and updated based on a combination of:
Impact
Urgency
Risk
How do the issues affect the business.
What is the immediacy of the end user's need.
Consider the impact & the probability of the event happening.
Depending on the risk, mitigation steps may need to be identified.
When the RFC is created, the Change Manager validates that the impact & urgency values are consistent with the change type requested.
Stage 3
Change Types
Change Type
Description
Standard
Definition: Standard/Repeatable/Low Risk. Documented process that has been reviewed & approved. Could be automated - no human touch.
Example: DNS changes. Standard router module install in ACL/firewall.
Flow Characteristics: ENC service request module, No approval required.
Major
Definition: A change that requires large amounts of effort and could impact a major part of the organization.
Flow Characteristics: CAB Review; Attach robust Change description, Test Plan, Back out Plan.
Stage 3
Record Request for Change
When recording a Request for Change, you need to:
Step
Action
Description
1.
Classify RFC:
Impact, Urgency and it's change type of Minor, Significant or Emergency.
2.
Provide Change Details:
Test Plans
Back out Plans
Affected Configuration Items
Proposed Implement Date
3.
Submit Request:
When your RFC is submitted, the system generates a notification to the Change Manager and the Change Management process picks it up to proceed.
Click to view a demonstration how to Create a Request for Change (RFC).
Stage 4
Validate the Solution & Close the Problem Investigation
Once the change is processed through Change Management and it is deployed, the Problem Investigator validates that the solution did work.
Is Solution: Successful?
Is Solution: Unsuccessful?
The Problem Manager/Specialist...
Communicates back to Change Manager that it was successful.
The Change Manager closes the Request for Change which closes any related known errors (problem record) and any related knowledge articles are archived.
If there is no solution (or the solution did not work) or plans to implement a solution, then the available Known Error workaround remains open and usable if the incident reoccurs.
Ensure information in the Record is complete, accurate, and understandable.
Click to view a demonstration how to Close a Problem Investigation Record.
{/showhide}
Root Cause{showhide title="See More ..." changetitle="Hide ..."}
Identifying the root cause of an issue is the foundation to problem management. A root cause is often the most basic reason - which if eliminated - would prevent recurrence of the issue. It is the identified reason for a defect or problem, and is the source or origin of an event.
Cause vs Root Cause
It is possible to have multiple contributing factors that cause a problem to exist. But the root cause is the underlying reason that causes a problem and requires being fixed.
Example
A Microsoft exchange mailbox stops working which results in creating issues for customers trying to connect to their email. There are multiple possible reasons that the mailbox might stop working such as outdated mail software on the customer's PC, disconnected cables/networks, power source issues, or a faulty hard drive server. Investigating the problem identified that the faulty hard drive was causing the problem.
In the above example, although multiple factors could be affecting the email service, the root cause for the problem is the faulty hard drive. Even if the outdated mail software was corrected, or if power source access was proven to be available, there would still be service issues due to the faulty hard drive not functioning correctly. In this example, the faulty hard drive is the root cause.
Root Cause Analysis
Root cause analysis (RCA) is a method for finding and correcting the most important reasons/causes for problems. It is a method to identify known errors in the environment, and helps to avoid wasting time resolving a symptom that does not correct the problem.
RCA differs from troubleshooting and problem-solving as these typically seek solutions to specific difficulties, whereas RCA is directed at the underlying issues.
Root cause analysis includes 3 primary goals:
Root Cause Analysis Phases
1 Data Collection
Focus on fact finding investigation, not fault finding mission.
Data sources may include: incident records, event & logs, maintenance & equipment records, interviews, correspondence, meeting minutes.
2 Event Investigation
Evaluate the data collected to identify any causal factor chain that may have led to the occurrence of the issue (failure).
Causal factor chains help identify which issues might be causing other issues.
3 Resolution of Occurrence
Assess the success of the corrective actions implemented in the Event Investigation phase.
{/showhide}
Root Cause Method: Ishikawa/Fishbone{showhide title="See More ..." changetitle="Hide ..."}
The Ishikawa diagramming method is one way to organize your root cause analysis brainstorming activities. It is also called the Fishbone diagram because of the shape that is used for organizing ideas.
Ishikawa/Fishbone diagramming (example referencing above email faulty hard drive)
Process
Brainstorm possible cause Topic Areas, based on Tour Service.
Availability Performance: Bug (Vendor?) | User (Front End) | Expectation of functionality | Improper usage or Development/test/release
{/showhide}
Output to Problem Control{showhide title="See More ..." changetitle="Hide ..."}
The following items are output deliverables to Problem Control. These items support the Problem Management processes and prevention.
Problem Control
Description
Root Cause Map
To show relationships of root cause to symptoms.
Identification of root causes
Clearly defined statement(s) of what had caused the observed symptoms.
Should be able to answer the question: If this is fixed, will this symptom be reduced or eliminated?
Prioritized
List of root causes to be attacked
Impact
1. ____________________
10
2. ____________________
7
3. ____________________
6
4. ____________________
3
{/showhide}
||||
Problem Mgt: Integration ::
Problem Management has close intersects with specific operational ITIL processes. Although Problem Management intersects with other processes, only the main process intersects for Problem Management are being noted here.
All ITIL processes work in conjunction and are dependent with each other. The main intersects to note are:
Incident Management
Knowledge Management
Change Management
Click to view a demonstration how Problem Management Processes Intersect Other Processes.
Incident Management: Intersect{showhide title="See More ..." changetitle="Hide ..."}
The intersect between Incident Management and Problem Management involves the incident processes as direct input into the Problem Management processes. During Problem Management process activities, the Problem Manager or Problem Specialist monitors incident tickets and the nature of issues that come in as incidents. It is highly recommended for Problem Managers and Specialists to have a comprehensive understanding of the Incident Management processes given much time is spent analyzing incident ticket data.
Problem Management differs from Incident Management based on their fundamental purposes. This is highlighted as:
Problem Management
The primary purpose is to find and resolve the root cause of a problem and thus prevent further incidents.
Incident Management
The primary purpose is to return the service to operational standards as soon as possible with the smallest possible business impact.
Work with Incident Managers to adopt a standard format for information in the incident summary description field. This makes finding, sorting, and reporting on incidents much easier.
Proactive Problem Management
Problem Managers and Problem Specialists can take proactive steps to enable effective and efficient process activities. Recommendations for proactive performance include:
Analyzing incident trends and other infrastructure data to discover weaknesses that are causing incidents, or may potentially cause them.
Actively review incidents to help identify areas that could potentially become an issue.
Provide early detection and resolution of potential issues as key steps in proactive Problem Management.
Mentor Incident Specialists regarding incident data quality.
Monitor incident tickets daily. The benefits are:
Data quality - identifying trends & reporting,
Understand the health of your service
Be able to answer any question regarding the health of your service & problem management process.
Major Incidents: Key Concepts
As a Problem Manager or Problem Specialist, it is important for you to understand Major Incidents as Intel IT now requires a problem investigation for any and all MI's.
A Major Incident is defined as an incident with a high impact, or potentially high impact, which requires a response that is above and beyond that given to normal incidents. Typically, these incidents require:
Cross-company coordination
Management escalation
Mobilization of additional resources
Increased communications
Major Incident Process Flow
If it is determined that a Major Incident should be managed by ITERP, the ITERP Incident Commander will take over and
manage ITERP,
perform communication,
perform diagnostics & resolution,
conduct a post mortem, and
forward to Problem Management.
{/showhide}
Knowledge Management: Intersect{showhide title="See More ..." changetitle="Hide ..."}
Knowledge Management has a close working relationship with Problem Management. The output from Problem Management processes is a direct input to Knowledge Management processes. Problem Managers and Problem Specialists need to be aware of Knowledge Management, and why it is important to be involved in the content & quality of knowledge articles.
Knowledge Articles: Benefits
Knowledge Articles are maintained in the Knowledge Management Database to provide search capabilities for pre-existing (documented) available knowledge.
The Knowledge Management Database saves time with efficient, concise, and value-add work notes.
Leveraging existing Knowledge Articles can provide immediate support solutions to workarounds and other 'How To...' information.
Knowledge Articles: Search
Incident Specialists are required to search for Knowledge Articles that facilitate restoring service.
Problem Managers or Problem Specialists play a key role in Knowledge Articles quality - specifically to review & publish workarounds.
Knowledge Articles: Create
Problem Managers and Problem Specialists may be required to review a Knowledge Article draft to either reject, update, request more information, or publish it.
Due to reliance and use of Knowledge Articles, a Problem Manager or Problem Specialist may need to mentor Incident Specialists on attaching the right Knowledge Article to their incidents.
Click to view a demonstration how to Create a Knowledge Article.
Click to view a demonstration how to Review & Publish a Knowledge Article.
Knowledge Articles: Edit/Change (Update)
If a Knowledge Article is found to have gaps or missing information, anyone can update that article by adding feedback or editing it. The Problem Manager or Specialist acts as a Knowledge Specialist to:
Review the feedback and edited articles
Make additional changes to articles, as needed.
Everyone involved in service support is instrumental in ensuring the accuracy of available Knowledge Articles. It is important that review, reuse, and improve existing knowledge occurs when possible.
Click to view a demonstration how to Edit a Knowledge Article.
{/showhide}
Change Management: Intersect{showhide title="See More ..." changetitle="Hide ..."}
Change Management also has a close working relationship with Problem Management. The Change Management processes help IT organizations implement fixes to existing problems in the IT environment. When a change is introduced into the environment, there are risks that may destabilize the environment and possibly introduce new incidents and problems.
Change Management ensures that standardized methods and procedures are used for the efficient handling of all changes. A short definition of Change Management is
An event that results in a new status of one or more Configuration Items (CIs) approved by management, is cost effective, and enhances business process changes (fixes) with a minimum risk to IT infrastructure.
Change Management goals include
minimal disruption of services,
reduction in back-out activities, and
economic utilization of resources involved in the change.
Problem Management & Change Management Interactions
To correct a problem, a change to the IT environment is often needed. This is where Problem Management processes intersect with the Change Management processes.
The primary interactions include the following actions:
The Problem Manager or Problem Specialist plays the change requester role and submits a request for change. After a change request is submitted, Change Management processes intersect to manage the request.
After a change is implemented, the Problem Manager or Problem Specialist verifies the change was successful and contacts the Change Manager to confirm.
The Change Manager closes the change request. This triggers the closure of the Known Error and any related Knowledge Articles are archived.
In larger change stabilization needs, exit criteria is predetermined and used to manage the 'success' or 'failure' of the change request.
The Problem Manager and Problem Specialist may provide coaching and mentoring with Incident Specialists. A "teaching moment" opportunity may be presented if you notice a data or quality problem on an incident - - it is an opportunity to share continuous improvement value.
The above noted mentoring and teaching opportunities can provide
Benefits
Results
Improved data quality
Fewer escalations
Increased confidence for the Incident Specialist
Increased customer satisfaction with incidents effectively resolved
1:1 contact that builds a stronger team
Improved user confidence in agents ability to resolve issues in a timely manner
Improved Incident Specialist Level 1's confidence and knowledge
Reinforced process and business rules established by the service
{/showhide}
||||
Performance Indicators ::
Multiple services may be involved in a problem investigation. The service that owns the failing CI also owns the investigation. Impacted services may choose to have their own problem record if their CIs are damaged or degraded as a result of the primary incident.
The intent in having key performance indicators (KPIs) is to reflect the objectives of Problem Management: to prevent problems and resulting incidents from happening, to eliminate recurring incidents, and to minimize the affect of incidents that cannot be prevented. The below image displays the current IT wide problem dashboard where you can execute KPI reports. Click My Integrated Tool Environment (MyITE) to access the problem dashboard. Within the dashboard, your access to the specific section/tool is for IT Service Management Reports (ITSMR).
Key Performance Indicators
Problem Management Reports
In the ITSMR tool, you will find
Core Problem Reports located under Reports in the left Navigation menu.
Key Reports located in the Problem section. These can be modified to meet your needs and saved as Personal Reports.
BKM Recommendation: Using the ITSMR tool, it is recommended you create a graphical representation of the incident trends for the service being reported on. This can help you to
explain how problem management addresses incident trends,
identify when the service becomes stable (when all new incidents are related to known errors),
demonstrate the affect of changes to the health of your service,
increase your value to the service owner as the forcing factor of continuous improvement.
||||
|| Completion ||::
When you have completed the learning in this content, click the below Track Completion button.
This allows the IT Service Management team to know that you have completed this learning and are prepared for relevant next steps.