Article
Service DesignUsability Test: Method, Process & Practical Guide for Services
Usability testing for digital and physical services: guide, practical example, common mistakes, and comparison with other testing methods.
A usability test is an empirical method in which real users perform concrete tasks with a service, product, or system while a researcher observes where they encounter difficulties, make errors, or abandon the process [1]. Instead of asking “Do you find this user-friendly?” you observe whether people can actually use the service — and exactly where it fails.
The method was shaped decisively by Jakob Nielsen, who laid the theoretical foundations in 1993 with Usability Engineering and codified the methodological repertoire in 1994 with Usability Inspection Methods [2] [3]. Nielsen’s most influential finding: five test users find 85% of all usability problems [4]. This number — supported by a mathematical analysis of 83 usability studies by Thomas Landauer and Nielsen (1993) — has revolutionized practice because it shows that usability tests need not be large-scale projects.
What most usability guides fail to mention: the method was developed for digital interfaces but is transferable to physical and hybrid services. A usability test does not only check whether a screen works — it checks whether a person can successfully complete a task in a specific context. That can be a form, a self-service terminal, a telephone advisory process, or the physical flow at a service counter.
This article gives you everything you need to apply usability testing in service design projects: the methodological background, test types, a complete step-by-step protocol, a practical example from the automotive sector, the six most common mistakes, and a systematic comparison with related testing methods.
Academic Foundations: From Nielsen to Service Usability
The Origins
Usability as a discipline originated in Human-Computer Interaction (HCI) research in the 1980s. The central insight of this field: it is not the user who is the problem when they cannot operate a system — it is the system. Jakob Nielsen (then at Sun Microsystems, later Nielsen Norman Group) and Ben Shneiderman (University of Maryland) independently formulated principles for user-friendly design that remain valid today [2] [5].
Nielsen defined five quality components of usability [1]:
- Learnability: How quickly can a first-time user accomplish basic tasks?
- Efficiency: How quickly can an experienced user perform tasks?
- Memorability: How easily can a user regain proficiency after a period of not using the system?
- Errors: How many errors do users make, how severe are they, and how easily can they recover?
- Satisfaction: How pleasant is the experience?
The “5-User” Finding
Nielsen and Landauer (1993) analyzed 83 usability studies and derived a mathematical model: the probability of finding a usability problem follows a negative exponential function. With each additional test user, the marginal return decreases. With 5 users, you have found an average of 85% of problems. With 15 users, you have found virtually all of them [4].
The implication: instead of testing once with 20 users, it is more effective to conduct three rounds of 5 users each, fixing the discovered problems between rounds [4]. This iterative testing paradigm has fundamentally changed practice — it makes usability tests fast, cost-effective, and compatible with agile development cycles.
Important caveat: The “5-user rule” applies to qualitative tests with a homogeneous user group. If your service serves different user groups (e.g., customers, employees, and external partners), you need 3-5 users per group [4]. For quantitative usability tests (statistical significance), the Nielsen Norman Group recommends at least 20 participants [6].
From Software to Service
Transferring usability testing to services is methodologically feasible but requires adaptations. Services differ from software in three dimensions that affect test design:
| Dimension | Software | Service | Consequence for Testing |
|---|---|---|---|
| Tangibility | Interface is tangible and repeatable | Service unfolds situationally and is partially intangible | You must simulate the service context or test in the real environment |
| Co-production | User interacts with a system | User interacts with people AND systems | Employee behavior is part of the test object |
| Temporal extension | Task takes seconds to minutes | Service experience takes minutes to hours | The test must cover the full flow or test strategic segments |
Types of Usability Tests
Moderated vs. Unmoderated
Moderated test: A researcher sits beside the participant (physically or remotely via video conference), assigns tasks, observes, and asks follow-up questions. The researcher does not intervene, but can clarify when the participant gets stuck. Advantage: deep qualitative insights. Disadvantage: time-consuming, one session per researcher.
Unmoderated test: The participant completes tasks independently, often via a platform (e.g., UserTesting, Maze, Lookback). Advantage: scalable, cost-effective, parallelizable. Disadvantage: no follow-up questions possible, less context, tasks must be self-explanatory.
Recommendation for service tests: Moderated tests, because service processes often generate context questions that cannot be resolved in unmoderated settings. For purely digital touchpoints (app, website), unmoderated tests work well.
Think-Aloud vs. Task-Based
Think-aloud protocol: The participant verbalizes their thoughts while performing the task: “I’m looking for the button for… I’m not sure if this is right… I’ll click on it…” This method was formalized by Ericsson and Simon (1993) and provides the richest data on thought processes and decision patterns [7]. Disadvantage: verbalization changes behavior — participants are slower and more reflective than under normal conditions.
Task-based testing: The participant receives a concrete task (“Report damage for a broken windshield”) and completes it without verbalizing their thoughts. The researcher observes and measures: success rate, error rate, task time. Advantage: more objective measurements. Disadvantage: less insight into the “why” behind problems.
Recommendation: Combine both. Start with think-aloud for the first 3-5 sessions (qualitative depth), then task-based for quantification.
Lab vs. Remote vs. Field
Lab test: Controlled environment, typically with eye-tracking, screen recording, and one-way mirror. Advantage: maximum control, best data quality. Disadvantage: artificial environment, expensive, logistically demanding.
Remote test: Participant is in their own environment, connected via video conference and screen sharing. Advantage: more realistic context, geographically flexible, cost-effective. Disadvantage: less control, technical issues possible.
Field test: The test takes place in the real service environment — at the car dealership, in the bank branch, at the self-service terminal. Advantage: maximum realism. Disadvantage: many uncontrollable variables, more difficult documentation.
Recommendation for service tests: Field tests for physical service processes, remote tests for digital touchpoints. Lab tests only when you need eye-tracking or other specialized equipment.
Step by Step: Conducting a Usability Test for a Service
Step 1: Define Test Objectives
What do you want to find out? Formulate 2-4 concrete test questions. Not: “Is the service user-friendly?” Instead: “Can customers complete the self-service check-in at the terminal without assistance?” or “At which points in the advisory process do comprehension questions arise?”
Define success criteria: Before the test, establish what “success” means — e.g., “80% of participants complete the check-in within 5 minutes without needing help.” Success criteria prevent selective interpretation after the test.
Select metrics:
- Task success rate: Percentage of participants who successfully complete the task
- Time per task: How long do participants need?
- Error rate: How many errors do they make?
- System Usability Scale (SUS): Standardized questionnaire by Brooke (1996), score 0-100 [8]
- Net Promoter Score (NPS): Recommendation likelihood (for the overall service)
Step 2: Create the Test Plan
Define participants: Who should test? Select participants who represent your actual user group — not colleagues, not “friendly” customers, but people who would use the service under real conditions. The Nielsen Norman Group documents four bias types when testing with colleagues: relationship bias, loyalty bias, social desirability, and false consensus [9].
Formulate tasks: Create 3-5 realistic tasks that cover the service process. Each task has a clear starting point, a goal, and a recognizable end point.
Example for a self-service check-in at a car dealership:
- Task 1: “You have a service appointment at 8:00 AM. Check in at the terminal.”
- Task 2: “You need a loaner car during the service. Book a loaner car through the terminal.”
- Task 3: “You need a company invoice. Find the option to request one.”
Create the test script: A standardized script ensures all sessions are comparable. It includes: greeting, consent form, briefing, tasks, debrief questions, wrap-up. The script is not a screenplay — it is a guide that ensures consistency while leaving room for follow-up questions.
Step 3: Recruit Participants
Sample size: 5 participants per user group for qualitative tests [4]. If your service has two user groups (e.g., individual customers and corporate customers), you need 10 participants.
Recruitment criteria: Define inclusion and exclusion criteria. Inclusion: “Owns a vehicle and has had a workshop appointment in the last 12 months.” Exclusion: “Works in the automotive industry” (too much prior knowledge).
Incentives: Offer appropriate compensation (40-80 EUR for 60 minutes is standard in Germany). Without incentives, you get only highly motivated volunteers — they are not representative.
Step 4: Conduct the Test
Before the test:
- Prepare the test environment (set up terminal, start recording, arrange observer seating)
- Have the consent form signed
- Give the briefing: “We are testing the service, not you. There are no wrong answers. If you get stuck, that is valuable information for us.”
During the test:
- Present tasks one at a time
- Observe without intervening (the most common temptation: helping the participant when they get stuck)
- Take notes: What does the participant do? Where do they hesitate? Where do they make errors? What do they say (in think-aloud)?
- For think-aloud: if the participant falls silent, gently remind them: “What are you thinking right now?”
After the test (debrief):
- Ask open questions: “What was the most difficult moment?” “What surprised you?” “What would you do differently?”
- Have the participant fill out the SUS questionnaire
- Thank and dismiss
Step 5: Analyze
Consolidate data: Create a matrix: tasks (columns) x participants (rows). For each cell, record: success/failure, errors, time, qualitative observations.
Identify problems: Look for patterns. A problem occurring with one participant may be an isolated case. A problem occurring with 3 of 5 participants is a design flaw.
Assign severity levels: Nielsen defines four severity levels [2]:
- Cosmetic (1): Problem noticed but no impact on task completion
- Minor (2): Slight delay or irritation, task is still completed
- Major (3): Significant difficulty, task completed only with workaround or after multiple attempts
- Catastrophic (4): Task cannot be completed; user abandons
Prioritization: Severity x Frequency = Priority. A catastrophic problem occurring in 4 of 5 users is top priority. A cosmetic problem in one user can wait.
Step 6: Iterate
Implement findings: Fix the major and catastrophic problems. Run a second test round with the same tasks and new participants. Compare results. Nielsen recommends three iterative rounds as a minimum [4].
Documentation: Create a usability report with: test objectives, methodology, participant profiles, tasks, findings (sorted by severity), recommendations, metrics (success rate, time, SUS score). The report is the foundation for design decisions — and for comparison with future test rounds.
Comparison: Usability Test vs. A/B Test vs. Expert Review vs. Service Walkthrough
| Dimension | Usability Test | A/B Test | Expert Review | Service Walkthrough |
|---|---|---|---|---|
| Method | Real users perform tasks | Two variants compared live | Expert evaluates against heuristics | Team simulates the service |
| Data type | Qualitative + quantitative | Quantitative (conversion, click rate) | Qualitative (expert opinion) | Qualitative (team discussion) |
| Participants | 5-15 real users | Hundreds to thousands | 3-5 usability experts | Internal team (5-10 people) |
| Prerequisite | Prototype or existing system | Live system with traffic | Existing system or concept | Service concept or blueprint |
| Strength | Uncovers unexpected problems | Statistical validation, scalable | Fast, cost-effective, early | Team alignment, concept gaps |
| Weakness | Time-consuming, small sample | No explanation of “why” | Subjective, no real users | No real users, no real context |
| Best phase | Prototype validation, redesign | Live optimization | Early concept evaluation | Concept development |
Decision guide: Use an expert review early to find obvious problems without user involvement. Use a usability test to uncover real user problems that experts miss. Use an A/B test to decide between variants when you have enough traffic. Use a service walkthrough to align the team on a shared service understanding before testing with real users.
Practical Example: Usability Test of a Self-Service Check-In Terminal at a Car Dealership
Starting Situation
An automotive group is introducing self-service check-in terminals at 120 German locations. The terminals are intended to relieve the service advisor: customers can check in, drop their car key in a key box, and book additional services (loaner car, pick-up/drop-off). The pilot location reports after three weeks: 42% of customers abandon the self-check-in and go directly to the service advisor. The project team commissions a usability test.
Test Design
Test objectives:
- At which points in the check-in process do customers fail?
- Which steps create uncertainty or frustration?
- Can customers book additional services without assistance?
Type: Moderated think-aloud test in the field (directly at the terminal in the dealership)
Participants: 8 customers with actual service appointments, recruited during appointment booking. Criteria: owns a vehicle of the brand, has had a service appointment in the last 12 months, has never used the terminal. Incentive: complimentary car wash.
Tasks:
- Task 1: “You have a service appointment at 8:00 AM. Check in at the terminal.”
- Task 2: “You need a loaner car during the service. Book a loaner car.”
- Task 3: “You need a company invoice. Find the option.”
- Task 4: “Drop your car key in the key box.”
Observations
| Task | Success Rate | Mean Time | Most Frequent Problems |
|---|---|---|---|
| Task 1: Check-in | 6/8 (75%) | 3:42 min | Customer number not at hand (5/8); license plate entry unclear (3/8) |
| Task 2: Loaner car | 4/8 (50%) | 4:15 min | ”Additional services” button not found (4/8); pricing display confusing (3/8) |
| Task 3: Company invoice | 2/8 (25%) | 2:50 min (abandoned) | Option hidden under “Payment options” (6/8); participants expected it under “Invoice” |
| Task 4: Key box | 7/8 (88%) | 0:45 min | One participant did not know the flap opens automatically |
Qualitative Findings (Think-Aloud)
Finding 1 — “Customer number?”: 5 of 8 participants did not have their customer number with them. Typical reaction: “Is that on my invoice? I don’t have it with me.” The terminal offered license plate entry as an alternative, but the input field was designed as a free-text field without format guidance — participants typed “B-XY-1234” instead of “BXY1234.”
Finding 2 — “Where is the loaner car?”: The “Additional services” button was placed in the upper right corner of the screen — outside the primary field of vision. 4 participants searched for the loaner car under “Service options” (where it was not) or scrolled through the check-in page without noticing the button.
Finding 3 — “This is like at the airport”: Three participants spontaneously compared the terminal to airport check-in kiosks. The comparison was positive (“I know this”) — but the expectation it created was problematic: participants expected a QR code scan (like with airlines), which the terminal did not offer.
Finding 4 — The service advisor as backup: All 8 participants knew a service advisor was nearby. 3 stated in the debrief that they “would have just gone to the advisor in a real situation.” The terminal was not perceived as a standalone channel but as an optional preliminary step.
Results and Actions
SUS Score: 54/100 (below the industry average of 68; Brooke classifies scores below 51 as “unacceptable” [8]).
Priority actions:
| # | Finding | Severity | Action |
|---|---|---|---|
| 1 | Company invoice not findable | Catastrophic (4) | Rename “Payment options” to “Invoice & Payment,” company invoice as standalone button |
| 2 | Loaner car button not visible | Major (3) | Move button to main navigation, add visual highlight |
| 3 | Customer number not at hand | Major (3) | License plate entry as primary identification, with format hint “e.g. BXY1234” |
| 4 | No QR code scan | Minor (2) | QR code in appointment confirmation email, scanner at terminal |
Second test round (after redesign): Success rate for Task 3 (company invoice) rose from 25% to 88%. SUS score rose to 72/100. Task 2 (loaner car) rose to 75% — still below the 85% target, so a third iteration was planned.
Note: This example is illustratively constructed to demonstrate the method in a service context. The observations are based on typical industry patterns in the automotive sector.
6 Common Mistakes in Usability Testing
1. Testing too late
What goes wrong: The service is fully developed, the IT systems are live, the employees are trained — and then “a quick usability test” is conducted. The results reveal fundamental problems, but changes are now expensive and politically difficult.
Why it hurts: The later you test, the more expensive corrections become. IBM estimated as early as the 1980s that the cost of fixing a bug after launch is 100 times the cost of an early correction [10].
Solution: Test as early as possible — even with paper prototypes, wireframes, or simulated service flows. A usability test with a paper prototype during the concept phase is more valuable than a perfect test after launch.
2. Leading the participant
What goes wrong: The researcher unconsciously helps the participant: “Did you see the button in the upper right?” or “Try it under ‘Settings’.” Every hint distorts the result.
Why it hurts: When you help the participant, you are not testing the service — you are testing whether someone can use the service when someone helps them. That is a fundamentally different question.
Solution: When the participant gets stuck, wait 30 seconds. Then ask: “What would you do next?” If they still cannot proceed, record “task failed” and move to the next task. A failed task is a valuable result — not a test failure.
3. Testing with colleagues instead of real users
What goes wrong: The team tests with internal employees, the project team, or “friendly” customers selected by the account manager. The feedback is systematically too positive.
Why it hurts: The Nielsen Norman Group documents four bias types [9]: relationship bias (colleagues spare feelings), loyalty bias (employees evaluate the company), social desirability (politeness), and false consensus (projecting own usage patterns onto everyone). Internal testers know the system, the terminology, and the logic — real users do not.
Solution: Recruit external participants who represent your actual user group. Include at least one “skeptical” participant — someone with no particular motivation to rate the service favorably.
4. Ignoring qualitative data
What goes wrong: The team focuses exclusively on metrics (success rate, time, SUS score) and ignores the qualitative observations — the facial expressions, the sighs, the comments, the hesitation moments.
Why it hurts: Metrics tell you THAT a problem exists. Qualitative data tells you WHY. Without the “why,” you cannot solve the problem — you only know it is there.
Solution: Treat qualitative and quantitative data as equally important. Every metric needs a qualitative explanation. “75% success rate on Task 2” is a diagnosis. “Four participants looked for the loaner car under ‘Service options’ because they expected all service-related options there” is the explanation that leads to the solution.
5. Testing only the happy path
What goes wrong: The test tasks describe only the standard case — everything works, the customer has all information, no exceptions occur. The test shows “all good” — and at launch, problems with exception cases, error messages, and edge cases surface.
Why it hurts: In real service operations, the happy path is the minority. Customers do not have their customer number. They mistype. They want something the process does not accommodate. If you only test the happy path, you test at best 30% of real usage situations.
Solution: Plan at least one “exception task”: “You do not have your customer number. How do you check in?” or “You accidentally booked the wrong loaner car. How do you cancel?” These tasks are often more revealing than the standard tasks.
6. No iteration — test once and done
What goes wrong: The team conducts one usability test, documents the results — and implements changes without verifying whether they actually solved the problems. Or: the results are presented but never implemented.
Why it hurts: Without a second test round, you do not know whether your fixes work. Some “fixes” create new problems — you shift the usability burden from one element to the next. Nielsen recommends: test-fix-test as the minimum cycle [4].
Solution: Plan three iterative test rounds from the start. Budget time and participants for all three rounds. The first round uncovers the major problems. The second round validates the fixes. The third round polishes the details.
Service-Specific Usability Considerations
Testing Physical Touchpoints
Not every service is digital. When testing a physical service process (check-in terminal, advisory conversation, self-service counter), additional considerations apply:
- Spatial orientation: Can customers find the touchpoint? Is the signage clear? Test the path to the terminal, not just the terminal itself.
- Physical ergonomics: Is the terminal at the right height? Can wheelchair users operate it? Is the text large enough for customers with visual impairments?
- Context variables: How does the service behave during crowds? In poor lighting? Under time pressure?
- Employee interaction: If an employee is part of the service, test the interaction too. How does the employee react when the customer gets stuck?
Multi-Channel Tests
Many services span multiple channels (app, website, phone, in-person). Test not only individual channels but also the transitions between them. The most common usability error in multi-channel services: information the customer entered in one channel is not available in the next.
Frequently Asked Questions
What is a usability test?
A usability test is an empirical method in which real users perform concrete tasks with a service, product, or system while a researcher observes where difficulties, errors, or abandonments occur [1]. The goal: finding out whether people can actually use the service and where the problems lie — not what they think about the service, but what they do with it.
How many test participants do I need?
5 participants per user group for qualitative usability tests [4]. This number is based on Nielsen and Landauer’s mathematical model (1993), which shows that 5 users uncover an average of 85% of all usability problems. For quantitative tests (statistical significance), the Nielsen Norman Group recommends at least 20 participants [6]. For services with multiple user groups: 5 per group.
What is the difference between a usability test and a user test?
The terms are often used interchangeably but mean slightly different things. A usability test checks usability — can the user complete the task? A user test (or user experience test) checks the overall experience — not just functionality but also satisfaction, emotions, and overall impression. In practice, the boundaries are fluid.
Can I conduct usability tests for physical services?
Yes. Usability tests were originally developed for digital interfaces but are transferable to physical and hybrid services. Test the entire service flow: the path to the touchpoint, the interaction, the handovers, and the completion. Physical tests require additional attention to spatial orientation, ergonomics, and context variables (crowds, lighting, time pressure).
What is the System Usability Scale (SUS)?
The System Usability Scale is a standardized questionnaire with 10 items, developed by John Brooke (1996) [8]. The user rates statements like “I found the system unnecessarily complex” on a scale of 1-5. The result is a score from 0-100. Industry average: 68. Scores below 51 are classified as “unacceptable,” above 85 as “excellent.” The SUS is the most widely used standardized usability questionnaire worldwide.
Related Methods
A typical sequence in service development: With user research, you understand user needs. With service prototyping, you build a testable draft. With a usability test, you verify whether real users can use the service. Insights feed back into a Customer Journey Map and a Service Blueprint. The overarching method selection guide is in the Service Design Methods Overview.
- Service Prototyping: The prototype is the test object — usability tests validate whether the prototype works
- User Research in Service Design: User research provides the foundation for test design — who are the users, what are their tasks?
- Customer Journey Mapping: Usability test results feed back into the journey map — where do problems arise in the customer path?
- Service Design: The overarching discipline in which usability testing is embedded as a validation method
Research Methodology
This article synthesizes findings from Nielsen’s foundational works on usability (1993, 1994), Nielsen and Landauer’s mathematical sample size model (1993), Ericsson and Simon’s think-aloud protocol (1993), Brooke’s System Usability Scale (1996), Shneiderman’s design principles (2016), and the practitioner literature from the Nielsen Norman Group. The practical example (self-service check-in at a car dealership) is illustratively constructed based on typical industry process patterns.
Limitations: The “5-user rule” is debated in the literature — Spool and Schroeder (2001) argue that the optimal sample size depends on problem density and cannot be generalized to 5. The transfer of usability testing principles from digital interfaces to physical services is methodologically feasible but empirically under-researched. The SUS was developed for software; its validity for physical services has not been formally established.
Disclosure
SI Labs advises companies on the design of services. In the Integrated Service Development Process (iSEP), we use usability tests as a validation method to test service concepts with real users. This practical experience informs the framing of the method in this article. Readers should be aware of potential perspective bias.
References
[1] Nielsen, Jakob. “Usability 101: Introduction to Usability.” Nielsen Norman Group, January 4, 2012. URL: https://www.nngroup.com/articles/usability-101-introduction-to-usability/ [Practitioner Article | Five usability components | Quality: 90/100]
[2] Nielsen, Jakob. Usability Engineering. San Francisco: Morgan Kaufmann, 1993. [Foundational work | Usability methodology | Citations: 12,000+ | Quality: 95/100]
[3] Nielsen, Jakob, and Robert L. Mack (eds.). Usability Inspection Methods. New York: John Wiley & Sons, 1994. [Foundational work | Heuristic Evaluation, Cognitive Walkthrough | Citations: 4,000+ | Quality: 90/100]
[4] Nielsen, Jakob. “Why You Only Need to Test with 5 Users.” Nielsen Norman Group, March 19, 2000 (updated). URL: https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/ [Practitioner Article | Mathematical model, 85% rule | Quality: 88/100]
[5] Shneiderman, Ben, Catherine Plaisant, Maxine Cohen, Steven Jacobs, and Niklas Elmqvist. Designing the User Interface: Strategies for Effective Human-Computer Interaction. 6th edition. Boston: Pearson, 2016. [Textbook | HCI foundations, 8 Golden Rules | Citations: 8,000+ | Quality: 90/100]
[6] Nielsen Norman Group. “Quantitative vs. Qualitative Usability Testing.” Accessed February 25, 2026. URL: https://www.nngroup.com/articles/quant-vs-qual/ [Practitioner Article | Sample sizes for quantitative tests | Quality: 85/100]
[7] Ericsson, K. Anders, and Herbert A. Simon. Protocol Analysis: Verbal Reports as Data. Rev. edition. Cambridge: MIT Press, 1993. [Foundational work | Think-aloud methodology | Citations: 15,000+ | Quality: 92/100]
[8] Brooke, John. “SUS: A ‘Quick and Dirty’ Usability Scale.” In Usability Evaluation in Industry, edited by Patrick W. Jordan et al., 189-194. London: Taylor & Francis, 1996. [Book chapter | System Usability Scale | Citations: 8,000+ | Quality: 88/100]
[9] Nielsen Norman Group. “Employees as Usability-Test Participants.” Accessed February 25, 2026. URL: https://www.nngroup.com/articles/employees-user-test/ [Practitioner Article | 4 bias types documented | Quality: 85/100]
[10] Boehm, Barry W. Software Engineering Economics. Englewood Cliffs: Prentice-Hall, 1981. [Foundational work | Cost-of-defects model (1:10:100 rule) | Citations: 5,000+ | Quality: 85/100]