Guest Post: Planning for Failure

[When I moved to New York City a year ago, Chase Culpon was one of the first judges I met — and I’m very glad I did. In addition to his expertise at events, Chase has taken a huge role in strengthening the judge community in the city, by coordinating social gatherings, organizing carpools to tournaments and conferences, and even starting a support group for judges interested in becoming more fit.

I’ve always valued Chase’s insight into tricky topics, so when he approached me a few months ago with an idea for a guest post, I couldn’t have said yes faster. I’m very excited to share his expert perspective about ways we can better prepare ourselves for the challenges of running great tournaments. —Bearz]

Chase Culpon
New York, New York

Hi, I’m Chase–a judge and audio engineer in New York City. On the surface, my job is plugging in some good mics and pushing around some faders to make sure events sound great on a live broadcast. What that really means is that, well before the event begins, I design and implement good systems. Deciding what tools are right for each job, establishing best practices, and working with a team are all essential for making live events run smoothly. Great shows and concerts have great systems that keep rolling despite failures, and the same is true for Magic events. In the spirit of entwining experiences, I’d like to share how we might use some of the tools of my trade to make the best of every Magic event.

One of my favorite frameworks for refining the systems I use professionally is the Failure Mode and Effect Analysis, or FMEA for short. It’s an approach designed to pick apart systems where failures mean life and death. Originally developed by the military in World War II, it was also used extensively by NASA in the USA’s race to the moon. Today, FMEAs are most often seen when designing systems for construction and manufacturing, but the methodology has been incredibly useful for me in both my professional and judging careers.

What’s a FMEA?

FMEA might sound scary and complex, but the basics are very simple. The best way for us to understand it is to look at all the components.

The first step of FMEA is analyzing each potential problem though the following lenses:

System: What thing are we working with? Registration, scorekeeping, deck checks, etc. This can be listed by task, or by responsibility.

Potential Failure Mode: A specific thing that can go wrong.

Severity: A quantitative value of how much damage this failure will typically do. For magic I’d suggest a scale of 0-5 based on the impact it has on the event and the player experience.

5 – Critical Failure: This ends the tournament for multiple participants, or does irreparable damage to the event or other lasting damages to an individual. Anything requiring professional medical attention or results in significant destruction/loss of property automatically falls in this category. Failures of diplomacy that cause players or judges to stop participating in competitive Magic also apply.
4 – Substantial Impact: A problem that results in significant delays to the event, harms the integrity of the tournament as whole, creates a large impact to tournament integrity for a match, or causes an extremely negative player experience. Having to re-pair a round after seating players, experiences that cause players to drop from the event, and some DQs would fall here.
3 – Significant Issue: A disruption or delay to an event, or an issue that could impact the integrity of a specific game or match. A bad ruling that substantially impacts a game, a short delay in posting pairings, and any errors in reporting a match fall in this category.
2 – Minor Issue: A notable problem, but is correctable without substantial impact on the tournament or player experience. These things cause delays of seconds, not minutes, and cause general inefficiencies.
1 – Annoyance: A minor inconvenience that has little or no impact on the event.
0 – No Impact: Does not impact the event in any measurable way.

Current Controls: How do we detect the problem, and do our current systems catch this issue before the potential effects happen?

Occurrence %: Your best guess at how likely an average event will suffer this failure. I prefer to define this as the likelihood of a failure per event, but it could also be calculated by round, so long as your metric is consistent. In larger industries, this would be strictly defined, studied, and measured.

Risk Priority Number (RPN): A simple formula: Severity x Occurrence = RPN. This is a quantitative value that lets us prioritize what problems need to be addressed first. A common problem with a minor impact is often more important to address than an extreme situation that’s incredibly unlikely.

Suggested Actions: What would you like to do about this problem! This section should also include people to follow-up with and questions that need to be answered.

FMEAs work best when tailored for specific situations–the potential problems at a GP, as well as their impacts on hundreds to thousands of players, will be very different than those at your local PPTQ. Be aware that tweaking your system may lead to a new set of problems, so run your great ideas through the process again! For small systems, it’s possible to go through this process quite quickly. This method is great for polishing up all the little things in your events.

Now what?!?

The purpose of a FMEA for judging is to guide a discussion and lead us to good decisions for how we run events. The core metric of FMEA is the RPN, the Risk Priority Number. The higher the number, the more significant an impact a problem is likely to have. These critical areas demand attention and resources for an event to run successfully. Pay attention, this stuff will ruin your day!

A low RPN is a good indicator that your current systems are taking care of the potential problem adequately. These problems can be caught easily, and don’t cause major issues. It can be correct to let something with a minor impact fail in order to solve a bigger issue. Skipping that extra deck-check when judges are going on break, or delaying getting slips out to cover a flurry of judge calls at the start of the round are both examples of how we can optimize how we fail when resources are limited.

Often a potential problem can be solved with very little or no resources. Some of the best habits of judges like picking up trash and pushing in chairs are fixing problems that won’t derail an event, but take next to nothing to solve. Just be reasonable and skimp whenever needed. These are awesome things to do, but a bad idea if a ruling is waiting on you!

In the Real World

Just like the exercise of going through a FMEA is a great way of thinking critically and in detail about potential problems, but it’s a much more powerful tool than that. Let’s take a look at one issue been a big challenge for the program recently: product and promo distribution at large events, and how a FMEA might help our planning.

As events get bigger, and TOs offer more and more custom VIP packages, how we handle all the logistics of the player meeting get increasingly challenging. There are literally thousands of packs, promos, and deck boxes that all need to be distributed as quickly as possible, and many different tiers of product that have to be given to exactly the right players. If we screw this up, we’re costing the TO money, and denying players what they paid for. It’s a particularly sensitive area for players, and each time something fails, it can do substantial damage to the reputation of the Judge Program, the TO, and Organized Play overall.

So what’s our FMEA look like for the player meeting? Let’s say we’re using the ‘standard’ approach of judges directly handing out packages to the appropriate players.

Some quick conclusions off the bat: things go most obviously wrong when product runs out on the floor. Even when product is counted and allocated for specific areas, it’s very common for piles to get mixed up, judges to hand out in the wrong area, or just simple miscommunications to run rampant. While this type of issue normally only leads to a slight delay, how easy it is for this to go wrong in any number of ways cranks its RPN up to the max. This means if you’re planning logistics for an event, it’s worth it to slow down the distribution by holding a few judges in reserve to bring extra product to where it’s needed; you’ll save time in the long run.

Poor communication of the plan is the most common cause of failures in this list: it contributes to almost every failure point. This means it’s well worth it for the logistics team/Head Judge to communicate as clearly as possible before the event. Judges need to know where they will get the right stuff and where they should hand out the stuff, and they should also know where all of the other stuff is going so they can pivot and help out elsewhere when things go wrong.

Also, the most common “nightmare scenario” of product running out isn’t one we need to worry about–especially in big events, our TOs are pros. I’ve yet to see a large event run out of packs. Yes, it’s potentially catastrophic, but we can breathe easy knowing our current systems take care of it. (Keep in mind, though, this FMEA is all about the initial player meeting–having enough preregistered pools for sleep-in specials may be a completely different story!)

What to do?

Another great way to apply FMEAs is as a comparative tool for picking the best system to use for an event. To illustrate this, let’s look at two methods of handling EOR for large events, which I’ll define as having flights of 700-1000 players.

The traditional approach is for one judge to have a clipboard of all of the outstanding matches that the scorekeeper provides. I’ve worked under this system during GP Charlotte and multiple SCG Opens.

Another approach is to have multiple judges each coordinating EOR coverage in smaller zones, rather than having everyone go to a single judge. This system often involves these judges creating lists of outstanding tables by hand, rather than getting it from the scorekeeper, and may feature the EOR team lead coordinating coverage between zones. This is the method my flight used on day 1 of GP Vegas 2015.

So, what does the FMEA look like for each?

Right away, we see that the clipboard method fails quickly whenever the judge with the clipboard is overwhelmed. It’s difficult to track a stream of incoming slips while simultaneously assigning judges to cover specific matches and bring back intel on potential problems. Some judges can do an amazing job, but we simply can’t overlook that this system creates a choke point which quickly leads to delays when overwhelmed. Tens or possibly hundreds of tables are too much to keep in your head all at once.

Zone coverage, on the other hand, has the opposite problem: There are more judges involved, and more moving parts, so there are more places where things can fail. But the task is much simpler. Any L2 can use the skills they’ve honed managing their local PPTQs to cover 50-100 tables. It’s easy to just walk and physically check on any question-mark matches, and when things do go wrong, the problem is contained to a small area of the tournament, so it’s a manageable problem to triage. When the zone leads feel like all of their matches are covered, their hand-written list can be compared against the outstanding matches from the scorekeeper. Since this comparison happens later in the round, it’s a much more manageable list.

Both systems can work. For a smaller flight, when the number of tables is more manageable, a clipboard offers an efficient, centralized solution. It allows a more experienced judge to centrally distribute resources throughout the event in an optimized fashion, which is handy with a green staff. Conversely, zone coverage empowers judges to “own” their areas and be proactive about getting things covered in manageable chunks. It works great when you have a team of judges that can solve problems with little instruction, but it may require an additional management layer to make sure judges are spread throughout the entire event.

One of the best aspects of the judge program is that it’s full of very smart people, who’ve come up with many possible ways to run events. Choosing among these options, however, is a major challenge in running large events. What works for one event may not work for others; intellectually, we know the most robust solution will depend on the specifics of the event in question–the number of players, the number of staff and their experience, and so on. FMEA allows us to quantify these factors, and tease out the most robust solution for any given event or staff.

Moving Forward

We’re seeing more and bigger events as the game grows, and it doesn’t look like things will slow down anytime soon. We often talk about how each event is another opportunity to learn; the dark side of that is that each additional event is another opportunity for things to go wrong–and they will. We’ll use the wrong tool for the job, we’ll forget exactly how some system is supposed to work, or we won’t have enough resources to implement our systems successfully.

Failure will be more and more normal in our events.

…but the show must go on. The best events are those that succeed in spite of challenges; they’re ones where the players have a fantastic time. If systems fail more often, that doesn’t mean our events have become horrible. What it really means is that we need to design and implement our systems to make every event great even when things fail.

FMEA is one of many tools that can help us prepare for failures and make sure they don’t spiral out of control. With a bit of attention, a dash of preparation, and a lot of hard work, we’ll bring Magic to new heights.

One thought on “Guest Post: Planning for Failure”

Chase says:

August 19, 2015 at 10:58 pm

FMEAs have given me tons of confidence as a judge. Studying what will likely go wrong, and knowing that you have systems in place to cover it, makes it easier to not worry about things on the floor. It frees up mental space to focus on making an event great for players, rather than worrying about potential fires to put out.

While I’ve presented situations that apply mainly to big events, I challenge you to apply the same methodology to running your local FNM. You’ll be amazed on how smoothly things run, and how easy problems are to solve when you know what’s coming!

What system has been a challenge for your events, and what would you like to do to improve things?

Comments are closed.