Archive for CEIP
It’s More than Just Exceptions and Errors
Posted by: | CommentsOne thing we often debate back and forth internally is how to strike the balance between tracking errors and exceptions and providing broad, general diagnostic capabilities. There’s a lot of basic attention paid to tracking exceptions – and a lot of tools that can help you do that. Certainly, each exception that gets reported to an end user is a problem and you should address it (even if that’s just to make a better catch & report dialog). But, what about application problems that either aren’t exceptions or can’t be solved with just the exception detail?
We recently released Gibraltar 2.1. During the several month preview program we used Gibraltar to gather detailed data about how the application was running and what problems users were encountering. Our Agent tracks exceptions and errors but also a lot of little details. This data got reported back to us automatically using the Hub each day after the application exited.
For each exception that got reported to us we looked at:
- Was this a distinct problem, or part of a cascade?
- Could we determine the root cause from the exception report alone?
We broadened our definition of exception to include any single error log message to include things that could have been expressed as a single exception. This is particularly important since Gibraltar supports logging exceptions as extended data on any log message severity and supports recording errors that aren’t the result of an exception. We also counted all of the session summary information (such as operating system, memory, processor architecture, etc.) as part of the exception detail.
What we found is that many of the issues could be solved just from the exception – which is a great step forward from a decade ago where this type of runtime problem information wasn’t readily available. By far, having this information is dramatically better than not having it.
However, most issues required significantly more context to be categorized and resolved- 56% in our case. In a few cases – about 1 in 8 – it would have been possible to recognize cascade errors as long as you knew the exceptions happened in the same process in order. That means 49% of all issues required additional context. In most cases we needed to see several messages leading up to the problem, sometimes a range of them. In about 10% of all issues we ultimately needed data from the performance counters to identify the issue. Most commonly these were memory counters – both for the current process and for the system.
For example, when a computer gets low on memory it will start to experience all kinds of odd errors due to transient memory allocation problems. These can also happen if the current process is simply allocating memory too quickly. The first scenario is not your fault – nothing you can really do about the user chewing up all of the memory with something else. The second scenario is your fault – you need to be able to see the slope of private memory climbing before the problem and track that back to whatever your app was doing up to the point where errors started happening.
Here’s a Gibraltar Analyst screenshot from a customer who had this problem with their application:
In the end, it’s situations just like this that motivated us to design the Gibraltar Agent to be a great logging system, performance monitor, and configuration monitor out of the box. It’s great to get notified of exceptions, but it’s even better when you can fix a user’s issues instead of just better informed that there are problems you can’t nail down.
How Gibraltar helped Gibraltar 2.1 – our CEIP
Posted by: | CommentsFor us, one of the great things about being the developers of Gibraltar is that, well, we get to use it to support our product and customers. After all, we got into this business because we’re really passionate about software diagnostics. We’ve always had a strong commitment to dogfooding; a term Microsoft uses to talk about using your own products internally before asking others to risk their success. We went through three closed beta cycles before Gibraltar 2.0 shipped – and each used Gibraltar to support itself. Whenever there was an issue we focused on whether we could solve it just with the data we got from our own tool. Many times we couldn’t – so we added key capabilities like the detailed assembly information tracking, culture and time zone tracking, and a host of other little details.
For the past four months we’ve been working on the centerpiece of Gibraltar 2.1 – the Hub Server. We gave the first preview builds out in September and since then have been updating them based on both the feedback of early beta users and our own experience using the Hub as part of our new Customer Experience Improvement Program (CEIP). Since starting the program we’ve received detailed logs on thousands of Gibraltar sessions from around one hundred users (folks that elected to use the Gibraltar 2.1 preview builds).
How it Worked
To implement our CEIP we needed several things:
- End user consent: Before gathering anything from end-users we made sure they knew and agreed to what was going to happen. For the beta there was a simple notification that to participate in the beta you had to opt in to the CEIP. Otherwise, go back and install the latest release version. For production we have a much nicer opt-in / opt-out system.
- Application runtime monitoring: Exceptions, logging, feature usage metrics, performance counters, and other information about how the application was performing were collected automatically by the Gibraltar Agent. This information is stored locally on the end-user’s computer as the application runs.
- Background data transfer: runtime monitoring data was periodically transferred from the end-user’s computer to a central server. This was done entirely in the background using a resumable, HTTP-based protocol. This is a feature built into the new Gibraltar Agent when working with the Gibraltar Hub.
- Central analysis: As data was submitted, summaries and detailed information were sent down to the development team for analysis. This uses the new integration between the Gibraltar Analyst and the Hub. In particular there are a few key reports built into the Analyst designed to dissect application errors and usage information.
Fortunately, this is just want we designed Gibraltar for.
By The Numbers
We did two broader beta releases of Gibraltar 2.1. For each we tracked all of the issues that were found post-developer. This could be by our internal QA processes, reported to us by end users, or only found by analyzing the CEIP data.
The key question we wanted to know is how much better would our product because of the CEIP? In other words, how many improvements could we make based solely on the automated CEIP feedback, not information from any other source. We were actively engaged in talking with our beta users as well as actively reviewing our own internal testing and information. So to justify itself, it’d have to find real improvements that didn’t also come in from those sources.
Combining the results, here’s how the issues were discovered:

At first glance, a few things jump out:
- Internal testing didn’t find many issues: Our post-developer QA processes didn’t find many things we didn’t already know about (We eliminated known issues from these charts that we elected to ship the beta with). More on this below.
- CEIP beat customer reports 3 to 1: The CEIP was the biggest single contributor. It points out that even in this audience of select customers they didn’t report most issues they ran into.
Now, not every issue happens with the same frequency. When you weight each issue by the number of end user sessions that it affected we see a very different distribution:
In this case our Internal testing fared better, indicating that while we knew or identified few issues internally they reflected the most common issues. The ration of CEIP-identified issues to Reported issues also improves to almost 2 to 1, indicating that end users are reporting the ones they are running into most frequently.
Finally, if we look at the number of issues weighted by occurrence, even if they occur many times in the same session we see:
So this validated that our internal testing overwhelmingly knew about the most frequent problems, but the CEIP to Reported ratio now is nearly 1:1; Users are reporting issues that they run into multiple times in the same session substantially more than ones that happen sporadically. This makes sense if you consider human nature: If you run into the same problem five times in one session you’re likely to be pretty annoyed and more likely to send off a comment than the one freak occurrence.
What does this say about a CEIP?
Look back at the charts above and imagine that each slice marked “CEIP” was labeled “Don’t know we Don’t know” – each reflects something that today you don’t know about, and worse don’t have any way of knowing you don’t know about it.
We focus on the middle chart as our true measure of reliability – issues weighted by number of sessions. This gets rid of the funny outliers where a user persisted in running an application that was in a bad state and kept experiencing the same problem repeatedly or issues that tended to cause cascade failures. In this chart, over 40% of the total issues we would never have known about without our CEIP. This was nearly equal to the issues we found in final QA. Imagine not doing any QA. That’d sound ridiculous, right? Well, the CEIP found the same number of problems.
We were honestly surprised by these results, because our mental image was the third graph: We knew about all of the problems. But, this view was skewed because we were unconsciously weighting against the volume of each problem. What we found through our CEIP are all of these little boundary issues, and these are the little things that separate an OK product from one that your customers can count on – one that Just Works.
Announcing Gibraltar Hub for Easy CEIP
Posted by: | CommentsGibraltar Hub is our new server-based product that works with Gibraltar Agent and Gibraltar Analyst to deliver an end-to-end solution for creating a customer experience improvement programs (CEIP) as well as remote debugging for customer support. We have been quietly developing and testing Hub for months and are thrilled with how well it’s working. We’ll be releasing it commercially later this Fall and are now inviting existing Gibraltar customers to participate in our beta testing program.
We’ve created a short video tour (3 min, below) to give you a sense of Gibraltar Hub as well as a podcast (8 min) of a conversation between Kendall and me talking about Gibraltar Hub and the problems it solves. You can read an abridged version of the interview below and we’ll be posting more technical details later this week.
If you like what you see and want to participate in our beta program, please shoot me an email.
More about Gibraltar Hub
What problems does Gibraltar Hub solve?
Gibraltar Hub is designed to address a couple key scenarios:
- Collecting data from many application instances even past firewalls such as commercial software products.
- Customer Experience Improvement Programs (CEIP) for proactively gathering feedback on application performance in the field through continuous data collection and analysis.
How does Gibraltar Hub complement Gibraltar Agent and Gibraltar Analyst?
Gibraltar Hub sits between Gibraltar Agent and Gibraltar Analyst making it easier to get data from users to the development team. It’s a web service providing two interfaces: one for Agents to submit logs, the other allowing logs to stream down into Analysts. With Gibraltar Hub you can collect, manage and analyze thousands of logs and provide every member of your development team with a consistent, near real-time view of all that data as well as simple, powerful tools to analyze the data and gain new insights into the areas of your applications most needing improvement.
Is Hub required to use Gibraltar?
No, Hub is totally optional. The existing email and file transfer mechanisms in Agent will continue to be supported. However, we believe Hub will provides the best user experience because both Analyst and Agent have been enhanced to support secure, reliable, background data transfers with Hub. This means the applications can be configured to silently stream logs in the background and the development team sees new data automagically appear like new mail popping into your inbox.
How does Hub help development teams?
Gibraltar Analyst has always made it easy to import and export packages containing logs. But some of our customers with large user communities or development teams with multiple members found it challenging to ensure that everyone had a consistent view of all the relevant data. With Gibraltar Hub each team member can subscribe to a shared feed and have all the data available and continuously updated.
What is a Customer Experience Improvement Program (CEIP) and how does Gibraltar Hub help?
Microsoft coined the term Customer Experience Improvement Program (CEIP) and describe it like this:
CEIP collects information about how our customers use Microsoft programs and about some of the problems they encounter. Microsoft uses this information to improve the products and features customers use most often and to help solve problems. Participation in the program is voluntary, and the end results are software improvements to better meet the needs of our customers.
The three components of Gibraltar correspond directly with the three key challenges for development teams wishing to create their own CEIP:
- Agent efficiently collects data about how customers use programs and records details on problems they encounter.
- Hub provides reliable, secure transmission of log data from end-users to each member of the development team. Data is highly compressed and the transfer protocol is firewall-friendly and reliable even when network connectivity is limited and intermittent.
- Analyst indexes all that data and provides powerful visualization tools that help team members identify broad patterns spanning many logs as well as the ability to drill into each log to point the root cause of a single issue.
Does automatic transmission of logging data raise any privacy concerns?
Yes, dealing responsibly with information privacy is extremely important. At the same time, the considerations vary widely between different applications so we think that it’s important for Gibraltar to provide the flexibility to fit within a broad range of usage scenarios. For example, having a dialog for CEIP opt-in is a recommended best practice for a commercial software product, but for an in-house corporate application that is only used by employees, their employment agreement or computer login screen may already require informed consent to certain information being monitored. With this in mind, the default configuration settings for Gibraltar only transmit data on demand with explicit user consent. And we also make it easy to enable automatic background log transmission when appropriate.
UPDATE: Kendall has written a nice follow-up post on what Gibraltar Hub is, how it works and why we created it.
Want to get started?
Upon release you can either purchase your own private Hub to run on your own server or subscribe to the Gibraltar Hub Service we host. If you’re a registered user of Gibraltar you can try out the Gibraltar Hub Service right now. Just contact us to get your preview account set up. Depending on the final user feedback from this preview we expect the general release to be in the next 4 to 8 weeks.





