Archive for CEIP

I spend most of my time on the development side of Gibraltar – I lead the team writing the code and supporting our customers.  I sometimes get involved in presales activities – if you click our web site chat link, it’s a good bet I’m on the other side – but the bulk of our marketing and presales work is done by my partner Jay.   A few times a week we discuss what he’s hearing from folks trying out Gibraltar and we decide if we need to make any adjustments to our development plans. We give first priority to development requests from our customers, then prospects, then the fun cool ideas we have. Every once in a while those ideas collide at the same time and we know we’re onto something.

Several weeks ago I was handling a support request from a new customer and Jay had a conversation with a prospect that hit on the same issues: Our charting features for metrics looked really good on the surface, but fell down in real life:

  • Really Long Strings: Our customer wanted to track the execution characteristics of every database query, and used dynamic SQL – so the query could be really long.  If a label was longer than you could show diagonally on the screen, the whole chart wouldn’t show.
  • Outlier Filtering: In any set of data some weird points will fall.  Perhaps your code was just spinning up or whatever, it should be ignored.
  • Show Me the Money: It’s all great you can summarize the information, but what details went into that?  To take actions on the information most people ended up replicating the chart in the Metric Grid tool to see the individual elements of data to work with.
  • Bad Topping: When you added secondary data to a chart you were topping (restricting the display to the Top N values) it re-topped the secondary data, causing downright bizarre results.

Now, we knew about the last two of them but the first two came out from real users solving problems with Gibraltar.  Frankly, our test data always generated relatively short strings partly because we knew how our tool worked and how to get good results from it.  The killer comment came from our prospective customer, and it hurt:

I mean, it all looks nice but basically it’s demoware until you address these issues

Ouch.

We hate demoware.  We will not make demoware.  This hurt even more because we really believe in the need for great charting – it’s important, it’s one of the things really unique about Gibraltar, and yet we also had a list of things we wanted to do with it.  The trouble is we have a list of things we’d like to do with just about every part of the product; we’re insanely passionate about application diagnostics.

So we wrote out a set of things we had to address and moved this to the front of the development queue.  We got sample data from a few customers we knew were using the feature and having trouble and went to work.

Previously we prepared the data and fed it to the chart control for analysis.  Unfortunately, to fix the topping problem would be really problematic with the chart doing the work – the control vendor disagreed that what we wanted was sensible and working around it was going to be complicated.  Additionally, we were jittery about having the same calculation done two different ways – one way to show the details and another to display the data.  In the end, we chose to rewrite the analysis into a central set of code we could check and control.  That way we could guarantee the results were consistent, and tailor them to our needs.

That’s just the way the breaks go – some things are easy, some are hard, but in the end it’s about what it’s worth to our customers.  Here’s what’s fun and new:

See the Details

As the data is grouped up you can see all of the raw data that went into it.  Don’t like one outlier?  You can suppress it and immediately see the chart change to show how that affects the analysis.   Curious as to why there’s a big spike in the chart? Click it and look at the individual rows to see the little details to know what to do.

This is one place I really love what we do vs. traditional performance profiling:  You can see not just the overall time used by a method but the exact parameters used for each call and their individual time to know if it’s a problem of just one particular set of options taking a lot of time or the method is generally slow.  This is useful to eliminate false leads so you know what methods are really worth profiling with your full up performance tool.  The great part is that it’s safe for use in production – which gives you much more accurate information on what really matters.

Have hundreds of group items?  That’s OK too – we’ll automatically scroll the chart to keep things at a sensible size.  Otherwise, you can use the Top feature to just show the most significant information, whether that’s based on a top count, percentage, or threshold.

Without the Noise

In the real world there’s always some noise in the data – points that will lead you astray.  In particular, if you’re looking at the duration of something you really shouldn’t judge its performance on the average or maximum; the maximum is likely a worst case scenario that reflects first time startup or a transient and the average will hide operations that are often slow, and often fast.  The problem is that you’re still upsetting users with the slow ones.

To resolve this, we’ve implemented a 95th percentile performance summarization that gives you a good real-world feel for performance data.  Here’s an example from our web site:

Notice that if you sorted by the average page duration instead of the 95th percentile you wouldn’t get the same pages floating to the top.  It’s clear from this that we have a few pages that are consistently slow for some users – we’ll definitely be taking a look at that!

Work with Any Data

We’re often… impressed… with what people do with Gibraltar.  We’ve seen proof positive that if you design a product to go to X, users will immediately take it to 3X.  We’ve previously addressed some edge cases with the log viewer, now we’ve applied the same lessons to Metric Charting.  Want to group by raw SQL Statements that are a page long?  No problem.  It’ll be fast, it’ll display, and you won’t have to worry about tooltips trying to go off the edges of the screen.

Need to still be able to see the full page of SQL when you find that slow request?  Yep, you can do that too.

With more Freedom

Previously, you could only chart a single metric at a time.  This was a big coding shortcut, and it worked for us internally because we designed our metrics to work with it.  But, in the real world things aren’t quite that simple.  Our own Agent for PostSharp, which is probably the way most people start trying out metrics, records data in multiple metrics and really can’t be charted well if you can’t throw a bunch of them together.

We’ve updated the charting to let you throw just about any combination of metrics together you think could be sensibly put together – and it’ll figure out what columns are common enough and can be graphed.  In the end there are a range of reasons why you might want to record metrics in different ways in your application – and we’d rather you could do what was convenient and minimize how much you have to worry about how it shows up in Analyst.

We’ve also made the charting more configurable – you can change most of the different labels if you don’t like the autogenerated values and show or hide various elements to get the chart the way you want it.  When you’re done, you can export it right to the format of your choice

Bringing it All Together

I hope you get a sense from the pictures above just how much you can do with the new metric charting features, but there’s no substitute for downloading a trial and seeing it for yourself.  All of the data behind the charts in this example was collected using the Gibraltar Agent for ASP.NET so no coding was necessary to get these results.   Try it and you’ll see that Gibraltar can help you solve problems in the real world you live in, not just in some theoretical abstract place where demos happen.

Check out our recent post on error notification for another example of how we are incorporating customer feedback to help you build rock solid .NET software.

Categories : CEIP, Development
Comments (3)

kick it on DotNetKicks.com

One thing we often debate back and forth internally is how to strike the balance between tracking errors and exceptions and providing broad, general diagnostic capabilities.  There’s a lot of basic attention paid to tracking exceptions – and a lot of tools that can help you do that.  Certainly, each exception that gets reported to an end user is a problem and you should address it (even if that’s just to make a better catch & report dialog).  But, what about application problems that either aren’t exceptions or can’t be solved with just the exception detail?

We recently released Gibraltar 2.1.  During the several month preview program we used Gibraltar to gather detailed data about how the application was running and what problems users were encountering.  Our Agent tracks exceptions and errors but also a lot of little details.  This data got reported back to us automatically using the Hub each day after the application exited.

For each exception that got reported to us we looked at:

  1. Was this a distinct problem, or part of a cascade?
  2. Could we determine the root cause from the exception report alone?

We broadened our definition of exception to include any single error log message to include things that could have been expressed as a single exception.  This is particularly important since Gibraltar supports logging exceptions as extended data on any log message severity and supports recording errors that aren’t the result of an exception.  We also counted all of the session summary information (such as operating system, memory, processor architecture, etc.) as part of the exception detail.

What we found is that many of the issues could be solved just from the exception – which is a great step forward from a decade ago where this type of runtime problem information wasn’t readily available.    By far, having this information is dramatically better than not having it.

Data Required to Identifiy and Resolve

However, most issues required significantly more context to be categorized and resolved- 56% in our case.  In a few cases – about 1 in 8 – it would have been possible to recognize cascade errors as long as you knew the exceptions happened in the same process in order.   That means 49% of all issues required additional context.  In most cases we needed to see several messages leading up to the problem, sometimes a range of them.  In about 10% of all issues we ultimately needed data from the performance counters to identify the issue.  Most commonly these were memory counters – both for the current process and for the system.

For example, when a computer gets low on memory it will start to experience all kinds of odd errors due to transient memory allocation problems.  These can also happen if the current process is simply allocating memory too quickly.  The first scenario is not your fault – nothing you can really do about the user chewing up all of the memory with something else.  The second scenario is your fault – you need to be able to see the slope of private memory climbing before the problem and track that back to whatever your app was doing up to the point where errors started happening.

Here’s a Gibraltar Analyst screenshot from a customer who had this problem with their application:

Example of Gibraltar Performance Counter graph for application induced memory problem

In the end, it’s situations just like this that motivated us to design the Gibraltar Agent to be a great logging system, performance monitor, and configuration monitor out of the box.  It’s great to get notified of exceptions, but it’s even better when you can fix a user’s issues instead of just better informed that there are problems you can’t nail down.

kick it on DotNetKicks.com

Categories : CEIP, Development, Logging
Comments (0)

For us, one of the great things about being the developers of Gibraltar is that, well, we get to use it to support our product and customers.  After all, we got into this business because we’re really passionate about software diagnostics.  We’ve always had a strong commitment to dogfooding; a term Microsoft uses to talk about using your own products internally before asking others to risk their success.  We went through three closed beta cycles before Gibraltar 2.0 shipped – and each used Gibraltar to support itself.  Whenever there was an issue we focused on whether we could solve it just with the data we got from our own tool.  Many times we couldn’t – so we added key capabilities like the detailed assembly information tracking, culture and time zone tracking, and a host of other little details.

For the past four months we’ve been working on the centerpiece of Gibraltar 2.1 – the Hub Server.  We gave the first preview builds out in September and since then have been updating them based on both the feedback of early beta users and our own experience using the Hub as part of our new Customer Experience Improvement Program (CEIP).  Since starting the program we’ve received detailed logs on thousands of Gibraltar sessions from around one hundred users (folks that elected to use the Gibraltar 2.1 preview builds).

How it Worked

To implement our CEIP we needed several things:

  1. End user consent: Before gathering anything from end-users we made sure they knew and agreed to what was going to happen.  For the beta there was a simple notification that to participate in the beta you had to opt in to the CEIP.  Otherwise, go back and install the latest release version.  For production we have a much nicer opt-in / opt-out system.
  2. Application runtime monitoring: Exceptions, logging, feature usage metrics, performance counters, and other information about how the application was performing were collected automatically by the Gibraltar Agent.  This information is stored locally on the end-user’s computer as the application runs.
  3. Background data transfer: runtime monitoring data was periodically transferred from the end-user’s computer to a central server.  This was done entirely in the background using a resumable, HTTP-based protocol.  This is a feature built into the new Gibraltar Agent when working with the Gibraltar Hub.
  4. Central analysis: As data was submitted, summaries and detailed information were sent down to the development team for analysis.  This uses the new integration between the Gibraltar Analyst and the Hub.  In particular there are a few key reports built into the Analyst designed to dissect application errors and usage information.

Fortunately, this is just want we designed Gibraltar for.

By The Numbers

We did two broader beta releases of Gibraltar 2.1.  For each we tracked all of the issues that were found post-developer.  This could be by our internal QA processes, reported to us by end users, or only found by analyzing the CEIP data.

The key question we wanted to know is how much better would our product because of the CEIP?  In other words, how many improvements could we make based solely on the automated CEIP feedback, not information from any other source.  We were actively engaged in talking with our beta users as well as actively reviewing our own internal testing and information.  So to justify itself, it’d have to find real improvements that didn’t also come in from those sources.

Combining the results, here’s how the issues were discovered:

Chart of the % of issues detected through different sources.

At first glance, a few things jump out:

  • Internal testing didn’t find many issues: Our post-developer QA processes didn’t find many things we didn’t already know about (We eliminated known issues from these charts that we elected to ship the beta with).  More on this below.
  • CEIP beat customer reports 3 to 1: The CEIP was the biggest single contributor.  It points out that even in this audience of select customers they didn’t report most issues they ran into.

Now, not every issue happens with the same frequency.  When you weight each issue by the number of end user sessions that it affected we see a very different distribution:

Issue Discovery weighted by Sessions

In this case our Internal testing fared better, indicating that while we knew or identified few issues internally they reflected the most common issues.  The ration of CEIP-identified issues to Reported issues also improves to almost 2 to 1, indicating that end users are reporting the ones they are running into most frequently.

Finally, if we look at the number of issues weighted by occurrence, even if they occur many times in the same session we see:

Issue Discovery weighted by Frequency

So this validated that our internal testing overwhelmingly knew about the most frequent problems, but the CEIP to Reported ratio now is nearly 1:1; Users are reporting issues that they run into multiple times in the same session substantially more than ones that happen sporadically.  This makes sense if you consider human nature:  If you run into the same problem five times in one session you’re likely to be pretty annoyed and more likely to send off a comment than the one freak occurrence.

What does this say about a CEIP?

Look back at the charts above and imagine that each slice marked “CEIP” was labeled “Don’t know we Don’t know” – each reflects something that today you don’t know about, and worse don’t have any way of knowing you don’t know about it.

We focus on the middle chart as our true measure of reliability – issues weighted by number of sessions.  This gets rid of the funny outliers where a user persisted in running an application that was in a bad state and kept experiencing the same problem repeatedly or issues that tended to cause cascade failures.  In this chart, over 40% of the total issues we would never have known about without our CEIP.  This was nearly equal to the issues we found in final QA.  Imagine not doing any QA.  That’d sound ridiculous, right?  Well, the CEIP found the same number of problems.

We were honestly surprised by these results, because our mental image was the third graph:  We knew about all of the problems.  But, this view was skewed because we were unconsciously weighting against the volume of each problem.  What we found through our CEIP are all of these little boundary issues, and these are the little things that separate an OK product from one that your customers can count on – one that Just Works.

Categories : CEIP, Development
Comments (1)