Archive for Logging

Jun
14

Reworking the Gibraltar AddIn API

Posted by: | Comments (0)

There’s a classic saying in designing software that you should plan on throwing the first one away.  I’ve never been very comfortable with that because it can be easily abused by a team to justify half-baked thinking, inadequate requirements and design, and just general hope instead of solid engineering.

That said, it’s generally been my experience that no matter how good and diligent you are when designing an API you’ll inevitably find that you’ve missed the mark as soon as the first set of users start solving real problems with it.   So, you’ll want to leave room to redo and rethink your original approach before you’re married to it.  This is the major reason that we shipped Gibraltar 2.2.0 as a Beta back in early April with our first vision of a Gibraltar Add In API.

Feedback from inside & outside

To make sure we got solid evidence that the design was sound we worked hard to get some existing customers as well as outsiders to try solving their challenges with the API.  We also did our own internal Red Team approach:  we asked Jay (who isn’t doing any day-to-day development on Gibraltar, but has extensive commercial development experience) to scratch his favorite itch with the API as well.  We’re going to publish more about what he built soon, but without going into details something significant emerged from his work and discussions with our customers.

Instead of people using the API to solve a few tactical problems (like data export or an alternate view of the data) people wanted to use it to extensively integrate Gibraltar with external systems. This went way beyond our original intent; we have a big block of features on our roadmap for creating an external integration engine and hadn’t ever intended this to be the way that came about.  Despite that, it’s the problem people kept wanting to solve with the API.

Unfortunately, there are several things that we just didn’t allow for, but are common problems in these scenarios:

  • Configuration Management: Whenever you’re working between two systems, you’ll want some metadata to describe endpoints and map data between the systems.
  • Credential Management: Nobody likes hardcoding credentials, or doing something simple like popping them in a file without encryption.  We all know the right things to do, but it’s annoying and overhead.  It should be built in.
  • Coordination: All the various user interface and process extensions don’t exist in a vacuum; they need easy ways to coordinate their work to enable new processes, not just isolated features.
  • More Extensibility: People want to be able to extend other parts of the UI again with the goal of connecting processes together.
  • Development Debugging: In a twist of irony, it’s hard to log and view data from within an Add in because you’re running within the Gibraltar environment; you can’t use the Agent to log because you’re running inside of the Agent.  This made development of extensions a pain.

Powerful capabilities, not demoware

As we’ve said before, we hate demoware.   Although we’ve heard from a few parties that we just need to “check the feature box” with the API (have it but no one will use it), we just aren’t willing to accept that measure for success.  Instead, we see another angle:  We’re going to leverage the add in API ourself to enable us to create targeted features that only apply to some part of our audience so we can create them faster and keep control over the product’s complexity.  In short, we’ll make sure that it’ll solve problems for you by having it solve problems for us.

Fortunately, we didn’t have to throw the first one away in a literal sense.  We did end up discarding virtually our entire implementation of it from the Gibraltar side, but the change to the API you see is fairly modest.

We’re still finalizing the details, and deciding whether we need to do another beta round of 2.2 or should go ahead and formally release 2.2 so that everyone can get the benefit of all of the improvements that have happened over the past several months beyond the API.  But, we can speak to a few great features:

Configuration Management

Configuration data can be shared between the Hub and all of its connected Analysts.  This makes it more practical for add ins that connect systems together but need everyone on the team to share metadata on how items in both systems relate.  Since add ins can run on both the Hub and in the Analyst we made it so you can change the Hub configuration for an add in remotely.  This means you don’t have to run to IT and get on the server to say add new mappings for new projects or do whatever other configuration your favorite add in requires.  The central data is automatically synchronized between the Hub and Analyst and is available even if the Hub isn’t (say because you’re running disconnected from the network)

For example, say you want to integrate with your defect tracking system.  It’ll have some form of heirarchy for managing things – probably with different areas for the current development release vs. the production releases, etc.  You need to be able to relate the data Gibraltar has (like product and application names as well as versions) to those areas so when it opens or updates an issue it shows up in the right place and in the right workflow.

Credential Management

In a perfect world, your Windows credentials would give you access to any external system your Add In needs to access, so no credentials need to be stored.  Unfortunately, most of the third party systems you’ll want to integrate with (particularly ones that aren’t designed for enterprises) just flat won’t integrate that way.  You need to be able to store a user id and password but do so in a way that doesn’t create an unreasonable security risk.

We’ve baked in credential management leveraging the data protection API in Windows (DPAPI) so that you can encrypt account names and passwords so they can be accessed only by the current user, computer, or the hub.  Here again you can manage it all from Analyst, so if you need to change the account the Hub is using to login into your defect tracking system or something you don’t have to go to the server to do it.

Development Debugging

We’re taking advantage of improvements in the Live Log Viewer control to offer a real-time view of what’s happening with Add Ins, both the Gibraltar side and the messages logged by each Add In.  We log a lot of information as Analyst runs, and we don’t want to bury you in all of that detail so it automatically filters down to just information needed to monitor and support Add Ins.

This is done using only capabilities that are exposed through our normal API, so you can do the same thing in your WinForms/WPF application as well.  You’ll notice we’ve updated the live viewer to include a number of capabilities formerly only available in Analyst, like source code lookup and exception / detail viewing.

More Extensibility

We’ve added a number of places you can extend Gibraltar to the original list:

  • Analyze Sessions on Hub: Any analysis engine you create can run in Analyst or Hub.  We always intended it, but it wasn’t ready for the beta.  This means you can have data flowing from Gibraltar to external systems without anyone being logged in or having to leave something running.
  • Global Commands: You can extend the main Analyst menu bar with your own menu items to create global commands end users can easily access.
  • Controllers: We’ve added some key controller interfaces so you can hold state in the right places (think MVC) and substantially improve the efficiency of integrating with remote systems that may be expensive to query.
  • And More: Frankly, we’re going to continue adding items as we head towards release based on our internal experimentation of what we can safely get done and commit to.  Our ultimate goal is that you can extend each of the different views in Analyst so you can have it seamlessly participate in end-user workflow.

Real World Experience

We’ve been internally using the new API to help us automate our CEIP process for Gibraltar itself and it’s given us a lot of confidence not just in the new implementation but also in the vision it represents.  We’re going to be able to ship extensions outside of our release cycle, add optional modules in the box for a great initial experience, and help create a space for you to show your peers the cool things you can do where Gibraltar + Your System  = Awesome.

Is there a specific system you’d like to see us integrate with?  Drop us a line; we love to engage with our peers to understand new & creative ways to make a difference in how developers create rock solid .NET software.

Categories : .NET, Development, Logging
Comments (0)

We had a customer quiz us about why one of our thread names was showing up on some of their log messages.  We looked into the problem and were a bit baffled.  We name all of the threads we create inside the Agent to ensure we can separate what they do from any client application.  The name in question is used by a thread that the Gibraltar Agent creates and then destroys relatively early in the process.  This thread isn’t taken from the threadpool or put back into one, we confirmed it gets created and released so there just seemed no way that they could be processing on our thread.

We checked the data up and down and were confident that it wasn’t a data corruption problem – the only assumption made by the code was that Managed Thread Ids are unique.  This seemed pretty reasonable: the documentation for the ManagedThreadId property reads:

Thread.ManagedThreadId: Gets a unique identifier for the current managed thread.

But, we kept digging and found another scenario on a long running ASP.NET application where a similar event occurred – a thread that was created and destroyed relatively early in the application was clearly now in the thread pool and handling events.  Researching more, we found this gem in the documentation.  Not on the MSDN documentation for ManagedThreadId but rather for Thread.GetHashCode:

The hash code is not guaranteed to be unique. Use the ManagedThreadId property if you need a unique identifier for a managed thread.

OK, still pointing us that ManagedThreadId is the right guy for our use.  But then there’s this note on the Thread Class itself:

GetHashCode provides identification for managed threads. For the lifetime of your thread, it will not collide with the value from any other thread, regardless of the application domain from which you obtain the value.

This started to cast some concern:  That little bit of weasel room in the second sentence is troubling: “For the lifetime of your thread”…  Was .NET reusing thread Id’s after a thread exits?  The wiggle room in the statement above made that sound possible, even though there’s no reason necessarily that the hash code and the thread Id are related.  My first read of this was that the variation was about the second part of the sentence – uniqueness across application domains (which we never assumed).

So we created a few brutal tests – creating and destroying threads then ramping up the thread pool’s activity.  Sure enough, the same Managed Thread Ids showed up in the thread pool.  These weren’t the same threads – the thread static variables we were using for tests had been reset – but they had the same Managed Thread Id.

Go Team

The fix for us is to not rely on Managed Thread Id for correlating events to threads.  Instead, we’re using an internal thread static variable to track the relationship and identify it with our own unique identifier.  Because we track the thread responsible for log messages and many other things we record we had to represent this in the smallest amount of data feasible, and remain backwards/forwards compatible with existing data.

We’ve updated the display to automatically generate unique display names to separate out threads with the same Id’s and had to do a range of other adjustments to ensure we treat the Managed Thread Id as nothing more unique than a display name.   That way you’ll be sure that if two events are ascribed to something called “Thread 14″, they really are the same thread.  All of  the changes for this are included in Gibraltar 2.1.1 which will ship within the next few days (this was the last issue we needed to resolve before shipping).

Incomplete is worse than Missing

The frustrating part is that if the documentation had never made any claim about the uniqueness of the thread Id we’d likely have gone through a set of proof and qualification testing.  Like many people, when there isn’t documentation on something we have to create experiments to tease out the true behavior, review source code, and then decide what risks we want to take.  This is one reason we are passionate about documentation, even at the expense of extra features.  We want to make sure that you never have a doubt about what something on our API does.  We also know that people don’t want to review documentation if they don’t have to – so we try hard to make the API understandable just from Intellisense.

Now, I don’t want to knock Microsoft too hard here – .NET is a massive framework even if you just look at the core .NET 2.0 API.  But, as we all rely more and more on ever increasing layers of abstraction over what’s really going on it’s more important than ever to be precise in the documentation – about what something is and what it isn’t.  Precise is more important than being comprehensive, because it will set the right expectation for people about what they can rely on and what they’ll have to verify for themselves.

Categories : .NET, Development, Logging
Comments (5)
Hippocrates - 460-377 BC

Hippocrates’ Primum non nocere, “First do no harm”

Several customers have requested a notification mechanism to be alerted when errors are detected in their programs.  Simply raising an event is straightforward, but our promise to our customers is that we’ll do the hard thinking that ensures Gibraltar is safe and robust in production systems.  Our mantra is: first, do no harm.

In this case, we asked ourselves questions like:

  • What if a customer’s error notification logic is slow?  How do we ensure that it doesn’t slow down the application as a whole?
  • What if the program starts screaming thousands of errors?  How do we ensure that we don’t swamp the error notification handler?
  • What if there are errors in the customer’s error notification handler?  What if it throws an exception?  What if it hangs?

This resulted in a design that ensures that the logging infrastructure (including Gibraltar itself AND customer logic that interfaces with it) will be robust and safe.

Our central Log object in Gibraltar Agent now has a MessageAlert event that is raised when warning, error, or critical messages are recorded.  This event has a number of safety features such as:

  • Asynchronous: The event is raised on a background thread that is not part of the logging path, ensuring that time spent handling the event will not slow down logging or affect other threads.
  • Batching: When a burst of messages are recorded that qualify they will typically be raised together to allow more efficient processing
  • Throttling: A minimum delay between events can be easily specified to ensure the event isn’t raised too frequently, particularly in error cascade scenarios.  Messages are batched up until the next time the event can be raised.
  • Hang Protection: If the event handler never returns the Agent will continue to process messages and not queue them, allowing them to be released from memory.
  • Loop Protection: Messages that are recorded by your event handler will not cause additional events to be raised.  This prevents notification loops where an event handler records an error during notification which subsequently causes the message alert notification to be raised again.
  • Low Overhead: We don’t spin up anything (the threading, queue, etc.) until someone subscribes to the event so if you don’t use this feature it doesn’t take up resources either.

The MessageAlert event is particularly useful for automatically triggering immediate data transmission in the case of an error and implementing your own error notification mechanism.  The full detail of each log message is available in the event.

Check out our recent post on charting enhancements for more examples of how we are incorporating customer feedback to ensure that Gibraltar provides a robust logging infrastructure allowing you to build rock solid .NET software.

Categories : .NET, Development, Logging
Comments (2)

kick it on DotNetKicks.com

One thing we often debate back and forth internally is how to strike the balance between tracking errors and exceptions and providing broad, general diagnostic capabilities.  There’s a lot of basic attention paid to tracking exceptions – and a lot of tools that can help you do that.  Certainly, each exception that gets reported to an end user is a problem and you should address it (even if that’s just to make a better catch & report dialog).  But, what about application problems that either aren’t exceptions or can’t be solved with just the exception detail?

We recently released Gibraltar 2.1.  During the several month preview program we used Gibraltar to gather detailed data about how the application was running and what problems users were encountering.  Our Agent tracks exceptions and errors but also a lot of little details.  This data got reported back to us automatically using the Hub each day after the application exited.

For each exception that got reported to us we looked at:

  1. Was this a distinct problem, or part of a cascade?
  2. Could we determine the root cause from the exception report alone?

We broadened our definition of exception to include any single error log message to include things that could have been expressed as a single exception.  This is particularly important since Gibraltar supports logging exceptions as extended data on any log message severity and supports recording errors that aren’t the result of an exception.  We also counted all of the session summary information (such as operating system, memory, processor architecture, etc.) as part of the exception detail.

What we found is that many of the issues could be solved just from the exception – which is a great step forward from a decade ago where this type of runtime problem information wasn’t readily available.    By far, having this information is dramatically better than not having it.

Data Required to Identifiy and Resolve

However, most issues required significantly more context to be categorized and resolved- 56% in our case.  In a few cases – about 1 in 8 – it would have been possible to recognize cascade errors as long as you knew the exceptions happened in the same process in order.   That means 49% of all issues required additional context.  In most cases we needed to see several messages leading up to the problem, sometimes a range of them.  In about 10% of all issues we ultimately needed data from the performance counters to identify the issue.  Most commonly these were memory counters – both for the current process and for the system.

For example, when a computer gets low on memory it will start to experience all kinds of odd errors due to transient memory allocation problems.  These can also happen if the current process is simply allocating memory too quickly.  The first scenario is not your fault – nothing you can really do about the user chewing up all of the memory with something else.  The second scenario is your fault – you need to be able to see the slope of private memory climbing before the problem and track that back to whatever your app was doing up to the point where errors started happening.

Here’s a Gibraltar Analyst screenshot from a customer who had this problem with their application:

Example of Gibraltar Performance Counter graph for application induced memory problem

In the end, it’s situations just like this that motivated us to design the Gibraltar Agent to be a great logging system, performance monitor, and configuration monitor out of the box.  It’s great to get notified of exceptions, but it’s even better when you can fix a user’s issues instead of just better informed that there are problems you can’t nail down.

kick it on DotNetKicks.com

Categories : CEIP, Development, Logging
Comments (1)