Archive for Logging
Managed Thread Ids – Unique Id’s that aren’t Unique
Posted by: | CommentsWe had a customer quiz us about why one of our thread names was showing up on some of their log messages. We looked into the problem and were a bit baffled. We name all of the threads we create inside the Agent to ensure we can separate what they do from any client application. The name in question is used by a thread that the Gibraltar Agent creates and then destroys relatively early in the process. This thread isn’t taken from the threadpool or put back into one, we confirmed it gets created and released so there just seemed no way that they could be processing on our thread.
We checked the data up and down and were confident that it wasn’t a data corruption problem – the only assumption made by the code was that Managed Thread Ids are unique. This seemed pretty reasonable: the documentation for the ManagedThreadId property reads:
Thread.ManagedThreadId: Gets a unique identifier for the current managed thread.
But, we kept digging and found another scenario on a long running ASP.NET application where a similar event occurred – a thread that was created and destroyed relatively early in the application was clearly now in the thread pool and handling events. Researching more, we found this gem in the documentation. Not on the MSDN documentation for ManagedThreadId but rather for Thread.GetHashCode:
The hash code is not guaranteed to be unique. Use the ManagedThreadId property if you need a unique identifier for a managed thread.
OK, still pointing us that ManagedThreadId is the right guy for our use. But then there’s this note on the Thread Class itself:
GetHashCode provides identification for managed threads. For the lifetime of your thread, it will not collide with the value from any other thread, regardless of the application domain from which you obtain the value.
This started to cast some concern: That little bit of weasel room in the second sentence is troubling: “For the lifetime of your thread”… Was .NET reusing thread Id’s after a thread exits? The wiggle room in the statement above made that sound possible, even though there’s no reason necessarily that the hash code and the thread Id are related. My first read of this was that the variation was about the second part of the sentence – uniqueness across application domains (which we never assumed).
So we created a few brutal tests – creating and destroying threads then ramping up the thread pool’s activity. Sure enough, the same Managed Thread Ids showed up in the thread pool. These weren’t the same threads – the thread static variables we were using for tests had been reset – but they had the same Managed Thread Id.
Go Team
The fix for us is to not rely on Managed Thread Id for correlating events to threads. Instead, we’re using an internal thread static variable to track the relationship and identify it with our own unique identifier. Because we track the thread responsible for log messages and many other things we record we had to represent this in the smallest amount of data feasible, and remain backwards/forwards compatible with existing data.
We’ve updated the display to automatically generate unique display names to separate out threads with the same Id’s and had to do a range of other adjustments to ensure we treat the Managed Thread Id as nothing more unique than a display name. That way you’ll be sure that if two events are ascribed to something called “Thread 14″, they really are the same thread. All of the changes for this are included in Gibraltar 2.1.1 which will ship within the next few days (this was the last issue we needed to resolve before shipping).
Incomplete is worse than Missing
The frustrating part is that if the documentation had never made any claim about the uniqueness of the thread Id we’d likely have gone through a set of proof and qualification testing. Like many people, when there isn’t documentation on something we have to create experiments to tease out the true behavior, review source code, and then decide what risks we want to take. This is one reason we are passionate about documentation, even at the expense of extra features. We want to make sure that you never have a doubt about what something on our API does. We also know that people don’t want to review documentation if they don’t have to – so we try hard to make the API understandable just from Intellisense.
Now, I don’t want to knock Microsoft too hard here – .NET is a massive framework even if you just look at the core .NET 2.0 API. But, as we all rely more and more on ever increasing layers of abstraction over what’s really going on it’s more important than ever to be precise in the documentation – about what something is and what it isn’t. Precise is more important than being comprehensive, because it will set the right expectation for people about what they can rely on and what they’ll have to verify for themselves.
First, Do No Harm – Designing Robust Infrastructure
Posted by: | CommentsSeveral customers have requested a notification mechanism to be alerted when errors are detected in their programs. Simply raising an event is straightforward, but our promise to our customers is that we’ll do the hard thinking that ensures Gibraltar is safe and robust in production systems. Our mantra is: first, do no harm.
In this case, we asked ourselves questions like:
- What if a customer’s error notification logic is slow? How do we ensure that it doesn’t slow down the application as a whole?
- What if the program starts screaming thousands of errors? How do we ensure that we don’t swamp the error notification handler?
- What if there are errors in the customer’s error notification handler? What if it throws an exception? What if it hangs?
This resulted in a design that ensures that the logging infrastructure (including Gibraltar itself AND customer logic that interfaces with it) will be robust and safe.
Our central Log object in Gibraltar Agent now has a MessageAlert event that is raised when warning, error, or critical messages are recorded. This event has a number of safety features such as:
- Asynchronous: The event is raised on a background thread that is not part of the logging path, ensuring that time spent handling the event will not slow down logging or affect other threads.
- Batching: When a burst of messages are recorded that qualify they will typically be raised together to allow more efficient processing
- Throttling: A minimum delay between events can be easily specified to ensure the event isn’t raised too frequently, particularly in error cascade scenarios. Messages are batched up until the next time the event can be raised.
- Hang Protection: If the event handler never returns the Agent will continue to process messages and not queue them, allowing them to be released from memory.
- Loop Protection: Messages that are recorded by your event handler will not cause additional events to be raised. This prevents notification loops where an event handler records an error during notification which subsequently causes the message alert notification to be raised again.
- Low Overhead: We don’t spin up anything (the threading, queue, etc.) until someone subscribes to the event so if you don’t use this feature it doesn’t take up resources either.
The MessageAlert event is particularly useful for automatically triggering immediate data transmission in the case of an error and implementing your own error notification mechanism. The full detail of each log message is available in the event.
Check out our recent post on charting enhancements for more examples of how we are incorporating customer feedback to ensure that Gibraltar provides a robust logging infrastructure allowing you to build rock solid .NET software.
It’s More than Just Exceptions and Errors
Posted by: | CommentsOne thing we often debate back and forth internally is how to strike the balance between tracking errors and exceptions and providing broad, general diagnostic capabilities. There’s a lot of basic attention paid to tracking exceptions – and a lot of tools that can help you do that. Certainly, each exception that gets reported to an end user is a problem and you should address it (even if that’s just to make a better catch & report dialog). But, what about application problems that either aren’t exceptions or can’t be solved with just the exception detail?
We recently released Gibraltar 2.1. During the several month preview program we used Gibraltar to gather detailed data about how the application was running and what problems users were encountering. Our Agent tracks exceptions and errors but also a lot of little details. This data got reported back to us automatically using the Hub each day after the application exited.
For each exception that got reported to us we looked at:
- Was this a distinct problem, or part of a cascade?
- Could we determine the root cause from the exception report alone?
We broadened our definition of exception to include any single error log message to include things that could have been expressed as a single exception. This is particularly important since Gibraltar supports logging exceptions as extended data on any log message severity and supports recording errors that aren’t the result of an exception. We also counted all of the session summary information (such as operating system, memory, processor architecture, etc.) as part of the exception detail.
What we found is that many of the issues could be solved just from the exception – which is a great step forward from a decade ago where this type of runtime problem information wasn’t readily available. By far, having this information is dramatically better than not having it.
However, most issues required significantly more context to be categorized and resolved- 56% in our case. In a few cases – about 1 in 8 – it would have been possible to recognize cascade errors as long as you knew the exceptions happened in the same process in order. That means 49% of all issues required additional context. In most cases we needed to see several messages leading up to the problem, sometimes a range of them. In about 10% of all issues we ultimately needed data from the performance counters to identify the issue. Most commonly these were memory counters – both for the current process and for the system.
For example, when a computer gets low on memory it will start to experience all kinds of odd errors due to transient memory allocation problems. These can also happen if the current process is simply allocating memory too quickly. The first scenario is not your fault – nothing you can really do about the user chewing up all of the memory with something else. The second scenario is your fault – you need to be able to see the slope of private memory climbing before the problem and track that back to whatever your app was doing up to the point where errors started happening.
Here’s a Gibraltar Analyst screenshot from a customer who had this problem with their application:
In the end, it’s situations just like this that motivated us to design the Gibraltar Agent to be a great logging system, performance monitor, and configuration monitor out of the box. It’s great to get notified of exceptions, but it’s even better when you can fix a user’s issues instead of just better informed that there are problems you can’t nail down.
Stay connected with your users via Gibraltar Hub
Posted by: | CommentsFollowing up on our announcement of Gibraltar Hub, I wanted to talk a bit more about what it does, how it works and our design goals in creating it.
We’ve always had an eye to making it as easy as possible to get diagnostic data from where your application is run back to the developers that can make good use of it. We built in a packager that could gather up all of the data and send it to you via email or write it to a single, highly compressed file. But we always knew we also wanted a way to transmit logs over basic web protocols.
What is Gibraltar Hub?
With the Gibraltar Hub we’ve created a service that listens for Gibraltar data sent to it via email or our RESTful web API. We’ve updated the Gibraltar Agent to be able to send data to the Hub just like it can send via email or write to a file. On the other side we’ve updated the Gibraltar Analyst to be able to retrieve all of the data sent to a Hub automatically in the background.
It sounds like a small thing – we put a web service between the Agent and the Analyst – but the effect on using Gibraltar is remarkable. First, it means that diagnostic data automatically flows to you without any extra work on your part. When you have hundreds or thousands of installed copies of your software that’s key. It makes it practical to gather and analyze all that data.
Second, multiple users can subscribe to the same Hub, meaning each gets a copy of the data. This keeps your whole team in the loop on what’s happening in the field. You can have your laptop and desktop both set up to get data so you can provide great customer support in the field and in the office.
Third, it still works with email. Based on the suggestions from our beta users we added the ability for the Hub to connect up to a mailbox and automatically pull in all of the Gibraltar data sent to it. This provides a few neat capabilities:
- End users can email in data collected from computers that can’t talk to the Hub directly.
- You can start off using email and easily migrate to the Hub without having to change the configuration of the Agents.
- You can use email as a fallback if for some reason a computer can’t access the web but can send email.
- You can selectively forward sets of log data into your whole team by just forwarding the email. This may fit well with your existing customer service system if it’s receiving the mail first.
It Just Works
Mobile User Support
Like many of our customers, we’re all over the place from day to day so it’s essential that our support infrastructure works wherever we are. We designed the Hub for this exact scenario because only the Hub has to be in a well known, accessible location: The Agents send to it, and the Analysts get data from wherever they are.
Large Sessions? Bad Connections? Firewalls?
We wanted to be absolutely sure that the Hub would Just Work. That meant even if you created really large log files or had a very dodgy network connection it still had to function without complaint. To support this, the Hub uses a chunked encoding method for uploading and downloading data to ensure that it’s efficient over bad or slow connections and won’t get halfway through sending a file only to discover that the web server won’t accept it because it’s too large.
We chose to use a RESTful API for the Hub with an eye to making it as compatible as possible with even the most strict firewall configurations. We work with customers that categorically block all SOAP web services and do other interesting content restrictions. Without being downright subversive, we wanted to make sure the Agent could always get data through.
On the Analyst side, it works in the background to stream the data from the Hub to your local repository so it’s available to you even if you lose your network connection.
Want to get started?
Upon release you can either purchase your own Private Hub to run on your own server or you can use the Gibraltar Hub Service that we host.
If you’re a registered user of Gibraltar you can try out the Gibraltar Hub Service right now. Just contact us to get your preview account set up. Depending on feedback we receive, we expect the general release to be in late December or early January.
If you hadn’t seen it in our previous post, here’s a short video tour (3 min) of Gibraltar Hub:





