Archive for Logging
First, Do No Harm – Designing Robust Infrastructure
Posted by: | CommentsSeveral customers have requested a notification mechanism to be alerted when errors are detected in their programs. Simply raising an event is straightforward, but our promise to our customers is that we’ll do the hard thinking that ensures Gibraltar is safe and robust in production systems. Our mantra is: first, do no harm.
In this case, we asked ourselves questions like:
- What if a customer’s error notification logic is slow? How do we ensure that it doesn’t slow down the application as a whole?
- What if the program starts screaming thousands of errors? How do we ensure that we don’t swamp the error notification handler?
- What if there are errors in the customer’s error notification handler? What if it throws an exception? What if it hangs?
This resulted in a design that ensures that the logging infrastructure (including Gibraltar itself AND customer logic that interfaces with it) will be robust and safe.
Our central Log object in Gibraltar Agent now has a MessageAlert event that is raised when warning, error, or critical messages are recorded. This event has a number of safety features such as:
- Asynchronous: The event is raised on a background thread that is not part of the logging path, ensuring that time spent handling the event will not slow down logging or affect other threads.
- Batching: When a burst of messages are recorded that qualify they will typically be raised together to allow more efficient processing
- Throttling: A minimum delay between events can be easily specified to ensure the event isn’t raised too frequently, particularly in error cascade scenarios. Messages are batched up until the next time the event can be raised.
- Hang Protection: If the event handler never returns the Agent will continue to process messages and not queue them, allowing them to be released from memory.
- Loop Protection: Messages that are recorded by your event handler will not cause additional events to be raised. This prevents notification loops where an event handler records an error during notification which subsequently causes the message alert notification to be raised again.
- Low Overhead: We don’t spin up anything (the threading, queue, etc.) until someone subscribes to the event so if you don’t use this feature it doesn’t take up resources either.
The MessageAlert event is particularly useful for automatically triggering immediate data transmission in the case of an error and implementing your own error notification mechanism. The full detail of each log message is available in the event.
Check out our recent post on charting enhancements for more examples of how we are incorporating customer feedback to ensure that Gibraltar provides a robust logging infrastructure allowing you to build rock solid .NET software.
It’s More than Just Exceptions and Errors
Posted by: | CommentsOne thing we often debate back and forth internally is how to strike the balance between tracking errors and exceptions and providing broad, general diagnostic capabilities. There’s a lot of basic attention paid to tracking exceptions – and a lot of tools that can help you do that. Certainly, each exception that gets reported to an end user is a problem and you should address it (even if that’s just to make a better catch & report dialog). But, what about application problems that either aren’t exceptions or can’t be solved with just the exception detail?
We recently released Gibraltar 2.1. During the several month preview program we used Gibraltar to gather detailed data about how the application was running and what problems users were encountering. Our Agent tracks exceptions and errors but also a lot of little details. This data got reported back to us automatically using the Hub each day after the application exited.
For each exception that got reported to us we looked at:
- Was this a distinct problem, or part of a cascade?
- Could we determine the root cause from the exception report alone?
We broadened our definition of exception to include any single error log message to include things that could have been expressed as a single exception. This is particularly important since Gibraltar supports logging exceptions as extended data on any log message severity and supports recording errors that aren’t the result of an exception. We also counted all of the session summary information (such as operating system, memory, processor architecture, etc.) as part of the exception detail.
What we found is that many of the issues could be solved just from the exception – which is a great step forward from a decade ago where this type of runtime problem information wasn’t readily available. By far, having this information is dramatically better than not having it.
However, most issues required significantly more context to be categorized and resolved- 56% in our case. In a few cases – about 1 in 8 – it would have been possible to recognize cascade errors as long as you knew the exceptions happened in the same process in order. That means 49% of all issues required additional context. In most cases we needed to see several messages leading up to the problem, sometimes a range of them. In about 10% of all issues we ultimately needed data from the performance counters to identify the issue. Most commonly these were memory counters – both for the current process and for the system.
For example, when a computer gets low on memory it will start to experience all kinds of odd errors due to transient memory allocation problems. These can also happen if the current process is simply allocating memory too quickly. The first scenario is not your fault – nothing you can really do about the user chewing up all of the memory with something else. The second scenario is your fault – you need to be able to see the slope of private memory climbing before the problem and track that back to whatever your app was doing up to the point where errors started happening.
Here’s a Gibraltar Analyst screenshot from a customer who had this problem with their application:
In the end, it’s situations just like this that motivated us to design the Gibraltar Agent to be a great logging system, performance monitor, and configuration monitor out of the box. It’s great to get notified of exceptions, but it’s even better when you can fix a user’s issues instead of just better informed that there are problems you can’t nail down.
Stay connected with your users via Gibraltar Hub
Posted by: | CommentsFollowing up on our announcement of Gibraltar Hub, I wanted to talk a bit more about what it does, how it works and our design goals in creating it.
We’ve always had an eye to making it as easy as possible to get diagnostic data from where your application is run back to the developers that can make good use of it. We built in a packager that could gather up all of the data and send it to you via email or write it to a single, highly compressed file. But we always knew we also wanted a way to transmit logs over basic web protocols.
What is Gibraltar Hub?
With the Gibraltar Hub we’ve created a service that listens for Gibraltar data sent to it via email or our RESTful web API. We’ve updated the Gibraltar Agent to be able to send data to the Hub just like it can send via email or write to a file. On the other side we’ve updated the Gibraltar Analyst to be able to retrieve all of the data sent to a Hub automatically in the background.
It sounds like a small thing – we put a web service between the Agent and the Analyst – but the effect on using Gibraltar is remarkable. First, it means that diagnostic data automatically flows to you without any extra work on your part. When you have hundreds or thousands of installed copies of your software that’s key. It makes it practical to gather and analyze all that data.
Second, multiple users can subscribe to the same Hub, meaning each gets a copy of the data. This keeps your whole team in the loop on what’s happening in the field. You can have your laptop and desktop both set up to get data so you can provide great customer support in the field and in the office.
Third, it still works with email. Based on the suggestions from our beta users we added the ability for the Hub to connect up to a mailbox and automatically pull in all of the Gibraltar data sent to it. This provides a few neat capabilities:
- End users can email in data collected from computers that can’t talk to the Hub directly.
- You can start off using email and easily migrate to the Hub without having to change the configuration of the Agents.
- You can use email as a fallback if for some reason a computer can’t access the web but can send email.
- You can selectively forward sets of log data into your whole team by just forwarding the email. This may fit well with your existing customer service system if it’s receiving the mail first.
It Just Works
Mobile User Support
Like many of our customers, we’re all over the place from day to day so it’s essential that our support infrastructure works wherever we are. We designed the Hub for this exact scenario because only the Hub has to be in a well known, accessible location: The Agents send to it, and the Analysts get data from wherever they are.
Large Sessions? Bad Connections? Firewalls?
We wanted to be absolutely sure that the Hub would Just Work. That meant even if you created really large log files or had a very dodgy network connection it still had to function without complaint. To support this, the Hub uses a chunked encoding method for uploading and downloading data to ensure that it’s efficient over bad or slow connections and won’t get halfway through sending a file only to discover that the web server won’t accept it because it’s too large.
We chose to use a RESTful API for the Hub with an eye to making it as compatible as possible with even the most strict firewall configurations. We work with customers that categorically block all SOAP web services and do other interesting content restrictions. Without being downright subversive, we wanted to make sure the Agent could always get data through.
On the Analyst side, it works in the background to stream the data from the Hub to your local repository so it’s available to you even if you lose your network connection.
Want to get started?
Upon release you can either purchase your own Private Hub to run on your own server or you can use the Gibraltar Hub Service that we host.
If you’re a registered user of Gibraltar you can try out the Gibraltar Hub Service right now. Just contact us to get your preview account set up. Depending on feedback we receive, we expect the general release to be in late December or early January.
If you hadn’t seen it in our previous post, here’s a short video tour (3 min) of Gibraltar Hub:
Announcing Gibraltar Hub for Easy CEIP
Posted by: | CommentsGibraltar Hub is our new server-based product that works with Gibraltar Agent and Gibraltar Analyst to deliver an end-to-end solution for creating a customer experience improvement programs (CEIP) as well as remote debugging for customer support. We have been quietly developing and testing Hub for months and are thrilled with how well it’s working. We’ll be releasing it commercially later this Fall and are now inviting existing Gibraltar customers to participate in our beta testing program.
We’ve created a short video tour (3 min, below) to give you a sense of Gibraltar Hub as well as a podcast (8 min) of a conversation between Kendall and me talking about Gibraltar Hub and the problems it solves. You can read an abridged version of the interview below and we’ll be posting more technical details later this week.
If you like what you see and want to participate in our beta program, please shoot me an email.
More about Gibraltar Hub
What problems does Gibraltar Hub solve?
Gibraltar Hub is designed to address a couple key scenarios:
- Collecting data from many application instances even past firewalls such as commercial software products.
- Customer Experience Improvement Programs (CEIP) for proactively gathering feedback on application performance in the field through continuous data collection and analysis.
How does Gibraltar Hub complement Gibraltar Agent and Gibraltar Analyst?
Gibraltar Hub sits between Gibraltar Agent and Gibraltar Analyst making it easier to get data from users to the development team. It’s a web service providing two interfaces: one for Agents to submit logs, the other allowing logs to stream down into Analysts. With Gibraltar Hub you can collect, manage and analyze thousands of logs and provide every member of your development team with a consistent, near real-time view of all that data as well as simple, powerful tools to analyze the data and gain new insights into the areas of your applications most needing improvement.
Is Hub required to use Gibraltar?
No, Hub is totally optional. The existing email and file transfer mechanisms in Agent will continue to be supported. However, we believe Hub will provides the best user experience because both Analyst and Agent have been enhanced to support secure, reliable, background data transfers with Hub. This means the applications can be configured to silently stream logs in the background and the development team sees new data automagically appear like new mail popping into your inbox.
How does Hub help development teams?
Gibraltar Analyst has always made it easy to import and export packages containing logs. But some of our customers with large user communities or development teams with multiple members found it challenging to ensure that everyone had a consistent view of all the relevant data. With Gibraltar Hub each team member can subscribe to a shared feed and have all the data available and continuously updated.
What is a Customer Experience Improvement Program (CEIP) and how does Gibraltar Hub help?
Microsoft coined the term Customer Experience Improvement Program (CEIP) and describe it like this:
CEIP collects information about how our customers use Microsoft programs and about some of the problems they encounter. Microsoft uses this information to improve the products and features customers use most often and to help solve problems. Participation in the program is voluntary, and the end results are software improvements to better meet the needs of our customers.
The three components of Gibraltar correspond directly with the three key challenges for development teams wishing to create their own CEIP:
- Agent efficiently collects data about how customers use programs and records details on problems they encounter.
- Hub provides reliable, secure transmission of log data from end-users to each member of the development team. Data is highly compressed and the transfer protocol is firewall-friendly and reliable even when network connectivity is limited and intermittent.
- Analyst indexes all that data and provides powerful visualization tools that help team members identify broad patterns spanning many logs as well as the ability to drill into each log to point the root cause of a single issue.
Does automatic transmission of logging data raise any privacy concerns?
Yes, dealing responsibly with information privacy is extremely important. At the same time, the considerations vary widely between different applications so we think that it’s important for Gibraltar to provide the flexibility to fit within a broad range of usage scenarios. For example, having a dialog for CEIP opt-in is a recommended best practice for a commercial software product, but for an in-house corporate application that is only used by employees, their employment agreement or computer login screen may already require informed consent to certain information being monitored. With this in mind, the default configuration settings for Gibraltar only transmit data on demand with explicit user consent. And we also make it easy to enable automatic background log transmission when appropriate.
UPDATE: Kendall has written a nice follow-up post on what Gibraltar Hub is, how it works and why we created it.
Want to get started?
Upon release you can either purchase your own private Hub to run on your own server or subscribe to the Gibraltar Hub Service we host. If you’re a registered user of Gibraltar you can try out the Gibraltar Hub Service right now. Just contact us to get your preview account set up. Depending on the final user feedback from this preview we expect the general release to be in the next 4 to 8 weeks.





