Is your data center healthy?

Author: Julie Wills - MarCom/Thursday, February 19, 2015/Categories: Data Services

When it comes to your data center there is no such thing as a “small change.” Current data center infrastructure is more interrelated and interdependent than ever before. Making one minor change in one place can change everything. As server, network, storage, and hypervisor operating systems become increasingly dependent on each other; tremendous efficiencies are gained, but also present a new challenge. To keep all data center services working together at peak efficiency, all components must be continuously in compliance with each other and must follow best practices. In addition, new security patches (or even new features) require keeping all of these systems up-to-date with firmware and patches. But how can you keep everything up to date and healthy when some changes are easy, some are difficult, and just finding the best practices and compatibility matrices takes more time than you have to spare?

The answer to this question is...

The answer to this question is:  Periodic Health Checks.

Just like an annual exam at your doctor can ensure proper blood pressure and cholesterol levels, a periodic health check for your data center make sure that your systems are up to date and healthy. And just like annual exams keep your health in check and help keep avoidable ailments at bay, an investment in Data Center health checks will pay for itself in lost downtime and keeping up with security issues.

However, best practices are not static as people discover new and better ways to do things each year. A periodic review of your environment is a critical part of keeping your systems running well. We look at this process with two primary goals in mind:

1.       To ensure your systems are up to date and optimized.

It is impossible to be an expert in everything so relying on a partner with extensive expertise is a wise decision.

2.       To hear about any successes and/or pains you may have experienced.

We love to hear the successes. However, the pains are even more important to share as there may be simple things that can be done to reduce or eliminate them entirely.

How often should a health check be done?

The short answer is annually with a mini checkup in between, but really it depends on your environment. The rate of change or growth, or introduction of significant updates or patches, may modify that recommendation. Not long ago I would have said once a year for most situations would be a good standard. I would add that if you have been doing annual health checks, a mini checkup can be done quickly and easily before and after upgrades or new hardware installations to verify continued health. More recently there have been more issues that have had widespread influence on when to do a full health check, such as the Heartbleed security bug from last April. Not everything gets major media attention though.

The National Vulnerability Database lists 734 Software Flaws (CVE), US-CERT Technical Alerts, and US-CERT Vulnerability Notes in the last 3 months alone.

Because of stats like these, I still believe a full annual health check is good for everyone.  I also believe in the value of a mini checkup when any significant changes or upgrades take place. Mini checkups are so easy when an annual health check is done. I would suggest doing one at the 6 month mark between full checks if you haven’t done one for other reasons.

 

 

What kinds of things can happen if I don’t do health checks?

The scary answer here is, almost anything could happen! Because systems are so interdependent, even down to applications in some cases, one seemingly minor change can have a significant impact in unexpected areas.

  • A driver update in VMware had a compatibility issue with a specific firmware revision on a Cisco UCS Blade server. This not only put it into an unsupported configuration, but caused the server to suddenly lose all connectivity randomly and drop a critical system over two months later.
  • New servers were ordered and installed in a Cisco UCS Blade system, but the firmware of the new blades was newer than the firmware in use. This kept the blades from being fully discovered and left them in a state that should not have been used for production.
  • A firmware issue on specific hard drives is found and published. It causes them to show as failed when there is no actual failure. It is a very random issue, but the odds of seeing it increase with the age of the drive and suddenly drives begin showing as failed at an alarming rate and data loss becomes a real threat.

These are just a few real world examples of recent issues I have personally run into. The possibilities can be endless and in many cases nearly impossible to trace to a root cause. The first example was only found through the assistance of both Cisco and VMware support teams along with hours of scouring logs.

Who should be involved in a health check and how is it done?

Both you and your data center partner should be involved. The actual health check should be performed by a partner who is very familiar with current best practices, updates, and threats and that has extensive experience with the systems involved. There is an advantage to this individuals not being people who are on your systems every day as they are less likely to miss something because they already know what the settings are. For your part, you will need to provide any information the reviewers need and any concerns or questions you might have. This is the perfect opportunity for the reviewers to add any special concerns to the standard check. Finally, you will need to be involved in the review of the results. The reviewers will prepare a report and discuss the results with you. A health check does not fix any issues, but it does identify the issues that need to be addressed. The report review is critical to the success of the check and you should be prepared to ask questions and begin to form an action plan to address any issues found. We would all love to have this come back saying everything is perfect and there are no issues. I have done these for years and I am waiting for my first perfect review.

While health checks are a critical part of maintaining IT infrastructure, they are often overlooked and undervalued until a serious issue comes up that could have been avoided. You usually can’t be in IT for very long before you start seeing those issues and nobody wants to be there when that happens.

Article courtesy of Bill Baker, Consolidated Communications Senior Solutions Engineer


About Bill Baker

 With over 20 years of experience assessing and diagnosing networks, Bill is a Senior Solutions Engineer at Enventis. He currently spends his time focused on all things data center. Bill has been a home-brewer in his free time since before it was cool, and has also been running a successful side business breeding reptiles.

Print

Number of views (23387)/Comments (0)