In this article, we explore the concept of the “Tragedy of the Commons” as it relates to data centres and server rooms, before we begin, a brief background.
The tragedy of the commons is where a common resource, which is neither publicly or privately owned is depleted or becomes poorly managed to the detriment of all users.
In economic science, the tragedy of the commons is a situation in which individual users, who have open access to a resource unhampered by shared social structures or formal rules that govern access and use, act independently according to their own self-interest and, contrary to the common good of all users, cause depletion of the resource through their uncoordinated action. The concept originated in an essay written in 1833 by the British economist William Forster Lloyd, who used a hypothetical example of the effects of unregulated grazing on common land (also known as a "common") in Great Britain and Ireland. The concept became widely known as the "tragedy of the commons" over a century later after an article written by Garrett Hardin in 1968.
IT’s “Tragedy of the Commons”
What do we mean by ITs “Tragedy of the Commons”? I’m sure we all been to unlocked server or communication cupboards where cabling appears to have been thrown in, the room is full of boxes or used as a storage facility by those who actually don’t know better and generally poorly managed.
Its more prevalent in branch offices or where there are no IT staff around to get antsy about it, but it can also happen in environments where staff are firefighting with IT problems and never seem to have the time to deal with the issue.
However, that box or item that is “stored” in a server room or communications cupboard can actually cause air flow issues, which can lead to failure of IT components or increased energy consumption.
Figure 1-Server Room, Cable Management Fail
I visited a university data centre a few years ago to conduct an assessment that covered operational efficiency as well as energy efficiency, we use an audit technique called a RAG, or Red, Amber, Green, which essentially identified problems and applies a severity level to each issue.
This particular site had multiple entries on the RAG in all categories, the principal issues being that one of the UPS units had been offline for nearly a year awaiting a part, and thus the entire data centres was in a reduced resiliency mode, a lack of a formal agreement with the on-site facilities manager regarding a maintenance regime, a very poor fire management protocol as well as many others. One thing that stood out, was that the fire suppression room, where the gas was stored was filled with rubbish, old boxes and ancillary IT equipment, server arms, old PCBs etc, in conjunction with the poor fire protocols, it was an accident waiting to happen!
I’ve also been to colocation data centres where the business model is merely to provide power, space and cooling and where external IT personnel are allowed in via a change control ticket to install, remove or maintain IT systems on behalf of customers, and these are sometimes the worst of the lot!
I’ve seen IT equipment powered by wall sockets; I’ve seen multiple extension leads plugged into other extension leads being used to provide power to mission critical IT equipment. I’ve seen racks installed in cages that do not adhere to air flow management protocols, I’ve seen rubbish stacked up in the white space. All of this is poor management and adds risk to our operations.
It also blocks capacity, how many times have you heard “I can’t install that server, there isn’t enough power or space”? A poorly run server room will suffer from a lack of capacity simply because records are not being updated (see documentation below) or available slots (power/network) are thought to have being used by long decommissioned kit, but you’d never know because they are filled with cables and the records haven’t been updated!
Figure 2- Another Example of the "Tragedy of the Commons"
It happens when there are poor controls on the management of the server room or data centre, but it’s also personal responsibility, far too many people just do what is needed to get a service up and running, but this results in problems later down the line, which then requires downtime to “fix” the issue, the mantra should be “get it right, first time”.
There may also be a “lack of training” for technicians which can be addressed by sending technicians on some of the training courses available globally.
But, in my opinion, the biggest problem is a lack of a thorough installation inspection and check by other technicians/managers who would stop most of this from happening, so if you manage a data centre and get internal or external IT guys/girls in to install stuff, make sure that you have a process in place to check installation/maintenance work.
How to stop it!
All technology rooms, be they server cupboards, or data centres need to be locked and have access controls protocols applied, basically the only people who should be allowed in, should have business in the room, it shouldn’t be a free for all with unrestricted entry.
If any work is done inside your facility, check it upon completion and get it rectified if necessary, and make sure people where they just can’t leave site without their work being checked either by other on-site technicians or remotely via video phone.
“Ensure that high quality, accurate O&M manuals, As-Built records, commissioning records, schematics and single lines diagrams are available in order to enable all installed infrastructure and equipment to be maintained as originally designed and operated at optimum levels of efficiency.
Accurate documentation and records are essential to the correct operation and use of energy efficiency functions built-In by equipment manufacturers.
Updates should be made whenever any settings are changed or equipment is added, replaced or modified. Historical records should also be kept.
Effective commissioning and delivery of detailed and accurate documentation should be a key part of any project handover.
Note: EN 50600-3-1 can be referenced for more detail on this area.
By referencing this best practice, as well as ITIL, you have an excellent reason to implement a robust post-installation checklist that hopefully will prevent examples as shown earlier.
On my many assessments, I can usually tell a well-run facility from a poorly run facility merely by looking to see where the IT staff are located, too many times I have entered a messy room, boxes, cables and the innards of servers/pcs lying around and if it is like that in their own room, you can imagine what it was like in the white space.
Cabling as shown in the examples block air from leaving the servers, which will lead to hot spots or failure, and almost certainly mean that the air conditioning units are struggling to maintain the appropriate and recommended ASHRAE temperature and humidity ranges which as a result mean they will use more energy.
In a previous article we highlighted that small server rooms and data centres could be using up to 12% of the UK electricity consumption and poor management is the primary cause of this.
I only mentioned one of the 160 EU Code of Conduct for Data Centres (Energy Efficiency) best practices earlier, but the prudent application of the code will yield significant energy savings and get you started on the road of ICT Sustainability so use it!
The 12th Edition can be downloaded from this site https://e3p.jrc.ec.europa.eu/communities/data-centres-code-conductbut the 13th Edition, 2022 is currently in production and will be published in Q1 2022, in this edition, some of the optional best practices will be become mandatory for “new build and retro-fit” and some of the metrics will become reportable.
As soon as the EUCOC 13th Edition is published, we’ll be publishing a “What’s New in the EU Code of Conduct for Data Centres (Energy Efficiency) article detailing the changes.
Copyright © 2022 - All Rights Reserved.
Gaia Edge ™ is operated under licence by Posetiv Cloud Ltd. Company Reg No. 11815224