Documentation might not be evil, per se, but I have not heard the words, "Oh my gosh! I get to document my implementation. That is so exciting!" Or "I love filling out tickets, change control forms, CMDBs, etc.!" If someone who feels this way is out there, you are certainly a rare breed and could probably find a job anywhere, at any time. For the rest of us, documentation is an often-dreaded afterthought, once the great fun of designing and deployment is over.
However, documentation is very important. Without it, infrastructures, applications, and security function as a black box. The environment is a mystery to anyone who did not implement the solution from the virtual ground up. While some might consider this job security, it is not the best way to run an enterprise. At this point, you might be thinking of the many times documentation would have been useful to you. But, let me supply some examples nonetheless.
A prime example is the datacenter move. Whether this move is due to a merger or acquisition or it is time to move to a better-suited location, knowing the ins and outs of the infrastructure and applications is critical to a successful migration. So many times the project starts with the question, what is running over there? Followed by, who is using it? Once those questions have been answered satisfactorily, discovering the physical and logical layouts of said datacenter and applications must be tackled. Generally, the project is launched into an investigative phase that lasts months or years. The actual move usually does not take longer than a few scheduled weekends. Of course, these scheduled outage windows do not imply success. Surprises occur, and the IT organization spends thirty hours with all hands on deck delving into the system looking for the cause of the issue. If the answer is not found, the application must be rolled back to the original datacenter. Sometimes the entire migration project does not move past the discovery phase because the entire task is too daunting. Enterprises are then stuck in costly datacenters on aging and insecure infrastructure.
Another important situation is the moment disaster strikes. Maybe a single system or application failed. Maybe a natural disaster or power event affects the entire datacenter. Regardless, it is time to bring production workloads online in a short amount of time. But how do we do this? We might have backups to restore (fingers crossed that the restore actually works). Or data could be replicated to a secondary or tertiary site, waiting to be powered on as systems. If we are successful bringing the individual server online, this only sets our environment into a crash consistent state. The application might not yet be functional in this scenario. In addition to assessing the health of the environment, we must identify where IPs and host names are set, what the effect is if those values are changed, what inter-system communication occurs and how, what startup dependencies exist either internally or externally, and so on and so forth. Stress is already high in a disaster, think how peaceful it would be to have the recovery work completely documented.
The above examples were extreme. However, other very important needs for documentation arise, no matter how trivial they seem at first glance. It might be time to promote an application into production that developers have tinkered with for the last six months in a test environment. Go-live is in a week. How is that accomplished without thorough documentation? An application might already be in production, but it is time to integrate it with another system, enhance the security, or just upgrade the application. Without proper documentation of the system configuration, well, the discovery work will far surpass the originally planned project.
I could provide statistics around how much money companies can save in datacenter migrations and consolidations alone - millions of dollars, often just in power savings. Or we can consider the likely hood of a large enterprise sustaining a profitable business model after unsuccessfully recovering from a disaster – in the low 10s of percentages. But it is more interesting to share a really irritating experience I had because I did not provide detailed and clear instructions on an upgrade I was leading.
Years ago when Active Directory integrated DNS was only four years old, I joined a company that still used BIND on their UNIX servers for internal DNS with Active Directory Forest and Windows DHCP. After convincing everyone AD integrated DNS was secure and reliable, I migrated the company to AD DNS during a Windows 2003 Forest-wide upgrade. I performed the upgrades at the main datacenter and then provided instructions to my other team members to replace the Windows 2000 domain controllers in the field. I basically said, “Check the box to install DNS and choose AD integrated.”
A few weeks after the transition, support calls started rolling in that e-mail wasn’t working. We all know e-mail is the most important application in a company, regardless of where it is tiered on the official application inventory. So, this was a big deal. It was discovered that e-mail was malfunctioning because the mx record mysteriously disappeared in the forward lookup zone. I quickly manually entered the record on the management console and e-mail was back online. However, this kept happening every week or two. My credibility was on the line, not to mention I was not getting enough sleep during my on-call rotation due to email being down at a global company.
I finally pieced together that this would happen after domain controllers were replaced in remote locations after reading through many, many AD log files. In speaking with the person who touched the last DC, I discovered he created the Forward and Reverse Lookup zones manually after the promotion instead of letting them replicate on their own (as AD Integrated DNS is designed to function). The a and ptr records would remain, but other useful records in AD DNS, like mx records, would not reappear. You can imagine my fury at the revelation! But, it could have all been avoided if I explicitly documented the replacement steps for the administrator instead of telling them in passing, “Install AD Integrated DNS.”
Why am I writing about documentation in the cloud era? This is the age for 3rd generation applications, and mobile and social apps are the only types that we wish to develop. Self-healing infrastructures are available, and environments automatically deploy more systems to support the user load as thresholds are reached. This is all true, but it does not excuse us from documenting. One cannot automate something until the process is defined. And how is the process defined? It is documented; otherwise it is ambiguous and open to interpretation. This is true for any deployment to the hybrid cloud, as many options exist. Explicit documentation ensures the system configuration is clear.
I recently visited with a few customers who confirmed my thoughts that new services such as self-service provisioning, automation, and network and security virtualization will causes issues without proper documentation. Once the application is brought online and runs in the cloud, then the infrastructure and security teams must reverse engineer the implementation to ensure they understand the building block components, the configuration, and the overall support mechanisms required. I wouldn’t go as far to say the cloud era brings more work to these teams, but it does leave them in a precarious position for day two operations and support.
During a presentation about cloud native applications, I recently asked a co-worker, how does developer and application promotion automation include documentation? He thought for a moment and said, “That is a really good question that I don’t know the answer to.” He is one of the brightest minds I know on this subject, so if he is not aware of a solution, it might not exist. And we must resolve this soon. Otherwise, we will continue to suffer uncharted applications and datacenters. As folks retire or move on, the knowledge of how systems function will be lost. Future workers will start from the beginning; they will reverse engineer the systems that generate millions, if not billions, of dollars in revenue. They will not able to focus on innovation to neither help the company’s profits nor implement emerging technologies. And this hurts everyone – the individual and the company alike.