Daniel López Azaña

Theme

Social Media

Infrastructure audit and disaster recovery in 45 minutes

When a routine audit uncovers a hidden fourth server and a power outage paralyzes a store, thorough documentation proves critical. How a 45-minute recovery saved a company's ERP and PBX from extended downtime.

In January 2019, I was contacted for what seemed routine: audit the Linux infrastructure of a commercial company. What I discovered during those first hours of analysis would completely change the project’s direction. The servers were hiding secrets that not even the client knew.

Weeks later, when a power outage left the ERP and PBX offline, those secrets would become the difference between a 45-minute recovery or days of business interruption.

Linux server architecture with KVM virtualization, DRBD distributed filesystem, and Asterisk PBX

The scenario: when lack of documentation is the problem

The client needed to transfer maintenance of critical servers from their outgoing provider. It was a commercial company with a physical store. The initial scenario was puzzling: virtually non-existent technical documentation, unexplained configurations, and the feeling that something important was missing from the puzzle.

My mission was clear: discover what was really there, document everything, and assess whether the infrastructure could continue or needed urgent migration.

Obsolete operating system

Debian Wheezy without security updates since May 2018. Over a year of unpatched vulnerabilities.

Non-existent documentation

No network maps, no service diagrams, no recovery procedures. Everything to discover.

The discovery: nothing was what it seemed

When “mail server” means something completely different

The first server I analyzed supposedly managed email. But 15 minutes into the investigation, something didn’t add up. Postfix was installed, yes, but it only forwarded internal messages. There was no IMAP service, no user mailboxes. Where was the real email?

What I did find was much more interesting: a Windows XP virtual machine running on KVM with a DRBD distributed filesystem that theoretically replicated to a secondary server… which was offline.

The “aha” moment

A Linux server running a Windows XP virtual machine with a critical ERP, on top of a distributed filesystem… whose secondary node had been inactive for months. The “high availability” was an illusion.

ComponentDiscovered configuration
HardwareDell PowerEdge R210 II, Intel i3-2100, 4GB RAM
VirtualizationKVM running Windows XP for the ERP
StorageDRBD mounted at /var/kvm-images/0
Critical problemDRBD secondary node inoperative, no real redundancy
Operating systemDebian Wheezy (unsupported since May 2018)

I documented the exact startup command for the virtual machine. This would prove critical weeks later when the system collapsed and I had to manually start it during a crisis.

The PBX: Asterisk at the limit

The second server was more straightforward: an Asterisk PBX with MySQL managing extensions and call records. It worked, but ran on the same obsolete Debian Wheezy. The company’s entire phone infrastructure depended on software without security updates for over a year.

Asterisk PBX
Version 1.8, MySQL backend, ~15 SIP extensions
Hardware
Dell R210 II, 2GB RAM, 465GB RAID 1
Risk
No failover, no high availability, obsolete OS

The ghost server: when you find what nobody knew existed

But the email remained a mystery. The domain’s MX records pointed to an IP, but that Linux server had no IMAP or POP3 listening. Postfix logs showed only internal forwards. Where the hell was the real email?

I spent days investigating, verifying configurations, analyzing network traffic. Finally, I confronted the outgoing provider with the facts: “This server doesn’t manage email. There has to be another server somewhere.”

The moment of truth

”We finally discovered that there was a fourth server at the client’s facilities… The mail service uses Postfix as MTA and Dovecot as MDA/LDA and auth, with MySQL as backend.”

— Provider’s response after physically verifying the premises

This saved the project. If I had assumed the first server managed email and started a migration, I would have taken down the company’s entire email. Exhaustive investigation prevented disaster.

Service discovery and undocumented servers diagram

With all infrastructure documented, I delivered a complete report to the client with clear recommendations: migrate email to the cloud, keep Asterisk but plan its update, and above all, be aware that DRBD “high availability” was fiction.

Email to the cloudGoogle Workspace, Microsoft 365, or AWS WorkMail would eliminate dependence on obsolete physical hardware
Fictional DRBDThe inoperative secondary node turned the “distributed system” into a dangerous single point of failure

When the crisis arrives: 45 minutes to save a company

February 21, 2019, 9:30 AM

A power outage knocked down all servers. When power returned, nothing started automatically. The ERP managing the entire store: down. The PBX handling commercial calls: silent. The company was paralyzed.

The client called me urgently. The voice on the other end conveyed real concern: “We need this working now. Every hour that passes means lost sales.”

When documentation meets urgency

Having thoroughly documented everything during the audit meant I could quickly identify what to check and where the issue might be. Without that prior work, troubleshooting would have taken significantly longer.

Problem 1: The invisible ERP

The diagnosis was quick: the DRBD filesystem wasn’t mounting automatically because someone had commented out the line in /etc/fstab. Without the mounted disk, KVM couldn’t start the Windows XP virtual machine containing the Sage ERP.

Thanks to my exhaustive documentation from weeks before, I knew exactly what to do:

Uncomment /etc/fstab

Restore the line that automatically mounted the DRBD device

Manually mount DRBD

Execute mount /dev/drbd0 /var/kvm-images/0

Start VM with KVM

Use the exact command documented in the audit

ERP operational

30 minutes from call to working store

Problem 2: The amnesiac PBX

With the ERP up, the PBX remained. The problem here was different: Asterisk had changed its IP via DHCP. From 10.100.100.119 to 192.168.1.52. All office phones were statically configured to connect to the original IP. No phone worked.

Reconfiguring each terminal would have taken hours. The elegant solution was creating a virtual interface with the original IP:

ip addr add 10.100.100.119/24 dev eth0:0

Elegance under pressure

The server now responded on both IPs simultaneously. Phones kept connecting to their configured IP, and everything worked without touching a single terminal. 15 minutes from diagnosis to fully operational PBX.

The result: 45 minutes and two services saved

45 min
Total recovery time
2
Critical services restored
0
Data loss

Audit: I discovered a fourth undocumented server and prevented a disastrous migration. I identified critical systems running on obsolete infrastructure with no real redundancy.

Crisis: when the power outage brought everything down, exhaustive documentation turned what could have been days of interruption into 45 minutes of surgical recovery.

Lesson: in critical infrastructure, meticulous documentation and deep knowledge aren’t luxuries, they’re the difference between an operating company and a paralyzed one.

What we learned (and what saved the day)

Exhaustive documentation saves livesWithout it, recovery would have taken days. With it, 45 minutes.
DHCP in production is dangerousCritical servers need static IPs, period.
Never assume anything without verificationThe “mail server” didn’t manage email. The hidden server prevented disaster.

Is your infrastructure ready for a crisis?

If your business depends on critical systems:

  • Infrastructure audits that discover what you really have (even hidden servers)
  • Exhaustive documentation that turns days-long recoveries into minutes
  • Disaster recovery when every minute counts and the business is paralyzed
  • Legacy systems forensic analysis with undocumented complex architectures
  • Modernization recommendations to migrate to cloud without interrupting operations

With over 20 years of Linux systems administration experience, I turn undocumented infrastructures into known, predictable, and recoverable systems.

Specialized in KVM virtualization, distributed filesystems, Asterisk PBX, and emergency recovery when seconds count.

Get in touch →

Daniel López Azaña

About the author

Daniel López Azaña

20+ Years ExperienceAWS & GCP CertifiedAI/LLM Specialist

Tech entrepreneur and cloud architect with over 20 years of experience transforming infrastructures and automating processes. Specialist in AI/LLM integration, Rust and Python development, and AWS & GCP architecture. Restless mind, idea generator, and passionate about technological innovation and AI.

Comments

Be the first to comment

Submit comment

Have a Similar Project in Mind?

Let's discuss how I can help you achieve your goals

Start a Conversation