Overview
These runbooks capture common operational workflows for the MOOD MNKY data center:- Capturing or refreshing hardware snapshots.
- Onboarding a new node at the hardware and Proxmox levels.
- Expanding storage.
- Handling node failures or GPU replacement.
Runbook: Capture or refresh hardware snapshots
Purpose
Collect a comprehensive hardware and Proxmox snapshot on any node, suitable for auditing, troubleshooting, and documentation.Prerequisites
- Root SSH or console access to the node.
- The collector script present on the node:
Steps
-
SSH into the target node:
-
Run the collector script:
-
Confirm snapshot creation:
You should see a path like:
-
Record location:
- Use this path when updating node or storage/network documentation.
- Optionally, archive the directory for long-term historical records.
Runbook: Onboard a new node
Purpose
Add a new physical node to the cluster and capture its initial hardware profile.Steps (high level)
-
Install Proxmox VE:
- Install Proxmox VE 8.x from ISO.
- Configure EFI boot; disable Secure Boot.
- Ensure networking reaches the existing cluster.
-
Join the Proxmox cluster:
- From an existing node, generate cluster join information via the Proxmox UI or CLI.
- On the new node, join the cluster using the provided command.
-
Verify with:
-
Configure storage:
- Create ZFS pools as needed (e.g. local pool, dedicated data pool).
- Add storage entries in Proxmox (
Datacenter -> Storageor viapvesh).
-
Run the hardware collector:
- Use the snapshot runbook above.
- Store the resulting snapshot under
/root/hardware-snapshots/<NODE>/<TIMESTAMP>/.
-
Update documentation:
- Add the node to the table in
/infra/data-center/nodes. - Document new pools and mounts in
/infra/data-center/storage-and-network.
- Add the node to the table in
Runbook: Expand storage on a node
Purpose
Increase storage capacity on a node and reflect the change in both ZFS and documentation.Steps
-
Plan the expansion:
- Decide which pool (e.g.
CODE-MAIN-zfs,STUD-zfs) to expand. - Confirm free drive bays, SATA/NVMe slots, and power constraints.
- Decide which pool (e.g.
-
Install new disk(s):
- Power down if required.
- Install the disk(s) and boot.
-
Create or extend ZFS vdevs:
-
Use
zpool addor similar commands to add new vdevs. -
Verify with:
-
Use
-
Run the hardware collector:
- Capture a fresh snapshot.
-
Update docs:
- Adjust pool sizes and layout in
/infra/data-center/storage-and-network. - Update node summary in
/infra/data-center/nodes.
- Adjust pool sizes and layout in
Runbook: Handle node failure
Purpose
Provide a structured response when a node becomes unavailable or unstable.Steps (outline)
- Confirm failure:
- Use Proxmox UI,
pvesh get /nodes, and physical inspection.
- Use Proxmox UI,
- Identify impact:
- Determine which LXCs/VMs or services are affected (use
cluster/resourcesand your LXC runbooks).
- Determine which LXCs/VMs or services are affected (use
- Mitigation:
- Migrate or restart affected workloads on other nodes, respecting storage constraints.
- Root cause:
- Once the node is reachable:
- Run
collect-node-hardware.sh. - Check ZFS health, SMART status, and logs.
- Run
- Once the node is reachable:
- Update docs:
- Document the incident in an internal log or appendix.
- Note any permanent hardware changes in the relevant node and storage sections.
Runbook: MNKY-HQ rpool capacity monitoring and relief
Purpose
MNKY-HQ’s root pool (rpool) was ~93% full at last capture. This runbook helps monitor usage and plan capacity relief so the standalone node remains stable for pfSense and other VMs.
Prerequisites
- Root SSH or console access to MNKY-HQ (
101.0.0.100).
Steps
-
Check current pool and dataset usage:
-
Identify large consumers:
- Use
zfs listoutput andduon key datasets (e.g.rpool/ROOT/pve-1, VM disk paths) to find what is using space. - Check Proxmox VM/LXC disk locations and sizes in the UI or via
qm list/pct listandqm config/pct config.
- Use
-
Relief options (choose as appropriate):
- Cleanup: Remove old snapshots, ISO/images, or logs; trim VM disks if over-provisioned.
- Offload data: Move large VM disks or backups to another pool or node if MNKY-HQ has additional storage.
- Expand storage: Add disk(s) and extend the pool (e.g. mirror expansion) or create a dedicated data pool and migrate VM storage; then run the hardware collector and update MNKY-HQ node and Storage and network docs.
-
Re-run snapshot and update docs:
- Run
collect-node-hardware.shon MNKY-HQ. - Update the MNKY-HQ node page and runbooks if layout or capacity guidance changes.
- Run
Runbook: GPU replacement on CODE-MNKY
Purpose
Safely replace or upgrade the GPU on CODE-MNKY while preserving documentation and minimizing downtime.Steps
-
Pre-change state capture:
- Run
nvidia-smi -qandcollect-node-hardware.sh. - Save the latest snapshot path.
- Run
-
Drain GPU workloads:
- Stop or migrate GPU-dependent LXCs/VMs.
- Confirm no critical GPU jobs are running.
-
Perform the hardware change:
- Power down the node.
- Replace or add the GPU.
- Boot and confirm the new GPU is recognized.
-
Install drivers if needed:
- Ensure the appropriate NVIDIA driver and CUDA version are installed and compatible.
-
Post-change snapshot:
- Run
collect-node-hardware.shagain. - Confirm
nvidia-smi -qreflects the new GPU.
- Run
-
Update documentation:
- Refresh
/infra/data-center/code-mnky-nodeGPU section. - If capabilities changed significantly (e.g. memory size, architecture), note any constraints or new use cases.
- Refresh
Keeping runbooks in sync
When you discover a better approach, an additional safety check, or a new edge case:- Update the relevant runbook steps here.
- Cross-link to any detailed how-to docs in
proxmox-ansible/docs/or elsewhere. - Consider capturing before/after snapshots as examples for future incidents.