Restoring Hive-Engine Full-Node: How I Fought MongoDB's WiredTiger Throttle and Won

avatar
(Edited)

I thought this would be easy.

I’m upgrading my Hive Engine node from "Lite" to "Full." I have the hardware to make this trivial: a Dual Xeon Gold server boasting 80 logical cores and 64GB of RAM, backed by fast SSD storage.

I had a 250GB .archive snapshot ready to go. With that much horsepower, I figured I’d run a standard parallel mongorestore command, grab lunch, and be done by the afternoon.

Instead, I spent the last 24 hours fighting hidden bottlenecks, single-threaded legacy code, and MongoDB’s internal panic triggers.


Here is the autopsy of a restore gone wrong, the "Triple Pincer" method that finally broke the logjam, and the custom tracking script I wrote to keep my sanity.

Attempt 1: The "Sane Defaults" (ETA: 5 Days)

I started with what I thought was an aggressive command. I told mongorestore to use 16 parallel streams and handle decompression on the fly.

# The naive approach
mongorestore \
  -j=16 \
  --numInsertionWorkersPerCollection=16 \
  --bypassDocumentValidation \
  --drop \
  --gzip \
  --archive=hsc_snapshot.archive

I fired it up and watched the logs. It was moving... but barely.

The Diagnosis: A look at htop revealed the problem immediately. One single CPU core was pegged at 100%, while the other 79 were asleep.

The built-in --gzip flag in MongoDB tools is single-threaded. I had a Ferrari engine, and I was feeding it fuel through a coffee stirrer. It was crunching about 2GB per hour. At that rate, I'd be done next Tuesday.

Attempt 2: The pigz Pipe (CPU Unleashed)

If the built-in tool is the bottleneck, bypass it. I aborted the restore and switched to using pigz (Parallel Implementation of GZip). This uses every available core to decompress the stream and pipes raw BSON straight into mongo’s stdin.

# The "Nuclear" Option
pigz -dc hsc_snapshot.archive | mongorestore \
  --archive \
  -j=16 \
  --numInsertionWorkersPerCollection=10 \
  --bypassDocumentValidation \
  --drop

CPU usage skyrocketed across all 80 cores. The intake pipe was finally wide open. Data started flying into the database.

Until it didn't.

After about 20 minutes of high speed, the restore started stuttering. It would run fast for 10 seconds, then completely stall for 30 seconds. It was faster than Attempt 1, but painfully inconsistent.

The Real Enemy: The WiredTiger "Panic Button"

Why was my powerful server stuttering? It wasn't CPU anymore. I ran mongostat 1 to look under the hood of the database engine.

The "smoking gun" was in the dirty column. It was flatlining at 20%.

Here is what that means: MongoDB’s storage engine, WiredTiger, keeps data in RAM (dirty cache) before writing it to disk. It has safety triggers:

  1. At 5% dirty, background threads start lazily flushing data to disk.
  2. At 20% dirty, it hits the panic button. It decides the disk can't keep up. To prevent crashing, it forces the application threads (my restore workers) to stop inserting data and help flush the cache to disk instead.

My 80 cores were decompressing data so fast that the SSD drive couldn't swallow it quick enough. WiredTiger was throttling my CPU to protect the disk.

I tried to tune this live using db.adminCommand to increase the panic to 30%, but it didn't help much. I was stuck.

db.adminCommand({
  "setParameter": 1,
  "wiredTigerEngineRuntimeConfig": "eviction_dirty_target=20,eviction_dirty_trigger=30"
})

The Final Solution: The "Triple Pincer" Attack

If I couldn't tune the engine to accept one massive stream, I decided to overwhelm it with three smaller ones.

The Hive Engine database is dominated by two massive collections: hsc.chain and hsc.transactions. When restoring linearly, you hit lock contention as dozens of threads fight over the same collection lock while simultaneously fighting eviction threads.

I aborted everything and launched three simultaneous restore processes in separate terminals.

screenshot-20260102-074621.png

Terminal 1 (The Chain):

pigz -dc hsc_snapshot.archive | mongorestore --archive --nsInclude="hsc.chain" --drop --numInsertionWorkersPerCollection=10 --bypassDocumentValidation

Terminal 2 (The Transactions):

pigz -dc hsc_snapshot.archive | mongorestore --archive --nsInclude="hsc.transactions" --drop --numInsertionWorkersPerCollection=10 --bypassDocumentValidation

Terminal 3 (Everything Else):

pigz -dc hsc_snapshot.archive | mongorestore --archive --nsExclude="hsc.chain" --nsExclude="hsc.transactions" --drop --numInsertionWorkersPerCollection=10 --bypassDocumentValidation

Why this works:
Yes, this reads the 250GB archive from disk three times simultaneously. But SSD read speeds are practically infinite for this workload.

By splitting the job, I broke the collection-level locks. The process restoring chain doesn't care if the transactions process is paused for cache eviction. It smoothed out the I/O pattern.

The Proof:
Looking at my mongostat now (top right pane in screenshot), the insert rate is holding steady in the thousands, but look at the dirty column. It's still hovering at 15%. But we are not at the 20% panic threshold.

The Tooling: Tracking the Invisible

There was one final problem. Because I was piping data via pigz, mongorestore had no idea how big the file was (not that it helps anyway) I had zero progress bars. Restoring hive-engine nodes is a slog and there is nothing to tell you where you are...

Were there's a Linux there is a way.

Everything is a file, you can see where the kernel is reading from memory with the lsof tool. You can get the exact bytes from the filesystem stat, and with those numbers you can do a little bit of math.

ss

So, I wrote track_restore.sh. This script auto-detects the pigz process, finds the open file descriptor using lsof, reads the byte offset from the kernel, and calculates the real-time progress. It works with the normal mongorestore method as well, and would probably be helpful to other Hive-Engine node operators (even light nodes).

You can see it running, keeping me sane while the gigabytes churn.

ss

#!/bin/bash

# Configuration
INTERVAL=5

# AUTO-DETECT: Check for pigz first (fast mode), then mongorestore (slow mode)
PID=$(pgrep -x "pigz" | head -n 1)
PROC_NAME="pigz"

if [ -z "$PID" ]; then
  PID=$(pgrep -x "mongorestore" | head -n 1)
  PROC_NAME="mongorestore"
fi

if [ -z "$PID" ]; then
  echo "Error: Neither pigz nor mongorestore process found."
  exit 1
fi

echo "--- Restore Progress Tracker (V3) ---"
echo "Monitoring Process: $PROC_NAME (PID: $PID)"

# Find file and size
ARCHIVE_PATH=$(lsof -p $PID -F n | grep ".archive$" | head -n 1 | cut -c 2-)

if [ -z "$ARCHIVE_PATH" ]; then
  echo "Could not auto-detect .archive file. Is the restore running?"
  exit 1
else
  TOTAL_SIZE=$(stat -c%s "$ARCHIVE_PATH")
  echo "Tracking File: $ARCHIVE_PATH"
fi

TOTAL_GB=$(echo "scale=2; $TOTAL_SIZE / 1024 / 1024 / 1024" | bc)
echo "Total Archive Size: $TOTAL_GB GB"
echo "----------------------------------------"

while true; do
  # 1. Get Offset
  # 2>/dev/null suppresses "lsof: WARNING" noise
  RAW_OFFSET=$(lsof -o -p $PID 2>/dev/null | grep ".archive" | awk '{print $7}')

  # 2. Safety Check: If empty, assume finished or closing
  if [ -z "$RAW_OFFSET" ]; then
    echo -e "\n\nRestore finished! (Process closed file)"
    break
  fi

  # 3. Clean the Offset (The Fix)
  # Remove '0t' (decimal prefix) and '0x' (hex prefix) to be safe
  # Bash handles 0x, but we can treat everything as standard base-10 if we convert hex
  if [[ "$RAW_OFFSET" == 0x* ]]; then
    # It's hex (mongorestore style)
    CURRENT_BYTES=$((RAW_OFFSET))
  else
    # It's likely 0t (pigz style) or raw number. Strip 0t.
    CURRENT_BYTES=$(echo "$RAW_OFFSET" | sed 's/^0t//')
  fi

  # 4. Math Safety Check
  if [ -z "$CURRENT_BYTES" ]; then continue; fi

  # 5. Calculate
  PERCENT=$(echo "scale=4; ($CURRENT_BYTES / $TOTAL_SIZE) * 100" | bc)
  CURRENT_GB=$(echo "scale=2; $CURRENT_BYTES / 1024 / 1024 / 1024" | bc)

  # 6. Bar
  BAR_WIDTH=50
  # Use 0 if PERCENT is empty to avoid crash
  INT_PERCENT=$(echo "${PERCENT:-0}" | cut -d'.' -f1)

  # Ensure INT_PERCENT is a number
  if ! [[ "$INT_PERCENT" =~ ^[0-9]+$ ]]; then INT_PERCENT=0; fi

  FILLED=$(($INT_PERCENT * $BAR_WIDTH / 100))
  EMPTY=$(($BAR_WIDTH - $FILLED))

  BAR=$(printf "%0.s#" $(seq 1 $FILLED))
  SPACE=$(printf "%0.s-" $(seq 1 $EMPTY))

  printf "\rProgress: [%s%s] %s%% (%s GB / %s GB)" "$BAR" "$SPACE" "$PERCENT" "$CURRENT_GB" "$TOTAL_GB"

  sleep $INTERVAL
done

The Lesson

When you throw enterprise-grade hardware at standard-grade tools, things break in weird ways.

Don't trust defaults. Monitor your bottlenecks. And if the database engine tries to throttle you, sometimes the only answer is to hit it from three directions at once.

The node should finally be synced by mid-day.

As always,
Michael Garcia a.k.a. TheCrazyGM



0
0
0.000
18 comments
avatar
I am an AI curator currently being programmed; if I voted for you, it's because your post respects certain curation rules. help us put a 1% vote
0
0
0.000
avatar

I am enjoying following you along as you are going through these processes from the beginning, so cool that before even sync'ing you have an incredibly useful new gist to add to our project builder!

!PAKX
!PIMP
!PIZZA

0
0
0.000
avatar

You're a masterful coding wizard, my friend, and I very much appreciate reading your adventures. Oh, and no, I never trust defaults. 😁🙏💚✨🤙

0
0
0.000
avatar

Amazing! Mongorestore has been the bane of my existence! On NVMEs a full restore was under 20 hours with some tweaks, but your approach should make that even faster!

0
0
0.000
avatar

My bottleneck is still the SSD. I probably should invest in an NVMe at some point. But I'm poor folk. 😅

0
0
0.000
avatar

Aren't consumer NVME and SSDs almost identical in price with SSDs being slightly cheaper?

0
0
0.000
avatar

Pretty close, but I don't have an M.2 adapter either. (which I think you can get one pretty dirt cheap too) but this SSD was a gift.

Not making excuses, it's just I live week to week, hand to mouth, and barely, if ever, have "extra" to get anything.

0
0
0.000
avatar

Fair enough if you already have it. 0 additional cost is best :)

0
0
0.000
avatar
(Edited)

👀 What FS for the mongoDB please :D

0
0
0.000
avatar

I'm using BTRFS with CoW turned off for the mongod dir. but I wanted the ease of the snapshots for backup purposes, so I don't ever had to do this restore ever again...

0
0
0.000
avatar

Interesting... haven't done that for a while. Have you tried in-memory with ZFS delayed writes?

0
0
0.000
avatar

Honestly, I haven't setup ZFS in, I wanna say about 5 years. Best thing to ever come out of Solaris. But this machine is my "dev machine" so it gets scrambled all the time, probably worth trying.

0
0
0.000
avatar
(Edited)

ZFS has some interesting performance trends on higher memory bandwidth and high core count mobos... especially when you can delay writebacks if that's a SSD/NVMe problem.

So, for a single mobo, a must! For cluster backends, it depends a lot more... as it's a DDoS game (and highly depends on fabric being IP-based or RDMA).

With ASIC cards and SAS stuff, you can overcome some of these problems, but memory ZFS stuff is becoming way interesting. Especially when comparing with PCIe speeds.

0
0
0.000
avatar

That's def worth looking into. Even with every tweak I could throw at it, ran into a physical bottleneck, the SSD can only eat it so fast. It's pegged at 100% util and "still going to take for fucking ever"

0
0
0.000
avatar

Depending on the SSD, you can also tune the queue and block size to perform to its best.

NVMe's are better because of their RAM cache and controllers' further "atomic" stuff, but SAS enterprise cards' stuff does many more atomic commands, which, when used on SAS controllers, drops the latency of high IO/ps quite a lot. So, depends on what you have...

New BIG NVMe's are a different beast, I am still exploring... and they are almost like a computer! It's going to be a game on PCIe over those...

0
0
0.000
avatar
(Edited)

Now I know I need to meet you in real life. Enough proof from your posts. 😉🙃

0
0
0.000