MongoDB Kept Crashing After a Power Outage, But the Real Culprit Was Systemd
Hey everyone,
If you run a Hive-Engine node, this one might save you a few hours of chasing the wrong problem.
I had a power outage, my machine came back up, and suddenly mongod would not stay running. At first glance it looked exactly like the kind of issue you would expect after an unclean shutdown: lock file warnings, recovery logs, and a database that simply refused to stay alive.
That turned out to be only half the story.
The outage was real. The unclean shutdown was real. But the actual reason mongod kept dying was not plain old data corruption. It was a startup-environment difference between launching MongoDB manually and launching it through systemd.

The Initial Symptoms
After the outage, mongod would start and then die roughly 30 seconds later. Sometimes it stretched to about a minute, but it never stayed up.
systemctl status mongod showed the process exiting with a segmentation fault:
Main process exited, code=killed, status=11/SEGV
The MongoDB journal and log showed a few important details:
- WiredTiger detected an unclean shutdown
- recovery completed successfully
- MongoDB finished startup
- it became writable
- then it segfaulted shortly afterward
That was the first clue that this was not just "MongoDB cannot open the data files."
The First False Lead: Repair
Because the problem started immediately after a power outage, I assumed I was dealing with corruption somewhere in WiredTiger metadata or an index file that did not survive the abrupt shutdown.
So I ran:
sudo systemctl stop mongod
sudo mongod --repair --dbpath /var/lib/mongodb
I should note one important detail here: I did not take a fresh backup right before running repair, because /var is already covered by my regular Btrfs snapshots. That meant I had multiple rollback points available if repair made the situation worse.
The first run actually failed because of open file limits:
Too many open files
Running repair again with a much higher ulimit -n let it complete successfully with exit code 0.
That made it look like the problem might be fixed.
It was not.
After repair, mongod still crashed when started from systemd.
The Second False Lead: Permissions
Repair did leave me with a real permission problem afterward. mongod started failing immediately with:
/var/lib/mongodb/WiredTiger.turtle: Permission denied
That part was easy to fix:
sudo chown -R mongodb:mongodb /var/lib/mongodb /var/log/mongodb
Once ownership was corrected, the original crash pattern returned: MongoDB started successfully, ran for a short period, then segfaulted.
So at that point I knew I had fixed a side effect of repair, but not the underlying problem.
The Weird Part
The breakthrough came when I stopped testing with systemd and launched MongoDB manually as the mongodb user:
sudo -u mongodb bash -lc 'ulimit -n 65535; /usr/bin/mongod --config /etc/mongod.conf'
And that version stayed up.
That immediately narrowed the scope of the problem. If the same binary and the same data directory work in the foreground, then the problem is probably not the database files themselves. It is probably something about the service environment.
Comparing Manual Start vs Systemd
Looking at the packaged unit file showed this:
[Service]
Environment="MONGODB_CONFIG_OVERRIDE_NOFORK=1"
Environment="GLIBC_TUNABLES=glibc.pthread.rseq=0"
LimitNOFILE=64000
My manual foreground run differed in two meaningful ways:
- it did not include
GLIBC_TUNABLES=glibc.pthread.rseq=0 - it had a much higher open file limit than the systemd service
At that point the question was simple: which difference was actually causing the segfault?
The Decisive Test
I reproduced the systemd environment as closely as possible in the foreground:
sudo -u mongodb env GLIBC_TUNABLES=glibc.pthread.rseq=0 bash -lc 'ulimit -n 64000; /usr/bin/mongod --config /etc/mongod.conf'
That crashed immediately with a segmentation fault.
That was the smoking gun.
The trigger was not the outage anymore. The trigger was the GLIBC_TUNABLES=glibc.pthread.rseq=0 environment variable that the systemd unit was injecting.
What Was Actually Going On
The power outage still mattered, because it sent me down the path of recovery and repair. But the repeatable crash turned out to be tied to the systemd startup environment, not just the recovered dataset.
In my case:
- MongoDB
8.2.5 - Debian sid/forky
mongod.servicesettingGLIBC_TUNABLES=glibc.pthread.rseq=0
That combination was enough to make mongod segfault when started as a service, while the same binary stayed alive when launched manually without that tunable.
The Fix
I created a systemd override and removed the GLIBC_TUNABLES line while also raising the file descriptor limit to match the healthier manual environment.
sudo systemctl edit mongod
I added:
[Service]
Environment=
Environment="MONGODB_CONFIG_OVERRIDE_NOFORK=1"
LimitNOFILE=524288
Then:
sudo systemctl daemon-reload
sudo systemctl restart mongod
The important part is the blank Environment= line. That clears the inherited environment entries from the packaged unit so you can add back only the safe ones you want.
What Is the Risk of Removing GLIBC_TUNABLES=glibc.pthread.rseq=0?
From a practical node-operator perspective, the risk appears low compared to leaving a guaranteed crash in place.
What you are giving up is not data safety logic. You are removing a glibc runtime tuning knob, not disabling journaling, replication, or WiredTiger recovery.
The likely tradeoff is one of these:
- a workaround for a userspace/kernel/glibc interaction on some systems
- a performance or stability tweak intended for another environment
- a packaging-level compatibility setting that is harmful on this specific stack
What you are not doing:
- disabling MongoDB durability
- bypassing recovery
- suppressing corruption checks
- changing your on-disk format
In my case, leaving it enabled caused reliable segfaults. Removing it allowed the service to behave like the stable foreground launch.
Why This Might Matter for Hive-Engine Operators
Hive-Engine nodes put MongoDB under a pretty specific workload:
- lots of collections
- lots of indexes
- steady local traffic from the smart contracts stack
- a restart path that often follows maintenance, package updates, or bad shutdowns
That makes it easy to assume every MongoDB startup issue is a data problem. Sometimes it is. But if:
mongodstarts fine in the foregroundmongodonly dies undersystemd- repair does not solve it
- logs show recovery completing successfully
then compare the service environment before you start tearing apart the database.
The Takeaway
The outage was the event that exposed the issue, but it was not the full cause of the persistent crash.
If your Hive-Engine node's MongoDB instance keeps segfaulting after a reboot or power event, test these two paths:
- Start it manually as the
mongodbuser in the foreground. - Compare that environment to the
systemdunit withsystemctl cat mongod.
If your unit includes:
Environment="GLIBC_TUNABLES=glibc.pthread.rseq=0"
then it is worth testing whether that line is the actual trigger.
Sometimes the difference between "corrupt data" and "broken service environment" is just one env var.
As always,
Michael Garcia a.k.a. TheCrazyGM
Great detective work Batman!
!PIMP
!PAKX
!PIZZA
View or trade
PAKXtokens.Use !PAKX command if you hold enough balance to call for a @pakx vote on worthy posts! More details available on PAKX Blog.
$PIZZA slices delivered:
@ecoinstant(1/20) tipped @thecrazygm
Learn more at https://hive.pizza.
Systemd logs were full of zombie processes, never would’ve thought that was the real issue.
Great. Thanks for letting me know.
I have always had great admiration for people who nderstand and formulate the goobledegook that sits in the background of anything internetty.
As a complete technophobe this is way above my pay grade!!
Congratulations @thecrazygm! Your post has been a top performer on the Hive blockchain and you have been rewarded with this rare badge
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOPThank you for this detailed write-up. This is the kind of post that saves people hours of frustration.
Your breakdown resonates with something I experienced just yesterday. I uploaded and updated a file on my server via SSH. Even though that file didn't have many active connections, a pm2 restart caused my entire server to crash. All functions stopped, and I couldn't even log in. I stayed up until 3 AM getting everything back online, and MongoDB was right in the middle of the mess.
What’s interesting is that when I talked to some friends in this space the next morning, around 90% of them had hit a similar issue that day especially those who had pushed some kind of update.
Your point about distinguishing between "the triggering event" and "the actual root cause" is something I wish I had understood before I started panic-repairing things. I assumed data corruption, when the real problem might have been in the service environment the whole time.
The two-step test you laid out, manual foreground start vs. systemd, is something I’m saving permanently:
✅ If it runs fine manually, it’s an environment problem.
❌ If it dies under systemd, check your unit file before touching the data.
The GLIBC_TUNABLES finding is particularly valuable. It’s a great reminder that a single environment variable in a service definition can be the difference between a stable node and a guaranteed segfault after every restart.
I’m going to audit my mongod.service unit today.
I appreciate you taking the time to document this so thoroughly. This is what makes this community worth being part of.
I've been doing many troubleshooting rabbit holes very much like this. Congratulations on finding and resolving the issue. I've never tried to run a Hive-Engine node yet, but if I ever do, I'll remember this post. 😁🙏💚✨🤙