CentOS mail VM lost months of data on reboot

@admindev @hidead @rgk @BirdoBaggins

This one has me stumped…

I rebooted my mail VM (Centos7 running on a KVM)… In fact, I shutdown all my VMs and rebooted the host - all cleanly with a shutdown now -h - no errors seen anyhere.

No errors appeared, but the ENTIRE file system on this one VM appears as if it reverted back to a july snapshot… Not even sure how that’s possible - it has been rebooted between then and now, so I don’t see how the xfs journal could have rolled back that far…

xfs_repair shows no errors…

The dates on everything just jump from 7/11 to today…

Thoughts?

I’ve never seen this before.

1 Like

Did you find any errors on the host about devices being disconnected?

2 Likes

Is there something on virt-manager or the XML about booting a particular snapshot?

Fuck dude, I’m sorry to hear that tho :persevere:

1 Like

Did you figure it out? I am not well versed in KVM, so idk if can add anything to the convo.

What are you using for a storage backend?

1 Like

No snapshots hot, i realized after i posted that could be confusing. Thats how it behaved, but there is no snapshot setup for this system. It backs up various things but I hadn’t been making qcow snapshots (maybe I should have, but Ive been busy with the move).

ZFS running on the host. I’m thinking, though I’m not liking this thought, that I’m mis-remembering the sequence of events since july and it really has been running continuously and what happened here is the zfs failed to flush the cache on shutdown.

I’m baffled by that thought as I added a second optane to the cache at some point that I thought was In august not july. i would had to reboot then.

I also thought I updated and rebooted in october when my cert needed to renew. I have no logs to check. Need to fix my backups to catch those. The timeline doesn’t make sense for the theory above but its the only way i can explain that much data being lost.

1 Like

Re: cache there are other files and vms on the same hist that lost nothing though. This is the only one that appears affected.

I have backups but Im just baffled by the behavior here. If ZFS cache really is that busted, I need to stop using it.

1 Like

I feel like the VM lost/disconnected from its storage and ran from memory this whole time.

1 Like

That’s the net effect of what appears to have happened, but I’m not aware of a mechanism that wold allow that to happen without error. It was running fine. Loads of spam and large documents from real estate and house building things going on in my life.

Just before I shutdown I pulled up the console and saw the last debug of mail filter I had done (something was getting over filtered). That consoles showed no indication of issues there.

You just reminded me that I definitely would have rebooted the guest when I did that back in October.

I’ve definitely seen VMs run for a bit when the disk store disappears until they hit an uncached file, but it crashes eventually… to survive a reboot in that condition doesn’t seem possible.

1 Like

No errors on the host btw… logs look clean so far… need to look through the rotated messages to see if something happened weeks ago…

2 Likes

that would most definately cause a kernel panic

1 Like

Depends on the kernel

i know both suse and ubuntu server will KP, from experience. idk about centos

1 Like

guest is 3.10-9xx I believe at the time (its 1062) now…
host is 5.2.0-1.el7.erlrepo (this was needed to get epyc and zfs into a good place, but that means… bum bum bah… the potential for zol bugs…)

1 Like