38
submitted 3 months ago by anzo@programming.dev to c/linux@programming.dev
you are viewing a single comment's thread
view the rest of the comments
[-] sxan@midwest.social 4 points 3 months ago* (last edited 3 months ago)

I'm getting random reboots, tied to nothing. Micro computer, AMD Ryzen 5 5800H. New (<6mo) computer; no re-used old components. 36GB RAM, which has passed a few runs of memtest. I have regularly seen the k10 temp spike to the low 90s without reboot, and when the reboots happen I haven't noticed that the temps were higher than 60. The only thing I've been able to correlate it at all to is composing email; I'm a fairly fast typer and markdown-oxide goes berserk and consumes in the mid-high 100% CPU use (~165%) while I'm typing. I made the correlation because multiple times this has happened has been while I was composing emails (and subsequently lost them).

There is nothing in boot-1 logs. Just normal logging and then reboot. Nothing at all suspicious, no weird errors. I struggle to use more than 50% memory, so memory contention is not an issue. It's like a sudden power cycle.

The system is on a UPS; my next avenue of investigation is the UPS itself, but power surges in the house shouldn't be a possibility; there are a half dozen other computers in the house, some on UPS, some not, and none of those are having issues.

I saw an article a few days ago about a tool to help track down mysterious reboots like this, but can't find it now. I don't know how software could help; it is literally: everything is working, the screens go blank, and in a second or so the BIOS posts.

I am suspicious of the CPU core temp readings, which I can't seem to get at. I get the GPU temp, which is never stressed (stays around 45C); and k10temp_tctl, which from what I can find is an edge temp and not the core temp; and all of the NVMe temps, which all stay in the 40s. But the fact that I don't know if I'm seeing what's really going on temp-wise in the CPU worries me. But I don't think I've had it crash during a software update, which often includes compiling a bunch of Rust, C, Go, and whatever packages which I can see pegging multiple cores.

I'm at a loss. I've looked at everything I can think of, but still haven't gotten a hint about what is triggering this. I may just do a bunch of markdown editing with markdown-oxide enabled and see of I can reliably force it to happen, but that still wouldn't tell me why. I am certain it's not memory, and have mostly convinced myself it isn't temperature, unless it's something hidden I can't get a reading on.

Help?

Edit it just occurred to me: how do I check for UPS issues when the nut monitor is running on the computer connected to the UPS? If the UPS is stuttering, it's not going to get logged by but. I suppose I could connect a laptop and use it to be the monitor, but this sounds like a lot of work to set up. What else should I try first?

Edit 2 I've now run stress with 16 cores for multiple minutes a couple of times. Once, with -c (busy-work threads), and once with -m (busywork using malloc/free). Both times, gotop showed all 16 cores gratifyingly pegged at 99/100%. Interestingly, k10temp never hit 90C, which I've seen it do before, but today is cool so that's probably helping. With mem-thrashing, I got a bunch of cached memory and finally saw free memory drop to 28%, which I rarely see on this machine because - when I set it up - I was tired of always fretting about memory use and decided to make it a non-issue by maxing the memory with 64GB. Anyway, that's the lowest I've ever noticed free memory drop to. Neither tests crashed the machine. I may try longer runs - a half-hour, maybe? But I'm now suspecting less that it's thermal load related.

[-] anzo@programming.dev 1 points 3 months ago

Replace markdown oxide for another tool for some time, try breaking the correlation to find causation

[-] sxan@midwest.social 1 points 3 months ago

Yeah, I've disabled markdown-oxide for the moment, so I'll see what my uptimes look like for a bit.

I honestly can't imagine how a userspace program could cause this behavior, though. There's no memory pressure, and there are 16 cores in this CPU, fer chrissake. Even trying to peg the CPU, I didn't notice the md-oxide correlation until I started watching top; the temps weren't going up, performance wasn't impacted.

I thought for sure it was a memory (hardware) issue, but I've run several memtests and they come back clean. No odd kernel module crashes in the logs; no indication anything is wrong until - poof. Reboot.

load more comments (5 replies)
this post was submitted on 14 Jul 2024
38 points (95.2% liked)

Linux

5161 readers
182 users here now

A community for everything relating to the linux operating system

Also check out !linux_memes@programming.dev

Original icon base courtesy of lewing@isc.tamu.edu and The GIMP

founded 1 year ago
MODERATORS