CPU load over 70 means I can't even ssh into my server

@PlutoniumAcid · edit-2 2 years ago

CPU load over 70 means I can't even ssh into my server

@[email protected] · edit-2 2 years ago

It’s sounds like it could be an IO wait issue, system load will climb a ton without showing much CPU usage.

Make sure you’re not running out of RAM and going into swap space, it doesn’t sound like it though.

iotop might show something useful. And in htop you can add the 'PERCENT_IO_DELAY" column which can be useful.

@PriorProject · edit-2 2 years ago

My money is also on IO. Outside of CPU and RAM, it’s the most likely resource to get saturated (especially if using rotational magnetic disks rather than an SSD, magnetic disks are going to be the performance limiter by a lot for many workloads), and also the one that OP said nothing about, suggesting it’s a blind spot for them.

In addition to the excellent command-line approaches suggested above, I recommend installing netdata on the box as it will show you a very comprehensive set of performance metrics without having to learn to collect each one on the CLI. A downside is that it will use RAM proportional to the data retention period, which if you’re swapping hard will be an issue. But even a few hours of data can be very useful and with 16gb of ram I feel like any swapping is likely to be a gross misconfiguration rather than true memory demand… and once that’s sorted dedicating a gig or two to observability will be a good investment.

@[email protected] · 2 years ago

And I know OP mentioned not using much ram, but almost everytime I see a server load that high, it’s usually because the server is swapping heavily causing the iowait.

@[email protected] · edit-2 2 years ago

Yeah I figured I would mention it since OP does describe symptoms like that.

@[email protected] · 2 years ago

Does top show unpaged memory too? I’ve had an application with a memory leak before that would fill up unpaged memory and it would look like nothing was using ram when I looked in the task manager, even though usage was 99%.

thelastknowngod · 2 years ago

Yep. IO.

OP, this might be overkill for you but it might be worth standing up a grafana/prometheus stack… You’d be able to see this stuff a lot faster and potentially narrow in on a root cause.

@PlutoniumAcid · 2 years ago

That is definitely an interesting idea! Much, much better than the stupid dashdot container I am running now :-D