Is RAID still needed?

@[email protected] · 8 months ago

Is RAID still needed?

@[email protected] · 8 months ago

Yeah and Titanic was unsinkable.

If the controller in your SSD fries, it doesn’t matter how many unused gigabytes your SSD has got for relocating bad sectors. It is still fried. For you, that data is forever gone.

This is why you have redundancy. Full redundancy. You can go for RAID1, one disk die and you still have no data loss, or go bananas with RAID6, two full disks can die and you’re still going strong.

Ps. Spinning harddrives have had hidden sectors used for relocation of bad sectors for ages. It’s nothing new. If you have to much time on your hand, Google harddrive hidden sectors nsa.

@[email protected] · edit-2 8 months ago

Unlike hdd, I never experienced graceful disk failures on ssd. Instead, they just randomly decided to die at the most inconvenient time. Raid 1 saved my hide a couple times now from those ssd failures.

r00ty · 8 months ago

Yep. While it has been decades since I had a home SSD failure. But I have had 2 SSD failures in the last 10 years in server hardware. In the first case it was RAID striped and I needed to restore from backup. In the second case it was part of a raid 1 array and I just requested a replacement and got on with my day.

In my house, I have non raid SSDs on my own PC. But important stuff is on my NAS made up of 4xHDD drives in raid 5 (that also has the important folders backed up to an encrypted cloud).

RAID still has a place in an overall data security solution. Especially for servers that you want to keep up.

@[email protected] · 8 months ago

…absolutely, positively, super false. I work in a sector where we’re constantly dealing with huge capacity enterprise SSDs - 15 and 30 terabytes at times. Always using RAID. It’s not even a question. Not only can you have controller malfunctions, but even though you’ve got what’s known as “over provisioning” on the SSDs, you still need to watch out for total disk failures!

PirateJesus · edit-2 8 months ago

SSDs still have component bottlenecks that can kill the whole drive, same as hard drives.

Also, 3-2-1 is far superior to RAID, but having RAID on top of that is nice.

Maintain three copies of your data: This includes the original data and at least two copies.
Use two different types of media for storage: Store your data on two distinct forms of media to enhance redundancy.
Keep at least one copy off-site: To ensure data safety, have one backup copy stored in an off-site location, separate from your primary data and on-site backups. https://www.veeam.com/blog/321-backup-rule.html

@[email protected] · 8 months ago

3-2-1 is for backup, RAID is also for availability, eg your domain server not going down in case of drive failure. good point though.

Zagorath · 8 months ago

People say RAID isn’t backup, but I’ve never understood that. Yes it’s only one medium and it’s probably not off-site, but if you’ve got an off-site copy in a different medium, why doesn’t a single RAID 5 count as 2 copies of your data to add up to get the 3 in 321 backup?

@[email protected] · 8 months ago

Media failure isn’t the only reason to back up. If you delete a file on a RAID array, it’s gone on all disks. If you need to recover that deleted file, you can’t recover from RAID. The same goes for formatting/damage of the file system, recovery from something wrong inside a database, etc.

@IphtashuFitz · 8 months ago

Suppose you’re hit by a ransomware attack and all the data on your NAS gets encrypted. Your RAID “backup” is just as inaccessible as everything else. So it’s not a backup. A true backup would let you recover from the ransomware attack once you have identified and removed the malware that allowed the attack.

Zagorath · 8 months ago

I really, really liked @[email protected]’s answer, because even as I was reading it, I was thinking of things that they could have said—but didn’t—which would have been easily rebutted. Those things fell into two basic categories: malware, and environmental effects.

As I understand it, malware is an issue with any online backup system, whether that’s a RAID or just a second external hard drive. So I don’t really think it works as an answer to why RAIDs specifically don’t qualify as backup.

@IphtashuFitz · 8 months ago

A well thought out and implemented backup system, along with a good security setup is how you deal with malware. If backups won’t protect you from malware then you’re doing backups wrong. A proper backup implementation keeps a series of full backups plus incremental backups based on those full ones. So say your data doesn’t change very often, then you might do a full backup once a month and incremental ones twice a week. You keep 6 months of the combinations of full & incrementals, you don’t just overwrite the backups with new ones.

If you’re doing backups like that and you suffer a malware attack then you have the ability to recover data as far as 6 months ago. The chances you don’t discover malware encrypting your data for 6+ months is tiny. If you’re really paranoid then you also test recovering files from random backups on a regular basis.

My employer has detected and blocked multiple malware attacks using a combination of the above practices plus device management software that can detect unusual NAS activity and block suspect devices on our networks. Each time our security team was able to identify the encrypted files and restore over 99% from backups.

@brygphilomena · 8 months ago

RAID is resiliency, but not a backup. It doesnt hold a previous dates version, it doesn’t protect against accidental deletion. Nor does it protect against changes to files.

@[email protected] · 8 months ago

Many causes of data loss affect all RAID drives equally from accidental deletion over power surges, fire, water damage, theft,…

Zagorath · 8 months ago

I really, really liked @[email protected]’s answer, because even as I was reading it, I was thinking of things that they could have said—but didn’t—which would have been easily rebutted. Those things fell into two basic categories: malware, and environmental effects.

Environmental effects like water damage and theft are a problem for any local storage, regardless of the technology. If it’s a RAID, or an external USB drive, or even a NAS in your closet. The power surge is probably the best example of RAID not being backup, since it’s very possible that one device might receive the surge but not the other, if they’re connected to different outlets. But as for the other ones? Eh, I don’t really buy it.

@[email protected] · 8 months ago

I have literally lost all data on a RAID6 of 12 drives since the power distributor in the server (the bit between the redundant PSUs and the rest of the system) got fried and took 5 out of the 12 drives with it.

@blurg · 7 months ago

What if the RAID 5 gets encrypted with ransomware, how many backups are there?

lemmyreader · 8 months ago

Reminds me of the days that cdroms were brand new and advertised like indestructible, with photos of elephants walking over it. Having said that I assume SSD disks can break like other hard disks can break, and in that case RAID can save a lot of time to get a computer back up especially when a lot of data is involved.

@JusticeForPorygon · 8 months ago

Had a microsd card literally break in half last week. They’re definitely not invincible

@[email protected] · 8 months ago

Yeah they sometimes get touted as that

@[email protected] · 8 months ago

Was that a SteamDeck? 🙃

@JusticeForPorygon · 8 months ago

@[email protected] · 8 months ago

Ok. Coz it is really common for SteamD users to forget removing SD card when didassembling device. Lots of cards have been lost

@JusticeForPorygon · 8 months ago

Actually that’s kinda what happened. I inserted the card to test if it was working before I put the bottom back on, but forgot to take it out. When I started screwing the bottom back on I heard a snap and that’s when I realized…

Definitely a lot of data lost, but most of it is redownloadable.

@[email protected] · 8 months ago

Funny. Growing up, I was taught to be extra careful with CDs because the moment you look at them wrong, all your data gets corrupted.

originalucifer · 8 months ago

its not about the individual drive… its about total drive failure… if that ssd’s controller dies it doesnt matter if it has extra data sectors.

that said, I moved on from raid by mirroring multiple , unraided NAS devices for redundancy with data stored specifically on the drives in such a way as to eliminate cross disk logical volumes.

@[email protected] · 8 months ago

you can replace sectors within them if a problem occurs

That won’t help you if sector where your data is located dies!

@thorbot · 8 months ago

This is a total load of bullshit, your friend is wrong

Dekkia · edit-2 8 months ago

I don’t think the internal wear-leveling and overprovisioning of SSDs can or should be able to replace raid. Disregarding a dead sector without losing capacity is great, but it won’t help you when (for example) the controller dies.

Depending on the amount of data you’re storing SSDs also might be too expensive.

The only exception is maybe Raid 0 in a normal PC. Here it’s probably better to just get one disk for each logical drive.

@[email protected] · 8 months ago

RAID0 has always been playing with fire

@[email protected] · 8 months ago

Its very much still needed and heavily utilised in the enterprise world. Volume size is usually the lowest priority when it comes to arrays, redundancy and IOPS (the amount of concurrent transactions to the storage) is typically the priority. The exception here would be backup and archive storage, where IOPS is less important and volume size is more important.

As far as replacing sectors goes, I’ve never heard of this and I might just be ignorant on the subject but as far as I know you can’t “replace” a bad sector. Only mark it as bad and not use it, and whatever was there before is gone. This has existed since HDD days. This is also why we use RAID - parity across disks to protect data.

Generally production storage will be in RAID-10, and backup/archive storage in RAID-6 or in some cases RAID-60 but I’m personally not a fan.

You also would consider how many disks are in the volume because there is a sweet spot. Too many disks = higher likelihood of total array failure due to simultaneous disk failures and more data loss in the event it does, but too few disks and you won’t have good redundancy, capacity or performance either (depending on RAID level).

The biggest change I see in RAID these days is moving away from hardware RAID cards and into software-based solutions like Microsoft Storage Spaces, md, ZFS and similar. These all have their own way of doing things and some can even synchronise the data with other hosts.

Hope this helps!

@Blue_Morpho · 8 months ago

As far as replacing sectors goes, I’ve never heard of this and I might just be ignorant on the subject but as far as I know you can’t “replace” a bad sector.

Ssds maintain stats on cell writes and move data when a cell nears it end. They keep spare capacity hidden from end users for this. Not using part of the drive increases also this spare capacity.

However ssds do fail and moving data to spare cells doesn’t change that.

@[email protected] · 8 months ago

Bit rot is still a problem, you need a high integrity file system and or RAID to avoid that
Full drive failure is still about as likely, IE the main reason for RAID of multiple drives in the first place.

A good read on the problems with SSDs SSD 101: How Reliable are SSDs?

@[email protected] · 8 months ago

I found this article from the one you posted. It is crazy think DNA can be used for storage one day.

storage tech of the future

@[email protected] · 8 months ago

I do recall google apparently stopped using raid in some data centres, but it was because they had whole-machine redundancy.

RAID is probably redundant for some of the uses it used to have, like optimising read performance by using many drives (SSD is fast) and honestly I suspect that SSDs are probably more reliable as they don’t have a bunch of platters and bearings and screaming rotational speeds.

So if you needed it for a base level of reliability, an SSD on its own may have exceeded that.

I suspect there are still uses for drive redundancy in some high availability setups… although your friend might be right. If the likelihood of drive failure is lower than other parts in the machine and you need high redundancy for availability it might make more sense to replicate the whole machine rather than the drives.

It’s possible redundancy specifically for the drives was an artifact of unreliable drives back in the day 🤔 they might have a point! I think it’s likely still useful at times though.

I’d rather hotswap a drive than set up a new server, even if it’s a less likely scenario.

@[email protected] · edit-2 8 months ago

I wholeheartedly agree with you. It is worth noting that a lot of the use cases of RAID can now be solved via software, but there are some places where hardware RAID still shines, such as redundancy. Yes, software also can provide redundancy, but I still haven’t seen a software solution that is equivalent to a proper RAID controller with a dedicated battery to keep the I/O buffer alive in case of hardware failure. That one has saved me a few times.

Source: I’m in charge of 6 storage clusters at work. Beegfs is what takes care of the actual clustering, resulting in each cluster clocking in at 1.2PB of storage. Each cluster consists of four machines with three storage volumes each.
Each storage volume consists of 12 drives in a RAID6 configuration.

I can yank faulty drives and toss them out and have them replaced with no downtime. I know some like to set up hot spares, but I for one don’t. I’ve even had entire servers die on me, and thanks to additional redundancy provided by beegfs, I’ve changed motherboard with no cluster downtime either. Just move the drives over to an identical machine (yes, each cluster has a dedicated spare machine), import the RAID, and you’re good to go.

@[email protected] · 8 months ago

a dedicated battery to keep the I/O buffer alive in case of hardware failure

Unless I’m misunderstanding, that sounds like you’re worried about the write hole, which RAIDZ doesn’t have

@[email protected] · edit-2 8 months ago

It’s mostly a matter of making sure any writes that are interrupted part way through (power failure, etc) are kept alive until the issue has been resolved. The raid controller caches everything until the write is complete.

It’s not so much about disks being out of sync, but more about preventing data loss.

@[email protected] · 8 months ago

RAIDZ is copy-on-write, and will notice and correct parity discrepancies if interrupted partway through. Doesn’t help if you don’t get at least one copy of the data written, but I’d take RAIDZ and a UPS over a hardware raid any day

@[email protected] · 8 months ago

And at the scale I’m operating, I’ll take hardware raid over raidz any day. I did some performance benchmarking when initially building these clusters, and beegfs really doesn’t like raidz.

I use raidz at home, though.

@[email protected] · 8 months ago

That’s fair. My biggest concern with a hardware raid is the risk having trouble finding compatible hardware if/when a controller dies, but I expect that’s not really an issue at larger scale; you probably buy hardware in bulk and have replacements on hand

@xkforce · 8 months ago

Higher end Samsung ssds were dying a lot faster than they should. I dont know what drugs your friend is on thinking they cant fail but theyd better have enough for the rest of the class.

@lemmylommy · 8 months ago

This has nothing to do with ssd or their size. Harddisks also have a little spare area (though not as big) and can mark and remap failing sectors.

RAID (1) is still (possibly) good for the only thing it ever was (possibly) good for: Keeping the system running long enough for you to put in a new harddisk if one fails.

Think of industrial systems where every minute of downtime can cost thousands of dollars. And even there the usefulness of RAID can be questioned: should you not in that case have a whole spare system, easy to swap in, because more than just storage can fail?

And what about the RAID controller itself? Does it not add complexity and another point of failure to the whole system?

And most importantly: will anyone actually get notified of a failing disk and replace it quickly? Or will the whole thing just prolong the inevitable?

Would you even trust a system that had one disk fail already to keep going in a critical place? Or would it not be safer to just replace the whole thing anyway after one failure?

@[email protected] · 8 months ago

And what about the RAID controller itself? Does it not add complexity and another point of failure to the whole system?

This is why people prefers software raid these days instead of hardware raid.

Atemu · edit-2 8 months ago

That does not address the point made. It doesn’t matter whether it’s a complex hardware or software component in the stack; they will both fail.

@[email protected] · 8 months ago

Yes, I didn’t address the point made, just want to mention that people are increasingly avoiding hardware raid these days.