Help with understanding MySQL column and row size limits

@a_fancy_kiwi · edit-2 1 year ago

Help with understanding MySQL column and row size limits

dual_sport_dork 🐧🗡️ · edit-2 1 year ago

The answer is complicated, but TL;DR, MySQL has a hard coded limit of 65,535 bytes (that’s 2^16 - 1, or 64k) per row. This excludes column types ‘text’ and ‘blob,’ which your storage engine will store separately. Each type of column you have requires a certain amount of storage space. For instance, this is why on a ‘varchar’ column you have to specify how many characters it can hold. Longer fields, more data required. Add all this up – probably plus some overhead or other quirks I don’t know off the top of my head – and if the result is more than 64k per row, you’re toast.

MySQL supports multiple types of storage engines, and they can be different for each table. In general, InnoDB is the best one to use for most purposes (I’m sure I’ll catch flak for this opinion) but each one technically has subtle pros and cons. InnoDB also has a column limit of 1017 columns per table, regardless of their composition. Exceed this, you are also toast.

However. If your database design is such that you are running up against row storage limits, you’re probably doing it wrong. You should reconsider how you’re storing data and whether everything in whatever your app considers a single logical ‘record’ actually all has to be in the same table. (It probably doesn’t.) So what are you doing that requires creating more than 1017 or greater than 64k of data per row?

There is a complete, but mildly baffling per usual, explanation of MySQL’s column count and row size limits here: https://dev.mysql.com/doc/refman/8.0/en/column-count-limit.html

@a_fancy_kiwi · edit-2 1 year ago

For instance, this is why on a ‘varchar’ column you have to specify how many characters it can hold.

That’s likely what I’m doing wrong. I have roughly 220 columns and 95% of them were set to varchar(255) not understanding the size restrictions. A few were set to text(1000) as well (but these don’t count?). I definitely don’t need 255 characters worth of data in all fields

This excludes column types ‘text’ and ‘blob,’ which your storage engine will store separately.

are there any storage limits when using text or blob? I assume that the lookups are slower?

So what are you doing that requires creating more than 1017 or greater than 64k of data per row?

64 kilobytes, correct?

And I suppose it’s a glorified project book generator. 220ish variables per project and once it’s all entered into the webapp, I can generate a few hundred page PDF. Each variable of info isn’t very big, most are one word variables, a few are like 3 sentences; each row worth of info fits in a max 15KB JSON file.

At the moment, I was just putting all the data of each JSON file in one table. As I progress, variables like contacts will be put in a separate contact table and I’ll reference them (once I figure out how to do that lol). Based on the way the work is split up to collect the data on each project, I could logically split the table into 3 but I was hoping to avoid that just to reduce complexity and have everything, minus contacts, easily viewable on one table

@ShunkW · 1 year ago

220 columns in one table? Yeah, you definitely need to do some research and learn how to design a schema that isn’t based on incredibly wide tables.

@a_fancy_kiwi · 1 year ago

If you had to ball park it, what’s the max number of columns you would use per table?

Right now I’m considering splitting the table into 4. 3 tables per person/job and 1 for contacts.

Or, within each job there are 5ish main topics. So I could have 2 tables. One table for contacts and one table 15 columns wide but I’d store JSON in the cells. The data between each row isn’t related to another row except by contacts.

Is one option more correct than the other?

@ShunkW · 1 year ago

In a perfect world, I’d say 20 columns per table max, but shoot under that if possible. This isn’t always feasible, and I’ve definitely had some fat tables in some legacy apps I’ve worked on. But 220 is just unmanageable, especially if you’re doing a select * against that table ever in your app.

@a_fancy_kiwi · 1 year ago

Thank you for the info, I appreciate it

dual_sport_dork 🐧🗡️ · 1 year ago

If your variables are inconsistent in number but relatively consistent in format, and can be uniquely identified, it is probably a better idea to have a table that’s all variables – one per row. Your table structure would be something like id as INT or LONGINT, variable_name as VARCHAR(x), and variable_value as TEXT. When you look up a record, you SELECT * FROM variables WHERE id=whatever and parse the results. Note that in this case, the table will not have a primary key. You’ll be able to have more than one row with the id of whatever, which matches the ID of your document. You can keep whatever metadata about the document in another table, which will hopefully be short.

Having everything stored in text fields will not necessarily make lookups slow, but it may make retrieval of the data in them slow if they contain a lot of data and there are an awful lot of them. Especially if you retrieve it all the time when maybe you don’t have to. It will also make your app temporarily contain a lot of data in memory while it’s holding the result of the SQL call.

In SQL, finding a row (the seek or lookup) is a very different procedure to returning the data within it once found. The amount of time and CPU cycles it takes to find a given row can be quite high, especially if your tables are not efficiently designed and do not have suitable indexes, or you have to use a complex query to narrow it down. Once found, returning the data should take the same amount of time no matter what, dependent only on how much of it there is. Also, if your returned result is huge that will make it super slow if your data has to be piped over an external connection. If your app and the database live on the same machine the data transfer from database to app can be pretty fast even if the result is huge. If they’re on separate machines and that data has to be squeezed through a network connection, though, that’s going to be painful.

Consider that:

SELECT * FROM table WHERE id=1

and

SELECT * FROM table WHERE id REGEXP(‘#^1$#’)

Will result in very different lookup times despite superficially accomplishing the same thing. The second one is going to take longer. If your table is long, it will take a lot longer. You could even cause it to exceed the query time limit of your connection if you’re not careful. The first is just finding an ID by value, on a column that should (hopefully) be indexed. The second is using a regular expression to match the digit “1” as a string, which must be compared against every single value in column id one at a time in a full scan of every row in the table. Full table scans are slow and expensive, and you should avoid them whenever possible.

@surewhynotlem · 1 year ago

Just spitballing here, but it sounds more like you need a nosql database instead of a relational one. Those are designed to ingest store and search things like json objects.

@a_fancy_kiwi · 1 year ago

I’ve played around with MongoDB before. Do you know if nosql can still have relationships? Meaning, about the only data that will be reused per row would be contact info. In my app’s current state with everything in JSON files, I have the same contacts rewritten 10s of times. When contact info changes which happens often enough, I have to search through all the files and update the contacts in each place. I really want there to be 1 contact table that can be referenced from.

@surewhynotlem · 1 year ago

You would create a contacts object, with a unique reference id number.

Then all your other records would point to that unique reference id number, and would not hold any contacts data.

Changes to the customer would be done in the contacts object.

Queries could be done that join both groups of data.