I'm a data engineer, use parquet all the time and absolutely love love love it as a format!
arrow (a data format) + parquet, is particularly powerful, and lets you:
-
Only read the columns you need (with a csv your computer has to parse all the data even if afterwards you discard all but one column)
-
Use metadata to only read relevant files. This is particularly cool abd probably needs some unpacking. Say you're reading 10 files, but only want data where "column-a" is greater than 5. Parquet can look at file headers at run time, and figure out if a file doesn't have any column-a values over five. And therefore, never have to read it!.
-
Have data in an unambigious format that can be read by multiple programming languages. Since CSV is text, anything reading it will look at a value like "2022-04-05" and say "oh, this text looks like dates, let's see what happens if I read it as dates". Parquet contains actual data type information, so it will always be read consistently.
If you're handling a lot of data, this kind of stuff can wind up making a huge difference.