I had a DataFrame whose memory usage was 159.7 MB. When I used .to_csv method to write it in storage the written file was about 400 MB. And when I loaded this file its memory usage was 159.7 MB. Is there an explanation for this difference in sizes and how to write it so that it takes less space in the hard drive ? Thank you for your help
If your DataFrame contains strs, try using tab as a delimiter instead of comma. That could save you on the need for quotes.
df.to_csv('new_file.csv', sep='\t')
The easiest way to reduce the size of the csv is to compress it when writing, using the compression parameter in to_csv. For example df.to_csv(compression='gzip').
There are a variety of reasons the memory usage could be so different from the size of the csv on disk, it's a little hard to say without knowing any specifics about the data you're working with.
One generic recommendation is to check the precision of any floating point values in your dataframe, if you're writing a bunch of numbers with 15 decimal points of precision or something that will take up a lot of space. Try truncating these values to the precision you need.
Related
I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but not working as expected.
Below is the one I have used and tested on a 90 GB table from hive- ORC formatted. At the export(write) level it's giving random file sizes other than 4 GB
Any suggestion here to split the files with limit size while writing. Here I don't want to use repartition or coalesce as the df is going thru a lot of wide transformations.
df.write.format("csv").mode("overwrite").option("maxPartitionBytes", 4*1024*1024(1024).save(outputpath)
According to documentation spark.sql.files.maxPartitionBytes is working on read, if you are doing some shuffles later final size of tasks and due to that final files on write may change
Spark docu
You may try to use spark.sql.files.maxRecordsPerFile as according to docu its working on write
spark.sql.files.maxRecordsPerFile Maximum number of records to write out to a single file. If this value
is zero or negative, there is no limit.
If its not going to do the trick i think that other option is, as you mentioned, to repartition this dataset just before write
I am loading the dataset from SQL DB by using pd.read_sql(). I tried to store 100 million rows and 300 columns in an excel/csv file. But it failed due to the limitation of 1,048,576 rows.
So I am trying to store the same file as .tsv file using
pd.to_csv("data.txt", header=True, index=False, sep='\t', mode='a')
I dont find the limitation of tab separated txt file.
is it good to go or is there any other good option?
The only thing here that I am not sure about is how pandas internally works. Besides that, your approach is totally fine. Hadoop widely uses .tsv format to store and process data. And there is no such thing like "the limitation of .tsv file". A file is just a sequence of bytes. \t and \n are just characters without any differences. The limitation you encountered is imposed by Microsoft Excel, not by OS. For example, it was lower a long time ago and other spread sheet applications could impose different limitations.
If you open('your_file.tsv', 'wt') and readline, bytes until \n are just taken. Nothing else happens. There is no such thing like how many \ts are allowed until \n, how many \ns are allowed in a file. They are all just bytes and a file can take as much as characters allowed by OS.
It varies across different OSs, however, according to NTFS vs FAT vs exFAT, the maximum file size of an NTFS file system is almost 16TB. But in real, splitting a big file into multiple files of a reasonable size is a good idea. For example, you can easily distribute them.
To process such big data, you should take iterative or distributed approach. For example, Hadoop.
Probably not a good idea. Your limitation is your machines memory since pandas loads everything into memory. A dataframe of that size won't fit. You probably need more machines and a distributed computing framework like apache spark or dask.
Alternatively, dependending on what you want to do with the data, you might not need to load it to memory.
I have a pandas DataFrame I want to query often (in ray via an API). I'm trying to speed up the loading of it but it takes significant time (3+s) to cast it into pandas. For most of my datasets it's fast but this one is not. My guess is that it's because 90% of these are strings.
[742461 rows x 248 columns]
Which is about 137MB on disk. To eliminate disk speed as a factor I've placed the .parq file in a tmpfs mount.
Now I've tried:
Reading the parquet using pyArrow Parquet (read_table) and then casting it to pandas (reading into table is immediate, but using to_pandas takes 3s)
Playing around with pretty much every setting of to_pandas I can think of in pyarrow/parquet
Reading it using pd.from_parquet
Reading it from Plasma memory store (https://arrow.apache.org/docs/python/plasma.html) and converting to pandas. Again, reading is immediate but to_pandas takes time.
Casting all strings as categories
Anyone has any good tips on how to speed up pandas conversion when dealing with strings? I have plenty of cores and ram.
My end results wants to be a pandas DataFrame, so I'm not bound to the parquet file format although it's generally my favourite.
Regards,
Niklas
In the end I reduced the time by more carefully handling the data, mainly by removing blank values, making sure we had as much NA values as possible (instead of blank strings etc) and making categories on all text data with less than 50% unique content.
I ended up generating the schemas via PyArrow so I could create categorical values with a custom index size (int64 instead of int16) so my categories could hold more values. The data size was reduces by 50% in the end.
here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.