Parallelize Pandas CSV Writing - python

Is it possible to write multiple CSVs out simultaneously? At the moment, I do a listdir() on an outputs directory, and iterate one-by-one through a list of files. I would ideally like to write them all at the same time.
Has anyone had any experience in this before?

If you have only one HDD (not even an SSD drive), then the disk IO is your bottleneck and you'd better write to it sequentially instead of writing in parallel. The disk head needs to be positioned before writing, so trying to write in parallel will most probably be slower compared to one writer process. It would make sense if you would have multiple disks...

Related

Python: Reading single big .gz file consumes more memory than several smal files

I have a json file with several millions of rows, compressed to .gz. When I read the full file I run out of memory.
So I split the file in multiple files, each 100k rows, compress all of them to .gz and read them all in a loop. No memory problems.
Now I think, that in both cases I read exact the same amount of rows in memory, but the one file approach runs out of memory.
Could somebody elaborate why?
Because when you have a single file, you open and store in your RAM the full content of the file, whereas while using a loop, you process files one by one. If done properly (as it looks like you do), your program will close each file and free the allocated memory before opening the next one, hence reducing the necessary RAM to run the program and avoiding to run out of memory.

Is it possible in Python to load a large object into memory with one process, and access it in separate independent processes?

I'm writing a program that requires running algorithms on a very large (~6GB) csv file, which is loaded with pandas using read_csv().
The issue I have now, is that anytime I tweak my algorithms and need to re-simulate (which is very often), I need to wait ~30s for the dataset to load into memory, and then another 30s afterward to load the same dataset into a graphing module so I can visually see what's going on. Once it's loaded however, operations are done very quickly.
So far I've tried using mmap, and loading the dataset into a RAM disk for access, with no improvement.
I'm hoping to find a way to load up the dataset once into memory with one process, and then access it in memory with the algorithm-crunching process, which gets re-run each time I make a change.
This thread seems to be close-ish to what I need, but uses multiprocessing which needs everything to be run within the same context.
I'm not a computer engineer (I'm electrical :), so I'm not sure what I'm asking for is even possible. Any help would be appreciated however.
Thanks,
Found a solution that worked, although it was not directly related to my original ask.
Instead of loading a large file into memory and sharing between independent processes, I found that the bottleneck was really the parsing function in pandas library.
Particularly, CSV parsing, as CSVs are notoriously inefficient in terms of data storage.
I started storing my files in the python-native pickle format, which is supported by pandas through the to_pickle() and read_pickle() functions. This cut my load times drastically from ~30s to ~2s.

Most efficient way to store data on drive

baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)

Reparsing a file or unpickling

I have a 100Mb file with roughly 10million lines that I need to parse into a dictionary every time I run my code. This process is incredibly slow, and I am hunting for ways to speed it up. One thought that came to mind is to parse the file once and then use pickle to save it to disk. I'm not sure this would result in a speed up.
Any suggestions appreciated.
EDIT:
After doing some testing, I am worried that the slow down happens when I create the dictionary. Pickling does seem significantly faster, though I wouldn't mind doing better.
Lalit
MessagePack has in my experience been much faster for dumping/loading data in python then cPickle, even when using the highest protocol.
However, if you have a dictionary with 10 million entries in it, you might want to check that you're not hitting the upper limit of your computer's memory. The process will happen much slower if you run out of memory and have to use swap.
Depending on how you use the data, you could
divide it into many smaller files and load only what's needed
create an index into the file and lazy-load
store it to a database and then query the database
Can you give us a better idea just what your data looks like (structure)?
How you are using the data? Do you actually use every row on every execution? If you only use a subset on each run, could the data be pre-sorted?

python parallel processing

I am new to Python. I have 2000 files each about 100 MB. I have to read each of them and merge them into a big matrix (or table). Can I use parallel processing for this so that I can save some time? If yes, how? I tried searching and things seem very complicated. Currently, it takes about 8 hours to get this done serially. We have a really big server with one Tera Byte RAM and few hundred processors. How can I efficiently make use of this?
Thank you for your help.
You make be able to preprocess the files in separate processes using the subprocess module; however, if the final table is kept in memory, then that process will end up being you bottleneck.
There is another possible approach using shared memory with mmap objects. Each subprocess can be responsible for loading the files into a subsection of the mapped memory.

Categories

Resources