modify and write large file in python - python

Say I have a data file of size 5GB in the disk, and I want to append another set of data of size 100MB at the end of the file -- Just simply append, I don't want to modify nor move the original data in the file. I know I can read the hole file into memory as a long long list and append my small new data to it, but it's too slow. How I can do this more efficiently?
I mean, without reading the hole file into memory?
I have a script that generates a large stream of data, say 5GB, as a long long list, and I need to save these data into a file. I tried to generate the list first and then output them all in once, but as the list increased, the computer got slow down very very severely. So I decided to output them by several times: each time I have a list of 100MB, then output them and clear the list. (this is why I have the first problem)
I have no idea how to do this.Is there any lib or function that can do this?

Let's start from the second point: if the list you store in memory is larger than the available ram, the computer starts using the hd as ram and this severely slow down everything. The optimal way of outputting in your situation is fill the ram as much as possible (always keeping enough space for the rest of the software running on your pc) and then writing on a file all in once.
The fastest way to store a list in a file would be using pickle so that you store binary data that take much less space than formatted ones (so even the read/write process is much much faster).
When you write to a file, you should keep the file always open, using something like with open('namefile', 'w') as f. In this way, you save the time to open/close the file and the cursor will always be at the end. If you decide to do that, use f.flush() once you have written the file to avoid loosing data if something bad happen. The append method is good alternative anyway.
If you provide some code it would be easier to help you...

Related

Is it more beneficial to read many small files or fewer large files of the exact same data?

I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.

Save periodically gathered data with python

I periodically receive data (every 15 minutes) and have them in an array (numpy array to be precise) in python, that is roughly 50 columns, the number of rows varies, usually is somewhere around 100-200.
Before, I only analyzed this data and tossed it, but now I'd like to start saving it, so that I can create statistics later.
I have considered saving it in a csv file, but it did not seem right to me to save high amounts of such big 2D arrays to a csv file.
I've looked at serialization options, particularly pickle and numpy's .tobytes(), but in both cases I run into an issue - I have to track the amount of arrays stored. I've seen people write the number as the first thing in the file, but I don't know how I would be able to keep incrementing the number while having the file still opened (the program that gathers the data runs practically non-stop). Constantly opening the file, reading the number, rewriting it, seeking to the end to write new data and closing the file again doesn't seem very efficient.
I feel like I'm missing some vital information and have not been able to find it. I'd love it if someone could show me something I can not see and help me solve the problem.
Saving on a csv file might not be a good idea in this case, think about the accessibility and availability of your data. Using a database will be better, you can easily update your data and control the size amount of data you store.

How a file get shuffled on the disk without being loaded into the memory

I have been looking for an idea of shuffling files on disk without being loaded into the memory. At the beginning, I doubt such an approach exists but recently I have come across this answer. Since this answer is not supported or voted, I would love to know if this code really does shuffling the file without loading into the memory. If so, HOW does that happen? I don't see how a file can be shuffled without loading it first into the memory!
I assume you're talking about shuffling lines in a text file.
I don't know if the linked answer by Jamie Cockburn works, but it looks totally reasonable to me. The idea is the following:
mmap doesn't load the whole file into memory but allows you to access its random parts by indexing via "from" and "to" bytes, as if it were a list loaded into memory
You do go twice through the file, but you don't load the file's content into memory
First time you pass through the file, you watch out for line breaks \n and store not the line but the byte numbers (or indices) corresponding to the addresses of each line's start and end. You effectively store two numbers per line
You now shuffle the list of indices called lines (remember, it contains only pairs (int, int))
Now you open a new file for writing, and iterate over the shuffled indices; for each index pair, you read a single line data[start:end+1] from the original file into memory and write it into the new file. You don't keep the line in memory for longer than this single operation.
This approach requires a memory amount linear in the number of lines in the input file. It can be much smaller than reading the whole file if the average line length is bigger than the amount of memory you need to store two integers.

Comparing/Computing path info stored in several large text files in python

I have a bunch of flat files that basically store millions of paths and their corresponding info (name, atime, size, owner, etc)
I would like to compile a full list of all the paths stored collectively on the files. For duplicate paths only the largest path needs to be kept.
There are roughly 500 files and approximately a million paths in the text file. The files are also gzipped. So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
Should I go for a database solution? sqlite3? Is there a data structure or better algorithm to go about this in python? Thanks for any help!
So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
If "the current list" implies that you're keeping track of all of the paths seen so far in a list, and then doing if newpath in listopaths: for each line, then each one of those searches takes linear time. If you have 500M total paths, of which 100M are unique, you're doing O(500M*100M) comparisons.
Just changing that list to a set, and changing nothing else in your code (well, you need to replace .append with .add, and you can probably remove the in check entirely… but without seeing your code it's hard to be specific) makes each one of those checks take constant time. So you're doing O(500M) comparisons—100M times faster.
Another potential problem is that you may not have enough memory. On a 64-bit machine, you've got enough virtual memory to hold almost anything you want… but if there's not enough physical memory available to back that up, eventually you'll spend more time swapping data back and forth to disk than doing actual work, and your program will slow to a crawl.
There are actually two potential sub-problems here.
First, you might be reading each entire file in at once (or, worse, all of the files at once) when you don't need to (e.g., by decompressing the whole file instead of using gzip.open, or by using f = gzip.open(…) but then doing f.readlines() or f.read(), or whatever). If so… don't do that. Just iterate over the lines in each GzipFile, for line in f:.
Second, maybe even a simple set of however many unique lines you have is too much to fit in memory on your computer. In that case, you probably want to look at a database. But you don't need anything as complicated as sqlite. A dbm acts like a dict (except that its keys and values have to be byte strings), but it's stored on-dict, caching things in memory where appropriate, instead of stored in memory, paging to disk randomly, which means it will go a lot faster in this case. (And it'll be persistent, too.) Of course you want something that acts like a set, not a dict… but that's easy. You can model a set as a dict whose keys are always ''. So instead of paths.add(newpath), it's just paths[newpath] = ''. Yeah, that wastes a few bytes of disk space over building your own custom on-disk key-only hash table, but it's unlikely to make any significant difference.

Sorting .csv file by column title

Is there a way to sort a csv file by column header name (sort vertically) without loading the whole thing into memory? I tagged this as python because it is the language I am most familiar with, but any other way would be fine also. I am limited to doing this via commandline on a remote machine due to data protection rules.
Any on-disk sorting algorithm is going to require more disk operations than just reading and writing once, and that I/O is likely to be your bottleneck. And it's going to more complicated as well. So, unless you really can't fit the file into memory, it will be a lot faster to do so, and a whole lot simpler.
But if you have to do this…
The standard on-disk sorting algorithm is a merge sort, similar to the familiar in-memory merge sort. It works like this:
Split the file into chunks that are big enough to fit into memory. You can do this iteratively/lazily, and easily: just read, say, 100MB at a time. Just make sure to rfind the last newline and hold everything after it over for the next chunk.
For each chunk, sort it in memory, and write the result to a temporary file. You can use the csv module, and the sort function with key=itemgetter(colnum).
If you have, say, 10 or fewer chunks, just open all of the temporary files and merge them. Again, you can use the csv module, and min with the same key or heapq.merge with equivalent decorate-sort-undecorate.
If you have 10-100 chunks, merge groups of 10 into larger temp files, then merge the larger ones in exactly the same way. With 100-1000, or 1000-10000, etc., just keep doing the same thing recursively.
If you have a simple CSV file with no quoting/escaping, and you have either ASCII data, ASCII-superset data that you want to sort asciibetically, or ASCII-superset data that you want to sort according to LC_COLLATE, the POSIX sort command does exactly what you're looking for, in the same way you'd probably build it yourself. Something like this:
sort -t, -k ${colnum},${colnum} -i infile.csv -o outfile.csv
If your data don't meet those requirements, you might be able to do a "decorate-sort-undecorate" three-pass solution. But at that point, it might be easier to switch to Python. Trying to figure out how to sed an arbitrary Excel CSV into something sort can handle and that can be reversed sounds like you'd waste more time debugging edge cases than you would writing the Python.

Categories

Resources