I have a lot of data stored in generators, and i would like to sort them without using lists, to not go out of memory in the process. It's possible to sort the generators by this way?. I have some hours thinking this and i can't find a way to do it without saving the seen values somewhere (or there's a way saving them "partially"). I have read in google about lazy sorting, is that a nice approach? Thanks for the answers!!
EDIT: My final objective is to write all the sorted data to a file.
PS: sorry about my bad english ><
You should just write the data to your output file in non-sorted order, then sort it on the filesystem. If you're on Linux this is easily and very efficiently done using sort(1). Or if you want to do it within Python, try csvsort which is specifically designed for this.
Related
My code be summarised as a for loop (M~10^5-10^6 iterations) over some function which sequentially produces data in the form (W,N)-arrays where W~500 and N~100 and I need to store these as efficiently as possible. Apart from saving data of this form, I would also like be able to access them as fast as possible.
So far, I tried:
Creating a np.empty((M,W,N)),
Starting from a (W,N)-array and appending data using np.append, np.vstack, np.hstack.
So far everything seems to pretty slow.
What is the fastest way to manage this?
Do I need to rely on 3rd-party packages like Dask? If so, to which ones?
I usually use pandas for this kind of problem. But in this situation I have big troubles with the memory. I would like to avoid to change my computer to manage this trouble.
A possible solution could be use Generator instead of
df["column"] = df["column"].apply(func = lambda x: myfunc(x))
Can I read and write the same file in the same moment? A possible solution could be read "df.csv" and write "df1.csv" with the new column instead,
What do u think about that?
Am I on track?
I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".
Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.
Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.
There must be an off the shelf system for this. Can anyone recommend one?
There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.
I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one.
I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h
I am using python and mongodb. I have a collection called citymap in mongodb and I need to read a field out from each document. Before, I did this:
for NN5_doc in citymap.find().batch_size(500):
current_cell = citymap.find({'_id':NN5_doc['_id']})
citycell_pool.append(current_cell[0]['big_cell8']['POI'])
Now I know that parallel_scan may help me to increase the efficiency to realize the same goal. However, I don't know how to do that. To the best of my knowledge, maybe I can use:
grid750_cursors = citymap.parallel_scan(5)
And then how do I handle these cursors and let them return me the same citycell_pool as I did before?
I'm not sure why you think parallel_scan is what you need here. Seems that your approach for the iteration isn't efficient. Why iterate through the citymap collection and then fetch from the same collection the same document you've already fetched? This would make more sense to me given your code:
for NN5_doc in citymap.find({}, {'big_cell8.POI': 1}).batch_size(500):
citycell_pool.append(NN5_doc['big_cell8']['POI'])
Reading should be more efficient using projection (depending on the size of your documents) where you'd only be reading the fields you need (big_cell8.POI) and reduce the size of the retrieved documents. And by removing the redundant find from the loop, it should be at least twice as fast now.
Is there a way to sort a csv file by column header name (sort vertically) without loading the whole thing into memory? I tagged this as python because it is the language I am most familiar with, but any other way would be fine also. I am limited to doing this via commandline on a remote machine due to data protection rules.
Any on-disk sorting algorithm is going to require more disk operations than just reading and writing once, and that I/O is likely to be your bottleneck. And it's going to more complicated as well. So, unless you really can't fit the file into memory, it will be a lot faster to do so, and a whole lot simpler.
But if you have to do this…
The standard on-disk sorting algorithm is a merge sort, similar to the familiar in-memory merge sort. It works like this:
Split the file into chunks that are big enough to fit into memory. You can do this iteratively/lazily, and easily: just read, say, 100MB at a time. Just make sure to rfind the last newline and hold everything after it over for the next chunk.
For each chunk, sort it in memory, and write the result to a temporary file. You can use the csv module, and the sort function with key=itemgetter(colnum).
If you have, say, 10 or fewer chunks, just open all of the temporary files and merge them. Again, you can use the csv module, and min with the same key or heapq.merge with equivalent decorate-sort-undecorate.
If you have 10-100 chunks, merge groups of 10 into larger temp files, then merge the larger ones in exactly the same way. With 100-1000, or 1000-10000, etc., just keep doing the same thing recursively.
If you have a simple CSV file with no quoting/escaping, and you have either ASCII data, ASCII-superset data that you want to sort asciibetically, or ASCII-superset data that you want to sort according to LC_COLLATE, the POSIX sort command does exactly what you're looking for, in the same way you'd probably build it yourself. Something like this:
sort -t, -k ${colnum},${colnum} -i infile.csv -o outfile.csv
If your data don't meet those requirements, you might be able to do a "decorate-sort-undecorate" three-pass solution. But at that point, it might be easier to switch to Python. Trying to figure out how to sed an arbitrary Excel CSV into something sort can handle and that can be reversed sounds like you'd waste more time debugging edge cases than you would writing the Python.