Sorting .csv file by column title

Sorting .csv file by column title - python

Is there a way to sort a csv file by column header name (sort vertically) without loading the whole thing into memory? I tagged this as python because it is the language I am most familiar with, but any other way would be fine also. I am limited to doing this via commandline on a remote machine due to data protection rules.

Any on-disk sorting algorithm is going to require more disk operations than just reading and writing once, and that I/O is likely to be your bottleneck. And it's going to more complicated as well. So, unless you really can't fit the file into memory, it will be a lot faster to do so, and a whole lot simpler.
But if you have to do this…
The standard on-disk sorting algorithm is a merge sort, similar to the familiar in-memory merge sort. It works like this:
Split the file into chunks that are big enough to fit into memory. You can do this iteratively/lazily, and easily: just read, say, 100MB at a time. Just make sure to rfind the last newline and hold everything after it over for the next chunk.
For each chunk, sort it in memory, and write the result to a temporary file. You can use the csv module, and the sort function with key=itemgetter(colnum).
If you have, say, 10 or fewer chunks, just open all of the temporary files and merge them. Again, you can use the csv module, and min with the same key or heapq.merge with equivalent decorate-sort-undecorate.
If you have 10-100 chunks, merge groups of 10 into larger temp files, then merge the larger ones in exactly the same way. With 100-1000, or 1000-10000, etc., just keep doing the same thing recursively.
If you have a simple CSV file with no quoting/escaping, and you have either ASCII data, ASCII-superset data that you want to sort asciibetically, or ASCII-superset data that you want to sort according to LC_COLLATE, the POSIX sort command does exactly what you're looking for, in the same way you'd probably build it yourself. Something like this:
sort -t, -k ${colnum},${colnum} -i infile.csv -o outfile.csv
If your data don't meet those requirements, you might be able to do a "decorate-sort-undecorate" three-pass solution. But at that point, it might be easier to switch to Python. Trying to figure out how to sed an arbitrary Excel CSV into something sort can handle and that can be reversed sounds like you'd waste more time debugging edge cases than you would writing the Python.

Related

Best structure for on-disk retrieval of large data using Python?

I basically have a large (multi-terabyte) dataset of text (it's in JSON but I could change it to dict or dataframe). It has multiple keys, such as "group" and "user".
Right now I'm filtering the data by reading through the entire text for these keys. It would be far more efficient to have a structure where I filter and read only the key.
Doing the above would be trivial if it fit in memory, and I could use standard dict/pandas methods and hash tables. But it doesn't fit in memory.
There must be an off the shelf system for this. Can anyone recommend one?
There are discussions about this, but some of the better ones are old. I'm looking for the simplest off the shelf solution.

I suggest you to split your large file to multiple small files with method readlines(CHUNK) and then you can process it one by one.
I worked with large Json and at beginning, the process was 45sec by file and my program ran while 2 days but when I splintered it, the program finished only for 4h

modify and write large file in python

Say I have a data file of size 5GB in the disk, and I want to append another set of data of size 100MB at the end of the file -- Just simply append, I don't want to modify nor move the original data in the file. I know I can read the hole file into memory as a long long list and append my small new data to it, but it's too slow. How I can do this more efficiently?
I mean, without reading the hole file into memory?
I have a script that generates a large stream of data, say 5GB, as a long long list, and I need to save these data into a file. I tried to generate the list first and then output them all in once, but as the list increased, the computer got slow down very very severely. So I decided to output them by several times: each time I have a list of 100MB, then output them and clear the list. (this is why I have the first problem)
I have no idea how to do this.Is there any lib or function that can do this?

Let's start from the second point: if the list you store in memory is larger than the available ram, the computer starts using the hd as ram and this severely slow down everything. The optimal way of outputting in your situation is fill the ram as much as possible (always keeping enough space for the rest of the software running on your pc) and then writing on a file all in once.
The fastest way to store a list in a file would be using pickle so that you store binary data that take much less space than formatted ones (so even the read/write process is much much faster).
When you write to a file, you should keep the file always open, using something like with open('namefile', 'w') as f. In this way, you save the time to open/close the file and the cursor will always be at the end. If you decide to do that, use f.flush() once you have written the file to avoid loosing data if something bad happen. The append method is good alternative anyway.
If you provide some code it would be easier to help you...

Comparing/Computing path info stored in several large text files in python

I have a bunch of flat files that basically store millions of paths and their corresponding info (name, atime, size, owner, etc)
I would like to compile a full list of all the paths stored collectively on the files. For duplicate paths only the largest path needs to be kept.
There are roughly 500 files and approximately a million paths in the text file. The files are also gzipped. So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
Should I go for a database solution? sqlite3? Is there a data structure or better algorithm to go about this in python? Thanks for any help!

So far I've been able to do this in python but the solution is not optimized as for each file it basically takes an hour to load and compare against the current list.
If "the current list" implies that you're keeping track of all of the paths seen so far in a list, and then doing if newpath in listopaths: for each line, then each one of those searches takes linear time. If you have 500M total paths, of which 100M are unique, you're doing O(500M*100M) comparisons.
Just changing that list to a set, and changing nothing else in your code (well, you need to replace .append with .add, and you can probably remove the in check entirely… but without seeing your code it's hard to be specific) makes each one of those checks take constant time. So you're doing O(500M) comparisons—100M times faster.
Another potential problem is that you may not have enough memory. On a 64-bit machine, you've got enough virtual memory to hold almost anything you want… but if there's not enough physical memory available to back that up, eventually you'll spend more time swapping data back and forth to disk than doing actual work, and your program will slow to a crawl.
There are actually two potential sub-problems here.
First, you might be reading each entire file in at once (or, worse, all of the files at once) when you don't need to (e.g., by decompressing the whole file instead of using gzip.open, or by using f = gzip.open(…) but then doing f.readlines() or f.read(), or whatever). If so… don't do that. Just iterate over the lines in each GzipFile, for line in f:.
Second, maybe even a simple set of however many unique lines you have is too much to fit in memory on your computer. In that case, you probably want to look at a database. But you don't need anything as complicated as sqlite. A dbm acts like a dict (except that its keys and values have to be byte strings), but it's stored on-dict, caching things in memory where appropriate, instead of stored in memory, paging to disk randomly, which means it will go a lot faster in this case. (And it'll be persistent, too.) Of course you want something that acts like a set, not a dict… but that's easy. You can model a set as a dict whose keys are always ''. So instead of paths.add(newpath), it's just paths[newpath] = ''. Yeah, that wastes a few bytes of disk space over building your own custom on-disk key-only hash table, but it's unlikely to make any significant difference.

How to read a big (3-4GB) file that doesn't have newlines into a numpy array?

I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?

This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.

Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.

Python synchronised reading of sorted files

I have two groups of files that contain data in CSV format with a common key (Timestamp) - I need to walk through all the records chronologically.
Group A: 'Environmental Data'
Filenames are in format A_0001.csv, A_0002.csv, etc.
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains environmental data in CSV/column format
Very large, several GBs worth of data
Group B: 'Event Data'
Filenames are in format B_0001.csv, B_0002.csv
Pre-sorted ascending
Key is Timestamp, i.e.YYYY-MM-DD HH:MM:SS
Contains event based data in CSV/column format
Relatively small compared to Group A files, < 100 MB
What is best approach?
Pre-merge: Use one of the various recipes out there to merge the files into a single sorted output and then read it for processing
Real-time merge: Implement code to 'merge' the files in real-time
I will be running lots of iterations of the post-processing side of things. Any thoughts or suggestions? I am using Python.

im thinking importing it into a db (mysql, sqlite, etc) will give better performance than merging it in script. the db typically has optimized routines for loading csv and the join will be probably be as fast or much faster than merging 2 dicts (one being very large) in python.

"YYYY-MM-DD HH:MM:SS" can be sorted with a simple ascii compare.
How about reusing external merge logic? If the first field is the key then:
for entry in os.popen("sort -m -t, -k1,1 file1 file2"):
process(entry)

This is a similar to a relational join. Since your timestamps don't have to match, it's called a non-equijoin.
Sort-Merge is one of several popular algorithms. For non-equijoins, it works well. I think this would be what you're called "pre-merge". I don't know what you mean by "merge in real time", but I suspect it's still a simple sort-merge, which is a fine technique, heavily used by real databases.
Nested Loops can also work. In this case, you read the smaller table in the outer loop. In the inner loop you find all of the "matching" rows from the larger table. This is effectively a sort-merge, but with an assumption that there will be multiple rows from the big table that will match the small table.
This, BTW, will allow you to more properly assign meaning to the relationship between Event Data and Environmental Data. Rather than reading the result of a massive sort merge and trying to determine which kind of record you've got, the nested loops handle that well.
Also, you can do "lookups" into the smaller table while reading the larger table.
This is hard when you're doing non-equal comparisons because you don't have a proper key to do a simple retrieval from a simple dict. However, you can easily extend dict (override __contains__ and __getitem__) to do range comparisons on a key instead of simple equality tests.

I would suggest pre-merge.
Reading a file takes a lot of processor time. Reading two files, twice as much. Since your program will be dealing with a large input (lots of files, esp in Group A), I think it would be better to get it over with in one file read, and have all your relevant data in that one file. It would also reduce the number of variables and read statements you will need.
This will improve the runtime of your algorithm, and I think that's a good enough reason in this scenario to decide to use this approach
Hope this helps

You could read from the files in chunks of, say, 10000 records (or whatever number further profiling tells you to be optimal) and merge on the fly. Possibly using a custom class to encapsulate the IO; the actual records could then be accessed through the generator protocol (__iter__ + next).
This would be memory friendly, probably very good in terms of total time to complete the operation and would enable you to produce output incrementally.
A sketch:
class Foo(object):
def __init__(self, env_filenames=[], event_filenames=[]):
# open the files etc.
def next(self):
if self._cache = []:
# take care of reading more records
else:
# return the first record and pop it from the cache
# ... other stuff you need ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.