Efficient query on a sorted csv

Efficient query on a sorted csv - python

I have a .csv with several million rows. The first column is the id of each entry, and each id only occurs one time. The first column is sorted. Intuitively I'd say that it might be pretty easy to query this file efficiently using a divide and conquer algorithm. However, I couldn't find anything related to this.
Sample .csv file:
+----+------------------+-----+
| id | name | age |
+----+------------------+-----+
| 1 | John Cleese | 34 |
+----+------------------+-----+
| 3 | Mary Poppins | 35 |
+----+------------------+-----+
| .. | ... | .. |
+----+------------------+-----+
| 87 | Barry Zuckerkorn | 45 |
+----+------------------+-----+
I don't want to load the file in memory (too big), and I prefer to not use databases. I know I can just import this file in sqlite, but then I have multiple copies of this data, and I'd prefer to avoid that for multiple reasons.
Is there a good package I'm overlooking? Or is it something that I'd have to write myself?

Ok, my understanding is that you want some of the functionnalities of a light database, but are constrained to use a csv text file to hold the data. IMHO, this is probably a questionable design: past several hundred of rows, I would only see a csv file an an intermediate or exchange format.
As it is a very uncommon design, it is unlikely that a package for it already exists - for my part I know none. So I would imagine 2 possible ways: scan the file once and build an index id->row_position, and then use that index for your queries. Depending on the actual length of you rows, you could index only every n-th row to change speed for memory. But it costs an index file
An alternative way would be a direct divide and conquer algo: use stat/fstat to get the file size, and search for the next end of line starting at the middle of the file. You immediately get an id after it. If the id you want is that one, fine you have won, if it is greater, just recurse in the upper part, if lesser, recurse in the lower part. But because of the necessity to search for end of lines, be prepared to corner case like never finding the end of line in the expected range, or find it at the end.

After Serges answer I decided to write my own implementation, here it is. It doesn't allow newlines and doesn't deal with a lot of details regarding the .csv format. It assumes that the .csv is sorted on the first column, and that the first column are integer values.
import os
def query_sorted_csv(fname, id):
filesize = os.path.getsize(fname)
with open(fname) as fin:
row = look_for_id_at_location(fin, 0, filesize, id)
if not row:
raise Exception('id not found!')
return row
def look_for_id_at_location(fin, location_lower, location_upper, id, sep=',', id_column=0):
location = int((location_upper + location_lower) / 2)
if location_upper - location_lower < 2:
return False
fin.seek(location)
next(fin)
try:
full_line = next(fin)
except StopIteration:
return False
id_at_location = int(full_line.split(sep)[id_column])
if id_at_location == id:
return full_line
if id_at_location > id:
return look_for_id_at_location(fin, location_lower, location, id)
else:
return look_for_id_at_location(fin, location, location_upper, id)
row = query_sorted_csv('data.csv', 505)
You can look up about 4000 ids per second in a 2 million row 250MB .csv file. In comparison, you can look up 3 ids per second whilst looping over the entire file line by line.

Related

Trying to place strings into columns

There are 3 columns, levels 1-3. A file is read, and each line of the file contains various data, including the level to which it belongs, located at the back of the string.
Sample lines from file being read:
thing_1 - level 1
thing_17 - level 3
thing_22 - level 2
I want to assign each "thing" to it's corresponding column. I have looked into pandas, but it would seem that DataFrame columns won't work, as passed data would need to have attributes that match the number of columns, where in my case, I need 3 columns, but each piece of data only has 1 data point.
How could I approach this problem?
Desired output:
level 1 level 2 level 3
thing_1 thing_22 thing_17
Edit:
In looking at suggestion, I can refine my question further. I have up to 3 columns, and the line from file needs to be assigned to one of the 3 columns. Most solutions seem to need something like:
data = [['Mary', 20], ['John', 57]]
columns = ['Name', 'Age']
This does not work for me, since there are 3 columns, and each piece of data goes into only one.

There's an additional wrinkle here that I didn't notice at first. If each of your levels has the same number of things, then you can build a dictionary and then use it to supply the table's columns to PrettyTable:
from prettytable import PrettyTable
# Create an empty dictionary.
levels = {}
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# If this is is a new level, set it to a list containing its thing.
if level not in levels:
levels[level] = [thing]
# Otherwise, add the new thing to the level's list.
else:
levels[level].append(thing)
# Create the table, and add each level as a column
table = PrettyTable()
for level, things in levels.items():
table.add_column(level, things)
print(table)
For the example data you showed, this prints:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
+---------+----------+----------+
The Complication
I probably wouldn't have posted an answer (believing it was covered sufficiently in this answer), except that I realized there's an unintuitive hurdle here. If your levels contain different numbers of things each, you get an error like this:
Exception: Column length 2 does not match number of rows 1!
Because none of the solutions readily available have an obvious, "automatic" solution to this, here is a simple way to do it. Build the dictionary as before, then:
# Find the length of the longest list of things.
longest = max(len(things) for things in levels.values())
table = PrettyTable()
for level, things in levels.items():
# Pad out the list if it's shorter than the longest.
things += ['-'] * (longest - len(things))
table.add_column(level, things)
print(table)
This will print something like this:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
| - | - | thing_5 |
+---------+----------+----------+
Extra
If all of that made sense and you'd like to know about a way part of it can be streamlined a little, take a look at Python's defaultdict. It can take care of the "check if this key already exists" process, providing a default (in this case a new list) if nothing's already there.
from collections import defaultdict
levels = defaultdict(list)
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# Automatically handles adding a new key if needed:
levels[level].append(thing)

python how to nicely align long text in pandas dataframe?

Given a panda's dataframe (received from a database) I'm trying to output the result to the console in such was that it will be complete and readable.
The challenge I have is with respect to the long text in 2 columns:LPQ_REASON & LPQ_RESOLUTION. You will note from the output below (print df) that both LPQ columns are ended with 3 dots (...) so I can't read the text. This comes despite my initial settings of:
pd.set_option('display.max_rows', 1500)
pd.set_option('display.max_columns', 1500)
pd.set_option('display.width', 1000)
so the result on the console looks like this:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (h... This Memo was issued with conjunction to our j... 3979
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% (h... This Memo was issued on matching viloation for... 3979
An optimal solution I'm looking for (if doable) is to print the entire line such that:
ID DIS_CASE_ID CREATION_DATE type_2 LPQ_REASON LPQ_RESOLUTION RESOLUTION_CODE
0 727990 61180481 2017-01-05 13:47:05 7891 The LPQ we know is shorto add is 25% (here This Memo was issued with conjunction to our 3979
comes the rest of the sentence. it might be analysis to foster a better bs when writing
long, or not, it might be short or whatever)
1 727889 61180482 2017-01-05 13:51:09 7891 The LPQ he collide will increase 15% yes and This Memo was issued on matching viloation for 3979
here I'm going to write the entire sentence who cares on what violation. just issued.
as if I really remember what was written. ha

Not as optimal as you want, but you can try the following:
pd.set_option('display.max_colwidth',100)
where 100 is the column width you can choose. but this will not create a multi-line cell but rather a very long column.
or:
not much elegant, but
you can try and use 'tabulate' library (https://pypi.python.org/pypi/tabulate) which creates nice text tables like:
+--------+-------+
| item | qty |
+========+=======+
| spam | 42 |
+--------+-------+
| eggs | 451 |
+--------+-------+
| bacon | 0 |
+--------+-------+
with tabulate you can use the '\n' new-line character.
just iterate over your text cells and put an '\n' every X characters (lets say every 50 characters).
a simple code for that:
for i in range(len(data)):
data.at[i,'text'] = data.at[i,'text'][0:50] + '\n' + data.at[i,'text'][50:]
the above is limited to only one line break, but you can improve it to make multi breaks for a long text. and also doesn't take into consideration whether it breaks in a middle of a word.
!Make sure to do that on a copy of the data, because it changes your data. and if you'll try to print it with a regular 'print' then you will see the '\n' stuck inside the middle of the text!

How to efficiently process a large file with a grouping variable in Python

I've got a dataset that looks something like the following:
ID Group
1001 2
1006 2
1008 1
1027 2
1013 1
1014 4
So basically, a long list of unsorted IDs with a grouping variable as well.
At the moment, I want to take subsets of this list based on the generation of a random number (imagine they're being drafted, or won the lottery, etc.). Right now, this is the code I'm using to just process it row-by-row, by ID.
reader = csv.reader(open(inputname), delimiter=' ')
out1 = open(output1name,'wb')
out2 = open(output2name,'wb')
for row in reader:
assignment = gcd(1,p,marg_rate,rho)
if assignment[0,0]==1:
out1.write(row[0])
out1.write("\n")
if assignment[0,1]==1:
out2.write(row[0])
out2.write("\n")
Basically, i the gcd() function goes one way, you get written to one file, another way to a second, and then some get tossed out. The issue is I'd now like to do this by Group rather than ID - basically, I'd like to assign values to the first time a member of the group appears, and then apply it to all members of that group (so for example, if 1001 goes to File 2, so does 1006 and 1027).
Is there an efficient way to do this in Python? The file's large enough that I'm a little wary of my first thought, which was to do the assignments in a dictionary or list and then have the program look it up for each line.

I used random.randint to generate a random number, but this can be easily replaced.
The idea is to use a defaultdict to have single score (dict keys are unique) for a group from the moment it's created:
import csv
import random
from collections import defaultdict
reader = csv.DictReader(open(inputname), delimiter=' ')
out1 = open(output1name,'wb')
out2 = open(output2name,'wb')
# create a dictionary with a random default integer value [0, 1] for
# keys that are accessed for the first time
group_scores = defaultdict(lambda: random.randint(0,1))
for row in reader:
# set a score for current row according to it's group
# if none found - defaultdict will call it's lambda for new keys
# and create a score for this row and all who follow
score = group_scores[row['Group']]
if score==0:
out1.write(row['ID'])
out1.write("\n")
if score==1:
out2.write(row['ID'])
out2.write("\n")
out1.close()
out2.close()
I've also used DictReader which I find nicer for csv files with headers.
Tip: you may want to use the with context manager to open files.
Example output:
reut#sharabani:~/python/ran$ cat out1.txt
1001
1006
1008
1027
1013
reut#sharabani:~/python/ran$ cat out2.txt
1014

Sounds like you're looking for a mapping. You can use dicts for that.
Once you've first decided 1001 goes to file 2, you can add to your mapping dict.
fileMap={}
fileMap[group]="fileName"
And then, when you need to check if the group has been decided yet, you just
>>>group in fileMap
True
This is instead of mapping every ID to a filename. Just map the groups.
Also, I'm wondering about whether it's worth condsidering batching the writes with .write([aListofLines]).

Merge two lists of objects containing lists

I have a directory tree containing html files called slides. Something like:
slides_root
|
|_slide-1
| |_slide-1.html
| |_slide-2.html
|
|_slide-2
| |
| |_slide-1
| | |_slide-1.html
| | |_slide-2.html
| | |_slide-3.html
| |
| |_slide-2
| |_slide-1.html
...and so on. They could go even deeper. Now imagine I have to replace some slides in this structure by merging it with another tree which is a subset of this.
WITH AN EXAMPLE: say that I want to replace slide-1.html and slide-3.html inside "slides_root/slide-2/slide-1" merging "slides_root" with:
slide_to_change
|
|_slide-2
|
|_slide-1
|_slide-1.html
|_slide-3.html
I would merge "slide_to_change" into "slides_root". The structure is the same so everything goes fine. But I have to do it in a python object representation of this scheme.
So the two trees are represented by two instances - slides1, slides2 - of the same "Slide" class which is structured as follows:
Slide(object):
def __init__(self, path):
self.path = path
self.slides = [Slide(path)]
Both slide1 and slide2 contains a path and a list that contain other Slide objects with other path and list of Slide objects and so on.
The rule is that if the the relative path is the same then I would replace the slide object in slide1 with the one in slide2.
How can achieve this result? It is really difficult and I can see no way out. Ideally something like:
for slide_root in slide1.slides:
for slide_dest in slide2.slides:
if slide_root.path == slide_dest.path:
slide_root = slide_dest
// now restart the loop at a deeper level
// repeat
Thank everyone for any answer.

Sounds not so complicated.
Just use a recursive function for walking the to-be-inserted tree and keep a hold on the corresponding place in the old tree.
If the parts match:
If the parts are both leafs (html thingies):
Insert (overwrite) the value.
If the parts are both nodes (slides):
Call yourself with the subslides (here's the recursion).
I know this is just kind of a hint, just kind of a sketch on how to do it. But maybe you want to start on this. In Python it could look sth like this (also not completely fleshed out):
def merge_slide(slide, old_slide):
for sub_slide in slide.slides:
sub_slide_position_in_old_slide = find_sub_slide_position_by_path(sub_slide.path)
if sub_slide_position_in_old_slide >= 0: # we found a match!
sub_slide_in_old_slide = old_slide.slides[sub_slide_position_in_old_slide]
if sub_slide.slides: # this is a node!
merge_slide(sub_slide, sub_slide_in_old_slide) # here we recurse
else: # this is a leaf! so we replace it:
old_slide[sub_slide_position_in_old_slide] = sub_slide
else: # nothing like this in old_slide
pass # ignore (you might want to consider this case!)
Maybe that gives you an idea on how I would approach this.

python file manipulations (bash script porting)

I am attempting to rewrite some of my old bash scripts that I think are very inefficient (not to mention inelegant) and use some horrid piping...Perhaps somebody with real Python skills can give me some pointers...
The script makes uses of multiple temp files...another thing I think is a bad style and probably can be avoided...
It essentially manipulates INPUT-FILE by first cutting out certain number of lines from the top (discarding heading).
Then it pulls out one of the columns and:
calculate number of raws = N;
throws out all duplicate entries from this single column file (I use sort -u -n FILE > S-FILE).
After that, I create a sequential integer index from 1 to N and paste this new index column into the original INPUT-FILE using paste command.
My bash script then generates Percentile Ranks for the values we wrote into S-FILE.
I believe Python leverage scipy.stats, while in bash I determine number of duplicate lines (dupline) for each unique entry in S-FILE, and then calculated per-rank=$((100*($counter+$dupline/2)/$length)), where $length= length of FILE and not S-FILE. I then would print results into a separate 1 column file (and repeat same per-rank as many times as we have duplines).
I would then paste this new column with percentile ranks back into INPUT-FILE (since I would sort INPUT-FILE by the column used for calculation of percentile ranks - everything would line up perfectly in the result).
After this, it goes into the ugliness below...
sort -o $INPUT-FILE $INPUT-FILE
awk 'int($4)>2000' $INPUT-FILE | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 2000-$INPUT-FILE
diff $INPUT-FILE 2000-$INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>1000' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 1000-$INPUT-FILE
cat 2000-$INPUT-FILE 1000-$INPUT-FILE | sort > merge-$INPUT-FILE
diff merge-$INPUT-FILE $INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>500' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 500-$INPUT-FILE
rm merge-$INPUT-FILE
Essentially, this is a very inelegant bash way of doing the following:
RANDOMLY select 500 lines from $INPUT-FILE where value in column 4 is greater then 2000 and write it out to file 2000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE, randomly select 500 lines where value in column 4 is greater then 1000 and write it out to file 1000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE after 1) and 2), randomly select 500 lines where value in column 4 is greater then 500 and write it out to file 500-$INPUT-FILE
Again, I am hoping somebody can help me in reworking this ugly piping thing into a thing of python beauty! :) Thanks!

Two crucial points in the comments:
(A) The file is ~50k lines of ~100 characters. Small enough to comfortably fit in memory on modern desktop/server/laptop systems.
(B) The author's main question is about how to keep track of lines that have already been chosen, and don't choose them again.
I suggest three steps.
(1) Go through the file, making three separate lists -- call them u, v, w -- of the line numbers which satisfy each of the criteria. These lists may have more than 500 lines, and they may contain duplicates, but we will get rid of these problems in step (2).
u = []
v = []
w = []
with open(filename, "r") as f:
for linenum, line in enumerate(f):
x = int(line.split()[3])
if x > 2000:
u.append(x)
if x > 1000:
v.append(x)
if x > 500:
w.append(x)
(2) Choose line numbers. You can use the builtin Random.sample() to pick a sample of k elements from a population. We want to remove elements that have previously been chosen, so keep track of such elements in a set. (The "chosen" collection is a set instead of a list because the test "if x not in chosen" is O(log(n)) for a set, but O(n) for a list. Change it to a list and you'll see slowdown if you measure the timings precisely, though it might not be a noticeable delay for a data set of "only" 50k data points / 500 samples / 3 categories.)
import random
rand = random.Random() # change to random.Random(1234) for repeatable results
chosen = set()
s0 = rand.sample(u, 500)
chosen.update(s0)
s1 = rand.sample([x for x in v if x not in chosen], 500)
chosen.update(s1)
s2 = rand.sample([x for x in w if x not in chosen], 500)
chosen.update(s2)
(3) Do another pass through the input file, putting lines whose numbers are s0 into your first output file, lines whose numbers are in s1 into your second output file, and lines whose numbers are in s2 into your third output file. It's pretty trivial in any language, but here's an implementation which uses Python "idioms":
linenum2sample = dict([(x, 0) for x in s0]+[(x, 1) for x in s1]+[(x, 2) for x in s2])
outfile = [open("-".join(x, filename), "w") for x in ["2000", "1000", "500"]]
try:
with open(filename, "r") as f:
for linenum, line in enumerate(f):
s = linenum2sample.get(linenum)
if s is not None:
outfile[s].write(line)
finally:
for f in outfile:
f.close()

Break it up into easy pieces.
Read the file using csv.DictReader, or csv.reader if the headers are unusable. As you're iterating through the lines, check the value of column 4 and insert the lines into a dictionary of lists where the dictionary keys are something like 'gt_2000', 'gt_1000', 'gt_500'.
Iterate through your dictionary keys and for each, create a file and do a loop of 500 and for each iteration, use random.randint(0, len(the_list)-1) to get a random index of the list, write it to the file, then delete the item at that index from the list. If there could ever be fewer than 500 items in any bucket then this will require a tiny bit more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.