I am attempting to rewrite some of my old bash scripts that I think are very inefficient (not to mention inelegant) and use some horrid piping...Perhaps somebody with real Python skills can give me some pointers...
The script makes uses of multiple temp files...another thing I think is a bad style and probably can be avoided...
It essentially manipulates INPUT-FILE by first cutting out certain number of lines from the top (discarding heading).
Then it pulls out one of the columns and:
calculate number of raws = N;
throws out all duplicate entries from this single column file (I use sort -u -n FILE > S-FILE).
After that, I create a sequential integer index from 1 to N and paste this new index column into the original INPUT-FILE using paste command.
My bash script then generates Percentile Ranks for the values we wrote into S-FILE.
I believe Python leverage scipy.stats, while in bash I determine number of duplicate lines (dupline) for each unique entry in S-FILE, and then calculated per-rank=$((100*($counter+$dupline/2)/$length)), where $length= length of FILE and not S-FILE. I then would print results into a separate 1 column file (and repeat same per-rank as many times as we have duplines).
I would then paste this new column with percentile ranks back into INPUT-FILE (since I would sort INPUT-FILE by the column used for calculation of percentile ranks - everything would line up perfectly in the result).
After this, it goes into the ugliness below...
sort -o $INPUT-FILE $INPUT-FILE
awk 'int($4)>2000' $INPUT-FILE | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 2000-$INPUT-FILE
diff $INPUT-FILE 2000-$INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>1000' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 1000-$INPUT-FILE
cat 2000-$INPUT-FILE 1000-$INPUT-FILE | sort > merge-$INPUT-FILE
diff merge-$INPUT-FILE $INPUT-FILE | sed '/^[0-9][0-9]*/d; s/^. //; /^---$/d' | awk 'int($4)>500' | awk -v seed=$RANDOM 'BEGIN{srand(seed);} {print rand()"\t"$0}' | sort -k1 -k2 -n | cut -f2- | head -n 500 > 500-$INPUT-FILE
rm merge-$INPUT-FILE
Essentially, this is a very inelegant bash way of doing the following:
RANDOMLY select 500 lines from $INPUT-FILE where value in column 4 is greater then 2000 and write it out to file 2000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE, randomly select 500 lines where value in column 4 is greater then 1000 and write it out to file 1000-$INPUT-FILE
For all REMAINING lines in $INPUT-FILE after 1) and 2), randomly select 500 lines where value in column 4 is greater then 500 and write it out to file 500-$INPUT-FILE
Again, I am hoping somebody can help me in reworking this ugly piping thing into a thing of python beauty! :) Thanks!
Two crucial points in the comments:
(A) The file is ~50k lines of ~100 characters. Small enough to comfortably fit in memory on modern desktop/server/laptop systems.
(B) The author's main question is about how to keep track of lines that have already been chosen, and don't choose them again.
I suggest three steps.
(1) Go through the file, making three separate lists -- call them u, v, w -- of the line numbers which satisfy each of the criteria. These lists may have more than 500 lines, and they may contain duplicates, but we will get rid of these problems in step (2).
u = []
v = []
w = []
with open(filename, "r") as f:
for linenum, line in enumerate(f):
x = int(line.split()[3])
if x > 2000:
u.append(x)
if x > 1000:
v.append(x)
if x > 500:
w.append(x)
(2) Choose line numbers. You can use the builtin Random.sample() to pick a sample of k elements from a population. We want to remove elements that have previously been chosen, so keep track of such elements in a set. (The "chosen" collection is a set instead of a list because the test "if x not in chosen" is O(log(n)) for a set, but O(n) for a list. Change it to a list and you'll see slowdown if you measure the timings precisely, though it might not be a noticeable delay for a data set of "only" 50k data points / 500 samples / 3 categories.)
import random
rand = random.Random() # change to random.Random(1234) for repeatable results
chosen = set()
s0 = rand.sample(u, 500)
chosen.update(s0)
s1 = rand.sample([x for x in v if x not in chosen], 500)
chosen.update(s1)
s2 = rand.sample([x for x in w if x not in chosen], 500)
chosen.update(s2)
(3) Do another pass through the input file, putting lines whose numbers are s0 into your first output file, lines whose numbers are in s1 into your second output file, and lines whose numbers are in s2 into your third output file. It's pretty trivial in any language, but here's an implementation which uses Python "idioms":
linenum2sample = dict([(x, 0) for x in s0]+[(x, 1) for x in s1]+[(x, 2) for x in s2])
outfile = [open("-".join(x, filename), "w") for x in ["2000", "1000", "500"]]
try:
with open(filename, "r") as f:
for linenum, line in enumerate(f):
s = linenum2sample.get(linenum)
if s is not None:
outfile[s].write(line)
finally:
for f in outfile:
f.close()
Break it up into easy pieces.
Read the file using csv.DictReader, or csv.reader if the headers are unusable. As you're iterating through the lines, check the value of column 4 and insert the lines into a dictionary of lists where the dictionary keys are something like 'gt_2000', 'gt_1000', 'gt_500'.
Iterate through your dictionary keys and for each, create a file and do a loop of 500 and for each iteration, use random.randint(0, len(the_list)-1) to get a random index of the list, write it to the file, then delete the item at that index from the list. If there could ever be fewer than 500 items in any bucket then this will require a tiny bit more.
Related
There are 3 columns, levels 1-3. A file is read, and each line of the file contains various data, including the level to which it belongs, located at the back of the string.
Sample lines from file being read:
thing_1 - level 1
thing_17 - level 3
thing_22 - level 2
I want to assign each "thing" to it's corresponding column. I have looked into pandas, but it would seem that DataFrame columns won't work, as passed data would need to have attributes that match the number of columns, where in my case, I need 3 columns, but each piece of data only has 1 data point.
How could I approach this problem?
Desired output:
level 1 level 2 level 3
thing_1 thing_22 thing_17
Edit:
In looking at suggestion, I can refine my question further. I have up to 3 columns, and the line from file needs to be assigned to one of the 3 columns. Most solutions seem to need something like:
data = [['Mary', 20], ['John', 57]]
columns = ['Name', 'Age']
This does not work for me, since there are 3 columns, and each piece of data goes into only one.
There's an additional wrinkle here that I didn't notice at first. If each of your levels has the same number of things, then you can build a dictionary and then use it to supply the table's columns to PrettyTable:
from prettytable import PrettyTable
# Create an empty dictionary.
levels = {}
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# If this is is a new level, set it to a list containing its thing.
if level not in levels:
levels[level] = [thing]
# Otherwise, add the new thing to the level's list.
else:
levels[level].append(thing)
# Create the table, and add each level as a column
table = PrettyTable()
for level, things in levels.items():
table.add_column(level, things)
print(table)
For the example data you showed, this prints:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
+---------+----------+----------+
The Complication
I probably wouldn't have posted an answer (believing it was covered sufficiently in this answer), except that I realized there's an unintuitive hurdle here. If your levels contain different numbers of things each, you get an error like this:
Exception: Column length 2 does not match number of rows 1!
Because none of the solutions readily available have an obvious, "automatic" solution to this, here is a simple way to do it. Build the dictionary as before, then:
# Find the length of the longest list of things.
longest = max(len(things) for things in levels.values())
table = PrettyTable()
for level, things in levels.items():
# Pad out the list if it's shorter than the longest.
things += ['-'] * (longest - len(things))
table.add_column(level, things)
print(table)
This will print something like this:
+---------+----------+----------+
| level 1 | level 3 | level 2 |
+---------+----------+----------+
| thing_1 | thing_17 | thing_22 |
| - | - | thing_5 |
+---------+----------+----------+
Extra
If all of that made sense and you'd like to know about a way part of it can be streamlined a little, take a look at Python's defaultdict. It can take care of the "check if this key already exists" process, providing a default (in this case a new list) if nothing's already there.
from collections import defaultdict
levels = defaultdict(list)
with open('data.txt') as f:
for line in f:
# Remove trailing \n and split into the parts we want.
thing, level = line.rstrip('\n').split(' - ')
# Automatically handles adding a new key if needed:
levels[level].append(thing)
I have a .csv with several million rows. The first column is the id of each entry, and each id only occurs one time. The first column is sorted. Intuitively I'd say that it might be pretty easy to query this file efficiently using a divide and conquer algorithm. However, I couldn't find anything related to this.
Sample .csv file:
+----+------------------+-----+
| id | name | age |
+----+------------------+-----+
| 1 | John Cleese | 34 |
+----+------------------+-----+
| 3 | Mary Poppins | 35 |
+----+------------------+-----+
| .. | ... | .. |
+----+------------------+-----+
| 87 | Barry Zuckerkorn | 45 |
+----+------------------+-----+
I don't want to load the file in memory (too big), and I prefer to not use databases. I know I can just import this file in sqlite, but then I have multiple copies of this data, and I'd prefer to avoid that for multiple reasons.
Is there a good package I'm overlooking? Or is it something that I'd have to write myself?
Ok, my understanding is that you want some of the functionnalities of a light database, but are constrained to use a csv text file to hold the data. IMHO, this is probably a questionable design: past several hundred of rows, I would only see a csv file an an intermediate or exchange format.
As it is a very uncommon design, it is unlikely that a package for it already exists - for my part I know none. So I would imagine 2 possible ways: scan the file once and build an index id->row_position, and then use that index for your queries. Depending on the actual length of you rows, you could index only every n-th row to change speed for memory. But it costs an index file
An alternative way would be a direct divide and conquer algo: use stat/fstat to get the file size, and search for the next end of line starting at the middle of the file. You immediately get an id after it. If the id you want is that one, fine you have won, if it is greater, just recurse in the upper part, if lesser, recurse in the lower part. But because of the necessity to search for end of lines, be prepared to corner case like never finding the end of line in the expected range, or find it at the end.
After Serges answer I decided to write my own implementation, here it is. It doesn't allow newlines and doesn't deal with a lot of details regarding the .csv format. It assumes that the .csv is sorted on the first column, and that the first column are integer values.
import os
def query_sorted_csv(fname, id):
filesize = os.path.getsize(fname)
with open(fname) as fin:
row = look_for_id_at_location(fin, 0, filesize, id)
if not row:
raise Exception('id not found!')
return row
def look_for_id_at_location(fin, location_lower, location_upper, id, sep=',', id_column=0):
location = int((location_upper + location_lower) / 2)
if location_upper - location_lower < 2:
return False
fin.seek(location)
next(fin)
try:
full_line = next(fin)
except StopIteration:
return False
id_at_location = int(full_line.split(sep)[id_column])
if id_at_location == id:
return full_line
if id_at_location > id:
return look_for_id_at_location(fin, location_lower, location, id)
else:
return look_for_id_at_location(fin, location, location_upper, id)
row = query_sorted_csv('data.csv', 505)
You can look up about 4000 ids per second in a 2 million row 250MB .csv file. In comparison, you can look up 3 ids per second whilst looping over the entire file line by line.
I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)
This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)
I have a collection of large (~100,000,000 line) text files in the format:
0.088293 1.3218e-32 2.886e-07 2.378e-02 21617 28702
0.111662 1.1543e-32 3.649e-07 1.942e-02 93804 95906
0.137970 1.2489e-32 4.509e-07 1.917e-02 89732 99938
0.149389 8.0725e-32 4.882e-07 2.039e-02 71615 69733
...
And I'd like to find the mean and sum of column 2 and maximum and minimum values of columns 3 and 4, and the total number of lines. How can I do this efficiently using NumPy? Because of their size, loadtxt and genfromtxt are no good (take a long time to execute) since they attempt to read the whole file into memory. In contrast, Unix tools like awk:
awk '{ total += $2 } END { print total/NR }' <filename>
work in a reasonable amount of time.
Can Python/NumPy do the job of awk for such big files?
You can say something like:
awk '{ total2 += $2
for (i=2;i<=3;i++) {
max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i
min[i]=(length(min[i]) && min[i]<$i)?min[i]:$i
}
} END {
print "items", "average2", "min2", "min3", "max2", "max3"
print NR, total2/NR, min[2], min[3], max[2], max[3]
}' file
Test
With your given input:
$ awk '{total2 += $2; for (i=2;i<=3;i++) {max[i]=(length(max[i]) && max[i]>$i)?max[i]:$i; min[i]=((length(min[i]) && min[i]<$i)?min[i]:$i)}} END {print "items", "average2", "min2", "min3", "max2", "max3"; print NR, total2/NR, min[2], min[3], max[2], max[3]}' a | column -t
items average2 min2 min3 max2 max3
4 2.94938e-32 1.1543e-32 2.886e-07 8.0725e-32 4.882e-07
loop through the lines and apply regex to extract the data you are looking for, adding it into an initially empty list for each column you desire.
Once you have the column in list form you can apply max(list) min(list) avg(list) functions to the data to get whatever calculations you are interested in.
note: You may need to revise where you added the data to the list and convert the numbers from str to int form so that the max, min, avg functions can operate on them.
I have a large 40 million line, 3 gigabyte text file (probably wont be able to fit in memory) in the following format:
399.4540176 {Some other data}
404.498759292 {Some other data}
408.362737492 {Some other data}
412.832976111 {Some other data}
415.70665675 {Some other data}
419.586515381 {Some other data}
427.316825959 {Some other data}
.......
Each line starts off with a number and is followed by some other data. The numbers are in sorted order. I need to be able to:
Given a number x and and a range y, find all the lines whose number is within y range of x. For example if x=20 and y=5, I need to find all lines whose number is between 15 and 25.
Store these lines into another separate file.
What would be an efficient method to do this without having to trawl through the entire file?
If you don't want to generate a database ahead of time for line lengths, you can try this:
import os
import sys
# Configuration, change these to suit your needs
maxRowOffset = 100 #increase this if some lines are being missed
fileName = 'longFile.txt'
x = 2000
y = 25
#seek to first character c before the current position
def seekTo(f,c):
while f.read(1) != c:
f.seek(-2,1)
def parseRow(row):
return (int(row.split(None,1)[0]),row)
minRow = x - y
maxRow = x + y
step = os.path.getsize(fileName)/2.
with open(fileName,'r') as f:
while True:
f.seek(int(step),1)
seekTo(f,'\n')
row = parseRow(f.readline())
if row[0] < minRow:
if minRow - row[0] < maxRowOffset:
with open('outputFile.txt','w') as fo:
for row in f:
row = parseRow(row)
if row[0] > maxRow:
sys.exit()
if row[0] >= minRow:
fo.write(row[1])
else:
step /= 2.
step = step * -1 if step < 0 else step
else:
step /= 2.
step = step * -1 if step > 0 else step
It starts by performing a binary search on the file until it is near (less than maxRowOffset) the row to find. Then it starts reading every line until it finds one that is greater than x-y. That line, and every line after it are written to an output file until a line is found that is greater than x+y, and which point the program exits.
I tested this on a 1,000,000 line file and it runs in 0.05 seconds. Compare this to reading every line which took 3.8 seconds.
You need random access to the lines which you won't get with a text files unless the lines are all padded to the same length.
One solution is to dump the table into a database (such as SQLite) with two columns, one for the number and one for all the other data (assuming that the data is guaranteed to fit into whatever the maximum number of characters allowed in a single column in your database is). Then index the number column and you're good to go.
Without a database, you could read through file one time and create an in-memory data structure with pairs of values showing containing (number, line-offset). You calculate the line-offset by adding the lengths of each row (including line end). Now you can binary search these value pairs on number and randomly access the lines in the file using the offset. If you need to repeat the search later, pickle the in-memory structure and reload for later re-use.
This reads the entire file (which you said you don't want to do), but does so only once to build the index. After that you can execute as many requests against the file as you want and they will be very fast.
Note that this second solution is essentially creating a database index on your text file.
Rough code to create the index in second solution:
import Pickle
line_end_length = len('\n') # must be a better way to do this!
offset = 0
index = [] # probably a better structure to use than a list
f = open(filename)
for row in f:
nbr = float(row.split(' ')[0])
index.append([nbr, offset])
offset += len(row) + line_end_length
Pickle.dump(index, open('filename.idx', 'wb')) # saves it for future use
Now, you can perform a binary search on the list. There's probably a much better data structure to use for accruing the index values than a list, but I'd have to read up on the various collection types.
Since you want to match the first field, you can use gawk:
$ gawk '{if ($1 >= 15 && $1 <= 25) { print }; if ($1 > 25) { exit }}' your_file
Edit: Taking a file with 261,775,557 lines that is 2.5 GiB big, searching for lines 50,010,015 to 50,010,025 this takes 27 seconds on my Intel(R) Core(TM) i7 CPU 860 # 2.80GHz. Sounds good enough for me.
In order to find the line that starts with the number just above your lower limit, you have to go through the file line by line until you find that line. No other way, i.e. all data in the file has to be read and parsed for newline characters.
We have to run this search up to the first line that exceeds your upper limit and stop. Hence, it helps that the file is already sorted. This code will hopefully help:
with open(outpath) as outfile:
with open(inpath) as infile:
for line in infile:
t = float(line.split()[0])
if lower_limit <= t <= upper_limit:
outfile.write(line)
elif t > upper_limit:
break
I think theoretically there is no other option.