Search and sort data from several files - python

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks

In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)

I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal

Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....

A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8

This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out

If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done

Related

Split large CSV file based on row value

The porblem
I have a csv file called data.csv. On each row I have:
timestamp: int
account_id: int
data: float
for instance:
timestamp,account_id,value
10,0,0.262
10,0,0.111
13,1,0.787
14,0,0.990
This file is ordered by timestamp.
The number of row is too big to store all rows in memory.
order of magnitude: 100 M rows, number of account: 5 M
How can I quickly get all rows of a given account_id ? What would be the best way to make the data accessible by account_id ?
Things I tried
to generate a sample:
N_ROW = 10**6
N_ACCOUNT = 10**5
# Generate data to split
with open('./data.csv', 'w') as csv_file:
csv_file.write('timestamp,account_id,value\n')
for timestamp in tqdm.tqdm(range(N_ROW), desc='writing csv file to split'):
account_id = random.randint(1,N_ACCOUNT)
data = random.random()
csv_file.write(f'{timestamp},{account_id},{data}\n')
# Clean result folder
if os.path.isdir('./result'):
shutil.rmtree('./result')
os.mkdir('./result')
Solution 1
Write a script that creates a file for each account, read rows one by one on the original csv, write the row on on the file that corresponds to the account (open and close a file for each row).
Code:
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file_path = f'result/{account_id}.csv'
file_opening_mode = 'a' if os.path.isfile(account_file_path) else 'w'
with open(account_file_path, file_opening_mode) as account_file:
account_file.write(row)
p_bar.update(1)
Issues:
It is quite slow (i think it is inefficient to open and close a file on each row). It takes around 4 minutes for 1 M rows. Even if it works, will it be fast ? Given an account_id I know the name of the file I should read but the system has to look over 5M files to find it. Should I create some kind of binary tree with folders with the leafs being the files ?
Solution 2 (works on small example not on large csv file)
Same idea as solution 1 but instead of opening / closing a file for each row, store files in a dictionary
Code:
# A dict that will contain all files
account_file_dict = {}
# A function given an account id, returns the file to write in (create new file if do not exist)
def get_account_file(account_id):
file = account_file_dict.get(account_id, None)
if file is None:
file = open(f'./result/{account_id}.csv', 'w')
account_file_dict[account_id] = file
file.__enter__()
return file
# Split the data
p_bar = tqdm.tqdm(total=N_ROW, desc='splitting csv file')
with open('./data.csv') as data_file:
next(data_file) # skip header
for row in data_file:
account_id = row.split(',')[1]
account_file = get_account_file(account_id)
account_file.write(row)
p_bar.update(1)
Issues:
I am not sure it is actually faster.
I have to open simultaneously 5M files (one per account). I get an error OSError: [Errno 24] Too many open files: './result/33725.csv'.
Solution 3 (works on small example not on large csv file)
Use awk command, solution from: split large csv text file based on column value
code:
after generating the file, run: awk -F, 'NR==1 {h=$0; next} {f="./result/"$2".csv"} !($2 in p) {p[$2]; print h > f} {print >> f}' ./data.csv
Issues:
I get the following error: input record number 28229, file ./data.csv source line number 1 (number 28229 is an example, it usually fails around 28k). I assume It is also because i am opening too many files
#VinceM :
While not quite 15 GB, I do have a 7.6 GB one with 3 columns :
-- 148 mn prime numbers, their base-2 log, and their hex
in0: 7.59GiB 0:00:09 [ 841MiB/s] [ 841MiB/s] [========>] 100%
148,156,631 lines 7773.641 MB ( 8151253694) /dev/stdin
|
f="$( grealpath -ePq ~/master_primelist_19d.txt )"
( time ( for __ in '12' '34' '56' '78' '9'; do
( gawk -v ___="${__}" -Mbe 'BEGIN {
___="^["(___%((_+=_^=FS=OFS="=")+_*_*_)^_)"]"
} ($_)~___ && ($NF = int(($_)^_))^!_' "${f}" & ) done |
gcat - ) ) | pvE9 > "${DT}/test_primes_squared_00000002.txt"
|
out9: 13.2GiB 0:02:06 [98.4MiB/s] [ 106MiB/s] [ <=> ]
( for __ in '12' '34' '56' '78' '9'; do; ( gawk -v ___="${__}" -Mbe "${f}" &)
0.36s user 3 out9: 13.2GiB 0:02:06 [ 106MiB/s] [ 106MiB/s]
Using only 5 instances of gawk with big-integer package gnu-GMP, each with a designated subset of leading digit(s) of the prime number,
—- it managed to calculate the full precision squaring of those primes in just 2 minutes 6 seconds, yielding an unsorted 13.2 GB output file
if it can square that quickly, then merely grouping by account_id should be a walk in the park
Have a look at https://docs.python.org/3/library/sqlite3.html
You could import the data, create required indexes and then run queries normally. No dependencies except for the python itself.
https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.scan_csv.html
If you have to query raw data every time and you are limited by simple python only, then you can either write a code to read it manually and yield matched rows or use a helper like this:
from convtools.contrib.tables import Table
from convtools import conversion as c
iterable_of_matched_rows = (
Table.from_csv("tmp/in.csv", header=True)
.filter(c.col("account_id") == "1")
.into_iter_rows(dict)
)
However this won't be faster than reading 100M row csv file with csv.reader.

Better regex implementation than for looping whole file?

I have files looking like this:
# BJD K2SC-Flux EAPFlux Err Flag Spline
2457217.463564 5848.004 5846.670 6.764 0 0.998291
2457217.483996 6195.018 6193.685 6.781 1 0.998291
2457217.504428 6396.612 6395.278 6.790 0 0.998292
2457217.524861 6220.890 6219.556 6.782 0 0.998292
2457217.545293 5891.856 5890.523 6.766 1 0.998292
2457217.565725 5581.000 5579.667 6.749 1 0.998292
2457217.586158 5230.566 5229.232 6.733 1 0.998292
2457217.606590 4901.128 4899.795 6.718 0 0.998293
2457217.627023 4604.127 4602.793 6.700 0 0.998293
I need to find and count the lines with Flag = 1. (5th column.) This is how I have done it:
foundlines=[]
c=0
import re
with open('examplefile') as f:
for index, line in enumerate(f):
try:
found = re.findall(r' 1 ', line)[0]
foundlines.append(index)
print(line)
c+=1
except:
pass
print(c)
In Shell, I would just do grep " 1 " examplefile | wc -l, which is much shorter than the Python script above. The python script works, but I am interested in whether is there a shorter, more compact way to do the task than the script above? I prefer the shortness of Shell so I would like to have something similar in Python.
You have CSV data, you can use the csv module:
import csv
with open('your file', 'r', newline='', encoding='utf8') as fp:
rows = csv.reader(fp, delimiter=' ')
# generator comprehension
errors = (row for row in rows if row[4] == '1')
for error in errors:
print(error)
You shell implementation can be made even shorter, grep has -c option to get you a count, no need for an anonymous pipe and wc:
grep -c " 1 " examplefile
You shell code simply gets you the line counts where the pattern 1 is found, but your Python code additionally keeps a list of indexes of lines where the pattern is matched.
Only to get the line counts, you can use sum and genexp/list comprehension, also no need for Regex; simple string __contains__ check would do as strings are iterable:
with open('examplefile') as f:
count = sum(1 for line in f if ' 1 ' in line)
print(count)
If you want to keep indexes as well, you can stick to your idea with only replacing re test with str test:
count = 0
indexes = []
with open('examplefile') as f:
for idx, line in enumerate(f):
if ' 1 ' in line:
count += 1
indexes.append(idx)
Additionally, doing a bare except is almost always a bad idea (at least you should use except Exception to leave out SystemExit, KeyboardInterrupt like exceptions), catch only the exceptions you know might be raised.
Also, while parsing structured data, you should use specific tool e.g. here csv.reader with space as the separator (line.split(' ') should do in this case as well) and checking against index-4 would be safest (see Tomalak's answer). With the ' 1 ' in line test, there would be misleading results if any other column contains 1.
Considering the above, here's the shell way using awk to match against the 5-th field:
awk '$5 == "1" {count+=1}; END{print count}' examplefile
Shortest code
This is a very short version under some specific preconditions:
You just want to count occurrences like your grep invocation
There is guaranteed to be only one " 1 " per line
That " 1 " can only occur in the desired column
Your file fits easily into memory
Note that if these preconditions are not met, this may cause issues with memory or return false positives.
print(open("examplefile").read().count(" 1 "))
Easy and versatile, slightly longer
Of course, if you're interested in actually doing something with these lines later on, I recommend Pandas:
df = pandas.read_table('test.txt', delimiter=" ",
comment="#",
names=['BJD', 'K2SC-Flux', 'EAPFlux', 'Err', 'Flag', 'Spline'])
To get all the rows where Flag is 1:
flaggedrows = df[df.Flag == 1]
returns:
BJD K2SC-Flux EAPFlux Err Flag Spline
1 2.457217e+06 6195.018 6193.685 6.781 1 0.998291
4 2.457218e+06 5891.856 5890.523 6.766 1 0.998292
5 2.457218e+06 5581.000 5579.667 6.749 1 0.998292
6 2.457218e+06 5230.566 5229.232 6.733 1 0.998292
To count them:
print(len(flaggedrows))
returns 4

How to write specific line lengths of a file?

I have this sequences (over 9000) like this:
>TsM_000224500
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL
The lines containing the ">" are the ID's and the lines with the letters are the amino acid (aa) sequences. I need to delete (or move to another files) the sequences below 40 aa and over 4000 aa.
Then, the resulting file, should contain only the sequences within this range (>= 40 aa and <= 4K aa).
I've tried writing the following script:
def read_seq(file_name):
with open(file_name) as file:
return file.read().split('\n')[0:]
ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")
tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')
for x in range(len(ts)):
if ([x][0:1] != '>'):
if (len([x]) > 40 or len([x]) < 4000):
tsf.write('%s\n'%(x))
tsf.close()
print "OK!"
I've done some modifications, but all I'm getting are empty files or with all the +9000 sequences.
In your for loop, x is an iterating integer due to using range() (i.e, 0,1,2,3,4...). Try this instead:
for x in ts:
This will give you each element in ts as x
Also, you don't need the brackets around x; Python can iterate over the characters in strings on its own. When you put brackets around a string, you put it into a list, and thus if you tried, for example, to get the second character in x: [x][1], Python will try to get the second element in the list that you put x in, and will run into problems.
EDIT: To include IDs, try this:
NOTE: I also changed if (len(x) > 40 or len(x) < 4000) to if (len(x) > 40 and len(x) < 4000) -- using and instead of or will give you the result you're looking for.
for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
if (x[0] != '>'):
if (len(x) > 40 and len(x) < 4000):
tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
tsf.write('%s\n'%(x))
Try this, simple and easy to understand. It does not load the entire file into memory, instead iterates over the file line by line.
tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
for line in ts: # iterate over each line of input file
line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
if line[0]=='>': # if line is an ID
continue # move to the next line
else: # otherwise
if (len(line)>40) or (len(line)<4000): # if line is in required length
tsf.write('%s\n'%line) # write to output file
tsf.close() # done
print "OK!"
FYI, you could also use awk for a one line solution if working in unix environment:
cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt

Processing Large Files in Python [ 1000 GB or More]

Lets say i have a text file of 1000 GB. I need to find how much times a phrase occurs in the text.
Is there any faster way to do this that the one i am using bellow?
How much would it take to complete the task.
phrase = "how fast it is"
count = 0
with open('bigfile.txt') as f:
for line in f:
count += line.count(phrase)
If I am right if I do not have this file in the memory i would meed to wait till the PC loads the file each time I am doing the search and this should take at least 4000 sec for a 250 MB/sec hard drive and a file of 10000 GB.
I used file.read() to read the data in chunks, in current examples the chunks were of size 100 MB, 500MB, 1GB and 2GB respectively. The size of my text file is 2.1 GB.
Code:
from functools import partial
def read_in_chunks(size_in_bytes):
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt', 'r+b') as f:
prev = ''
count = 0
f_read = partial(f.read, size_in_bytes)
for text in iter(f_read, ''):
if not text.endswith('\n'):
# if file contains a partial line at the end, then don't
# use it when counting the substring count.
text, rest = text.rsplit('\n', 1)
# pre-pend the previous partial line if any.
text = prev + text
prev = rest
else:
# if the text ends with a '\n' then simple pre-pend the
# previous partial line.
text = prev + text
prev = ''
count += text.count(s)
count += prev.count(s)
print count
Timings:
read_in_chunks(104857600)
$ time python so.py
10000000
real 0m1.649s
user 0m0.977s
sys 0m0.669s
read_in_chunks(524288000)
$ time python so.py
10000000
real 0m1.558s
user 0m0.893s
sys 0m0.646s
read_in_chunks(1073741824)
$ time python so.py
10000000
real 0m1.242s
user 0m0.689s
sys 0m0.549s
read_in_chunks(2147483648)
$ time python so.py
10000000
real 0m0.844s
user 0m0.415s
sys 0m0.408s
On the other hand the simple loop version takes around 6 seconds on my system:
def simple_loop():
s = 'Lets say i have a text file of 1000 GB'
with open('data.txt') as f:
print sum(line.count(s) for line in f)
$ time python so.py
10000000
real 0m5.993s
user 0m5.679s
sys 0m0.313s
Results of #SlaterTyranus's grep version on my file:
$ time grep -o 'Lets say i have a text file of 1000 GB' data.txt|wc -l
10000000
real 0m11.975s
user 0m11.779s
sys 0m0.568s
Results of #woot's solution:
$ time cat data.txt | parallel --block 10M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m5.955s
user 0m14.825s
sys 0m5.766s
Got best timing when I used 100 MB as block size:
$ time cat data.txt | parallel --block 100M --pipe grep -o 'Lets\ say\ i\ have\ a\ text\ file\ of\ 1000\ GB' | wc -l
10000000
real 0m4.632s
user 0m13.466s
sys 0m3.290s
Results of woot's second solution:
$ time python woot_thread.py # CHUNK_SIZE = 1073741824
10000000
real 0m1.006s
user 0m0.509s
sys 0m2.171s
$ time python woot_thread.py #CHUNK_SIZE = 2147483648
10000000
real 0m1.009s
user 0m0.495s
sys 0m2.144s
System Specs: Core i5-4670, 7200 RPM HDD
Here is a Python attempt... You might need to play with the THREADS and CHUNK_SIZE. Also it's a bunch of code in a short time so I might not have thought of everything. I do overlap my buffer though to catch the ones in between, and I extend the last chunk to include the remainder of the file.
import os
import threading
INPUTFILE ='bigfile.txt'
SEARCH_STRING='how fast it is'
THREADS = 8 # Set to 2 times number of cores, assuming hyperthreading
CHUNK_SIZE = 32768
FILESIZE = os.path.getsize(INPUTFILE)
SLICE_SIZE = FILESIZE / THREADS
class myThread (threading.Thread):
def __init__(self, filehandle, seekspot):
threading.Thread.__init__(self)
self.filehandle = filehandle
self.seekspot = seekspot
self.cnt = 0
def run(self):
self.filehandle.seek( self.seekspot )
p = self.seekspot
if FILESIZE - self.seekspot < 2 * SLICE_SIZE:
readend = FILESIZE
else:
readend = self.seekspot + SLICE_SIZE + len(SEARCH_STRING) - 1
overlap = ''
while p < readend:
if readend - p < CHUNK_SIZE:
buffer = overlap + self.filehandle.read(readend - p)
else:
buffer = overlap + self.filehandle.read(CHUNK_SIZE)
if buffer:
self.cnt += buffer.count(SEARCH_STRING)
overlap = buffer[len(buffer)-len(SEARCH_STRING)+1:]
p += CHUNK_SIZE
filehandles = []
threads = []
for fh_idx in range(0,THREADS):
filehandles.append(open(INPUTFILE,'rb'))
seekspot = fh_idx * SLICE_SIZE
threads.append(myThread(filehandles[fh_idx],seekspot ) )
threads[fh_idx].start()
totalcount = 0
for fh_idx in range(0,THREADS):
threads[fh_idx].join()
totalcount += threads[fh_idx].cnt
print totalcount
Have you looked at using parallel / grep?
cat bigfile.txt | parallel --block 10M --pipe grep -o 'how\ fast\ it\ is' | wc -l
Had you considered indexing your file? The way search engine works is by creating a mapping from words to the location they are in the file. Say if you have this file:
Foo bar baz dar. Dar bar haa.
You create an index that looks like this:
{
"foo": {0},
"bar": {4, 21},
"baz": {8},
"dar": {12, 17},
"haa": {25},
}
A hashtable index can be looked up in O(1); so it's freaking fast.
And someone searches for the query "bar baz" you first break the query into its constituent words: ["bar", "baz"] and you then found {4, 21}, {8}; then you use this to jump out right to the places where the queried text could possible exists.
There are out of the box solutions for indexed search engines as well; for example Solr or ElasticSearch.
Going to suggest doing this with grep instead of python. Will be faster, and generally if you're dealing with 1000GB of text on your local machine you've done something wrong, but all judgements aside, grep comes with a couple of options that will make your life easier.
grep -o '<your_phrase>' bigfile.txt|wc -l
Specifically this will count the number of lines in which your desired phrase appears. This should also count multiple occurrences on a single line.
If you don't need that you could instead do something like this:
grep -c '<your_phrase>' bigfile.txt
We're talking about a simple count of a specific substring within a rather large data stream. The task is nearly certainly I/O bound, but very easily parallelised. The first layer is the raw read speed; we can choose to reduce the read amount by using compression, or distribute the transfer rate by storing the data in multiple places. Then we have the search itself; substring searches are a well known problem, again I/O limited. If the data set comes from a single disk pretty much any optimisation is moot, as there's no way that disk beats a single core in speed.
Assuming we do have chunks, which might for instance be the separate blocks of a bzip2 file (if we use a threaded decompressor), stripes in a RAID, or distributed nodes, we have much to gain from processing them individually. Each chunk is searched for needle, then joints can be formed by taking len(needle)-1 from the end of one chunk and beginning of the next, and searching within those.
A quick benchmark demonstrates that the regular expression state machines operate faster than the usual in operator:
>>> timeit.timeit("x.search(s)", "s='a'*500000; import re; x=re.compile('foobar')", number=20000)
17.146117210388184
>>> timeit.timeit("'foobar' in s", "s='a'*500000", number=20000)
24.263535976409912
>>> timeit.timeit("n in s", "s='a'*500000; n='foobar'", number=20000)
21.562405109405518
Another step of optimization we can perform, given that we have the data in a file, is to mmap it instead of using the usual read operations. This permits the operating system to use the disk buffers directly. It also allows the kernel to satisfy multiple read requests in arbitrary order without making extra system calls, which lets us exploit things like an underlying RAID when operating in multiple threads.
Here's a quickly tossed together prototype. A few things could obviously be improved, such as distributing the chunk processes if we have a multinode cluster, doing the tail+head check by passing one to the neighboring worker (an order which is not known in this implementation) instead of sending both to a special worker, and implementing an interthread limited queue (pipe) class instead of matching semaphores. It would probably also make sense to move the worker threads outside of the main thread function, since the main thread keeps altering its locals.
from mmap import mmap, ALLOCATIONGRANULARITY, ACCESS_READ
from re import compile, escape
from threading import Semaphore, Thread
from collections import deque
def search(needle, filename):
# Might want chunksize=RAID block size, threads
chunksize=ALLOCATIONGRANULARITY*1024
threads=32
# Read chunk allowance
allocchunks=Semaphore(threads) # should maybe be larger
chunkqueue=deque() # Chunks mapped, read by workers
chunksready=Semaphore(0)
headtails=Semaphore(0) # edges between chunks into special worker
headtailq=deque()
sumq=deque() # worker final results
# Note: although we do push and pop at differing ends of the
# queues, we do not actually need to preserve ordering.
def headtailthread():
# Since head+tail is 2*len(needle)-2 long,
# it cannot contain more than one needle
htsum=0
matcher=compile(escape(needle))
heads={}
tails={}
while True:
headtails.acquire()
try:
pos,head,tail=headtailq.popleft()
except IndexError:
break # semaphore signaled without data, end of stream
try:
prevtail=tails.pop(pos-chunksize)
if matcher.search(prevtail+head):
htsum+=1
except KeyError:
heads[pos]=head
try:
nexthead=heads.pop(pos+chunksize)
if matcher.search(tail+nexthead):
htsum+=1
except KeyError:
tails[pos]=tail
# No need to check spill tail and head as they are shorter than needle
sumq.append(htsum)
def chunkthread():
threadsum=0
# escape special characters to achieve fixed string search
matcher=compile(escape(needle))
borderlen=len(needle)-1
while True:
chunksready.acquire()
try:
pos,chunk=chunkqueue.popleft()
except IndexError: # End of stream
break
# Let the re module do the heavy lifting
threadsum+=len(matcher.findall(chunk))
if borderlen>0:
# Extract the end pieces for checking borders
head=chunk[:borderlen]
tail=chunk[-borderlen:]
headtailq.append((pos,head,tail))
headtails.release()
chunk.close()
allocchunks.release() # let main thread allocate another chunk
sumq.append(threadsum)
with infile=open(filename,'rb'):
htt=Thread(target=headtailthread)
htt.start()
chunkthreads=[]
for i in range(threads):
t=Thread(target=chunkthread)
t.start()
chunkthreads.append(t)
pos=0
fileno=infile.fileno()
while True:
allocchunks.acquire()
chunk=mmap(fileno, chunksize, access=ACCESS_READ, offset=pos)
chunkqueue.append((pos,chunk))
chunksready.release()
pos+=chunksize
if pos>chunk.size(): # Last chunk of file?
break
# File ended, finish all chunks
for t in chunkthreads:
chunksready.release() # wake thread so it finishes
for t in chunkthreads:
t.join() # wait for thread to finish
headtails.release() # post event to finish border checker
htt.join()
# All threads finished, collect our sum
return sum(sumq)
if __name__=="__main__":
from sys import argv
print "Found string %d times"%search(*argv[1:])
Also, modifying the whole thing to use some mapreduce routine (map chunks to counts, heads and tails, reduce by summing counts and checking tail+head parts) is left as an exercise.
Edit: Since it seems this search will be repeated with varying needles, an index would be much faster, being able to skip searches of sections that are known not to match. One possibility is making a map of which blocks contain any occurence of various n-grams (accounting for the block borders by allowing the ngram to overlap into the next); those maps can then be combined to find more complex conditions, before the blocks of original data need to be loaded. There are certainly databases to do this; look for full text search engines.
Here is a third, longer method that uses a database. The database is sure to be larger than the text. I am not sure about if the indexes is optimal, and some space savings could come from playing with that a little. (like, maybe WORD, and POS, WORD are better, or perhaps WORD, POS is just fine, need to experiment a little).
This may not perform well on 200 OK's test though because it is a lot of repeating text, but might perform well on more unique data.
First create a database by scanning the words, etc:
import sqlite3
import re
INPUT_FILENAME = 'bigfile.txt'
DB_NAME = 'words.db'
FLUSH_X_WORDS=10000
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS WORDS (
POS INTEGER
,WORD TEXT
,PRIMARY KEY( POS, WORD )
) WITHOUT ROWID
""")
cursor.execute("""
DROP INDEX IF EXISTS I_WORDS_WORD_POS
""")
cursor.execute("""
DROP INDEX IF EXISTS I_WORDS_POS_WORD
""")
cursor.execute("""
DELETE FROM WORDS
""")
conn.commit()
def flush_words(words):
for word in words.keys():
for pos in words[word]:
cursor.execute('INSERT INTO WORDS (POS, WORD) VALUES( ?, ? )', (pos, word.lower()) )
conn.commit()
words = dict()
pos = 0
recomp = re.compile('\w+')
with open(INPUT_FILENAME, 'r') as f:
for line in f:
for word in [x.lower() for x in recomp.findall(line) if x]:
pos += 1
if words.has_key(word):
words[word].append(pos)
else:
words[word] = [pos]
if pos % FLUSH_X_WORDS == 0:
flush_words(words)
words = dict()
if len(words) > 0:
flush_words(words)
words = dict()
cursor.execute("""
CREATE UNIQUE INDEX I_WORDS_WORD_POS ON WORDS ( WORD, POS )
""")
cursor.execute("""
CREATE UNIQUE INDEX I_WORDS_POS_WORD ON WORDS ( POS, WORD )
""")
cursor.execute("""
VACUUM
""")
cursor.execute("""
ANALYZE WORDS
""")
Then search the database by generating SQL:
import sqlite3
import re
SEARCH_PHRASE = 'how fast it is'
DB_NAME = 'words.db'
conn = sqlite3.connect(DB_NAME)
cursor = conn.cursor()
recomp = re.compile('\w+')
search_list = [x.lower() for x in recomp.findall(SEARCH_PHRASE) if x]
from_clause = 'FROM\n'
where_clause = 'WHERE\n'
num = 0
fsep = ' '
wsep = ' '
for word in search_list:
num += 1
from_clause += '{fsep}words w{num}\n'.format(fsep=fsep,num=num)
where_clause += "{wsep} w{num}.word = '{word}'\n".format(wsep=wsep, num=num, word=word)
if num > 1:
where_clause += " AND w{num}.pos = w{lastnum}.pos + 1\n".format(num=str(num),lastnum=str(num-1))
fsep = ' ,'
wsep = ' AND'
sql = """{select}{fromc}{where}""".format(select='SELECT COUNT(*)\n',fromc=from_clause, where=where_clause)
res = cursor.execute( sql )
print res.fetchone()[0]
I concede that grep will be be faster. I assume this file is a large string based file.
But you could do something like this if you really really wanted.
import os
import re
import mmap
fileName = 'bigfile.txt'
phrase = re.compile("how fast it is")
with open(fileName, 'r') as fHandle:
data = mmap.mmap(fHandle.fileno(), os.path.getsize(fileName), access=mmap.ACCESS_READ)
matches = re.match(phrase, data)
print('matches = {0}'.format(matches.group()))

Bash or Python to go backwards?

I have a text file which a lot of random occurrences of the string #STRING_A, and I would be interested in writing a short script which removes only some of them. Particularly one that scans the file and once it finds a line which starts with this string like
#STRING_A
then checks if 3 lines backwards there is another occurrence of a line starting with the same string, like
#STRING_A
#STRING_A
and if it happens, to delete the occurrence 3 lines backward. I was thinking about bash, but I do not know how to "go backwards" with it. So I am sure that this is not possible with bash. I also thought about python, but then I should store all information in memory in order to go backwards and then, for long files it would be unfeasible.
What do you think? Is it possible to do it in bash or python?
Thanks
Funny that after all these hours nobody's yet given a solution to the problem as actually phrased (as #John Machin points out in a comment) -- remove just the leading marker (if followed by another such marker 3 lines down), not the whole line containing it. It's not hard, of course -- here's a tiny mod as needed of #truppo's fun solution, for example:
from itertools import izip, chain
f = "foo.txt"
for third, line in izip(chain(" ", open(f)), open(f)):
if third.startswith("#STRING_A") and line.startswith("#STRING_A"):
line = line[len("#STRING_A"):]
print line,
Of course, in real life, one would use an iterator.tee instead of reading the file twice, have this code in a function, not repeat the marker constant endlessly, &c;-).
Of course Python will work as well. Simply store the last three lines in an array and check if the first element in the array is the same as the value you are currently reading. Then delete the value and print out the current array. You would then move over your elements to make room for the new value and repeat. Of course when the array is filled you'd have to make sure to continue to move values out of the array and put in the newly read values, stopping to check each time to see if the first value in the array matches the value you are currently reading.
Here is a more fun solution, using two iterators with a three element offset :)
from itertools import izip, chain, tee
f1, f2 = tee(open("foo.txt"))
for third, line in izip(chain(" ", f1), f2):
if not (third.startswith("#STRING_A") and line.startswith("#STRING_A")):
print line,
Why shouldn't it possible in bash? You don't need to keep the whole file in memory, just the last three lines (if I understood correctly), and write what's appropriate to standard-out. Redirect that into a temporary file, check that everything worked as expected, and overwrite the source file with the temporary one.
Same goes for Python.
I'd provide a script of my own, but that wouldn't be tested. ;-)
As AlbertoPL said, store lines in a fifo for later use--don't "go backwards". For this I would definitely use python over bash+sed/awk/whatever.
I took a few moments to code this snippet up:
from collections import deque
line_fifo = deque()
for line in open("test"):
line_fifo.append(line)
if len(line_fifo) == 4:
# "look 3 lines backward"
if line_fifo[0] == line_fifo[-1] == "#STRING_A\n":
# get rid of that match
line_fifo.popleft()
else:
# print out the top of the fifo
print line_fifo.popleft(),
# don't forget to print out the fifo when the file ends
for line in line_fifo: print line,
This code will scan through the file, and remove lines starting with the marker. It only keeps only three lines in memory by default:
from collections import deque
def delete(fp, marker, gap=3):
"""Delete lines from *fp* if they with *marker* and are followed
by another line starting with *marker* *gap* lines after.
"""
buf = deque()
for line in fp:
if len(buf) < gap:
buf.append(line)
else:
old = buf.popleft()
if not (line.startswith(marker) and old.startswith(marker)):
yield old
buf.append(line)
for line in buf:
yield line
I've tested it with:
>>> from StringIO import StringIO
>>> fp = StringIO('''a
... b
... xxx 1
... c
... xxx 2
... d
... e
... xxx 3
... f
... g
... h
... xxx 4
... i''')
>>> print ''.join(delete(fp, 'xxx'))
a
b
xxx 1
c
d
e
xxx 3
f
g
h
xxx 4
i
This "answer" is for lyrae ... I'll amend my previous comment: if the needle is in the first 3 lines of the file, your script will either cause an IndexError or access a line that it shouldn't be accessing, sometimes with interesting side-effects.
Example of your script causing IndexError:
>>> lines = "#string line 0\nblah blah\n".splitlines(True)
>>> needle = "#string "
>>> for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: list index out of range
and this example shows not only that the Earth is round but also why your "fix" to the "don't delete the whole line" problem should have used .replace(needle, "", 1) or [len(needle):] instead of .replace(needle, "")
>>> lines = "NEEDLE x NEEDLE y\nnoddle\nnuddle\n".splitlines(True)
>>> needle = "NEEDLE"
>>> # Expected result: no change to the file
... for i,line in enumerate(lines):
... if line.startswith(needle) and lines[i-3].startswith(needle):
... lines[i-3] = lines[i-3].replace(needle, "")
...
>>> print ''.join(lines)
x y <<<=== whoops!
noddle
nuddle
<<<=== still got unwanted newline in here
>>>
My awk-fu has never been that good... but the following may provide you what you're looking for in a bash-shell/shell-utility form:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "d"
LAST = NR
}' test_file` test_file
Basically... awk is producing a command for sed to strip certain lines. I'm sure there's a relatively easy way to make awk do all of the processing, but this does seem to work.
The bad part? It does read the test_file twice.
The good part? It is a bash/shell-utility implementation.
Edit: Alex Martelli points out that the sample file above might have confused me. (my above code deletes the whole line, rather than the #STRING_A flag only)
This is easily remedied by adjusting the command to sed:
sed `awk 'BEGIN{ORS=";"}
/#STRING_A/ {
if(LAST!="" && LAST+3 >= NR) print LAST "s/#STRING_A//"
LAST = NR
}' test_file` test_file
This may be what you're looking for?
lines = open('sample.txt').readlines()
needle = "#string "
for i,line in enumerate(lines):
if line.startswith(needle) and lines[i-3].startswith(needle):
lines[i-3] = lines[i-3].replace(needle, "")
print ''.join(lines)
this outputs:
string 0 extra text
string 1 extra text
string 2 extra text
string 3 extra text
--replaced -- 4 extra text
string 5 extra text
string 6 extra text
#string 7 extra text
string 8 extra text
string 9 extra text
string 10 extra text
In bash you can use sort -r filename and tail -n filename to read the file backwards.
$LINES=`tail -n filename | sort -r`
# now iterate through the lines and do your checking
I would consider using sed. gnu sed supports definition of line ranges. if sed would fail, then there is another beast - awk and I'm sure you can do it with awk.
O.K. I feel I should put my awk POC. I could not figure out to use sed addresses. I have not tried combination of awk+sed, but it seems to me it's overkill.
my awk script works as follows:
It reads lines and stores them into 3 line buffer
once desired pattern is found (/^data.*/ in my case), the 3-line buffer is looked up to check, whether desired pattern has been seen three lines ago
if pattern has been seen, then 3 lines are scratched
to be honest, I would probably go with python also, given that awk is really awkward.
the AWK code follows:
function max(a, b)
{
if (a > b)
return a;
else
return b;
}
BEGIN {
w = 0; #write index
r = 0; #read index
buf[0, 1, 2]; #buffer
}
END {
# flush buffer
# start at read index and print out up to w index
for (k = r % 3; k r - max(r - 3, 0); k--) {
#search in 3 line history buf
if (match(buf[k % 3], /^data.*/) != 0) {
# found -> remove lines from history
# by rewriting them -> adjust write index
w -= max(r, 3);
}
}
buf[w % 3] = $0;
w++;
}
/^.*/ {
# store line into buffer, if the history
# is full, print out the oldest one.
if (w > 2) {
print buf[r % 3];
r++;
buf[w % 3] = $0;
}
else {
buf[w] = $0;
}
w++;
}

Categories

Resources