Efficiently search for many different strings in large file

Efficiently search for many different strings in large file - python

I am trying to find a fast way of searching strings in a file. First of all, I don't have only one string to find. I have a list of 1900 strings to find in a file which is 150MB. So basically I am opening a file, looping for 1900 times to find all occurrences of that string in that file. Here are some of the attributes of my search.
Size of the file to be searched is 150mb – it’s text file.
I need to find all occurrences of 1900 strings in a file. Means I am looping 1900 times entire file to search for all occurrences.
It’s not simple search, I have to use regex to search the string.
In few cases, I need a line above and a line below the where I found the search string. So I need to use file.readlines() not file.read()
In few cases I also have to replace the searched string with new string.
First I am trying to find a best way to search in the file. My code is taking too long. I am not sure if this is best way to do it:
#searchstrings is list of 1900 strings
file = open("mytextfile.txt", "r")
for line in file:
for i in range(len(searchstrings)):
if searchstrings[i] in line:
print(line)
file.close()
This code does the job but it’s extremely slow. Also it does not give me option to choose the line above or below where the searchstring is found.
Another code I am using to replace the string is like below. This code is also extremely slow. Here I am using regex.
file = open("mytextfile.txt", "r")
file_data = file.read()
#searchstrings is list of 1900 strings
#replacestrings is list of 1900 strings that needs to be replaced
for i in range(len(searchstrings)):
src_str = re.compile(searchstrings[i], re.IGNORECASE)
file_data = src_str.sub(replacestrings[i], file_data)
file.close()
I know the performance of the code depends on the computing power as well, however, I just want to know what is the best way to write this code that will work at optimum speed for given hardware. Also I would like to know how to time the program execution.

I like Unix commands, they are fun, fast and efficient.
import re, sys
map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))

A few observations.
For idiomatic Python, you usually want
for string in searchstrings:
...
instead of
for i in range(len(searchstrings)):
searchstrings[i]
and with open(filename) as f: ... instead of open()/close(). The with statement will close the file automatically.
When you want to replace any of several strings with a regex, you can do
re.sub('|'.join(YOUR_STRINGS), replacement, text)
because | is the regex symbol for "or", instead of looping over them all individually.
For performance, I might try switching from CPython to PyPy. PyPy is another implementation of the same language but often much faster.
On the other hand, if that's really all your program is supposed to do, you might want to use a dedicated tool for the job, like Ag or RipGrep which has already been optimized for this job. Possibly through the subprocess.run() function if you're working in Python.

Related

TypeError: 'list' object cannot be interpreted as an integer, when using a regex. How can I fix this?

I know there are similar threads to this question (having looked at them already) but I cannot, as a noob, work out how to translate those answers across to adjust my script to make it work (4+ days of trying).
So.. I have a python script to randomly select a subset of items from a file and components of those items. I want to create two new txt files as output. One with the subset of items and one with just a list of components (Ingredients) for those items.
To do this I have done write-lines to the first txt (MenuOutput.txt)file and then want to use regex (re.sub) to strip out the first part of the string from each line in the second file (ShoppingOutput.txt).
Now the issue: the TypeError: 'list' object cannot be interpreted as an integer. I understand (I think) the problem is the re.sub outputs a list object. But I don't know another way to strip the first part of each line from a text file. Is there a way of tweaking the re.sub to make it work, or do I need another function I am unaware of?
Menu_choices = random.sample(sample_list, k=6)
MenuOutput = open('MenuOutput.txt', 'w')
for element in Menu_choices:
MenuOutput.write(element)
MenuOutput.close()
MyFile = open('ShoppingOutput.txt', 'w')
ShoppingOutput = re.sub(r'.*?', 'I', Menu_choices)
for element in ShoppingOutput:
MyFile.write(element)
MyFile.close

Just like you loop over the list of strings to write them, you have to loop over them to perform other string manipulations on them.
with open('ShoppingOutput.txt', 'w') as my_file:
for element in MenuChoices:
my_file.write(re.sub(r'.*?', 'I', element))
Notice also the upgrade to a with statement, and using snake_case for regular variables.
Your regex seems both inexact and inefficient, though. Probably better to just my_file.write('I' + element)) and get rid of the no-op re.sub, or perhaps replace with a simple substring operation if the intent was to remove a prefix but you hadn't worked out the correct regex for that yet.
my_file.write('I' + element[element.index(' ')+1:])
would write everything after the first space.

Write list variable to file

I have a .txt file of words I want to 'clean' of swear words, so I have written a program which checks each position, one-by-one, of the word list, and if the word appears anywhere within the list of censorable words, it removes it with var.remove(arg). It's worked fine, and the list is clean, but it can't be written to any file.
wordlist is the clean list.
newlist = open("lists.txt", "w")
newlist.write(wordlist)
newlist.close()
This returns this error:
newlist.write(wordlist)
TypeError: expected a string or other character buffer object
I'm guessing this is because I'm trying to write to a file with a variable or a list, but there really is no alternate; there are 3526 items in the list.
Any ideas why it can't write to a file with a list variable?
Note: lists.txt does not exist, it is created by the write mode.

write writes a string. You can not write a list variable, because, even though to humans it is clear that it should be written with spaces or semicolons between words, computers do not have the free hand for such assumptions, and should be supplied with the exact data (byte wise) that you want to write.
So you need to convert this list to string - explicitly - and then write it into the file. For that goal,
newlist.write('\n'.join(wordlist))
would suffice (and provide a file where every line contains a single word).
For certain tasks, converting the list with str(wordlist) (which will return something like ['hi', 'there']) and writing it would work (and allow retrieving via eval methods), but this would be very expensive use of space considering long lists (adds about 4 bytes per word) and would probably take more time.

If you want a better formatting for structural data you can use built-in json module.
text_file.write(json.dumps(list_data, separators=(',\n', ':')))
The list will work as a python variable too. So you can even import this later.
So this could look something like this:
var_name = 'newlist'
with open(path, "r+", encoding='utf-8') as text_file:
text_file.write(f"{var_name} = [\n")
text_file.write(json.dumps(list_data, separators=(',\n', ':')))
text_file.write("\n]\n")

python speed up this regex sub

p = re.compile('>.*\n')
p.sub('', text)
I want to delete all lines starting with a '>'. I have a really huge file (3GB) that I process in chunks of size 250MB, so the variable "text" is a string of size 250MB. (I tried different sizes, but the performance was always the same for the complete file).
Now, can I speed up this regex somehow? I tried the multi-line matching, but it was a lot slower. Or are there even better ways?
(I already tried to split the string and then filter out the line like this, but it was also slower (i also tried a lambda instead of def del_line: (that might not be working code, it's just from memory):
def del_line(x): return x[0] != '>'
def func():
....
text = file.readlines(chunksize)
text = filter(del_line, text)
...
EDIT:
As suggested in the comments, I also tried walking line by line:
text = []
for line in file:
if line[0] != '>':
text.append(line)
text = ''.join(text)
That's also slower, it needs ~12 sec. My regex need ~7 sec. (yeah, that's fast, but it must also run on slower machines)
EDIT: Of course, I also tried str.startswith('>'), it was slower...

If you have the chance, running grep as a subprocess is probably the most pragmatic choice.
If for whatever reason you can't rely on grep, you could try implementing some of the "tricks" that make grep fast. From the author himself, you can read about them here: http://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html
At the ending of the article, the author summarizes the main points. The one that stands out to me the most is:
Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for
newlines would slow grep down by a factor of several times, because to
find the newlines it would have to look at every byte!
The idea would be to load the entire file in memory and iterate with it on byte-level instead of line-level. Only when you find a match, you look for the line boundaries and delete it.
You say you have to run this on other computers. If it's within your reach and you are not doing it already, consider running it on PyPy instead of CPython (the default interpreter). This may (or may not) improve the runtime by a significant factor, depending on the nature of the program.
Also, as some comments already mentioned, benchmark with the actual grep to get a baseline of how fast you can go, reasonably speaking. Get it on Cygwin if you are on Windows, it's easy enough.

This is not faster?
def cleanup(chunk):
return '\n'.join(st for st in chunk.split('\n') if not(st and st[0] == '>'))
EDIT: yeah, that is not faster. That's twice as slow.
Maybe consider using subprocess and a tool like grep, as suggested by Ryan P. You could even take advantage of multiprocessing.

Using text in one file to search for match in second file

I'm using python 2.6 on linux.
I have two text files
first.txt has a single string of text on each line. So it looks like
lorem
ipus
asfd
The second file doesn't quite have the same format.
it would look more like this
1231 lorem
1311 assss 31 1
etc
I want to take each line of text from first.txt and determine if there's a match in the second text. If there isn't a match then I would like to save the missing text to a third file. I would like to ignore case but not completely necessary. This is why I was looking at regex but didn't have much luck.
So I'm opening the files, using readlines() to create a list.
Iterating through the lists and printing out the matches.
Here's my code
first_file=open('first.txt', "r")
first=first_file.readlines()
first_file.close()
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
while i < len(first):
j=search[i]
while k < len(second):
m=compare[k]
if not j.find(m):
print m
i=i+1
k=k+1
exit()
It's definitely not elegant. Anyone have suggestions how to fix this or a better solution?

My approach is this: Read the second file, convert it into lowercase and then create a list of the words it contains. Then convert this list into a set, for better performance with large files.
Then go through each line in the first file, and if it (also converted to lowercase, and with extra whitespace removed) is not in the set we created, write it to the third file.
with open("second.txt") as second_file:
second_values = set(second_file.read().lower().split())
with open("first.txt") as first_file:
with open("third.txt", "wt") as third_file:
for line in first_file:
if line.lower().strip() not in second_values:
third_file.write(line + "\n")
set objects are a simple container type that is unordered and cannot contain duplicate value. It is designed to allow you to quickly add or remove items, or tell if an item is already in the set.
with statements are a convenient way to ensure that a file is closed, even if an exception occurs. They are enabled by default from Python 2.6 onwards, in Python 2.5 they require that you put the line from __future__ import with_statements at the top of your file.
The in operator does what it sounds like: tell you if a value can be found in a collection. When used with a list it just iterates through, like your code does, but when used with a set object it uses hashes to perform much faster. not in does the opposite. (Possible point of confusion: in is also used when defining a for loop (for x in [1, 2, 3]), but this is unrelated.)

Assuming that you're looking for the entire line in the second file:
second_file=open('second.txt',"r")
second=second_file.readlines()
second_file.close()
first_file=open('first.txt', "r")
for line in first_file:
if line not in second:
print line
first_file.close()

Awk, bash or python for converting a regular file?

I have a text file with lots of lines and with this structure:
[('name_1a',
'name_1b',
value_1),
('name_2a',
'name_2b',
value_2),
.....
.....
('name_XXXa',
'name_XXXb',
value_XXX)]
I would like to convert it to:
name_1a, name_1b, value_1
name_2a, name_2b, value_2
......
name_XXXa, name_XXXb, value_XXX
I wonder what would be the best way, whether awk, python or bash.
Thanks
Jose

Tried evaluating it python? Looks like a list of tuples to me.
eval(your_string)
Note, it's massively unsafe! If there's code in there to delete your hard disk, evaluating it will run that code!

I would like to use Python:
lines = open('filename.txt','r').readlines()
n = len(lines) # n % 3 == 0
for i in range(0,n,3):
name1 = lines[i].strip("',[]\n\r")
name2 = lines[i+1].strip("',[]\n\r")
value = lines[i+2].strip("',[]\n\r")
print name1,name2,value

It looks like legal Python. You might be able to just import it as a module and then write it back out after formatting it.

Oh boy, here is a job for ast.literal_eval:
(literal_eval is safer than eval, since it restricts the input string to literals such as strings, numbers, tuples, lists, dicts, booleans and None:
import ast
filename='in'
with open(filename,'r') as f:
contents=f.read()
data=ast.literal_eval(contents)
for elt in data:
print(', '.join(map(str,elt)))

here's one way to do it with (g)awk
$ awk -vRS=")," ' { gsub(/\n|[\047\]\[)(]/,"") } 1' file
name_1a,name_1b,value_1
name_2a,name_2b,value_2
name_XXXa,name_XXXb,value_XXX

Awk is typically line oriented, and bash is a shell, with limited numbrer of string manipulation functions. It really depends on where your strength as a programmer lies, but all other things being equal, I would choose python.
Did you ever consider that by redirecting the time it took to post this on SO, you could have had it done?
"AWK is a language for processing
files of text. A file is treated as a
sequence of records, and by default
each line is a record. Each line is
broken up into a sequence of fields,
so we can think of the first word in a
line as the first field, the second
word as the second field, and so on.
An AWK program is of a sequence of
pattern-action statements. AWK reads
the input a line at a time. A line is
scanned for each pattern in the
program, and for each pattern that
matches, the associated action is
executed." - Alfred V. Aho[2]

Asking what's the best language for doing a given task is a very different question to say, asking: 'what's the best way of doing a given task in a particular language'. The first, what you're asking, is in most cases entirely subjective.
Since this is a fairly simple task, I would suggest going with what you know (unless you're doing this for learning purposes, which I doubt).
If you know any of the languages you suggested, go ahead and solve this in a matter of minutes. If you know none of them, now enters the subjective part, I would suggest learning Python, since it's so much more fun than the other 2 ;)

If the values are legal python values, you can take advantage of eval() since your data is a legal python data sucture. The following would work if values are integers, otherwise you might have to massage the print call a bit:
input = """[('name_1a',
'name_1b',
1),
('name_2a',
'name_2b',
2),
('name_XXXa',
'name_XXXb',
3)]"""
for e in eval(input):
print '%s,%s,%d' % e
P.S. using eval() is quite controversial since it will execute any valid python code that you pass into it, so take care.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Efficiently search for many different strings in large file - python

I like Unix commands, they are fun, fast and efficient. import re, sys map(sys.stdout.write,(string_x for string_x in sys.stdin if re.search(sys.argv[1],string_x)))

Related

TypeError: 'list' object cannot be interpreted as an integer, when using a regex. How can I fix this?

Write list variable to file

python speed up this regex sub

Using text in one file to search for match in second file

Awk, bash or python for converting a regular file?

Categories

Resources