Finding sub-strings in LARGE string - python

#read in csv file in form ("case, num, val \n case1, 1, baz\n...")
# convert to form FOO = "casenumval..." roughly 6 million characters
for someString in List: #60,000 substrings
if substr not in FOO:
#do stuff
else:
#do other stuff
So my issue is that there are far too many sub strings to check against this massive string. I have tried reading the file in line by line and checking the substrings against the line, but this still crashes the program. Are there any techniques for checking a lot of substrings againsts a very large string efficiently?
FOR CONTEXT:
I am performing data checks, suspect data is saved to a csv file to be reviewed/changed. This reviewed/changed file is then compared to the original file. Data which has not changed has been verified as good and must be saved to a new "exceptionFile". Data that has been changed and passes is disregarded. And data which has been changed and is checked and still suspect is the sent off for review again.

The first thing you should do is convert your list of 60,000 strings to search for into one big regular expression:
import re
searcher = re.compile("|".join(re.escape(s) for s in List)
Now you can search for them all at once:
for m in searcher.finditer(FOO):
print(m.group(0)) # prints the substring that matched
If all you care about is knowing which ones were found,
print(set(m.group(0) for m in searcher.finditer(FOO))
This is still doing substantially more work than the absolute minimum, but it should be much more efficient than what you were doing before.
Also, if you know that your input is a CSV file and you also know that none of the strings-to-search-for contain a newline, you can operate line by line, which may or may not be faster than what you were doing depending on conditions, but will certainly use less memory:
with open("foo.csv") as FOO:
for line in FOO:
for m in searcher.finditer(line):
# do something with the substring that matched

Related

Make a list from the words of fileA and check with that list Against fileB in python

With a bit more detail, i have a list of common words in a txt file and i want to check if any of those words (around 2000) exist in another file (html) and if they do replace them with a constant string (sssss for example). Regex didn't help me much using either of these \b \b(?:one|two|three)\b or \w or ?:^|(?<= ))(one|common|word|or|another)(?:(?= )|$) .
Now i know how to open a file and import the first list but i don't know how to check every entry of that list against a huge text and replace their instances. I don't mind if it would take time I just really need this done and don't know how.
import re
import string
f = open('test2.txt', 'r')
lines = f.readlines()
print (lines)
Here's a hint for you. Parse each file into a set where each word would be an entry.
Then you can do a comparison between both sets with one of the aggregation functions: union, intersection, difference, or symmetric difference.
Regular expressions is not necessary unless you plan to make additional correlations with each word (comparing cat to cats). But if you plan to go down this road, then you're probably better off generating a Trie (prefix tree). I can expand more if you are willing to show some more code (progress).

How can I efficiently search for many strings at once in many files?

Hi i post my portion of code and then i'll explain my goal:
for eachcsv in matches:
with open(eachcsv, 'r') as f:
lines = f.readlines()
for entry in rs:
for line in lines:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
So in "matches" i got a list of csv files (the path to them). I open every csv file and i load them into memory with readlines(). "rs" is a list of unique ids. For every element of the list "rs" i need to search every line of the csv file and print each time i find the id on the file (later on i will test if the line contains another fixed word also).
The code above works for my purpose but i don't know why it takes more than 10 minutes to process a 400k row file, i need to do this task for thousand of files so it's impossible for me to finish the task. It seems to me that the slow part is the testing process, but i'm not sure.
Please note that i use python because i'm more confortable with it, if there is ANY other solution to my problem using other tools i'm ok with it.
EDIT:
i will try to post some examples
"rs" list:
rs12334435
rs3244567
rs897686
....
files
# header data not needed
# data
# data
# data
# data
# data [...]
#COLUMN1 COLUMN2 COLUMN3 ...
data rs7854.rs165463 dataSS=1(random_data)
data rs465465data datadata
data rs798436 dataSS=1
data datars45648 dataSS=1
The final goal is to count how many times every rs appers on each file and if in column 3 there is SS=1 to flag it in the output.
Something like
found rs12345 SS yes file 3 folder /root/foobar/file
found rs74565 SS no file 3 folder /root/foobar/file
Much of the problem is because you have so many nested loops. You can probably make your program faster by eliminating loops:
One of the loops is over each of the lines in the file. But if all
you want to do is determine whether any of the matches exists in the
file, you can search the whole file body in one operation. To be
sure, this searches a longer string, but it does so in one operation
in native code instead of doing it in Python.
You loop over all the match strings. But you know those before you
start and they are the same for each file. So this is a good case
of when doing-more up-front work will pay off in time saved in the
rest of the program. Stand back, I'm going to use a regular
expression.
Here is a version of the code which combines these two ideas:
import re
import random
import sys
import time
# get_patterns makes up some test data.
def get_patterns():
rng = random.Random(1) # fixed seed, for reproducibility
n = 300
# Generate up to n unique integers between 60k and 80k.
return list(set([str(rng.randint(60000, 80000)) for _ in xrange(n)]))
def original(rs, matches):
for eachcsv in matches:
with open(eachcsv, 'r') as f:
lines = f.readlines()
for entry in rs:
for line in lines:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
def mine(rs, matches):
my_rx = re.compile(build_regex(rs))
for eachcsv in matches:
with open(eachcsv, 'r') as f:
body = f.read()
matches = my_rx.findall(body)
for match in matches:
print "found %s in %s" % (match, eachcsv)
def build_regex(literal_patterns):
return "|".join([re.escape(pat) for pat in literal_patterns])
def print_elapsed_time(label, callable, args):
t1 = time.time()
callable(*args)
t2 = time.time()
elapsed_ms = (t2 - t1) * 1000
print "%8s: %9.1f milliseconds" % (label, elapsed_ms)
def main(args):
rs = get_patterns()
filenames = args[1:]
for function_name_and_function in (('original', original), ('mine', mine)):
name, func = function_name_and_function
print_elapsed_time(name, func, [rs, filenames])
return 0
if __name__ == '__main__':
sys.exit(main(sys.argv))
Your original code is in there as original and my replacement is mine.
For 300 patterns, my implementation runs in 400ms on my computer. which is roughly a 30x speedup. For more match strings, the improvement should be greater. Doubling the number of patterns roughly doubles the runtime of your implementation, but the regex-based one only takes about 3% longer (though this is partly because my test patterns all have similar prefixes and that may not be true for the real data).
Edit: updated code to print one message for each match. My code is now somewhat slower, but it's still an improvement, and should be relatively speaking faster with more strings to match:
~/source/stackoverflow/36923237$ python search.py example.csv
found green fox in example.csv
original: 9218.0 milliseconds
found green fox in example.csv
mine: 600.4 milliseconds
Edit: explanation of the regex technique, as requested.
Suppose you want to search your file for the strings foobar and umspquux. One way to do this is to search the file for first foobar and then umspquux. This is the approach you started with.
Another approach is to search for both strings at once. Imagine that you check the first character of the file. If it's 'f' or 'u', you might be looking at a match, and should check the second character to see if it, respectively 'o' or 'm'. And so on. If you get to the end od of the file, you will have found all the matches there qre in the file to find.
A convenient way to tell a computer to look for multiple strings at once is to use a regular expression. Normal strings are regular expressions. The regular expression 'foobar' matches the sub-string 'foobar' inside 'the foobar is all fuzzy'. However, you can do more complex things. You can combine two regular expressions, each of which matches something, into a combined regular expression which will match either of those somethings. This is done with the alternation symbol, '|'. So the regular expression 'foobar|umspquux' will match either 'foobar' or 'umspquux'. You can also match a real '|' by escaping the significance of the '|' with a backslash '\'.
This is what the build_regex_literal_patterns is all about. It would convert the list ['foobar', 'umspquux'] into the string 'foobar|umspquux'. Try it out - put the function definition into your own file and call it with some try-out arguments to see how it behaves.
That's a good way, by the way, to figure out how any piece of code works - run part of it and print the intermediate result. This is harder to do safely with programs that have side-effects of course, but this program doesn't.
The call to re.escape in build_regex_literal_patterns simply ensures that any special regular expression operators (such as '|') are escaped (giving in this case '\|') so that they will match themselves.
The last part of the regular expression method here is to use the findall method of the compiled regular expression. This simply returns all the matches for our regular expression in the input string (i.e. the body of the file).
You can read up on Python regular expressions in the Python documentation on regular expressions. That documentation is basically reference material, so you might find the gentler introduction at the Google Develoeprs site has an introduction to Python regular expressions better as a starting point. Jeffrey Friedl's Mastering Regular Expressions is a pretty comprehensive work on regular expressions, thought it doesn't happen to cover the Python dialect of regular expressions.
Your largest memory intensive operation is going to be reading every line, this should be on the outer loop.
You'll also not want to read the entire document in at once with readlines, use readline instead. Readline is much less memory intensive.
for eachcsv in matches:
with open(eachcsv, 'r') as f:
for line in f:
for entry in rs:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
More reading here https://docs.python.org/2/tutorial/inputoutput.html
There are other optimizations you can take too that fall outside the scope of this question, such as using threading, or using multi processing to read in many csv files at the same time. https://docs.python.org/2/library/threading.html
import threading
def findRSinCSVFile(csvFile,rs)
with open(csvFile, 'r') as f:
for line in f:
for entry in rs:
if entry in line:
print("found %s in %s" % (entry, eachcsv))
for csvFile in csvFiles():
threads.append(threading.Thread(target=findRSinCSVFile,args=(csvFile,rs)))
for thread in threads:
thread.start()
for thread in threads:
thread.join()
This will allow you to parse all the csv files at the same time.
(Note I have not tested any of this code, and only serves as an example)

Pushing data into regex?

I am working on a small project which I call pydlp. It is basically a set of regex signatures that will extract data from a fileobject. And a function that check if extracted data is in fact interesting.
This code is how I perform matching. It is far from optimal, as I have to read the file over and over again.
for signature in signatures:
match = signature.validate(signature.regex.match(fobj.read())))
if match: matches.append(match)
fobj.seek(0)
Is there a way to perform multiple regex matches on the same file object while only reading the file object content once. The file object can be large, so I cannot put it in memory.
Edit:
I want to clarify why I mean by "pushing data into regex". I recognize that regex has similarities with a finite state machine. Instead of passing the whole data at once to the regex engine, is it possible to push parts of it at a time?
while True:
data = fobj.read(1024)
if data == "": break
for signature in signatures:
match = signature.regex.push_and_match(data)
if match: matches.append(match)
Edit 2:
Removed link, as I removed the project from github.
The standard way to do this sort of text processing with files too large to read into memory is to iterate over the file line by line:
regexes = [ .... ]
with open('large.file.txt') as fh:
for line in fh:
for rgx in regexes:
m = rgx.search(line)
if m:
# Do stuff.
But that approach assumes your regexes can operate successfully on single lines of text in isolation. If they cannot, perhaps there are other units that you can pass to the regexes (eg, paragraphs delimited by blank lines). In other words, you might need to do a bit of pre-parsing so that you can grab meaningful sections of text before sending them to your main regexes.
with open('large.file.txt') as fh:
section = []
for line in fh:
if line.strip():
section.append(line)
else:
# We've hit the end of a section, so we
# should check it against our regexes.
process_section(''.join(section), regexes)
section = []
# Don't forget the last one.
if section:
process_section('\n'.join(section), regexes)
Regarding your literal question: "Is there a way to perform multiple regex matches on the same file object while only reading the file object content once". No and yes. No in the sense that Python regexes operate on strings, not file objects. But you can perform multiple regex searches at the same time on one string, simply by using alternation. Here's a minimal example:
patterns = 'aa bb cc'.split()
big_regex = '|'.join(patterns) # Match this or that or that.
m = big_regex.search(some_text)
But that doesn't really solve your problem if the file is too big for memory.
Maybe consider using re.findall() if you don't need match objects but only matched strings? If file is too big you can slice it for parts, as you suggest, but using some overlaps, to miss no regexes (if you know nature of regexes maybe it'll be possible to find out how big overlap should be).

Search a delimited string in a file - Python

I have the following read.json file
{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}
and python script :
import re
shakes = open("read.json", "r")
needed = open("needed.txt", "w")
for text in shakes:
if re.search('JOL":"(.+?).tr', text):
print >> needed, text,
I want it to find what's between two words (JOL":" and .tr) and then print it. But all it does is printing all the text set in "read.json".
You're calling re.search, but you're not doing anything with the returned match, except to check that there is one. Instead, you're just printing out the original text. So of course you get the whole line.
The solution is simple: just store the result of re.search in a variable, so you can use it. For example:
for text in shakes:
match = re.search('JOL":"(.+?).tr', text)
if match:
print >> needed, match.group(1)
In your example, the match is JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr, and the first (and only) group in it is EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD, which is (I think) what you're looking for.
However, a couple of side notes:
First, . is a special pattern in a regex, so you're actually matching anything up to any character followed by tr, not .tr. For that, escape the . with a \. (And, once you start putting backslashes into a regex, use a raw string literal.) So: r'JOL":"(.+?)\.tr'.
Second, this is making a lot of assumptions about the data that probably aren't warranted. What you really want here is not "everything between JOL":" and .tr", it's "the value associated with key 'JOL' in the JSON object". The only problem is that this isn't quite a JSON object, because of that prefixed :. Hopefully you know where you got the data from, and therefore what format it's actually in. For example, if you know it's actually a sequence of colon-prefixed JSON objects, the right way to parse it is:
d = json.loads(text[1:])
if 'JOL' in d:
print >> needed, d['JOL']
Finally, you don't actually have anything named needed in your code; you opened a file named 'needed.txt', but you called the file object love. If your real code has a similar bug, it's possible that you're overwriting some completely different file over and over, and then looking in needed.txt and seeing nothing changed each timeā€¦
If you know that your starting and ending matching strings only appear once, you can ignore that it's JSON. If that's OK, then you can split on the starting characters (JOL":"), take the 2nd element of the split array [1], then split again on the ending characters (.tr) and take the 1st element of the split array [0].
>>> text = '{:{"JOL":"EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD.tr","LAPTOP":"error"}'
>>> text.split('JOL":"')[1].split('.tr')[0]
'EuXaqHIbfEDyvph%2BMHPdCOJWMDPD%2BGG2xf0u0mP9Vb4YMFr6v5TJzWlSqq6VL0hXy07VDkWHHcq3At0SKVUrRA7shgTvmKVbjhEazRqHpvs%3D-%1E2D%TL/xs23EWsc40fWD'

How to ensure two line breaks between each paragraph in python

I am reading txt files into python, and want to get paragraph breaks consistent. Sometimes there is one, two, three, four... occasionally several tens or hundreds of blank lines between paragraphs.
It is obviously easy to strip out all the breaks, but I can only think of "botched" ways of making everything two breaks (i.e. a single blank line between each paragraph). All i can think of would be specifying multiple strips/replaces for different possible combinations of breaks... which gets unwieldy when the number of breaks is very large ... or iterativly removing excess breaks until left with two, which I guess would be slow and not particularly scalable to many tens of thousands of txt files ...
Is there a moderately fast to process [/simple] way of achieving this?
import re
re.sub(r"([\r\n]){2,}",r"\1\1",x)
You can try this.Here x will be your string containing all the paragraphs.
Here's one way.
import os
f = open("text.txt")
r = f.read()
pars = [p for p in r.split(os.linesep) if p]
print (os.linesep * 2).join(pars)
This is assuming by paragraphs we mean a block of text not containing a linebreak.

Categories

Resources