Edit: it was huge mistake to initialize name-dataset on each call (its loading the blacklist-name-dataset into memory).
Just moved m = NameDataset() to the main function. Did not measured, but its now at least 100 times faster.
I developed a (multilingual) and fast searchable DB with dataTables. I want to anonymize some data (names etc.) from the MySQL DB.
In first instance I used spaCy - but hence the dictionaries were not trained - the result had too much false-positives. Right now I am pretty happy with name-dataset. Its working completely different of course and its much slower as spaCy, which benefits of GPU-Power.
Hence the custom names-DB got pretty big (350.000 lines) and target DB is huge - the processing of each found word with regex.finditer in a loop takes for ever (Ryzen 7 3700X).
We can say 5 cases/sec - which makes >100 hours for some Million rows.
Hence each process eats just about 10% CPU power, I start several (up to 10) python processes - on the end its still takes too long.
I hope, I never have to do this again - but I am afraid, that I have to.
Thatswhy I ask, do you have any performance tipps for the following routines?
the outer for loop in main() - which is looping through the piped object (the DB rows) and calls three times (= 3 items/columns) anonymize()
which has a for loop as well, which runs through each found word
Would it make sense to rewrite them, to use CUDA/numba (RTX 2070 available) etc.?
Any other performance tipps? Thanks!
import simplejson as json
import sys, regex, logging, os
from names_dataset import NameDataset
def anonymize(sourceString, col):
replacement = 'xxx'
output = ''
words = sourceString.split(' ')
#and this second loop for each word (will run three times per row)
for word in words:
newword = word
#regex for findind/splitting the words
fRegExStr = r'(?=[^\s\r\n|\(|\)])(\w+)(?=[\.\?:,!\-/\s\(\)]|$)'
pattern = regex.compile(fRegExStr)
regx = pattern.finditer(word)
if regx is None:
if m.search_first_name(word, use_upper_Row=True):
output += replacement
elif m.search_last_name(word, use_upper_Row=True):
output += replacement
else:
output += word
else:
for eachword in regx:
if m.search_first_name(eachword.group(), use_upper_Row=True):
newword = newword.replace(eachword.group(), replacement)
elif m.search_last_name(eachword.group(), use_upper_Row=True):
newword = newword.replace(eachword.group(), replacement)
output += newword
output += ' '
return output
def main():
#object with data is been piped to the python script, data structure:
#MyRows: {
# [Text_A: 'some text', Text_B: 'some more text', Text_C: 'still text'],
# [Text_A: 'some text', Text_B: 'some more text', Text_C: 'still text'],
# ....several thousand rows
# }
MyRows = json.load(sys.stdin, 'utf-8')
#this is the first outer loop for each row
for Row in MyRows:
xText_A = Row['Text_A']
if Row['Text_A'] and len(Row['Text_A']) > 30:
Row['Text_A'] = anonymize(xText_A, 'Text_A')
xText_B = Row['Text_B']
if xText_B and len(xText_B) > 10:
Row['Text_B'] = anonymize(xText_B, 'Text_B')
xMyRowText_C = Row['MyRowText_C']
if xMyRowText_C and len(xMyRowText_C) > 10:
Row['MyRowText_C'] = anonymize(xMyRowText_C, 'MyRowText_C')
retVal = json.dumps(MyRows, 'utf-8')
return retVal
if __name__ == '__main__':
m = NameDataset() ## Right here is good - THIS WAS THE BOTTLENECK ##
retVal = main()
sys.stdout.write(str(retVal))
You are doing
for word in words:
newword = word
#regex for findind/splitting the words
fRegExStr = r'(?=[^\s\r\n|\(|\)])(\w+)(?=[\.\?:,!\-/\s\(\)]|$)'
pattern = regex.compile(fRegExStr)
regx = pattern.finditer(word)
meaning that you regex.compile exactly same thing in each turn in loop, whilst you can do it before loop begin and get same result.
I do not see other obvious optimizations, so I suggest to profile code to found what part is most time consuming.
Related
I have huge strings (7-10k characters) from log files that I need to automatically extract and tabulate information from. Each string contains approximately 40 values that are input by various people. Example;
Example string 1.) 'Color=Blue, [randomJunkdataExampleHere] Weight=345Kg, Age=34 Years, error#1 randomJunkdataExampleThere error#1'
Example string 2.) '[randomJunkdataExampleHere] Color=Red 42, Weight=256 Lbs., Age=34yers, error#1, error#2'
Example string 3.) 'Color=Yellow 13,Weight=345lbs., Age=56 [randomJunkdataExampleHere]'
Desired outcome is a new string, or even dictionary that organizes the data and readies for database entry (one string for each row of data);
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
Considered using re.search for each column/value, but since there's variance in how users input data I don't know how to trap just the numbers that I want to extract. Also have no idea how to capture the number of times a 'Error#1Count' occurs in the string.
import re
line = '[randomJunkdataExampleHere] Color=Blue, Weight=345Kg, Age=34 Years, error#1, randomJunkdataExampleThere error#1'
try:
Weight = re.search('Weight=(.+?), Age',line).group(1)
except AttributeError:
Weight = 'ERROR'
Goal/Result:
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
As stated above, 10000 characters really isn't a huge deal.
import time
example_string_1 = 'Color=Blue, Weight=345Kg, Age=34 Years, error#1, error#1'
example_string_2 = 'Color=Red 42, Weight=256 Lbs., Age=34 yers, error#1, error#2'
example_string_3 = 'Color=Yellow 13, Weight=345lbs., Age=56'
def run():
examples = [example_string_1, example_string_2, example_string_3]
dict_list = []
for example in examples:
# first, I would suggest tokenizing the string to identify individual data entries.
tokens = example.split(', ')
my_dict = {}
for token in tokens: # Non-error case
if '=' in token:
subtokens = token.split('=') # this will split the token into two parts, i.e ['Color', 'Blue']
my_dict[subtokens[0]] = subtokens[1]
elif '#' in token: # error case. Not sure if this is actually the format. If not, you'll have to find something to key off of.
if 'error' not in my_dict or my_dict['error'] is None:
my_dict['error'] = [token]
else:
my_dict['error'].append(token)
dict_list.append(my_dict)
# Now lets test out how fast it is.
before = time.time()
for i in range(100000): # run it a hundred thousand times
run()
after = time.time()
print("Time: {0}".format(after - before))
Yields:
Time: 0.5782015323638916
See? Not too bad. Now all that is left is to iterate over the dictionary and record the metrics you want.
In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot
I have a file that every line is containing names of persons and a file containing text of speeches. The file with the names is very big(250k lines) ordered alphabetically, the speeches file has around 1k lines. What I want to do is a lookup for the names in my text file and do replacements for every occurring name from my names file.
This is my code EDIT: The with function that opens the list is executed only one time.
members_list = []
with open(path, 'r') as l:
for line in l.readlines():
members_list.append(line.strip('\n'))
for member in self.members_list:
if member in self.body:
self.body = self.body.replace(member, '<member>' + member + '</member>')
This code takes about 2.2 seconds to run, but because I have many speech files (4.5k) the total time is around 3 hours.
Is it possible to make this faster? Are generators the way to go?
Currently, you re-read each speech in memory once for each of the 250,000 names when you check "if member in self.body".
You need to parse the speech body once, finding whole words, spaces, and punctuation. Then you need to see if you have found a name, using a linear time lookup of the known member names, or at worst log time.
The problem is you have to find member names which have various word lengths. So here is a quick (and not very good) implementation I wrote up to handle checking the last three words.
# This is where you load members from a file.
# set gives us linear time lookup
members = set()
for line in ['First Person', 'Pele', 'Some Famous Writer']:
members.add(line)
# sample text
text = 'When Some Famous Writer was talking to First Person about Pele blah blah blah blah'
from collections import deque
# pretend we are actually parsing, but I'm just splitting. So lazy.
# This is why I'm not handling punctuation and spaces well, but not relevant to the current topic
wordlist = text.split()
# buffer the last three words
buffer = deque()
# TODO: loop while not done, but this sort of works to show the idea
for word in wordlist:
name = None
if len(buffer) and buffer[0] in members:
name = buffer.popleft()
if not name and len(buffer)>1:
two_word_name = buffer[0] + ' ' + buffer[1]
if two_word_name in members:
name = two_word_name
buffer.popleft()
buffer.popleft()
if not name and len(buffer)>2:
three_word_name = buffer[0] + ' ' + buffer[1] + ' ' + buffer[2]
if three_word_name in members:
name = three_word_name
buffer.popleft()
buffer.popleft()
buffer.popleft()
if name:
print ('<member>', name, '</member> ')
if len(buffer) >2:
print (buffer.popleft() + ' ')
buffer.append(word)
# TODO handle the remaining words which are still in the buffer
print (buffer)
I am just trying to demonstrate the concept. This doesn't handle spaces or punctuation. This doesn't handle the end at all -- it needs to loop while not done. It creates a bunch of temporary strings as it parses. But it illustrates the basic concept of parsing once, and even though it is horribly slow at parsing through the speech text, it might beat searching the speech text 250,000 times.
The reason you want to parse the text and check for name in set is that you do this once. A set has amortized linear time lookup, so it is much faster to check if name in members.
If I get the chance, I might edit it later to be a class that generates tokens, and fix finding names at the end, but I didn't intend this to be your final code.
I am trying to generate a sentence in the style of the bible. But whenever I run it, it stops at a KeyError on the same exact word. This is confusing as it is only using its own keys and it is the same word every time in the error, despite having random.choice.
This is the txt file if you want to run it: ftp://ftp.cs.princeton.edu/pub/cs226/textfiles/bible.txt
import random
files = []
content = ""
output = ""
words = {}
files = ["bible.txt"]
sentence_length = 200
for file in files:
file = open(file)
content = content + " " + file.read()
content = content.split(" ")
for i in range(100): # I didn't want to go through every word in the bible, so I'm just going through 100 words
words[content[i]] = []
words[content[i]].append(content[i+1])
word = random.choice(list(words.keys()))
output = output + word
for i in range(int(sentence_length)):
word = random.choice(words[word])
output = output + word
print(output)
The KeyError happens on this line:
word = random.choice(words[word])
It always happens for the word "midst".
How? "midst" is the 100th word in the text.
And the 100th position is the first time it is seen.
The consequence is that "midst" itself was never put in words as a key.
Hence the KeyError.
Why does the program reach this word so fast? Partly because of a bug here:
for i in range(100):
words[content[i]] = []
words[content[i]].append(content[i+1])
The bug here is the words[content[i]] = [] statement.
Every time you see a word,
you recreate an empty list for it.
And the word before "midst" is "the".
It's a very common word,
many other words in the text have "the".
And since words["the"] is ["midst"],
the problem tends to happen a lot, despite the randomness.
You can fix the bug of creating words:
for i in range(100):
if content[i] not in words:
words[content[i]] = []
words[content[i]].append(content[i+1])
And then when you select words randomly,
I suggest to add a if word in words condition,
to handle the corner case of the last word in the input.
"midst" is the 101st word in your source text and it is the first time it shows up. When you do this:
words[content[i]].append(content[i+1])
you are making a key:value pair but you aren't guaranteed that that value is going to be equivalent to an existing key. So when you use that value to search for a key it doesn't exist so you get a KeyError.
If you change your range to 101 instead of 100 you will see that your program almost works. That is because the 102nd word is "of" which has already occurred in your source text.
It's up to you how you want to deal with this edge case. You could do something like this:
if i == (100-1):
words[content[i]].append(content[0])
else:
words[content[i]].append(content[i+1])
which basically loops back around to the beginning of the source text when you get to the end.
Really been struggling with this one for some time now, i have many text files with a specific format from which i need to extract all the data and file into different fields of a database. The struggle is tweaking the parameters for parsing, ensuring i get all the info correctly.
the format is shown below:
WHITESPACE HERE of unknown length.
K PA DETAILS
2 4565434 i need this sentace as one DB record
2 4456788 and this one
5 4879870 as well as this one, content will vary!
X Max - there sometimes is a line beginning with 'Max' here which i don't need
There is a Line here that i do not need!
WHITESPACE HERE of unknown length.
The tough parts were 1) Getting rid of whitespace, and 2)defining the fields from each other, see my best attempt, below:
dict = {}
XX = (open("XX.txt", "r")).readlines()
for line in XX:
if line.isspace():
pass
elif line.startswith('There is'):
pass
elif line.startswith('Max', 2):
pass
elif line.startswith('K'):
pass
else:
for word in line.split():
if word.startswith('4'):
tmp_PA = word
elif word == "1" or word == "2" or word == "3" or word == "4" or word == "5":
tmp_K = word
else:
tmp_DETAILS = word
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',(tmp_PA,tmp_K,tmp_DETAILS))
At the minute, i can pull the K & PA fields no problem using this, however my DETAILS is only pulling one word, i need the entire sentance, or at least 25 chars of it.
Thanks very much for reading and I hope you can help! :)
K
You are splitting the whole line into words. You need to split into first word, second word and the rest. Like line.split(None, 2).
It would probably use regular expressions. And use the oposite logic, that is if it starts with number 1 through 5, use it, otherwise pass. Like:
pattern = re.compile(r'([12345])\s+\(d+)\s+\(.*\S)')
f = open('XX.txt', 'r') # No calling readlines; lazy iteration is better
for line in f:
m = pattern.match(line)
if m:
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
(m.group(2), m.group(1), m.group(3)))
Oh, and of course, you should be using prepared statement. Parsing SQL is orders of magnitude slower than executing it.
If I understand correctly your file format, you can try this script
filename = 'bug.txt'
f = file(filename,'r')
foundHeaders = False
records = []
for rawline in f:
line = rawline.strip()
if not foundHeaders:
tokens = line.split()
if tokens == ['K','PA','DETAILS']:
foundHeaders = True
continue
else:
tokens = line.split(None,2)
if len(tokens) != 3:
break
try:
K = int(tokens[0])
PA = int(tokens[1])
except ValueError:
break
records.append((K,PA,tokens[2]))
f.close()
for r in records:
print r # replace this by your DB insertion code
This will start reading the records when it encounters the header line, and stop as soon as the format of the line is no longer (K,PA,description).
Hope this helps.
Here is my attempt using re
import re
stuff = open("source", "r").readlines()
whitey = re.compile(r"^[\s]+$")
header = re.compile(r"K PA DETAILS")
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
if whitey.match(line):
pass
elif header.match(line):
pass
elif juicy_info.match(line):
result = juicy_info.search(line)
print result.group('third')
print result.group('second')
print result.group('first')
Using re I can pull the data out and manipulate it on a whim. If you only need the juicy info lines, you can actually take out all the other checks, making this a REALLY concise script.
import re
stuff = open("source", "r").readlines()
#create a regular expression using subpatterns.
#'first, 'second' and 'third' are our own tags ,
# we could call them Adam, Betty, etc.
juicy_info = re.compile(r"^(?P<first>[\d])\s(?P<second>[\d]+)\s(?P<third>.+)$")
for line in stuff:
result = juicy_info.search(line)
if result:#do stuff with data here just use the tag we declared earlier.
print result.group('third')
print result.group('second')
print result.group('first')
import re
reg = re.compile('K[ \t]+PA[ \t]+DETAILS[ \t]*\r?\n'\
+ 3*'([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*\r?\n')
with open('XX.txt') as f:
mat = reg.search(f.read())
for tripl in ((2,1,3),(5,4,6),(8,7,9)):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(*tripl)
I prefer to use [ \t] instead of \s because \s matches the following characters:
blank , '\f', '\n', '\r', '\t', '\v'
and I don't see any reason to use a symbol representing more that what is to be matched, with risks to match erratic newlines at places where they shouldn't be
Edit
It may be sufficient to do:
import re
reg = re.compile(r'^([1-5])[ \t]+(\d+)[ \t]*([^\r\n]+?)[ \t]*$',re.MULTILINE)
with open('XX.txt') as f:
for mat in reg.finditer(f.read()):
cu.execute('''INSERT INTO bugInfo2 (pa, k, details) VALUES(?,?,?)''',
mat.group(2,1,3)