How to extract sub strings from huge string in python?

How to extract sub strings from huge string in python? - python

I have huge strings (7-10k characters) from log files that I need to automatically extract and tabulate information from. Each string contains approximately 40 values that are input by various people. Example;
Example string 1.) 'Color=Blue, [randomJunkdataExampleHere] Weight=345Kg, Age=34 Years, error#1 randomJunkdataExampleThere error#1'
Example string 2.) '[randomJunkdataExampleHere] Color=Red 42, Weight=256 Lbs., Age=34yers, error#1, error#2'
Example string 3.) 'Color=Yellow 13,Weight=345lbs., Age=56 [randomJunkdataExampleHere]'
Desired outcome is a new string, or even dictionary that organizes the data and readies for database entry (one string for each row of data);
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0
Considered using re.search for each column/value, but since there's variance in how users input data I don't know how to trap just the numbers that I want to extract. Also have no idea how to capture the number of times a 'Error#1Count' occurs in the string.
import re
line = '[randomJunkdataExampleHere] Color=Blue, Weight=345Kg, Age=34 Years, error#1, randomJunkdataExampleThere error#1'
try:
Weight = re.search('Weight=(.+?), Age',line).group(1)
except AttributeError:
Weight = 'ERROR'
Goal/Result:
Color,Weight,Age,Error#1Count,Error#2Count
blue,345,34,2,0
red,256,24,1,1
yellow,345,56,0,0

As stated above, 10000 characters really isn't a huge deal.
import time
example_string_1 = 'Color=Blue, Weight=345Kg, Age=34 Years, error#1, error#1'
example_string_2 = 'Color=Red 42, Weight=256 Lbs., Age=34 yers, error#1, error#2'
example_string_3 = 'Color=Yellow 13, Weight=345lbs., Age=56'
def run():
examples = [example_string_1, example_string_2, example_string_3]
dict_list = []
for example in examples:
# first, I would suggest tokenizing the string to identify individual data entries.
tokens = example.split(', ')
my_dict = {}
for token in tokens: # Non-error case
if '=' in token:
subtokens = token.split('=') # this will split the token into two parts, i.e ['Color', 'Blue']
my_dict[subtokens[0]] = subtokens[1]
elif '#' in token: # error case. Not sure if this is actually the format. If not, you'll have to find something to key off of.
if 'error' not in my_dict or my_dict['error'] is None:
my_dict['error'] = [token]
else:
my_dict['error'].append(token)
dict_list.append(my_dict)
# Now lets test out how fast it is.
before = time.time()
for i in range(100000): # run it a hundred thousand times
run()
after = time.time()
print("Time: {0}".format(after - before))
Yields:
Time: 0.5782015323638916
See? Not too bad. Now all that is left is to iterate over the dictionary and record the metrics you want.

Related

Anonymisation of xMillion entries - need performance hints

Edit: it was huge mistake to initialize name-dataset on each call (its loading the blacklist-name-dataset into memory).
Just moved m = NameDataset() to the main function. Did not measured, but its now at least 100 times faster.
I developed a (multilingual) and fast searchable DB with dataTables. I want to anonymize some data (names etc.) from the MySQL DB.
In first instance I used spaCy - but hence the dictionaries were not trained - the result had too much false-positives. Right now I am pretty happy with name-dataset. Its working completely different of course and its much slower as spaCy, which benefits of GPU-Power.
Hence the custom names-DB got pretty big (350.000 lines) and target DB is huge - the processing of each found word with regex.finditer in a loop takes for ever (Ryzen 7 3700X).
We can say 5 cases/sec - which makes >100 hours for some Million rows.
Hence each process eats just about 10% CPU power, I start several (up to 10) python processes - on the end its still takes too long.
I hope, I never have to do this again - but I am afraid, that I have to.
Thatswhy I ask, do you have any performance tipps for the following routines?
the outer for loop in main() - which is looping through the piped object (the DB rows) and calls three times (= 3 items/columns) anonymize()
which has a for loop as well, which runs through each found word
Would it make sense to rewrite them, to use CUDA/numba (RTX 2070 available) etc.?
Any other performance tipps? Thanks!
import simplejson as json
import sys, regex, logging, os
from names_dataset import NameDataset
def anonymize(sourceString, col):
replacement = 'xxx'
output = ''
words = sourceString.split(' ')
#and this second loop for each word (will run three times per row)
for word in words:
newword = word
#regex for findind/splitting the words
fRegExStr = r'(?=[^\s\r\n|\(|\)])(\w+)(?=[\.\?:,!\-/\s\(\)]|$)'
pattern = regex.compile(fRegExStr)
regx = pattern.finditer(word)
if regx is None:
if m.search_first_name(word, use_upper_Row=True):
output += replacement
elif m.search_last_name(word, use_upper_Row=True):
output += replacement
else:
output += word
else:
for eachword in regx:
if m.search_first_name(eachword.group(), use_upper_Row=True):
newword = newword.replace(eachword.group(), replacement)
elif m.search_last_name(eachword.group(), use_upper_Row=True):
newword = newword.replace(eachword.group(), replacement)
output += newword
output += ' '
return output
def main():
#object with data is been piped to the python script, data structure:
#MyRows: {
# [Text_A: 'some text', Text_B: 'some more text', Text_C: 'still text'],
# [Text_A: 'some text', Text_B: 'some more text', Text_C: 'still text'],
# ....several thousand rows
# }
MyRows = json.load(sys.stdin, 'utf-8')
#this is the first outer loop for each row
for Row in MyRows:
xText_A = Row['Text_A']
if Row['Text_A'] and len(Row['Text_A']) > 30:
Row['Text_A'] = anonymize(xText_A, 'Text_A')
xText_B = Row['Text_B']
if xText_B and len(xText_B) > 10:
Row['Text_B'] = anonymize(xText_B, 'Text_B')
xMyRowText_C = Row['MyRowText_C']
if xMyRowText_C and len(xMyRowText_C) > 10:
Row['MyRowText_C'] = anonymize(xMyRowText_C, 'MyRowText_C')
retVal = json.dumps(MyRows, 'utf-8')
return retVal
if __name__ == '__main__':
m = NameDataset() ## Right here is good - THIS WAS THE BOTTLENECK ##
retVal = main()
sys.stdout.write(str(retVal))

You are doing
for word in words:
newword = word
#regex for findind/splitting the words
fRegExStr = r'(?=[^\s\r\n|\(|\)])(\w+)(?=[\.\?:,!\-/\s\(\)]|$)'
pattern = regex.compile(fRegExStr)
regx = pattern.finditer(word)
meaning that you regex.compile exactly same thing in each turn in loop, whilst you can do it before loop begin and get same result.
I do not see other obvious optimizations, so I suggest to profile code to found what part is most time consuming.

python script not joining strings as expected

I have a list of lists of sequences, and a corresponding list of lists of names.
testSequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
testNames = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
I also have a list of all the identifying parts of the names:
taxonNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
I am trying to produce a new list, where each item in the list will correspond to one of the "identifying parts of the names", and the string will be made up of all the sequences for that name.
If a name and sequence does not appear in one of the lists in the lists (i.e. no redFish or blueFish in the first list of testNames) I want to add in a string of hyphens the same length as the sequences in that list. This would give me this output:
['aaaa--AAAAAA', 'cccc--CCCCCC', '----ttTTTTTT', '----ggGGGG']
I have this piece of code to do this.
complete = [''] * len(taxonNames)
for i in range(len(testSequences)):
for j in range(len(taxonNames)):
sequenceLength = len(testSequences[i][0])
for k in range(len(testSequences[i])):
if taxonNames[j] in testNames[i][k]:
complete[j].join(testSequences[i][k])
if taxonNames[j] not in testNames[i][k]:
hyphenString = "-" * sequenceLength
complete[j].join(hyphenString)
print complete
"complete" should give my final output as explained above, but it comes out looking like this:
['', '', '', '']
How can I fix my code to give me the correct answer?

The main issue with your code, which makes it very hard to understand, is you're not really leveraging the language elements that make Python so strong.
Here's a solution to your problem that works:
test_sequences = [
['aaaa', 'cccc'],
['tt', 'gg'],
['AAAAAAA', 'CCCCCC', 'TTTTTT', 'GGGGGG']]
test_names = [
['>xx_oneFish |xzx', '>xx_twoFish |zzx'],
['>xx_redFish |zxx', '>xx_blueFish |zxx'],
['>xx_oneFish |xzx', '>xx_twoFish |xzx', '>xz_redFish |xxx', '>zx_blueFish |xzz']]
taxon_names = ['oneFish', 'twoFish', 'redFish', 'blueFish']
def get_seqs(taxon_name, sequences_list, names_list):
for seqs, names in zip(sequences_list, names_list):
found_seq = None
for seq, name in zip(seqs, names):
if taxon_name in name:
found_seq = seq
break
yield found_seq if found_seq else '-' * len(seqs[0])
result = [''.join(get_seqs(taxon_name, test_sequences, test_names))
for taxon_name in taxon_names]
print(result)
The generator get_seqs pairs up lists from test_sequences and test_names and for each pair, tries to find the sequence (seq) for the name (name) that matches and yields it, or yields a string of the right number of hyphens for that list of sequences.
The generator (a function that yields multiple values) has code that quite literally follows the explanation above.
The result is then simply a matter of, for each taxon_name, getting all the resulting sequences that match in order and joining them together into a string, which is the result = ... line.
You could make it work with list indexing loops and string concatenation, but this is not a PHP question, now is it? :)
Note: for brevity, you could just access the global test_sequences and test_names instead of passing them in as parameters, but I think that would come back to haunt you if you were to actually use this code. Also, I think it makes semantic sense to change the order of names and sequences in the entire example, but I didn't to avoid further deviating from your example.

Here is a solution that may do what you want. It begins, not with your data structures from this post, but with the three example files from your previous post (which you used to build this post's data structures).
The only thing I couldn't figure out was how many hyphens to use for a missing sequence from a file.
differentNames = ['oneFish', 'twoFish', 'redFish', 'blueFish']
files = ['f1.txt', 'f2.txt', 'f3.txt']
data = [[] for _ in range(len(differentNames))]
final = []
for file in files:
d = dict()
with open(file, 'r') as fin:
for line in fin:
line = line.rstrip()
if line.startswith('>'): # for ex., >xx_oneFish |xxx
underscore = line.index('_')
space = line.index(' ')
key = line[underscore+1:space]
else:
d[key] = line
for i, key in enumerate(differentNames):
data[i].append(d.get(key, '-' * 4))
for array in data:
final.append(''.join(array))
print(final)
Prints:
['AAAAAAAaaaa----', 'CCCCCCcccc----', 'TTTTTT----tt', 'GGGGGG----gg']

String Cutting with multiple lines

so i'm new to python besides some experience with tKintner (some GUI experiments).
I read an .mbox file and copy the plain/text in a string. This text contains a registering form. So a Stefan, living in Maple Street, London working for the Company "MultiVendor XXVideos" has registered with an email for a subscription.
Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
I would like to take this data and put in a .csv row with column
"Name", "Adress", "Company",...
Now i tried to cut and slice everything. For debugging i use "print"(IDE = KATE/KDE + terminal... :-D ).
Problem is, that the data contains multiple lines after keywords but i only get the first line.
How would you improve my code?
import mailbox
import csv
import email
from time import sleep
import string
fieldnames = ["ID","Subject","Name", "Adress", "Company"]
searchKeys = [ 'Name_OF_Person','Adress_HOME','Company_NAME']
mbox_file = "REG.mbox"
export_file_name = "test.csv"
if __name__ == "__main__":
with open(export_file_name,"w") as csvfile:
writer = csv.DictWriter(csvfile, dialect='excel',fieldnames=fieldnames)
writer.writeheader()
for message in mailbox.mbox(mbox_file):
if message.is_multipart():
content = '\n'.join(part.get_payload() for part in message.get_payload())
content = content.split('<')[0] # only want text/plain.. Ill split #right before HTML starts
#print content
else:
content = message.get_payload()
idea = message['message-id']
sub = message['subject']
fr = message['from']
date = message['date']
writer.writerow ('ID':idea,......) # CSV writing will work fine
for line in content.splitlines():
line = line.strip()
for pose in searchKeys:
if pose in line:
tmp = line.split(pose)
pmt = tmp[1].split(":")[1]
if next in line !=:
print pose +"\t"+pmt
sleep(1)
csvfile.closed
OUTPUT:
OFFICIAL_POSTAL_ADDRESS =20
Here, the lines are missing..
from file:
OFFICIAL_POSTAL_ADDRESS: =20
London, testarossa street 41
EDIT2:
#Yaniv
Thank you, iam still trying to understand every step, but just wanted to give a comment. I like the idea to work with the list/matrix/vector "key_value_pairs"
The amount of keywords in the emails is ~20 words. Additionally, my values are sometimes line broken by "=".
I was thinking something like:
Search text for Keyword A,
if true:
search text from Keyword A until keyword B
if true:
copy text after A until B
Name_OF_=
Person: Stefan
Adress_
=HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos
Maybe the HTML from EMAIL.mbox is easier to process?
<tr><td bgcolor=3D"#eeeeee"><font face=3D"Verdana" size=3D"1">
<strong>NAM=
E_REGISTERING_PERSON</strong></font></td><td bgcolor=3D"#eeeeee"><font
fac=e=3D"Verdana" size=3D"1">Stefan </font></td></tr>
But the "=" are still there
should i replace ["="," = "] with "" ?

I would go for a "routine" parsing loop over the input lines, and maintain a current_key and current_value variables, as a value for a certain key in your data might be "annoying", and spread across multiple lines.
I've demonstrated such parsing approach in the code below, with some assumptions regarding your problem. For example, if an input line starts with a whitespace, I assumed it must be the case of such "annoying" value (spread across multiple lines). Such lines would be concatenated into a single value, using some configurable string (the parameter join_lines_using_this). Another assumption is that you might want to strip whitespaces from both keys and values.
Feel free to adapt the code to fit your assumptions on the input, and raise Exceptions whenever they don't hold!
# Note the usage of .strip() in some places, to strip away whitespaces. I assumed you might want that.
def parse_funky_text(text, join_lines_using_this=" "):
key_value_pairs = []
current_key, current_value = None, ""
for line in text.splitlines():
line_split = line.split(':')
if line.startswith(" ") or len(line_split) == 1:
if current_key is None:
raise ValueError("Failed to parse this line, not sure which key it belongs to: %s" % line)
current_value += join_lines_using_this + line.strip()
else:
if current_key is not None:
key_value_pairs.append((current_key, current_value))
current_key, current_value = None, ""
current_key = line_split[0].strip()
# We've just found a new key, so here you might want to perform additional checks,
# e.g. if current_key not in sharedKeys: raise ValueError("Encountered a weird key?! %s in line: %s" % (current_key, line))
current_value = ':'.join(line_split[1:]).strip()
# Don't forget the last parsed key, value
if current_key is not None:
key_value_pairs.append((current_key, current_value))
return key_value_pairs
Example usage:
text = """Name_OF_Person: Stefan
Adress_HOME: London, Maple
Street
45
Company_NAME: MultiVendor
XXVideos"""
parse_funky_text(text)
Will output:
[('Name_OF_Person', 'Stefan'), ('Adress_HOME', 'London, Maple Street 45'), ('Company_NAME', 'MultiVendor XXVideos')]

You indicate in the comments that your input strings from the content should be relatively consistent. If that is the case, and you want to be able to split that string across multiple lines, the easiest thing to do would be to replace \n with spaces and then just parse the single string.
I've intentionally constrained my answer to using just string methods rather than inventing a huge function to do this. Reason: 1) Your process is already complex enough, and 2) your question really boils down to how to process the string data across multiple lines. If that is the case, and the pattern is consistent, this will get this one off job done
content = content.replace('\n', ' ')
Then you can split on each of the boundries in your consistently structured headers.
content = content.split("Name_OF_Person:")[1] #take second element of the list
person = content.split("Adress_HOME:")[0] # take content before "Adress Home"
content = content.split("Adress_HOME:")[1] #take second element of the list
address = content.split("Company_NAME:")[0] # take content before
company = content.split("Adress_HOME:")[1] #take second element of the list (the remainder) which is company
Normally, I would suggest regex. (https://docs.python.org/3.4/library/re.html). Long term, if you need to do this sort of thing again, regex is going to pay dividends on time spend munging data. To make a regex function "cut" across multiple lines, you would use the re.MULTILINE option. So it might endup looking something like re.search('Name_OF_Person:(.*)Adress_HOME:', html_reg_form, re.MULTILINE)

How can I effectively pull out human readable strings/terms from code automatically?

I'm trying to determine the most common words, or "terms" (I think) as I iterate over many different files.
Example - For this line of code found in a file:
for w in sorted(strings, key=strings.get, reverse=True):
I'd want these unique strings/terms returned to my dictionary as keys:
for
w
in
sorted
strings
key
strings
get
reverse
True
However, I want this code to be tunable so that I can return strings with periods or other characters between them as well, because I just don't know what makes sense yet until I run the script and count up the "terms" a few times:
strings.get
How can I approach this problem? It would help to understand how I can do this one line at a time so I can loop it as I read my file's lines in. I've got the basic logic down but I'm currently just doing the tallying by unique line instead of "term":
strings = dict()
fname = '/tmp/bigfile.txt'
with open(fname, "r") as f:
for line in f:
if line in strings:
strings[line] += 1
else:
strings[line] = 1
for w in sorted(strings, key=strings.get, reverse=True):
print str(w).rstrip() + " : " + str(strings[w])
(Yes I used code from my little snippet here as the example at the top.)

If the only python token you want to keep together is the object.attr construct then all the tokens you are interested would fit into the regular expression
\w+\.?\w*
Which basically means "one or more alphanumeric characters (including _) optionally followed by a . and then some more characters"
note that this would also match number literals like 42 or 7.6 but that would be easy enough to filter out afterwards.
then you can use collections.Counter to do the actual counting for you:
import collections
import re
pattern = re.compile(r"\w+\.?\w*")
#here I'm using the source file for `collections` as the test example
with open(collections.__file__, "r") as f:
tokens = collections.Counter(t.group() for t in pattern.finditer(f.read()))
for token, count in tokens.most_common(5): #show only the top 5
print(token, count)
Running python version 3.6.0a1 the output is this:
self 226
def 173
return 170
self.data 129
if 102
which makes sense for the collections module since it is full of classes that use self and define methods, it also shows that it does capture self.data which fits the construct you are interested in.

Python - reading data from file with variable attributes and line lengths

I'm trying to find the best way to parse through a file in Python and create a list of namedtuples, with each tuple representing a single data entity and its attributes. The data looks something like this:
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
UI: T145
RL: exhibits
ABR: EX
RIN: exhibited_by
RTN: R3.3.2
DEF: Shows or demonstrates.
HL: {isa} performs
STL: [Animal|Behavior]; [Group|Behavior]
UI: etc...
While several attributes are shared (eg UI), some are not (eg STY). However, I could hardcode an exhaustive list of necessary.
Since each grouping is separated by an empty line, I used split so I can process each chunk of data individually:
input = file.read().split("\n\n")
for chunk in input:
process(chunk)
I've seen some approaches use string find/splice, itertools.groupby, and even regexes. I was thinking of doing a regex of '[A-Z]*:' to find where the headers are, but I'm not sure how to approach pulling out multiple lines afterwards until another header is reached (such as the multilined data following DEF in the first example entity).
I appreciate any suggestions.

I took assumption that if you have string span on multiple lines you want newlines replaced with spaces (and to remove any additional spaces).
def process_file(filename):
reg = re.compile(r'([\w]{2,3}):\s') # Matches line header
tmp = '' # Stored/cached data for mutliline string
key = None # Current key
data = {}
with open(filename,'r') as f:
for row in f:
row = row.rstrip()
match = reg.match(row)
# Matches header or is end, put string to list:
if (match or not row) and key:
data[key] = tmp
key = None
tmp = ''
# Empty row, next dataset
if not row:
# Prevent empty returns
if data:
yield data
data = {}
continue
# We do have header
if match:
key = str(match.group(1))
tmp = row[len(match.group(0)):]
continue
# No header, just append string -> here goes assumption that you want to
# remove newlines, trailing spaces and replace them with one single space
tmp += ' ' + row
# Missed row?
if key:
data[key] = tmp
# Missed group?
if data:
yield data
This generator returns dict with pairs like UI: T020 in each iteration (and always at least one item).
Since it uses generator and continuous reading it should be effective event on large files and it won't read whole file into memory at once.
Here's little demo:
for data in process_file('data.txt'):
print('-'*20)
for i in data:
print('%s:'%(i), data[i])
print()
And actual output:
--------------------
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found in or deriving from a previously normal structure. Acquired abnormalities are distinguished from diseases even though they may result in pathological functioning (e.g., "hernias incarcerate").
STY: Acquired Abnormality
HL: {isa} Anatomical Abnormality
UI: T020
ABR: acab
--------------------
DEF: Shows or demonstrates.
STL: [Animal|Behavior]; [Group|Behavior]
RL: exhibits
HL: {isa} performs
RTN: R3.3.2
UI: T145
RIN: exhibited_by
ABR: EX

source = """
UI: T020
STY: Acquired Abnormality
ABR: acab
STN: A1.2.2.2
DEF: An abnormal structure, or one that is abnormal in size or location, found
in or deriving from a previously normal structure. Acquired abnormalities are
distinguished from diseases even though they may result in pathological
functioning (e.g., "hernias incarcerate").
HL: {isa} Anatomical Abnormality
"""
inpt = source.split("\n") #just emulating file
import re
reg = re.compile(r"^([A-Z]{2,3}):(.*)$")
output = dict()
current_key = None
current = ""
for line in inpt:
line_match = reg.match(line) #check if we hit the CODE: Content line
if line_match is not None:
if current_key is not None:
output[current_key] = current #if so - update the current_key with contents
current_key = line_match.group(1)
current = line_match.group(2)
else:
current = current + line #if it's not - it should be the continuation of previous key line
output[current_key] = current #don't forget the last guy
print(output)

import re
from collections import namedtuple
def process(chunk):
split_chunk = re.split(r'^([A-Z]{2,3}):', chunk, flags=re.MULTILINE)
d = dict()
fields = list()
for i in xrange(len(split_chunk)/2):
fields.append(split_chunk[i])
d[split_chunk[i]] = split_chunk[i+1]
my_tuple = namedtuple(split_chunk[1], fields)
return my_tuple(**d)
should do. I think I'd just do the dict though -- why are you so attached to a namedtuple?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract sub strings from huge string in python? - python

Related

Anonymisation of xMillion entries - need performance hints

python script not joining strings as expected

String Cutting with multiple lines

How can I effectively pull out human readable strings/terms from code automatically?

Python - reading data from file with variable attributes and line lengths

Categories

Resources