I am working on some Latin texts that contain dates and was using various regex patterns and rule based statements to extract dates. I was wondering if I can use an algorithm to train to extract these dates instead of the method I am currently using. Thanks
This is an extract of my algorithm:
def checkLatinDates(i, record, no):
if(i == 0 and isNumber(record[i])): #get deed no
df.loc[no,'DeedNo'] = record[i]
rec = record[i].lower()
split = rec.split()
if(split[0] == 'die'):
items = deque(split)
items.popleft()
split = list(items)
if('eodem' in rec):
n = no-1
if(no>1):
while ( pd.isnull(df.ix[n]['LatinDate'])):
n = n-1
print n
df['LatinDate'][no] = df.ix[n]['LatinDate']
if(words_in_string(latinMonths, rec.lower()) and len(split)<10):
if not (dates.loc[dates['Latin'] == split[0], 'Number'].empty):
day = dates.loc[dates['Latin'] == split[0], 'Number'].iloc[0]
split[0] = day
nd = ' '.join(map(str, split))
df['LatinDate'][no] = nd
elif(convertArabic(split[0])!= ''):
day = convertArabic(split[0])
split[0] = day
nd = ' '.join(map(str, split))
df['LatinDate'][no] = nd
You could use some machine learning algorithm like adaboost, using IOB tagging
adding some context features, like the type of word, a regex to detect if it is obviously a date, the surrounding words type, etc.
Here is a tutorial.
Related
I have the following script that does the following:
Extracts all text from a PowerPoint (all separated by a ":::")
Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
Creates a dataframe for the term + file which that term appeared
Iterates through each PowerPoint for the given folder
I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).
end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]
files = []
text = []
for p in ppt:
try:
prs_text = []
prs = Presentation(os.path.join(rfps, p))
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
prs_text.append(shape.text)
prs_text = ':::'.join(prs_text)
files.append(p)
text.append(prs_text)
except:
print("Failed: " + str(p))
agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()
terms = ['test','testing']
a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears
onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")
1 line sample of agg:
File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation
Adjustment based on input:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
To find the sentence containing the word "test", try:
>>> agg["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find("test"))+3: x.find(":::",x.find("test"))-3])
Looping through your terms:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager[term] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])
In the previous post, I did not clarify the questions properly, therefore, I would like to start a new topic here.
I have the following items:
a sorted list of 59,000 protein patterns (range from 3 characters "FFK" to 152 characters long);
some long protein sequences, aka my reference.
I am going to match these patterns against my reference and find the location of where the match is found. (My friend helped wrtoe a script for that.)
import sys
import re
from itertools import chain, izip
# Read input
with open(sys.argv[1], 'r') as f:
sequences = f.read().splitlines()
with open(sys.argv[2], 'r') as g:
patterns = g.read().splitlines()
# Write output
with open(sys.argv[3], 'w') as outputFile:
data_iter = iter(sequences)
order = ['antibody name', 'epitope sequence', 'start', 'end', 'length']
header = '\t'.join([k for k in order])
outputFile.write(header + '\n')
for seq_name, seq in izip(data_iter, data_iter):
locations = [[{'antibody name': seq_name, 'epitope sequence': pattern, 'start': match.start()+1, 'end': match.end(), 'length': len(pattern)} for match in re.finditer(pattern, seq)] for pattern in patterns]
for loc in chain.from_iterable(locations):
output = '\t'.join([str(loc[k]) for k in order])
outputFile.write(output + '\n')
f.close()
g.close()
outputFile.close()
Problem is, within these 59,000 patterns, after sorted, I found that some part of one pattern match with part of the other patterns, and I would like to consolidate these into one big "consensus" patterns and just keep the consensus (see examples below):
TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
will yield
TLYLQMNSLRAEDTAV
another example:
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR
will yield
KPGQAPRLLIYGASSRATGIPD
PS : I am aligning them here so it's easier to visualize. The 59,000 patterns initially are not sorted so it's hard to see the consensus in the actual file.
In my particular problem, I am not picking the longest patterns, instead, I need to take each pattern into account to find the consensus. I hope I have explained clearly enough for my specific problem.
Thanks!
Here's my solution with randomized input order to improve confidence of the test.
import re
import random
data_values = """TLYLQMNSLRAED
TLYLQMNSLRAEDT
YLQMNSLRAED
YLQMNSLRAEDT
YLQMNSLRAEDTA
YLQMNSLRAEDTAV
APRLLIYGASS
APRLLIYGASSR
APRLLIYGASSRA
APRLLIYGASSRAT
APRLLIYGASSRATG
APRLLIYGASSRATGIP
APRLLIYGASSRATGIPD
GQAPRLLIY
KPGQAPRLLIYGASSR
KPGQAPRLLIYGASSRAT
KPGQAPRLLIYGASSRATG
KPGQAPRLLIYGASSRATGIPD
LLIYGASSRATG
LLIYGASSRATGIPD
QAPRLLIYGASSR"""
test_li1 = data_values.split()
#print(test_li1)
test_li2 = ["abcdefghi", "defghijklmn", "hijklmnopq", "mnopqrst", "pqrstuvwxyz"]
def aggregate_str(data_li):
copy_data_li = data_li[:]
while len(copy_data_li) > 0:
remove_li = []
len_remove_li = len(remove_li)
longest_str = max(copy_data_li, key=len)
copy_data_li.remove(longest_str)
remove_li.append(longest_str)
while len_remove_li != len(remove_li):
len_remove_li = len(remove_li)
for value in copy_data_li:
value_pattern = "".join([x+"?" for x in value])
longest_match = max(re.findall(value_pattern, longest_str), key=len)
if longest_match in value:
longest_str_index = longest_str.index(longest_match)
value_index = value.index(longest_match)
if value_index > longest_str_index and longest_str_index > 0:
longest_str = value[:value_index] + longest_str
copy_data_li.remove(value)
remove_li.append(value)
elif value_index < longest_str_index and longest_str_index + len(longest_match) == len(longest_str):
longest_str += value[len(longest_str)-longest_str_index:]
copy_data_li.remove(value)
remove_li.append(value)
elif value in longest_str:
copy_data_li.remove(value)
remove_li.append(value)
print(longest_str)
print(remove_li)
random.shuffle(test_li1)
random.shuffle(test_li2)
aggregate_str(test_li1)
#aggregate_str(test_li2)
Output from print().
KPGQAPRLLIYGASSRATGIPD
['KPGQAPRLLIYGASSRATGIPD', 'APRLLIYGASS', 'KPGQAPRLLIYGASSR', 'APRLLIYGASSRAT', 'APRLLIYGASSR', 'APRLLIYGASSRA', 'GQAPRLLIY', 'APRLLIYGASSRATGIPD', 'APRLLIYGASSRATG', 'QAPRLLIYGASSR', 'LLIYGASSRATG', 'KPGQAPRLLIYGASSRATG', 'KPGQAPRLLIYGASSRAT', 'LLIYGASSRATGIPD', 'APRLLIYGASSRATGIP']
TLYLQMNSLRAEDTAV
['YLQMNSLRAEDTAV', 'TLYLQMNSLRAED', 'TLYLQMNSLRAEDT', 'YLQMNSLRAED', 'YLQMNSLRAEDTA', 'YLQMNSLRAEDT']
Edit1 - brief explanation of the code.
1.) Find longest string in list
2.) Loop through all remaining strings and find longest possible match.
3.) Make sure that the match is not a false positive. Based on the way I've written this code, it should avoid pairing single overlaps on terminal ends.
4.) Append the match to the longest string if necessary.
5.) When nothing else can be added to the longest string, repeat the process (1-4) for the next longest string remaining.
Edit2 - Corrected unwanted behavior when treating data like ["abcdefghijklmn", "ghijklmZopqrstuv"]
def main():
#patterns = ["TLYLQMNSLRAED","TLYLQMNSLRAEDT","YLQMNSLRAED","YLQMNSLRAEDT","YLQMNSLRAEDTA","YLQMNSLRAEDTAV"]
patterns = ["APRLLIYGASS","APRLLIYGASSR","APRLLIYGASSRA","APRLLIYGASSRAT","APRLLIYGASSRATG","APRLLIYGASSRATGIP","APRLLIYGASSRATGIPD","GQAPRLLIY","KPGQAPRLLIYGASSR","KPGQAPRLLIYGASSRAT","KPGQAPRLLIYGASSRATG","KPGQAPRLLIYGASSRATGIPD","LLIYGASSRATG","LLIYGASSRATGIPD","QAPRLLIYGASSR"]
test = find_core(patterns)
test = find_pre_and_post(test, patterns)
#final = "YLQMNSLRAED"
final = "KPGQAPRLLIYGASSRATGIPD"
if test == final:
print("worked:" + test)
else:
print("fail:"+ test)
def find_pre_and_post(core, patterns):
pre = ""
post = ""
for pattern in patterns:
start_index = pattern.find(core)
if len(pattern[0:start_index]) > len(pre):
pre = pattern[0:start_index]
if len(pattern[start_index+len(core):len(pattern)]) > len(post):
post = pattern[start_index+len(core):len(pattern)]
return pre+core+post
def find_core(patterns):
test = ""
for i in range(len(patterns)):
for j in range(2,len(patterns[i])):
patterncount = 0
for pattern in patterns:
if patterns[i][0:j] in pattern:
patterncount += 1
if patterncount == len(patterns):
test = patterns[i][0:j]
return test
main()
So what I do first is find the main core in the find_core function by starting with a string of length two, as one character is not sufficient information, for the first string. I then compare that substring and see if it is in ALL the strings as the definition of a "core"
I then find the indexes of the substring in each string to then find the pre and post substrings added to the core. I keep track of these lengths and update them if one length is greater than the other. I didn't have time to explore edge cases so here is my first shot
The date document is wrote like below
1060301 1030727 1041201 1060606 1060531 1060629 1060623 1060720
...and some of them like....
831008 751125 1060110 890731 700815 731022 1010724 980116
Which represent the date data of:
Year(2~3 character)/Month(2 characters)/Day(2 characters)
And some r blanks for leaking data
is there a way to read those data into arranged date type?
So by reading 1060301, Im assuming thats year:106, month:03, and day:01, so perform different operations on the different lengthed numbers with this:
valuelist = []
value = ''
date = ''
file = open('testfile.txt','r+')
filetowriteto = open('OUTPUTFILE','a+')
for line in file:
for char in line:
if char == ' ':
if len(value) == 6:
date = value[0:2]+'/'+value[2:4]+'/'+value[4:]
elif len(value)==7:
date = value[:3]+'/'+value[3:5]+'/'+value[5:]
valuelist.append(date)
value = ''
date = ''
else:
value += char
continue
for t in valuelist:
filetowriteto.write(t+' ')
file.close()
filetowriteto.close()
Please don't hesitate to comment about anything.
I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.
I have a large text document that I am reading in and attempting to split into a multiple list. I'm having a hard time with the logic behind actually splitting up the string.
example of the text:
Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410
This data contains 4 pieces of information in this format:
City[coordinates]Population Distances_to_previous
My aim is to split this data up into a List:
Data = [[City] , [Coordinates] , [Population] , [Distances]]
As far as I know I need to use .split statements but I've gotten lost trying to implement them.
I'd be very grateful for some ideas to get started!
I would do this in stages.
Your first split is at the '[' of the coordinates.
Your second split is at the ']' of the coordinates.
Third split is end of line.
The next line (if it starts with a number) is your distances.
I'd start with something like:
numCities = 0
Data = []
i = 0
while i < len(lines):
split = lines[i].partition('[')
if (split[1]): # We found something
city = split[0]
split = split[2].partition(']')
if (split[1]):
coords = split[0] #If you want this as a list then rsplit it
population = split[2]
distances = []
if i > 0:
i += 1
distances = lines[i].rsplit(' ')
Data.append([city, coords, population, distances])
numCities += 1
i += 1
for data in Data:
print (data)
This will print
['Youngstown, OH', '4110,8065', '115436', []]
['Yankton, SD', '4288,9739', '12011', ['966']]
['Yakima, WA', '4660,12051', '49826', ['1513', '2410']]
The easiest way would be with a regex.
lines = """Youngstown, OH[4110,8065]115436
Yankton, SD[4288,9739]12011
966
Yakima, WA[4660,12051]49826
1513 2410"""
import re
pat = re.compile(r"""
(?P<City>.+?) # all characters up to the first [
\[(?P<Coordinates>\d+,\d+)\] # grabs [(digits,here)]
(?P<Population>\d+) # population digits here
\s # a space or a newline?
(?P<Distances>[\d ]+)? # Everything else is distances""", re.M | re.X)
groups = pat.finditer(lines)
results = [[[g.group("City")],
[g.group("Coordinates")],
[g.group("Population")],
g.group("Distances").split() if
g.group("Distances") else [None]]
for g in groups]
DEMO:
In[50]: results
Out[50]:
[[['Youngstown, OH'], ['4110,8065'], ['115436'], [None]],
[['Yankton, SD'], ['4288,9739'], ['12011'], ['966']],
[['Yakima, WA'], ['4660,12051'], ['49826'], ['1513', '2410']]]
Though if I may, it's probably BEST to do this as a list of dictionaries.
groups = pat.finditer(lines)
results = [{key: g.group(key)} for g in groups for key in
["City", "Coordinates", "Population", "Distances"]]
# then modify later
for d in results:
try:
d['Distances'] = d['Distances'].split()
except AttributeError:
# distances is None -- that's okay
pass