Isolating the Sentence in which a Term appears - python

I have the following script that does the following:
Extracts all text from a PowerPoint (all separated by a ":::")
Compares each term in my search term list to the text and isolates just those lines of text that contain one or more of the terms
Creates a dataframe for the term + file which that term appeared
Iterates through each PowerPoint for the given folder
I am hoping to adjust this to include specifically the sentence in which it appears (e.g. the entire content between the ::: before and ::: after the term appears).
end = r'C:\Users\xxx\Table Lookup.xlsx'
rfps = r'C:\Users\xxx\Folder1'
ls = os.listdir(rfps)
ppt = [s for s in ls if '.ppt' in s]
files = []
text = []
for p in ppt:
try:
prs_text = []
prs = Presentation(os.path.join(rfps, p))
for slide in prs.slides:
for shape in slide.shapes:
if hasattr(shape, "text"):
prs_text.append(shape.text)
prs_text = ':::'.join(prs_text)
files.append(p)
text.append(prs_text)
except:
print("Failed: " + str(p))
agg = pd.DataFrame()
agg['File'] = files
agg['Unstructured'] = text
agg['Unstructured'] = agg['Unstructured'].str.lower()
terms = ['test','testing']
a = [(x, z, i) for x, z, y in zip(agg['File'],agg['Unstructured'], agg['Unstructured']) for i in terms if i in y]
#how do I also include the sentence where this term appears
onepager = pd.DataFrame(a, columns=['File', 'Unstructured', 'Term']) #will need to add a column here
onepager = onepager.drop_duplicates(keep="first")
1 line sample of agg:
File | Unstructured
File1.pptx | competitive offerings:::real-time insights and analyses for immediate use:::disruptive “moves”:::deeper strategic insights through analyses generated and assessed over time:::launch new business models:::enter new markets::::::::::::internal data:::external data:::advanced computing capabilities:::insights & applications::::::::::::::::::machine learning
write algorithms that continue to “learn” or test and improve themselves as they ingest data and identify patterns:::natural language processing
allow interactions between computers and human languages using voice and/or text. machines directly interact, analyze, understand, and reproduce information:::intelligent automation
Adjustment based on input:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager['Sentence'] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])

To find the sentence containing the word "test", try:
>>> agg["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find("test"))+3: x.find(":::",x.find("test"))-3])
Looping through your terms:
onepager = pd.DataFrame(a, columns=['File', 'Unstructured','Term'])
for t in terms:
onepager[term] = onepager["Unstructured"].apply(lambda x: x[x.rfind(":::", 0, x.find(t))+3: x.find(":::",x.find(t))-3])

Related

Repeatedly extracting substring inbetween specific characters, in a text file (python)

I have a several pieces of data stored in a text file. I am trying to extract each type of data into individual lists so that I can plot them/make various figures. There are thousands of values so doing it specifically isn't really an option.
An example of the text file is :
"G4WT7 > interaction in material = MATERIAL
G4WT7 > process PROCESSTYPE
G4WT7 > at position [um] = (x,y,z)
G4WT7 > with energy [keV] = 0.016
G4WT7 > track ID and parent ID = ,a,b
G4WT7 > with mom dir = (x,y,z)
G4WT7 > number of secondaries= c
G4WT1 > interaction in material = MATERIAL
G4WT1 > process PROCESSTYPE
G4WT1 > at position [um] = (x,y,z)
G4WT1 > with energy [keV] = 0.032
G4WT1 > track ID and parent ID = ,a,b
G4WT1 > with mom dir = (x,y,z)
G4WT1 > number of secondaries= c"
I would like to extract strings such as the string following "energy [keV] =" so 0.016, 0.032 etc, into a list. I hope to be able to separate all the data similarly to this.
So far I've tried to use regex, as following:
import re
file = open('file.txt')
textfile =file.read()
Energy = re.findall('[keV] = ;(.*)G', textfile)
But it just generates an empty list; []
I'm a newbie to python, so apologies if the answer is obvious, and any help would be greatly appreciated.
you might want to escape the square-brackets!
Energy = re.findall('\[keV\] = (.*)', text)
... or to be on the save-side you can also use re.escape to make sure all characters are properly escaped, e.g.:
Energy = re.findall(re.escape('[keV] = ') + '(.*)', text)

How to convert inkml file to an image format

I have dataset consist of inkml files of handwritten texts. I want to convert it to a usable image format to train a CNN. python script would be helpful.
I found a method given below is the source code
def get_traces_data(inkml_file_abs_path):
traces_data = []
tree = ET.parse(inkml_file_abs_path)
root = tree.getroot()
doc_namespace = "{http://www.w3.org/2003/InkML}"
'Stores traces_all with their corresponding id'
traces_all = [{'id': trace_tag.get('id'),
'coords': [[round(float(axis_coord)) if float(axis_coord).is_integer() else round(float(axis_coord)) \
for axis_coord in coord[1:].split(' ')] if coord.startswith(' ') \
else [round(float(axis_coord)) if float(axis_coord).is_integer() else round(float(axis_coord)) \
for axis_coord in coord.split(' ')] \
for coord in (trace_tag.text).replace('\n', '').split(',')]} \
for trace_tag in root.findall(doc_namespace + 'trace')]
# print("before sort ", traces_all)
'Sort traces_all list by id to make searching for references faster'
traces_all.sort(key=lambda trace_dict: int(trace_dict['id']))
# print("after sort ", traces_all)
'Always 1st traceGroup is a redundant wrapper'
traceGroupWrapper = root.find(doc_namespace + 'traceGroup')
if traceGroupWrapper is not None:
for traceGroup in traceGroupWrapper.findall(doc_namespace + 'traceGroup'):
label = traceGroup.find(doc_namespace + 'annotation').text
'traces of the current traceGroup'
traces_curr = []
for traceView in traceGroup.findall(doc_namespace + 'traceView'):
'Id reference to specific trace tag corresponding to currently considered label'
traceDataRef = int(traceView.get('traceDataRef'))
'Each trace is represented by a list of coordinates to connect'
single_trace = traces_all[traceDataRef]['coords']
traces_curr.append(single_trace)
traces_data.append({'label': label, 'trace_group': traces_curr})
else:
'Consider Validation data that has no labels'
[traces_data.append({'trace_group': [trace['coords']]}) for trace in traces_all]
return traces_data
You may consider using xml.etree.ElementTree in Python to parse your inkml files and use OpenCV's cv2.line method to connect the points to draw the stroke.

How to avoid for loops and iterate through pandas dataframe properly?

I have this code that I've been struggling for a while to optimize.
My dataframe is a csv file with 2 columns, out of which the second column contains texts. Looks like on the picture:
I have a function summarize(text, n) that needs a single text and an integer as input.
def summarize(text, n):
sents = sent_tokenize(text) # text into tokenized sentences
# Checking if there are less sentences in the given review than the required length of the summary
assert n <= len(sents)
list_sentences = [word_tokenize(s.lower()) for s in sents] # word tokenized sentences
frequency = calculate_freq(list_sentences) # calculating the word frequency for all the sentences
ranking = defaultdict(int)
for i, sent in enumerate(list_sentences):
for w in sent:
if w in frequency:
ranking[i] += frequency[w]
# Calling the rank function to get the highest ranking
sents_idx = rank(ranking, n)
# Return the best choices
return [sents[j] for j in sents_idx]
So summarize() all the texts, I first iterate through my dataframe and create a list of all the texts, which I later iterate again to send them one by one to the summarize() function so I can get the summary of the text. These for loops are making my code really, really slow, but I haven't been able to figure out a way to make it more efficient, and I would greatly appreciate any suggestions.
data = pd.read_csv('dataframe.csv')
text = data.iloc[:,2] # ilocating the texts
list_of_strings = []
for t in text:
list_of_strings.append(t) # creating a list of all the texts
our_summary = []
for s in list_of_strings:
for f in summarize(s, 1):
our_summary.append(f)
ours = pd.DataFrame({"our_summary": our_summary})
EDIT:
other two functions are:
def calculate_freq(list_sentences):
frequency = defaultdict(int)
for sentence in list_sentences:
for word in sentence:
if word not in our_stopwords:
frequency[word] += 1
# We want to filter out the words with frequency below 0.1 or above 0.9 (once normalized)
if frequency.values():
max_word = float(max(frequency.values()))
else:
max_word = 1
for w in frequency.keys():
frequency[w] = frequency[w]/max_word # normalize
if frequency[w] <= min_freq or frequency[w] >= max_freq:
del frequency[w] # filter
return frequency
def rank(ranking, n):
# return n first sentences with highest ranking
return nlargest(n, ranking, key=ranking.get)
Input text: Recipes are easy and the dogs love them. I would buy this book again and again. Only thing is that the recipes don't tell you how many treats they make, but I suppose that's because you could make them all different sizes. Great buy!
Output text: I would buy this book again and again.
Have you tried something like this?
# Test data
df = pd.DataFrame({'ASIN': [0,1], 'Summary': ['This is the first text', 'Second text']})
# Example function
def summarize(text, n=5):
"""A very basic summary"""
return (text[:n] + '..') if len(text) > n else text
# Applying the function to the text
df['Result'] = df['Summary'].map(summarize)
# ASIN Summary Result
# 0 0 This is the first text This ..
# 1 1 Second text Secon..
Such a long story...
I'm going to assume since you are performing a text frequency analysis, the order of reviewText don't matter. If that is the case:
Mega_String = ' '.join(data['reviewText'])
This should concat all strings in review text function into one big string, with each review separated with a white space.
You can just throw this result to your functions.

Extracting certain columns from multiple files simultaneously by Python

My purpose is to extract one certain column from the multiple data files.
So, I tried to use glob module to read files and tried to extract one column from each file with for statements like below:
filin = diri + '*_7.txt'
FileList=sorted(glob.glob(filin))
for INPUT in FileList:
a = []
b = []
c = []
T = []
f = open(INPUT,'r')
f.seek(0,0)
for columns in ( raw.strip().split() for raw in f):
b.append(columns[11])
t = np.array(b, float)
print t
t = list(t)
T = T + [t]
f.close()
print T
The number of data files which I used is 32. So, I expected the second 'for' statement ran only 32 times while generating only 32 arrays of t. However, the result doesn't look like what I expected.
I assume that it may be due to the influence from the first 'for' statement, but I am not sure.
Any idea or help would be really appreciated.
Thank you,
Isaac
You clear T = [] for every file. Move T = [] line before first loop.

How to predict the topic of a new query using a trained LDA model using gensim?

I have trained a corpus for LDA topic modelling using gensim.
Going through the tutorial on the gensim website (this is not the whole code):
question = 'Changelog generation from Github issues?';
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
print important_words
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
print dictionary
print ques_vec
print lda[ques_vec]
This is the output that I get:
['changelog', 'generation', 'github', 'issues']
Dictionary(15791 unique tokens)
[(514, 1), (3625, 1), (3626, 1), (3627, 1)]
[(4, 0.20400000000000032), (11, 0.20400000000000032), (19, 0.20263215848547525), (29, 0.20536784151452539)]
I don't know how the last output is going to help me find the possible topic for the question !!!
Please help!
I have written a function in python that gives the possible topic for a new query:
def getTopicForQuery (question):
temp = question.lower()
for i in range(len(punctuation_string)):
temp = temp.replace(punctuation_string[i], '')
words = re.findall(r'\w+', temp, flags = re.UNICODE | re.LOCALE)
important_words = []
important_words = filter(lambda x: x not in stoplist, words)
dictionary = corpora.Dictionary.load('questions.dict')
ques_vec = []
ques_vec = dictionary.doc2bow(important_words)
topic_vec = []
topic_vec = lda[ques_vec]
word_count_array = numpy.empty((len(topic_vec), 2), dtype = numpy.object)
for i in range(len(topic_vec)):
word_count_array[i, 0] = topic_vec[i][0]
word_count_array[i, 1] = topic_vec[i][1]
idx = numpy.argsort(word_count_array[:, 1])
idx = idx[::-1]
word_count_array = word_count_array[idx]
final = []
final = lda.print_topic(word_count_array[0, 0], 1)
question_topic = final.split('*') ## as format is like "probability * topic"
return question_topic[1]
Before going through this do refer this link!
In the initial part of the code, the query is being pre-processed so that it can be stripped off stop words and unnecessary punctuations.
Then, the dictionary that was made by using our own database is loaded.
We, then, we convert the tokens of the new query to bag of words and then the topic probability distribution of the query is calculated by topic_vec = lda[ques_vec] where lda is the trained model as explained in the link referred above.
The distribution is then sorted w.r.t the probabilities of the topics. The topic with the highest probability is then displayed by question_topic[1].
Assuming we just need topic with highest probability following code snippet may be helpful:
def findTopic(testObj, dictionary):
text_corpus = []
'''
For each query ( document in the test file) , tokenize the
query, create a feature vector just like how it was done while training
and create text_corpus
'''
for query in testObj:
temp_doc = tokenize(query.strip())
current_doc = []
for word in range(len(temp_doc)):
if temp_doc[word][0] not in stoplist and temp_doc[word][1] == 'NN':
current_doc.append(temp_doc[word][0])
text_corpus.append(current_doc)
'''
For each feature vector text, lda[doc_bow] gives the topic
distribution, which can be sorted in descending order to print the
very first topic
'''
for text in text_corpus:
doc_bow = dictionary.doc2bow(text)
print text
topics = sorted(lda[doc_bow],key=lambda x:x[1],reverse=True)
print(topics)
print(topics[0][0])
The tokenize functions removes punctuations/ domain specific characters to filtered and gives the list of tokens. Here dictionary created in training is passed as parameter of the function, but it can also be loaded from a file.
Basically, Anjmesh Pandey suggested a good example code. However the first word with highest probability in a topic may not solely represent the topic because in some cases clustered topics may have a few topics sharing those most commonly happening words with others even at the top of them. Therefore returning an index of a topic would be enough, which most likely to be close to the query.
topic_id = sorted(lda[ques_vec], key=lambda (index, score): -score)
The transformation of ques_vec gives you per topic idea and then you would try to understand what the unlabeled topic is about by checking some words mainly contributing to the topic.
latent_topic_words = map(lambda (score, word):word lda.show_topic(topic_id))
show_topic() method returns a list of tuple sorted by score of each word contributing to the topic in descending order, and we can roughly understand the latent topic by checking those words with their weights.

Categories

Resources