I am putting together a text analysis script in Python using pyLDAvis, and I am trying to clean up one of the outputs into something cleaner and easier to read. The function to return the top 5 important words for 4 topics is a list that looks like:
[(0, '0.008*"de" + 0.007*"sas" + 0.004*"la" + 0.003*"et" + 0.003*"see"'),
(1,
'0.009*"sas" + 0.004*"de" + 0.003*"les" + 0.003*"recovery" + 0.003*"data"'),
(2,
'0.007*"sas" + 0.006*"data" + 0.005*"de" + 0.004*"recovery" + 0.004*"raid"'),
(3,
'0.019*"sas" + 0.009*"expensive" + 0.008*"disgustingly" + 0.008*"cool." + 0.008*"houses"')]
I ideally want to turn this into a dataframe where the first row contains the first words of each topic, as well as the corresponding score, and the columns represent the word and its score i.e.:
r1col1 is 'de', r1col2 is 0.008, r1col3 is 'sas', r1col4 is 0.009, etc, etc.
Is there a way to extract the contents of the list and separate the values given the format it is in?
Assuming the output is consistent with your example, it should be fairly straight forward. The list contains tuples of 2 of which the second is a string with plenty of available operations in python.
str.split("+") will return a list split from str along the '+' character.
To then extract the word and the score you could make use of the python package 're' for matching regular expressions.
score = re.search('\d+.?\d*', str)
word = re.search('".*"', str)
you then use .group() to get the match as such:
score.group()
word.group()
You could also simply use split again along '*' this time to split the two parts.
The returned list should be ordered.
l = str.split('*')
Here is a solution, using regex "(.*?)" to extract the text between double quotes & use enumerate over extracted values to get expected result and join on delimeter ,.
import re
for k, v in values:
print(
", ".join([f"r{k + 1}col{i + 1} is {j}"
for i, j in enumerate(re.findall(r'"(.*?)"', v))])
)
r1col1 is de, r1col2 is sas, r1col3 is la, r1col4 is et, r1col5 is see
r2col1 is sas, r2col2 is de, r2col3 is les, r2col4 is recovery, r2col5 is data
r3col1 is sas, r3col2 is data, r3col3 is de, r3col4 is recovery, r3col5 is raid
r4col1 is sas, r4col2 is expensive, r4col3 is disgustingly, r4col4 is cool., r4col5 is houses
Related
I'm currently trying to convert character-level spans to token-level spans and am wondering if there's a functionality in the library that I may not be taking advantage of.
The data that I'm currently using consists of "proper" text (I say "proper" as in it's written as if it's a normal document, not with things like extra whitespaces for easier split operations) and annotated entities. The entities are annotated at the character level but I would like to obtain the tokenized subword-level span.
My plan was to first convert character-level spans to word-level spans, then convert that to subword-level spans. A piece of code that I wrote looks like this:
new_text = []
for word in original_text.split():
if (len(word) > 1) and (word[-1] in ['.', ',', ';', ':']):
new_text.append(word[:-1] + ' ' + word[-1])
else:
new_text.append(word)
new_text = ' '.join(new_text).split()
word2char_span = {}
start_idx = 0
for idx, word in enumerate(new_text):
char_start = start_idx
char_end = char_start + len(word)
word2char_span[idx] = (char_start, char_end)
start_idx += len(word) + 1
This seems to work well but one edge case I didn't think of is parentheses. To give a more concrete example, one paragraph-entity pair looks like this:
>>> original_text = "RDH12, a retinol dehydrogenase causing Leber's congenital amaurosis, is also involved in \
steroid metabolism. Three retinol dehydrogenases (RDHs) were tested for steroid converting abilities: human and \
murine RDH 12 and human RDH13. RDH12 is involved in retinal degeneration in Leber's congenital amaurosis (LCA). \
We show that murine Rdh12 and human RDH13 do not reveal activity towards the checked steroids, but that human type \
12 RDH reduces dihydrotestosterone to androstanediol, and is thus also involved in steroid metabolism. Furthermore, \
we analyzed both expression and subcellular localization of these enzymes."
>>> entity_span = [139, 143]
>>> print(original_text[139:143])
'RDHs'
This example actually returns a KeyError when I try to refer to (139, 143) because the adjustment code I wrote takes (RDHs) as the entity rather than RDHs. I don't want to hardcode parentheses handling either because there are some entities where the parentheses are included.
I feel like there should be a simpler approach to this issue and I'm overthinking things a bit. Any feedback on how I could achieve what I want is appreciated.
Tokenization is tricky. I'd suggest using SpaCy to process your data as you can access the offset of each token in the source text at character level, which should make mapping of your character spans to tokens straightforward:
import spacy
nlp = spacy.load("en_core_web_sm")
original_text = "Three retinol dehydrogenases (RDHs) were tested for steroid converting abilities:"
# Process data
doc = nlp(original_text)
for token in doc:
print(token.idx, token, sep="\t")
Output:
0 Three
6 retinol
14 dehydrogenases
29 (
30 RDHs
34 )
36 were
41 tested
48 for
52 steroid
60 converting
71 abilities
80 :
I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.
I have abstracts of academic articles. Sometimes, the abstract will contain lines like "PurposeThis article explores...." or "Design/methodology/approachThe design of our study....". I call terms like "Purpose" and "Design/methodology/approach" labels. I want the string to look like this: [label][:][space]. For example: "Purpose: This article explores...."
The code below gets me the result I want when the original string has a space between the label and the text (e.g. "Purpose This article explores....". But I don't understand why it also doesn't work when there is no space. May I ask what I need to do to the code below so that the labels are formatted the way I want, even when the original text has no space between the label and the text? Note that I imported re.sub.
def clean_abstract(my_abstract):
labels = ['Purpose', 'Design/methodology/approach', 'Methodology/Approach', 'Methodology/approach' 'Findings', 'Research limitations/implications', 'Research limitations/Implications' 'Practical implications', 'Social implications', 'Originality/value']
for i in labels:
cleaned_abstract = sub(i, i + ': ', cleaned_abstract)
return cleaned_abstract
Code
See code in use here
labels = ['Purpose', 'Design/methodology/approach', 'Methodology/Approach', 'Methodology/approach' 'Findings', 'Research limitations/implications', 'Research limitations/Implications' 'Practical implications', 'Social implications', 'Originality/value']
strings = ['PurposeThis article explores....', 'Design/methodology/approachThe design of our study....']
print [l + ": " + s.split(l)[1].lstrip() for l in labels for s in strings if l in s]
Results
[
'Purpose: This article explores....',
'Design/methodology/approach: The design of our study....'
]
Explanation
Using the logic from this post.
print [] returns a list of results
l + ": " + s.split(l)[1].lstrip() creates our strings
l is explained below
: literally
s.split(l).lstrip() Split s on l and remove any whitespace from the left side of the string
for l in labels Loops over labels setting l to the value upon each iteration
for s in strings Loops over strings setting s to the value upon each iteration
if l in s If l is found in s
Say I have one large string and an array of substrings that when joined equal the large string (with small differences).
For example (note the subtle differences between the strings):
large_str = "hello, this is a long string, that may be made up of multiple
substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng strin", ", that ay be mad up of multiple",
"subsrings tat aproimately ", "match the orginal strng"]
How can I best align the strings to produce a new set of sub strings from the original large_str? For example:
["hello, this is a long string", ", that may be made up of multiple",
"substrings that approximately ", "match the original string"]
Additional Info
The use case for this is to find the page breaks of the original text from the existing page breaks of text extracted from a PDF document. Text extracted from the PDF is OCR'd and has small errors compared to the original text, but the original text does not have page breaks. The goal is to accurately page break the original text avoiding the OCR errors of the PDF text.
Concatenate the substrings
Align the concatenation with the original string
Keep track of which positions in the original string are aligned with the boundaries between the substrings
Split the original string on the positions aligned with those boundaries
An implementation using Python's difflib:
from difflib import SequenceMatcher
from itertools import accumulate
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = [
"hello, ths is a lng strin",
", that ay be mad up of multiple",
"subsrings tat aproimately ",
"match the orginal strng"]
sub_str_boundaries = list(accumulate(len(s) for s in sub_strs))
sequence_matcher = SequenceMatcher(None, large_str, ''.join(sub_strs), autojunk = False)
match_index = 0
matches = [''] * len(sub_strs)
for tag, i1, i2, j1, j2 in sequence_matcher.get_opcodes():
if tag == 'delete' or tag == 'insert' or tag == 'replace':
matches[match_index] += large_str[i1:i2]
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
j1 += submatch_len
else:
while j1 < j2:
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
while submatch_len == 0:
match_index += 1
submatch_len = min(sub_str_boundaries[match_index], j2) - j1
matches[match_index] += large_str[i1:i1+submatch_len]
j1 += submatch_len
i1 += submatch_len
print(matches)
Output:
['hello, this is a long string',
', that may be made up of multiple ',
'substrings that approximately ',
'match the original string']
You are trying to solve sequence alignment problem. In your case it is a "local" sequence alignment. It can be solved with Smith-Waterman approach. One possible implementation is here.
If you run it, you receive:
large_str = "hello, this is a long string, that may be made up of multiple substrings that approximately match the original string"
sub_strs = ["hello, ths is a lng sin", ", that ay be md up of mulple", "susrings tat aproimately ", "manbvch the orhjgnal strng"]
for sbs in sub_strs:
water(large_str, sbs)
>>>
Identity = 85.185 percent
Score = 210
hello, this is a long strin
hello, th s is a l ng s in
hello, th-s is a l-ng s--in
Identity = 84.848 percent
Score = 255
, that may be made up of multiple
, that ay be m d up of mul ple
, that -ay be m-d- up of mul--ple
Identity = 83.333 percent
Score = 225
substrings that approximately
su s rings t at a pro imately
su-s-rings t-at a-pro-imately
Identity = 75.000 percent
Score = 175
ma--tch the or-iginal string
ma ch the or g nal str ng
manbvch the orhjg-nal str-ng
The middle line shows matching characters. If you need the positions, look for max_i value to get ending position in original string. The starting position will be the value of i at the end of water() function.
(The additional info makes a lot of the following unnecessary. It was written for a situation where the substrings provided might be any permutation of the order in which they occur in the main string)
There will be a dynamic programming solution for a problem very close to this. In the dynamic programming algorithm that gives you edit distance, the state of the dynamic program is (a, b) where a is the offset into the first string and b is the offset into the second string. For each pair (a, b) you work out the smallest possible edit distance that matches the first a characters of the first string with the first b characters of the second string, working out (a, b) from (a-1, b-1), (a-1, b), and (a, b-1).
You can now write a similar algorithm with state (a, n, m, b) where a is the total number of characters consumed by substrings so far, n is the index of the current substring, m is the position within the current substring, and b is the number of characters matched in the second string. This solves the problem of matching b against a string composed by pasting together any number of copies of any of the available substrings.
This is a different problem, because if you are trying to reconstitute a long string from fragments, you might get a solution that uses the same fragment more than once, but if you are doing this you might hope that the answer is obvious enough that the collection of substrings it produces happens to be a permutation of the collection given to it.
Because the edit distance returned by this method will always be at least as good as the best edit distance when you force a permutation, you could also use this to compute a lower bound on the best possible edit distance for a permutation, and run a branch and bound algorithm to find the best permutation.
I'm confronted with such a challenge right now. I've read some web classes and Dive Into Python on regex and nothing I found on my issue, thus I'm not sure if this is even possible to achieve.
Given this dict-alike string:
"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]
I'd like to compare values in brackets at corresponding positions and take only the greatest one. So comparing 11.76, 11.76, 11.91 should result in 11.91.
My alternative is to get all the values and compare them afterwards but I'm wondering whether regex could cope?
>>> import ast
>>> text = '''"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]'''
>>> rows = ast.literal_eval('{' + text + '}').values()
>>> [max(col) for col in zip(*rows)]
[11.91, 7.19]
Try this:
import re
text = '''"Mon.":[11.76,7.13],"Tue.":[11.76,7.19],"Wed.":[11.91,6.94]'''
values = re.findall(r'\[(.*?)\]', text)
values = map(lambda x: x.split(','), values)
values = zip(*values)
print max(map(float, values[0]))
print max(map(float, values[1]))
Output:
11.91
7.19