Why is my RegEx code replacing some strings, but not others? - python

I have abstracts of academic articles. Sometimes, the abstract will contain lines like "PurposeThis article explores...." or "Design/methodology/approachThe design of our study....". I call terms like "Purpose" and "Design/methodology/approach" labels. I want the string to look like this: [label][:][space]. For example: "Purpose: This article explores...."
The code below gets me the result I want when the original string has a space between the label and the text (e.g. "Purpose This article explores....". But I don't understand why it also doesn't work when there is no space. May I ask what I need to do to the code below so that the labels are formatted the way I want, even when the original text has no space between the label and the text? Note that I imported re.sub.
def clean_abstract(my_abstract):
labels = ['Purpose', 'Design/methodology/approach', 'Methodology/Approach', 'Methodology/approach' 'Findings', 'Research limitations/implications', 'Research limitations/Implications' 'Practical implications', 'Social implications', 'Originality/value']
for i in labels:
cleaned_abstract = sub(i, i + ': ', cleaned_abstract)
return cleaned_abstract

Code
See code in use here
labels = ['Purpose', 'Design/methodology/approach', 'Methodology/Approach', 'Methodology/approach' 'Findings', 'Research limitations/implications', 'Research limitations/Implications' 'Practical implications', 'Social implications', 'Originality/value']
strings = ['PurposeThis article explores....', 'Design/methodology/approachThe design of our study....']
print [l + ": " + s.split(l)[1].lstrip() for l in labels for s in strings if l in s]
Results
[
'Purpose: This article explores....',
'Design/methodology/approach: The design of our study....'
]
Explanation
Using the logic from this post.
print [] returns a list of results
l + ": " + s.split(l)[1].lstrip() creates our strings
l is explained below
: literally
s.split(l).lstrip() Split s on l and remove any whitespace from the left side of the string
for l in labels Loops over labels setting l to the value upon each iteration
for s in strings Loops over strings setting s to the value upon each iteration
if l in s If l is found in s

Related

Confused about the type of parameter that goes into this method

I'm trying to understand how this code works, we have:
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson',
'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']
def split_title_and_name(person):
return person.split()[0] + ' ' + person.split()[-1]
So we are given a list, and this method is supposed to basically delete everything in the middle between "Dr." and the last name. As far as I know, the split() function cannot be used for lists, but for strings. so person must be a string. However, we also add [0] and [-1] to person, which means we should be getting the first and last character of "person" but instead, we get first word and last word. I cannot make sense of this code! May you please help me understand?
Any help is greatly appreciated, thank you :)
The split function splits the string into a list of words. And then we select the first and last words to form the output.
>>> person = 'Dr. Christopher Brooks'
>>> person.split()
['Dr.', 'Christopher', 'Brooks']
>>> person.split()[0]
'Dr.'
>>> person.split()[-1]
'Brooks'
This is not a real answer, just adding this for clarification on how the function would be used, given a list of strings.
people = ['Dr. Christopher Brooks', 'Dr. Kevyn Collins-Thompson',
'Dr. VG Vinod Vydiswaran', 'Dr. Daniel Romero']
def split_title_and_name(person: str):
return person.split()[0] + ' ' + person.split()[-1]
# This code does not actually run (I guess this might have been what you were trying)
# result = split_title_and_name(people)
# Using a for loop to print the result of running function over each list element
print('== With loop')
for person in people:
result = split_title_and_name(person)
print(result)
# Using a list comprehension to get the same results as above
print('== With list comprehension')
results = [split_title_and_name(person) for person in people]
print(results)
Python's split() method splits a string into a list. You can specify the separator, the default separator is any whitespace. So in your case, you didn't specify any separator and therefore this function will split the string person into ['Dr.', 'Christopher', 'Brooks'] and therefore [0] = 'Dr.' and [-1] = 'Brooks'.
The syntax for split() function is: string.split(separator, maxsplit), here both parameters are optional.
If you don't give any parameters, the default values for separator is any whitespace such as space, \t , \n , etc and maxsplit is -1 (meaning, all occurrences)
You can learn more about split() on https://www.w3schools.com/python/ref_string_split.asp

Slice a string into two chunks of different lengths based on character in Python

So I have a file that looks something like this:
oak
elm
tulip
redbud
birch
/plants/
allium
bellflower
ragweed
switchgrass
All I want to do is split the trees and herbaceous species into two chunks so I can call them separately like this:
print(trees)
oak
elm
tulip
redbud
birch
print(herbs)
allium
bellflower
ragweed
switchgrass
As you can see in the sample data, the data chunks are of unequal length so I have to split based on the divider "/plants/". If I try splicing, the data is now just separated by space:
for groups in plant_data:
groups = groups.strip()
groups = groups.replace('\n\n', '\n')
pos = groups.find("/plants/")
trees, herbs = (groups[:pos], groups[pos:])
print(trees)
oa
el
tuli
redbu
birc
alliu
bellflowe
ragwee
switchgras
If I try simply splitting, I'm getting lists (which would be okay for my purposes), but they are still not split into the two groups:
for groups in plant_data:
groups = groups.strip()
groups = groups.replace('\n\n', '\n')
trees = groups.split("/plants/")
print(trees)
['oak']
['elm']
['tulip']
['redbud']
['birch']
['']
['', '']
['']
['allium']
['bellflower']
['ragweed']
['switchgrass']
To remove blank lines, which I thought was the issue, I tried following: How do I remove blank lines from a string in Python?
And I know that splitting a string by a character has been asked similarly here: Python: split a string by the position of a character
But I'm very confused as to why I can't get these two to split.
spam = """oak
elm
tulip
redbud
birch
/plants/
allium
bellflower
ragweed
switchgrass"""
spam = spam.splitlines()
idx = spam.index('/plants/')
trees, herbs = spam[:idx-1], spam[idx+2:]
print(trees)
print(herbs)
output
['oak', 'elm', 'tulip', 'redbud', 'birch']
['allium', 'bellflower', 'ragweed', 'switchgrass']
Of course, instead of playing with idx-1, idx+2, you can remove empty str using different approach (e.g. list comprehension)
spam = [line for line in spam.splitlines() if line]
idx = spam.index('/plants/')
trees, herbs = spam[:idx], spam[idx+1:]

Python extracting contents from list

I am putting together a text analysis script in Python using pyLDAvis, and I am trying to clean up one of the outputs into something cleaner and easier to read. The function to return the top 5 important words for 4 topics is a list that looks like:
[(0, '0.008*"de" + 0.007*"sas" + 0.004*"la" + 0.003*"et" + 0.003*"see"'),
(1,
'0.009*"sas" + 0.004*"de" + 0.003*"les" + 0.003*"recovery" + 0.003*"data"'),
(2,
'0.007*"sas" + 0.006*"data" + 0.005*"de" + 0.004*"recovery" + 0.004*"raid"'),
(3,
'0.019*"sas" + 0.009*"expensive" + 0.008*"disgustingly" + 0.008*"cool." + 0.008*"houses"')]
I ideally want to turn this into a dataframe where the first row contains the first words of each topic, as well as the corresponding score, and the columns represent the word and its score i.e.:
r1col1 is 'de', r1col2 is 0.008, r1col3 is 'sas', r1col4 is 0.009, etc, etc.
Is there a way to extract the contents of the list and separate the values given the format it is in?
Assuming the output is consistent with your example, it should be fairly straight forward. The list contains tuples of 2 of which the second is a string with plenty of available operations in python.
str.split("+") will return a list split from str along the '+' character.
To then extract the word and the score you could make use of the python package 're' for matching regular expressions.
score = re.search('\d+.?\d*', str)
word = re.search('".*"', str)
you then use .group() to get the match as such:
score.group()
word.group()
You could also simply use split again along '*' this time to split the two parts.
The returned list should be ordered.
l = str.split('*')
Here is a solution, using regex "(.*?)" to extract the text between double quotes & use enumerate over extracted values to get expected result and join on delimeter ,.
import re
for k, v in values:
print(
", ".join([f"r{k + 1}col{i + 1} is {j}"
for i, j in enumerate(re.findall(r'"(.*?)"', v))])
)
r1col1 is de, r1col2 is sas, r1col3 is la, r1col4 is et, r1col5 is see
r2col1 is sas, r2col2 is de, r2col3 is les, r2col4 is recovery, r2col5 is data
r3col1 is sas, r3col2 is data, r3col3 is de, r3col4 is recovery, r3col5 is raid
r4col1 is sas, r4col2 is expensive, r4col3 is disgustingly, r4col4 is cool., r4col5 is houses

Remove mirrored duplicate strings in list python?

What is an efficient python algorithm to remove all mirrored text duplicates in a list where the items are in the format as below?
ExList = [' dutch italian english', ' italian english dutch', ' dutch italian german', ' dutch german italian' ]
Required result: [' dutch english italian ', 'dutch german italian' ]
This solution uses the set datastructure and focuses on producing compact code, mostly with list/set/generator comprehenstions. If this is a homework task for a beginner course and you just copy the result, it will be very obvious that you did not write the code yourself. Try to follow the thought process and reproduce the results yourself.
1) split each element at " " (space)
for item in ExList:
splitted = item.split(" ")
2) remove now empty elements due to superfluous spaces in the input. This can be done in 1 line with the step above (empty strings are "falsy") using a list comprehenstion:
for item in ExList:
splitted = [lang for lang in item.split(" ") if lang]
3) Put the result in a set, which by definition disregards order and ignores duplicates. For this step we primarily need the property of unordered identity, meaning set([1, 2]) == set([2, 1]). This can be combined with the line above using a generator comprehension:
for item in ExList:
itemSet = set(lang for lang in item.split(" ") if lang)
Now, within that loop, put all those sets of languages into another set. This time, because all the item sets with the same items in any order are considered equal, the outer set will automatically disregard any duplicates. To be able to put the item set into another set, it needs to be immutable (because mutability might cause a change in identity), which is called a frozenset in python. The code looks like this:
ExList = [' dutch italian english', ' italian english dutch', ' dutch italian german', ' dutch german italian' ]
result = set()
for item in ExList:
result.add(frozenset(lang for lang in item.split(" ") if lang))
Or, as a set comprehension on one line:
result = {frozenset(lang for lang in item.split(" ") if lang) for item in ExList}
The result is as follows:
>>> print(result)
{frozenset({'italian', 'dutch', 'german'}), frozenset({'italian', 'dutch', 'english'})}
you can turn that back into lists if the set print output looks confusing to you
>>> print([list(itemSet) for itemSet in result])
[['italian', 'dutch', 'german'], ['italian', 'dutch', 'english']]
This may work for you:
def unique_list(s):
x = set([tuple(sorted(s.split())) for s in ExList])
return [" ".join(s) for s in x]
print(unique_list(ExList)
This might not be the most efficient solution, but hope it will be of some help.
Using the property that keys of dictionary are unique.
m_dict = {}
for a in ExList:
b = a.split()
b.sort()
m_dict[' '.join(b)] = None
print m_dict.keys()

Python, working with strings

i need to construct a program to my class which will : read a messed text from file and give this text a book form so from input:
This is programing story , for programmers . One day a variable
called
v comes to a bar and ordred some whiskey, when suddenly
a new variable was declared .
a new variable asked : " What did you ordered? "
into output
This is programing story,
for programmers. One day
a variable called v comes
to a bar and ordred some
whiskey, when suddenly a
new variable was
declared. A new variable
asked: "what did you
ordered?"
I am total beginner at programming, and my code is here
def vypis(t):
cely_text = ''
for riadok in t:
cely_text += riadok.strip()
a = 0
for i in range(0,80):
if cely_text[0+a] == " " and cely_text[a+1] == " ":
cely_text = cely_text.replace (" ", " ")
a+=1
d=0
for c in range(0,80):
if cely_text[0+d] == " " and (cely_text[a+1] == "," or cely_text[a+1] == "." or cely_text[a+1] == "!" or cely_text[a+1] == "?"):
cely_text = cely_text.replace (" ", "")
d+=1
def vymen(riadok):
for ch in riadok:
if ch in '.,":':
riadok = riadok[ch-1].replace(" ", "")
x = int(input("Zadaj x"))
t = open("text.txt", "r")
v = open("prazdny.txt", "w")
print(vypis(t))
This code have deleted some spaces and i have tried to delete spaces before signs like " .,_?" but this do not worked why ? Thanks for help :)
You want to do quite a lot of things, so let's take them in order:
Let's get the text in a nice text form (a list of strings):
>>> with open('text.txt', 'r') as f:
... lines = f.readlines()
>>> lines
['This is programing story , for programmers . One day a variable',
'called', 'v comes to a bar and ordred some whiskey, when suddenly ',
' a new variable was declared .',
'a new variable asked : " What did you ordered? "']
You have newlines all around the place. Let's replace them by spaces and join everything into a single big string:
>>> text = ' '.join(line.replace('\n', ' ') for line in lines)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'
Now we want to remove any multiple spaces. We split by space, tabs, etc... and keep only the non-empty words:
>>> words = [word for word in text.split() if word]
>>> words
['This', 'is', 'programing', 'story', ',', 'for', 'programmers', '.', 'One', 'day', 'a', 'variable', 'called', 'v', 'comes', 'to', 'a', 'bar', 'and', 'ordred', 'some', 'whiskey,', 'when', 'suddenly', 'a', 'new', 'variable', 'was', 'declared', '.', 'a', 'new', 'variable', 'asked', ':', '"', 'What', 'did', 'you', 'ordered?', '"']
Let us join our words by spaces... (only one this time)
>>> text = ' '.join(words)
>>> text
'This is programing story , for programmers . One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared . a new variable asked : " What did you ordered? "'
We now want to remove all the <SPACE>., <SPACE>, etc...:
>>> for char in (',', '.', ':', '"', '?', '!'):
... text = text.replace(' ' + char, char)
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. a new variable asked:" What did you ordered?"'
OK, the work is not done as the " are still messed up, the upper case are not set etc... You can still incrementally update your text. For the upper case, consider for instance:
>>> sentences = text.split('.')
>>> sentences
['This is programing story, for programmers', ' One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared', ' a new variable asked:" What did you ordered?"']
See how you can fix it ?
The trick is to take only string transformations such that:
A correct sentence is UNCHANGED by the transformation
An incorrect sentence is IMPROVED by the transformation
This way you can compose them an improve your text incrementally.
Once you have a nicely formatted text, like this:
>>> text
'This is programing story, for programmers. One day a variable called v comes to a bar and ordred some whiskey, when suddenly a new variable was declared. A new variable asked: "what did you ordered?"'
You have to define similar syntactic rules for printing it out in book format. Consider for instance the function:
>>> def prettyprint(text):
... return '\n'.join(text[i:i+50] for i in range(0, len(text), 50))
It will print each line with an exact length of 50 characters:
>>> print prettyprint(text)
This is programing story, for programmers. One day
a variable called v comes to a bar and ordred som
e whiskey, when suddenly a new variable was declar
ed. A new variable asked: "what did you ordered?"
Not bad, but can be better. Just like we previously juggled with text, lines, sentences and words to match the syntactic rules of English language, with want to do exactly the same to match the syntactic rules of printed books.
In that case, both the English language and printed books work on the same units: words, arranged in sentences. This suggests we might want to work on these directly. A simple way to do that is to define your own objects:
>>> class Sentence(object):
... def __init__(self, content, punctuation):
... self.content = content
... self.endby = punctuation
... def pretty(self):
... nice = []
... content = self.content.pretty()
... # A sentence starts with a capital letter
... nice.append(content[0].upper())
... # The rest has already been prettified by the content
... nice.extend(content[1:])
... # Do not forget the punctuation sign
... nice.append('.')
... return ''.join(nice)
>>> class Paragraph(object):
... def __init__(self, sentences):
... self.sentences = sentences
... def pretty(self):
... # Separating our sentences by a single space
... return ' '.join(sentence.pretty() for sentence in sentences)
etc... This way you can represent your text as:
>>> Paragraph(
... Sentence(
... Propositions([Proposition(['this',
... 'is',
... 'programming',
... 'story']),
... Proposition(['for',
... 'programmers'])],
... ',')
... '.'),
... Sentence(...
etc...
Converting from a string (even a messed up one) to such a tree is relatively straightforward as you only break down to the smallest possible elements. When you want to print it in book format, you can define your own book methods on each element of the tree, e.g. like this, passing around the current line, the output lines and the current offset on the current line:
class Proposition(object):
...
def book(self, line, lines, offset, line_length):
for word in self.words:
if offset + len(word) > line_length:
lines.append(' '.join(line))
line = []
offset = 0
line.append(word)
return line, lines, offset
...
class Propositions(object):
...
def book(self, lines, offset, line_length):
lines, offset = self.Proposition1.book(lines, offset, line_length)
if offset + len(self.punctuation) + 1 > line_length:
# Need to add the punctuation sign with the last word
# to a new line
word = line.pop()
lines.append(' '.join(line))
line = [word + self.punctuation + ' ']
offset = len(word + self.punctuation + ' ')
line, lines, offset = self.Proposition2.book(lines, offset, line_length)
return line, lines, offset
And work your way up to Sentence, Paragraph, Chapter...
This is a very simplistic implementation (and actually a non-trivial problem) which does not take into account syllabification or justification (which you would probably like to have), but this is the way to go.
Note that I did not mention the string module, string formatting or regular expressions which are tools to use once you can define your syntactic rules or transformations. These are extremely powerful tools, but the most important here is to know exactly the algorithm to transform an invalid string into a valid one. Once you have some working pseudocode, regexps and format strings can help you achieve it with less pain than plain character iteration. (in my previous example of tree of words for instance, regexps can tremendously ease the construction of the tree, and Python's powerful string formatting functions can make the writing of book or pretty methods much easier).
To strip the multiple spaces you could use a simple regex substitution.
import re
cely_text = re.sub(' +',' ', cely_text)
Then for punctuation you could run a similar sub:
cely_text = re.sub(' +([,.:])','\g<1>', cely_text)

Categories

Resources