I have an acronym dictionary that has keys as an acronym and values as full forms.
I want to replace the acronyms found in the text_list with the full forms to arrive at the ouput_list
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
output_list = [
'The status Quotation has changed',
'I SWEAR, This is part of India_Software',
'The update does not belong to the Software, version, branch',
'This is Regular Expression_Text'
]
I wrote a method to do that
import string
def remove_punctuations(text):
punct_str = string.punctuation # !"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~
for punctuation in punct_str:
text = text.replace(punctuation, ' ')
return text.strip()
def replace_single_acronym(text, acronym, fullform):
words = text.split()
return_words = []
for w in words:
if remove_punctuations(w).lower() == acronym.lower():
return_words.append(w.replace(acronym, fullform))
else:
return_words.append(w)
return " ".join(return_words)
my_op_list = []
for text in text_list:
for acronym in acronym_dict.keys():
text = replace_single_acronym(text, acronym, acronym_dict[acronym])
my_op_list.append(text)
Ideally output_list and my_op_list should look the same. It prints the below result (failing in 2 instances)
['The status Quotation has changed',
'I SWEAR, This is part of IN_SW',
'The update does not belong to the Software, version, branch',
'This is a RE_Text']
Also, the method replace_single_acronym is very slow on a corpus of 1000 text_list items.
Can someone help me in adjusting the method to do it in the right and efficient way?
You might use re.sub for this task by delivering function as 2nd argument following way
import re
acronym_dict = {
'QUO': 'Quotation',
'IN': 'India',
'SW': 'Software',
'RE': 'Regular Expression'
}
text_list = [
'The status QUO has changed', 'I SWEAR, This is part of IN_SW',
'The update does not belong to the SW, version, branch',
'This is a RE_Text'
]
def get_full_name(m):
return acronym_dict.get(m.group(1),m.group(1))
def replace_acronyms(text):
return re.sub(r'(?<![A-Z])([A-Z]+)(?![A-Z])', get_full_name, text)
output_list = [replace_acronyms(i) for i in text_list]
print(output_list)
output:
['The status Quotation has changed', 'I SWEAR, This is part of India_Software', 'The update does not belong to the Software, version, branch', 'This is a Regular Expression_Text']
Explanation: in pattern I used there are two zero-length assertions and one capturing group, it does find one or more uppercase ASCII letters, which are not preceded by ASCII uppercase letter (negative lookbehind) and not followed by ASCII uppercase letter (negative lookahead). get_full_name is function used as 2nd argument of re.sub thus it do accept single argument, which is match. m.group(1) denote content of sole capturing group I have used in pattern, it is acronym, I used .get function of dict so if given acronym is present in dict keys then use corresponding value, if it is not just use that acronym i.e. do not change anything.
Related
I need some help with a regex I am writing. I have a list of words that I want to match and words that might come after them (words meaning [A-Za-z/\s]+) I.e no parenthesis,symbols, numbers.
words = ['qtr','hard','quarter'] # keywords that must exist
test=['id:12345 cli hard/qtr Mix',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
excepted output is
['hard/qtr Mix', 'qtr', 'hard', 'hard work', None]
What I have tried so far
re.search(r'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))',x,re.I)
The problem with the pattern you have i.e.'((hard|qtr|quarter)(?:[[A-Za-z/\s]+]))', you have \s inside squared brackets [] which means to match the characters individually i.e. either \ or s, instead, you can just use space character i.e.
You can join all the words in words list by | to create the pattern '((qtr|hard|quarter)([a-zA-Z/ ]*))', then search for the pattern in each of strings in the list, if the match is found, take the group 0 and append it to the resulting list, else, append None:
pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
result = []
for x in test:
groups = pattern.search(x)
if groups:
result.append(groups.group(0))
else:
result.append(None)
OUTPUT:
result
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
And since you are including the space characters, you may end up with some values that has space at the end, you can just strip off the white space characters later.
Idea extracted from the existing answer and made shorter :
>>> pattern = re.compile('(('+'|'.join(words)+')([a-zA-Z/ ]*))')
>>> [pattern.search(x).group(0) if pattern.search(x) else None for x in test])
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
As mentioned in comment :
But it is quite inefficient, because it needs to search for same pattern twice, once for pattern.search(x).group(0) and the other one for if pattern.search(x), and list-comprehension is not the best way to go about in such scenarios.
We can try this to overcome that issue :
>>> [v.group(0) if v else None for v in (pattern.search(x) for x in test)]
['hard/qtr Mix', 'qtr ', 'hard ', 'hard work', None]
You can put all needed words in or expression and put your word definition after that
import re
words = ['qtr','hard','quarter']
regex = r"(" + "|".join(words) + ")[A-Za-z\/\s]+"
p = re.compile(regex)
test=['id:12345 cli hard/qtr Mix(qtr',
'id:12345 cli qtr 90%',
'id:12345 cli hard (red)',
'id:12345 cli hard work','Hello world']
for string in test:
result = p.search(string)
if result is not None:
print(p.search(string).group(0))
else:
print(result)
Output:
hard/qtr Mix
qtr
hard
hard work
None
I'm currently trying to scrape a website for some information but am running into some issues.
I currently have a bs4.element.Tag element with some html and text in it, and when I do "variable.text", I get the following text:
\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t
What I want is to get rid of all the white space characters (\n and \t) to get the relevant information in a list or any iterable form.
I've tried a bunch of regex commands already, but the one that got me closest to my goal was: re.split('[\t\n]',variable.text), I got the following:
['',
'',
'Ulmstead Club',
'',
'',
'',
'',
'',
'911 Lynch Dr',
'',
'',
'',
'',
'',
'',
'',
'Arnold, Maryland',
'',
'',
'',
'',
I've cut off a lot of the output to save some space.
I'm super lost and any help would be greatly appreciated
Try splitting on [\t\n]+:
re.split('[\t\n]+', variable.text.strip())
This would seem to work as it would eliminate the empty string entries in the output array.
My guess is that, this simple expression might be also helpful,
(?:\\n|\\t)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:\\n|\\t)"
test_str = "\\n\\nUlmstead Club\\n\\t\\t\\t\\t\\t911 Lynch Dr\\n\\n\\t\\t\\t\\t\\t\\tArnold, Maryland\\t\\t\\t\\t\\t 21012\\n\\t\\t\\t\\t\\tUnited States\\n(410) 757-9836 \\n\\n Get directions\\n\\n Favorite court \\n\\n\\n\\nTennis Court Details\\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tLocation type:\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tClub\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tMatches played here:\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t0\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
You could use string.replace() function to get rid of the \n and \t, no really needing a regular expression to do so (I have replaced the \n and \t with 2 whitespaces for the next step):
variable.text = variable.text.replace("\n"," ")
variable.text = variable.text.replace("\t"," ")
if you want then to split your data into a list, you could split it through whitespaces, and use remove() to delete any extra empty strings in the list (note that I am not 100% sure of how you want your data separated, I have just made the solution that fitted my logic of how it should be split) :
result = re.split("[\s]\s+",variable.text)
while ('' in result):
result.remove('')
Here is the full code example:
import re
teststring ="\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t"
teststring = teststring.replace("\n"," ")
teststring = teststring.replace("\t"," ")
#split any fields with more than 1 whitespace between them
result = re.split("[\s]\s+",teststring)
#remove any empty string fields of the list
while ('' in result):
result.remove('')
print(result)
Result is:
['Ulmstead Club', '911 Lynch Dr', 'Arnold, Maryland', '21012', 'United States', '(410) 757-9836', 'Get directions', 'Favorite court', 'Tennis Court Details', 'Location type:', 'Club', 'Matches played here:', '0']
I would run 2 regex on the string starting with 1 then 2
Find \s*(?:\r?\n)\s*
Replace \n
https://regex101.com/r/EGTyKB/1
Find [ ]*\t+[ ]*
Replace \t
https://regex101.com/r/XIyi44/1
This clears out all the whitespace cruft and turns it into
a readable block of text.
Ulmstead Club
911 Lynch Dr
Arnold, Maryland 21012
United States
(410) 757-9836
Get directions
Favorite court
Tennis Court Details
Location type:
Club
Matches played here:
0
I have a custom filter that highlights the keywords that the user had put into the search bar (like on Google search). However, as of now, it only highlights the last word of the keywords. For example, if the keywords are "American film industry", only "industry" will be highlighted. But I want all three words to be highlighted whenever and wherever they are present on the webpage (even if they aren't next to each other). To treat the keywords string as individual words, I have split the keywords:
def highlight(value, search_term, autoescape=True):
search_term_list = search_term.split()
search_term_word = ''
for search_term_word in search_term_list:
pattern = re.compile(re.escape(search_term_word), re.IGNORECASE)
new_value = pattern.sub('<span class="highlight">\g<0></span>', value)
return mark_safe(new_value)
Any idea why the filter only highlights the last word and how to make the code work?
Here is an alternative to the proposed solution by #WiktorStribiżew
# import re
def highlight(value, search_term):
pattern = r'{}'.format(search_term.replace(' ', '|'))
return re.sub(pattern, '<span class="highlight">\g<0></span>', value)
highlight('Hello one world', 'one world')
# 'Hello <span class="highlight">one</span> <span class="highlight">world</span>'
Given a string dialogue such as below, I need to find the sentence that corresponds to each user.
text = 'CHRIS: Hello, how are you...
PETER: Great, you? PAM: He is resting.
[PAM SHOWS THE COUCH]
[PETER IS NODDING HIS HEAD]
CHRIS: Are you ok?'
For the above dialogue, I would like to return tuples with three elements with:
The name of the person
The sentence in lower case and
The sentences within Brackets
Something like this:
('CHRIS', 'Hello, how are you...', None)
('PETER', 'Great, you?', None)
('PAM', 'He is resting', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD')
('CHRIS', 'Are you ok?', None)
etc...
I am trying to use regex to achieve the above. So far I was able to get the names of the users with the below code. I am struggling to identify the sentence between two users.
actors = re.findall(r'\w+(?=\s*:[^/])',text)
You can do this with re.findall:
>>> re.findall(r'\b(\S+):([^:\[\]]+?)\n?(\[[^:]+?\]\n?)?(?=\b\S+:|$)', text)
[('CHRIS', ' Hello, how are you...', ''),
('PETER', ' Great, you? ', ''),
('PAM',
' He is resting.',
'[PAM SHOWS THE COUCH]\n[PETER IS NODDING HIS HEAD]\n'),
('CHRIS', ' Are you ok?', '')]
You will have to figure out how to remove the square braces yourself, that cannot be done with regex while still attempting to match everything.
Regex Breakdown
\b # Word boundary
(\S+) # First capture group, string of characters not having a space
: # Colon
( # Second capture group
[^ # Match anything that is not...
: # a colon
\[\] # or square braces
]+? # Non-greedy match
)
\n? # Optional newline
( # Third capture group
\[ # Literal opening brace
[^:]+? # Similar to above - exclude colon from match
\]
\n? # Optional newlines
)? # Third capture group is optional
(?= # Lookahead for...
\b # a word boundary, followed by
\S+ # one or more non-space chars, and
: # a colon
| # Or,
$ # EOL
)
Regex is one way to approach this problem, but you can also think about it as iterating through each token in your text and applying some logic to form groups.
For example, we could first find groups of names and text:
from itertools import groupby
def isName(word):
# Names end with ':'
return word.endswith(":")
text_split = [
" ".join(list(g)).rstrip(":")
for i, g in groupby(text.replace("]", "] ").split(), isName)
]
print(text_split)
#['CHRIS',
# 'Hello, how are you...',
# 'PETER',
# 'Great, you?',
# 'PAM',
# 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]',
# 'CHRIS',
# 'Are you ok?']
Next you can collect pairs of consecutive elements in text_split into tuples:
print([(text_split[i*2], text_split[i*2+1]) for i in range(len(text_split)//2)])
#[('CHRIS', 'Hello, how are you...'),
# ('PETER', 'Great, you?'),
# ('PAM', 'He is resting. [PAM SHOWS THE COUCH] [PETER IS NODDING HIS HEAD]'),
# ('CHRIS', 'Are you ok?')]
We're almost at the desired output. We just need to deal with the text in the square brackets. You can write a simple function for that. (Regular expressions is admittedly an option here, but I'm purposely avoiding that in this answer.)
Here's something quick that I came up with:
def isClosingBracket(word):
return word.endswith("]")
def processWords(words):
if "[" not in words:
return [words, None]
else:
return [
" ".join(g).replace("]", ".")
for i, g in groupby(map(str.strip, words.split("[")), isClosingBracket)
]
print(
[(text_split[i*2], *processWords(text_split[i*2+1])) for i in range(len(text_split)//2)]
)
#[('CHRIS', 'Hello, how are you...', None),
# ('PETER', 'Great, you?', None),
# ('PAM', 'He is resting.', 'PAM SHOWS THE COUCH. PETER IS NODDING HIS HEAD.'),
# ('CHRIS', 'Are you ok?', None)]
Note that using the * to unpack the result of processWords into the tuple is strictly a python 3 feature.
I wish to let the user ask a simple question, so I can extract a few standard elements from the string entered.
Examples of strings to be entered:
Who is the director of The Dark Knight?
What is the capital of China?
Who is the president of USA?
As you can see sometimes it is "Who", sometimes it is "What". I'm most likely looking for the "|" operator. I'll need to extract two things from these strings. The word after "the" and before "of", as well as the word after "of".
For example:
1st sentence: I wish to extract "director" and place it in a variable called Relation, and extract "The Dark Knight" and place it in a variable called Concept.
Desired output:
RelationVar = "director"
ConceptVar = "The Dark Knight"
2nd sentence: I wish to extract "capital", assign it to variable "Relation".....and extract "China" and place it in variable "Concept".
RelationVar = "capital"
ConceptVar = "China"
Any ideas on how to use the re.match function? or any other method?
You're correct that you want to use | for who/what. The rest of the regex is very simple, the group names are there for clarity but you could use r"(?:Who|What) is the (.+) of (.+)[?]" instead.
>>> r = r"(?:Who|What) is the (?P<RelationVar>.+) of (?P<ConceptVar>.+)[?]"
>>> l = ['Who is the director of The Dark Knight?', 'What is the capital of China?', 'Who is the president of USA?']
>>> [re.match(r, i).groupdict() for i in l]
[{'RelationVar': 'director', 'ConceptVar': 'The Dark Knight'}, {'RelationVar': 'capital', 'ConceptVar': 'China'}, {'RelationVar': 'president', 'ConceptVar': 'USA'}]
Change (?:Who|What) to (Who|What) if you also want to capture whether the question uses who or what.
Actually extracting the data and assigning it to variables is very simple:
>>> m = re.match(r, "What is the capital of China?")
>>> d = m.groupdict()
>>> relation_var = d["RelationVar"]
>>> concept_var = d["ConceptVar"]
>>> relation_var
'capital'
>>> concept_var
'China'
Here is the script, you can simply use | to optional match one inside the brackets.
This worked fine for me
import re
list = ['Who is the director of The Dark Knight?','What is the capital of China?','Who is the president of USA?']
for string in list:
a = re.compile(r'(What|Who) is the (.+) of (.+)')
nodes = a.findall(string);
Relation = nodes[0][0]
Concept = nodes[0][1]
print Relation
print Concept
print '----'
Best Regards:)