How do you write a regular expression to match a specific word in a string, when the string has white space added in random places?
I've got a string that has been extracted from a pdf document that has a table structure. As a consequence of that structure the extracted string contains randomly inserted new lines and white spaces. The specific words and phrases that I'm looking for are there with characters all in the correct order, but chopped randomly with white spaces. For example: "sta ck over flow".
The content of the pdf document was extracted with PyPDF2 as this is the only option available on my company's python library.
I know that I can write a specific string match for this with a possible white space after every character, but there must be a better way of searching for it.
Here's an example of what I've been trying to do.
my_string = "find the ans weron sta ck over flow"
# r's\s*t\s*a\s*c\s*k\s*' # etc
my_cleaned_string = re.sub(r's\s*t\s*a\s*c\s*k\s*', '', my_string)
Any suggestions?
Actually what you're doing is the best way. The only addition I can suggest is to dynamically construct such regexp from a word:
word = "stack"
regexp = r'\s*'.join(word)
my_string = "find the ans weron sta ck over flow"
my_cleaned_string = re.sub(regexp, '', my_string)
The best you can probably do here is to just strip all whitespace and then search for the target string inside the stripped text:
my_string = "find the ans weron sta ck over flow"
my_string = re.sub(r'\s+', '', my_string)
if 'stack' in my_string:
print("MATCH")
The reason I use "best" above is that in general you won't know if a space is an actual word boundary, or just random whitespace which has been inserted. So, you can really only do as good as finding your target as a substring in the stripped text. Note that the input text 'rust acknowledge' would now match positive for stack.
Related
I've made this Python program for printing words from a text but I got stuck where Python reaches the next 'tab' index it returns to the initial one when it checks the conditional and I don't know why, so can anyone explain to me why it doesn't take the new 'tab' index?
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
text = re.sub('\W+', ' ', initial_text)
t = -1
for i in text:
n = text.find(i)
if i == ' ':
print(text[t+1:n])
t = n
This is because you are using the find() function, this will return the index number of the first occurrence of the word you are searching, that's why it is again moving to the first index.
You can refer to the find() function documentation.
Use this approach
import re
initial_text = "whatever your text is"
text = re.sub(r'[^\w\s]', '', initial_text)
words_list = text.split()
for word in words:
print(word)
Explanation using an example :
import re
initial_text = "Hello : David welcome to Stack ! overflow"
text = re.sub(r'[^\w\s]', '', initial_text)
Above piece removes the punctuations
words_list = text.split()
words_list after this step will be : ['Hello', 'David', 'welcome', 'to', 'Stack', 'overflow']
for word in words_list:
print(word)
Above code takes each element from the list and prints it.
Looks like you can use
import re
initial_text = '# Traditionally, a text is understood to be a piece of written or spoken material in its primary form (as opposed to a paraphrase or summary). A text is any stretch of language that can be understood in context. It may be as simple as 1-2 words (such as a stop sign) or as complex as a novel. Any sequence of sentences that belong together can be considered a text.'
words = re.findall(r'[^\W_]+', initial_text)
for word in words:
print(word)
See Python proof.
re.findall extracts all non-overlapping matches from the given text.
[^\W_]+ is a regular expression that matches one or more characters different from non-word and underscores, and that means it matches substrings that consist of digits or/and letters only (all, ASCII and other Unicode).
See regex proof.
EXPLANATION
[^\W_]+ any character except: non-word characters
(all but a-z, A-Z, 0-9, _), '_' (1 or more
times (matching the most amount possible))
Using the below code, I imported a few .csv files with sentences like the following into Python:
df = pd.concat((pd.read_csv(f) for f in path), ignore_index=True)
Sample sentence:
I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n
While I have no problem removing the newline characters surrounded by spaces, in the middle of words, or at the end of the string, I don't know what to do with the newline characters separating words.
The output I want is as follows:
Goal sentence:
I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS.
Is there a way for me to indicate in my code that the newline character is surrounded by two distinct words? Or is this classic garbage in, garbage out?
df = df[~df['Sentence'].str.contains("\n")]
After doing some digging, I came up with two solutions.
1. The textwrap package: Though it seems that the textwrap package is normally used for visual formatting (i.e. telling a UI when to show "..." to signify a long string), it successfully identified the \n patterns I was having issues with. Though it's still necessary to remove extra whitespace of other kinds, this package got me 90% of the way there.
import textwrap
sample = 'I WANT TO UNDERSTAND WHERE TH\nERE ARE\nSOME \n NEW RESTAURANTS. \n'
sample_wrap = textwrap.wrap(sample)
print(sample_wrap)
'I WANT TO UNDERSTAND WHERE THERE ARE SOME NEW RESTAURANTS. '
2. Function to ID different \n appearance patterns: The 'boil the ocean' solution I came up with before learning about textwrap, and it doesn't work as well. This function finds matches defined as a newline character surrounded by two word (alphanumeric) characters. For all matches, the function searches NLTK's words.words() list for each string surrounding the newline character. If at least one of the two strings is a word in that list, it's considered to be two separate words.
This doesn't take into consideration domain-specific words, which have to be added to the wordlist, or words like "about", which would be incorrectly categorized by this function if the newline character appeared as "ab\nout". I'd recommend textwrap for this reason, but still thought I'd share.
carriage = re.compile(r'(\n+)')
wordword = re.compile(r'((\w+)\n+(\w+))')
def carriage_return(sentence):
if carriage.search(sentence):
if not wordword.search(sentence):
sentence = re.sub(carriage, '', sentence)
else:
matches = re.findall(wordword, sentence)
for match in matches:
word1 = match[1].lower()
word2 = match[2].lower()
if word1 in wordlist or word2 in wordlist or word1.isdigit() or word2.isdigit():
sentence = sentence.replace(match[0], word1 + ' ' + word2)
else:
sentence = sentence.replace(match[0], word1+word2)
sentence = re.sub(carriage, '', sentence)
display(sentence)
return sentence
I scraped some text from pdfs and accents/umlaut on characters get scraped after their letter, e.g.: `"Jos´e" and "Mu¨ller". Because there are just a few of these characters, I would like to fix them to e.g. "José" and "Müller".
I am trying to adapt the pattern here Regex to match words with hyphens and/or apostrophes.
pattern="(?=\S*[´])([a-zA-Z´]+)"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
m.group() #returns "Jos´e" and "Vald´ez"
m.start() #returns 0 and 6, but I want 3 and 10
In the example above, what pattern can I use to get the position of the '´' character? Then I can check the subsequent letter and replace the text accordingly.
My texts are scraped from from scientific papers and could contain those characters elsewhere, for example in code. That is the reason why I am using regex instead of .replace or text normalization with e.g. unicodedata, because I want to make sure I am replacing "words" (more precisely the authors' first and last names).
EDIT: I can relax these conditions and simply replace those characters everywhere because, if they appear in non-words such as "F=m⋅x¨", I will discard non-words anyway. Therefore, I can use a simple replace approach
I suggest using
import re
d = {'´e': 'é', 'u¨' : 'ü'}
pattern = "|".join([x for x in d])
print( re.sub(pattern, lambda m: d[m.group()], "Jos´e Vald´ez") )
# => José Valdéz
See the Python demo.
If you need to make sure there are word boundaries, you may consider using
pattern = r"\b´e|u¨\b"
See this Python demo. \b before ´ and after u will make sure there are other word chars before/after them.
A quick fix on the pattern returns the indexes which you are looking for. Instead of matching the whole word, the group will catch the apostrophe characters only.
import re
pattern = "(?=\S*[´])[a-zA-Z]+([´]+)[a-zA-Z]+"
ms = re.finditer(pattern, "Jos´e Vald´ez")
for m in ms:
print(m.group()) # returns "Jos´e" and "Vald´ez"
print(m.start(1)) # returns 3 and 10
Whilst searching for a text classification method, I came across this Python code which was used in the pre-processing step
REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|#,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))
def clean_text(text):
"""
text: a string
return: modified initial string
"""
text = text.lower() # lowercase text
text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text. substitute the matched string in REPLACE_BY_SPACE_RE with space.
text = BAD_SYMBOLS_RE.sub('', text) # remove symbols which are in BAD_SYMBOLS_RE from text. substitute the matched string in BAD_SYMBOLS_RE with nothing.
text = text.replace('x', '')
text = ' '.join(word for word in text.split() if word not in STOPWORDS) # remove stopwords from text
return text
OP
I then tested this section of code to understand the syntax and its purpose
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
text = '[0a;m]'
BAD_SYMBOLS_RE.sub(' ', text)
# returns ' 0a m ' whilst I thought it would return ' ; '
Question: why didn't the code replace 0, a, and m although 0-9a-z was specified inside the [ ]? Why did it replace ; although that character wasn't specified?
Edit to avoid being marked as duplication:
My perceptions of the code are:
The line BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]') is confusing. Including the characters #, +, and _ inside the [ ]made me think the line trying to remove the characters in the list (because no word in an English dictionary would contain those bad characters #+_, I believe?). Consequently, it made me interpret the ^ as the start of a string (instead of negation). Thus, the original post (which was kindly answered by Tim Pietzcker and Raymond Hettinger). The two lines REPLACE_BY_SPACE_RE and BAD_SYMBOLS_RE should had been combined into one such as
REMOVE_PUNCT = re.compile('[^0-9a-z]')
text = REMOVE_PUNCT.sub('', text)
I also think the code text = text.replace('x', '') (which was meant to remove the IDs that were masked as XXX-XXXX.... in the raw data) will lead to bad outcome, for example the word next will become net.
Additional questions:
Are my perceptions reasonable?
Should numbers/digits be removed from text?
Could you please recommend an overall/general strategy/code for text pre-processing for (English) text classification?
Here's some documentation about character classes.
Basically, [abc] means "any one of a, b, or c" whereas [^abc] means "any character that is not a, b, or c".
So your regex operation removes every non-digit, non-letter character except space, #, + and _ from the string, which explains the result you're getting.
General rules
The square brackets specify any one single character.
Roughly [xyz] is a short-cut for (x|y|z) but without creating a group.
Likewise [a-z] is a short-cut for (a|b|c|...|y|z).
The interpretation of character sets can be a little tricky. The start and end points get converted to their ordinal positions and the range of matching characters is inferred from there. For example [A-z] converts A to 65 and z to 122, so everything from 65 to 122 is included. That means that it also matches characters like ^ which convert to 94. It also means that characters like ö won't match because that converts to 246 which is outside the range.
Another interesting form on character classes uses the ^ to invert the selection. For example, [^a-z] means "any character not in the range from a to z.
The full details are in the "character sets" section of the re docs.
Specific Problem
In the OP's example, BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]'), the caret ^ at the beginning inverts the range so that the listed symbols are excluded from the search.
That is why the code didn't replace 0, a, and m although 0-9a-z was specified inside the [ ]. Essentially, it treated the specified characters as good characters.
Hope this helps :-)
The problem is that I now have a string where some words are sticked together:
fooledDog and I need fooled D****string text continues with inserted " "
whateveredJ and I need whatevered J*******string text continues with inserted " "
string = string.replace("edD","ed D")
string = string.replace("edJ","ed J")
but I need instead of "D" and "J" to have any possible character so to avoid hard coding values here so that the code will work with any letter or number in this position.
This is a pretty easy problem to solve with regular expressions (not something that is always true, even if regex are the best tool for the job). Try this:
import re
text = "fooledDog whateveredJob"
fixed_text = re.sub(r'ed([A-Z])', r'ed \1', text)
print(fixed_text) # prints "fooled Dog whatevered Job"
The pattern looks for the letters 'ed' in lowercase, followed by any capital letter (which gets captured). The replacement is 'ed' and a space, followed by the capital letter from the capturing group.
I don't fully understand your question, but it seems you have some camelCase words you wanna separate. If that's the case, try this:
import re
name = 'CamelCaseTest123'
splitted = re.sub('(?!^)([A-Z][a-z]+)', r' \1', name).split()
Output:
['Camel', 'Case', 'Test123']