Format string using regex to remove non space whitespace characters - python

I'm currently trying to scrape a website for some information but am running into some issues.
I currently have a bs4.element.Tag element with some html and text in it, and when I do "variable.text", I get the following text:
\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t
What I want is to get rid of all the white space characters (\n and \t) to get the relevant information in a list or any iterable form.
I've tried a bunch of regex commands already, but the one that got me closest to my goal was: re.split('[\t\n]',variable.text), I got the following:
['',
'',
'Ulmstead Club',
'',
'',
'',
'',
'',
'911 Lynch Dr',
'',
'',
'',
'',
'',
'',
'',
'Arnold, Maryland',
'',
'',
'',
'',
I've cut off a lot of the output to save some space.
I'm super lost and any help would be greatly appreciated

Try splitting on [\t\n]+:
re.split('[\t\n]+', variable.text.strip())
This would seem to work as it would eliminate the empty string entries in the output array.

My guess is that, this simple expression might be also helpful,
(?:\\n|\\t)
Demo
Test
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"(?:\\n|\\t)"
test_str = "\\n\\nUlmstead Club\\n\\t\\t\\t\\t\\t911 Lynch Dr\\n\\n\\t\\t\\t\\t\\t\\tArnold, Maryland\\t\\t\\t\\t\\t 21012\\n\\t\\t\\t\\t\\tUnited States\\n(410) 757-9836 \\n\\n Get directions\\n\\n Favorite court \\n\\n\\n\\nTennis Court Details\\n\\n\\n\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tLocation type:\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tClub\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\tMatches played here:\\t\\t\\t\\t\\t\\t\\t\\t\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t0\\t\\t\\t\\t\\t\\t\\t\\t\\t\\n\\n\\n\\n\\t\\t\\t\\t\\t\\t\\t\\t\\t\\t"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

You could use string.replace() function to get rid of the \n and \t, no really needing a regular expression to do so (I have replaced the \n and \t with 2 whitespaces for the next step):
variable.text = variable.text.replace("\n"," ")
variable.text = variable.text.replace("\t"," ")
if you want then to split your data into a list, you could split it through whitespaces, and use remove() to delete any extra empty strings in the list (note that I am not 100% sure of how you want your data separated, I have just made the solution that fitted my logic of how it should be split) :
result = re.split("[\s]\s+",variable.text)
while ('' in result):
result.remove('')
Here is the full code example:
import re
teststring ="\n\nUlmstead Club\n\t\t\t\t\t911 Lynch Dr\n\n\t\t\t\t\t\tArnold, Maryland\t\t\t\t\t 21012\n\t\t\t\t\tUnited States\n(410) 757-9836 \n\n Get directions\n\n Favorite court \n\n\n\nTennis Court Details\n\n\n\n\n\n\n\t\t\t\t\t\t\t\t\t\tLocation type:\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\tClub\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\tMatches played here:\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t0\t\t\t\t\t\t\t\t\t\n\n\n\n\t\t\t\t\t\t\t\t\t\t"
teststring = teststring.replace("\n"," ")
teststring = teststring.replace("\t"," ")
#split any fields with more than 1 whitespace between them
result = re.split("[\s]\s+",teststring)
#remove any empty string fields of the list
while ('' in result):
result.remove('')
print(result)
Result is:
['Ulmstead Club', '911 Lynch Dr', 'Arnold, Maryland', '21012', 'United States', '(410) 757-9836', 'Get directions', 'Favorite court', 'Tennis Court Details', 'Location type:', 'Club', 'Matches played here:', '0']

I would run 2 regex on the string starting with 1 then 2
Find \s*(?:\r?\n)\s*
Replace \n
https://regex101.com/r/EGTyKB/1
Find [ ]*\t+[ ]*
Replace \t
https://regex101.com/r/XIyi44/1
This clears out all the whitespace cruft and turns it into
a readable block of text.
Ulmstead Club
911 Lynch Dr
Arnold, Maryland 21012
United States
(410) 757-9836
Get directions
Favorite court
Tennis Court Details
Location type:
Club
Matches played here:
0

Related

Extract text with multiple regex patterns in Python

I have a list with address information
The placement of words in the list can be random.
address = [' South region', ' district KTS', ' 4', ' app. 106', ' ent. 1', ' st. 15']
I want to extract each item of a list in a new string.
r = re.compile(".region")
region = list(filter(r.match, address))
It works, but there are more than 1 pattern "region". For example, there can be "South reg." or "South r-n".
How can I combine a multiple patterns?
And digit 4 in list means building number. There can be onle didts, or smth like 4k1.
How can I extract building number?
Hopefully I understood the requirement correctly.
For extracting the region, I chose to get it by the first word, but if you can be sure of the regions which are accepted, it would be better to construct the regex based on the valid values, not first word.
Also, for the building extraction, I am not sure of which are the characters you want to keep, versus the ones which you may want to remove. In this case I chose to keep only alphanumeric, meaning that everything else would be stripped.
CODE
import re
list1 = [' South region', ' district KTS', ' -4k-1.', ' app. 106', ' ent. 1', ' st. 15']
def GetFirstWord(list2,column):
return re.search(r'\w+', list2[column].strip()).group()
def KeepAlpha(list2,column):
return re.sub(r'[^A-Za-z0-9 ]+', '', list2[column].strip())
print(GetFirstWord(list1,0))
print(KeepAlpha(list1,2))
OUTPUT
South
4k1

replace punctuation with space in text

I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office.
I removed punctuations from this text by the following code.
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()
here in the above code df_Train is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above.
The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office
Notice how two words Eksioglu and Handsome are combing due to no space after , . I need a way to overcome this issue.
In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:
df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
Or, if you have chunks of punctuation + whitespace to handle:
df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()
Output:
>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0 Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet Ideal for home or office
Name: BULLET_POINTS, dtype: object
The (?:[^\w\s]|_)+ regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.
The [\W_]+ pattern is similar but includes whitespace.
The .str.strip() part is necessary as the replacement might result in leading/trailing spaces.

How do I replace double quote?

I have a list of strings and I would like to replace the " at the last two strings
"racist superman"|"rudy"|"mancuso"|"king"|"bach"|"racist"|"superman"|"love"|"rudy mancuso poo bear black white official music video"|"iphone x by pineapple"|"lelepons"|"hannahstocking"|"rudymancuso"|"inanna"|"anwar"|"sarkis"|"shots"|"shotsstudios"|"alesso"|"anitta"|"brazil"|"Getting My Driver's License | Lele Pons"
My code looks like this, it does however replace the "" from the other strings and removes the "|".
Note: the input tags_str for the function is received by a file
def extract_tags(tags_str):
b = [n.strip('""').strip().replace('""', '') for n in tags_str.split("|")]
return b
['racist superman', 'rudy', 'mancuso', 'king', 'bach', 'racist', 'superman', 'love', 'rudy mancuso poo bear black white official music video', 'iphone x by pineapple', 'lelepons', 'hannahstocking', 'rudymancuso', 'inanna', 'anwar', 'sarkis', 'shots', 'shotsstudios', 'alesso', 'anitta', 'brazil', "Getting My Driver's License", 'Lele Pons']
As you can see the first strip gets rid of the "" and the second strip() gets rid of whitespaces. However "Getting My Driver's License"
still has double quotes and with the replace('""', '') I expect the double quotes to be replaced, but that's not the case.
The preferred output is:
['racist superman', 'rudy', 'mancuso', 'king', 'bach', 'racist', 'superman', 'love', 'rudy mancuso poo bear black white official music video', 'iphone x by pineapple', 'lelepons', 'hannahstocking', 'rudymancuso', 'inanna', 'anwar', 'sarkis', 'shots', 'shotsstudios', 'alesso', 'anitta', 'brazil', 'Getting My Driver's License', 'Lele Pons']
Edit:
Thanks for answers/comments it got fixed by b = [n.strip('""').strip().replace("'", '') for n in tags_str.split("|")]
since it was a single quote instead of double.
Instead of using the single quote (apostrophe ( ' )) you could use the " ยด " instead.
This would avoid the whole issue.

Tokenization which works with terms that contain whitespace in Python?

My standard approach to tokenize a text using a regex in Python is this:
> text = "Los Angeles is in California"
> tokens = re.findall(r'\w+', text)
> tokens
['Los','Angeles','is','in','California']
A problem arises if I want to find the name Los Angeles in the above text.
What is the best way to find a needle which contains whitespace in a haystack?
I am asking a general question, because the solution should also work for a case like United States of America and for needles which don't contain whitespace.
For example a simple if "Los Angeles" in text (match) would not do, because if "for" in text would also return a match. But I am looking for full words only (match for and not California).
I suggest to use a text parser like NLTK for such tasks.
But for this case you can use following regex :
>>> re.findall(r'\b([A-Z]\w+ [A-Z]\w+)|(\w+)\b',text)
[('Los Angeles', ''), ('', 'is'), ('', 'in'), ('', 'California')]
the regex r'([A-Z]\w+ [A-Z]\w+)|(\w+)' will match 2 group the first is a pair word that its elements contain capital words! and the second will match a word!
The solution turned out to be simple:
re.search(r'\b'+needle+r'\b', haystack)

Python regex findall

I am trying to extract all occurrences of tagged words from a string using regex in Python 2.7.2. Or simply, I want to extract every piece of text inside the [p][/p] tags.
Here is my attempt:
regex = ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(pattern, line)
Printing person produces ['President [P]', '[/P]', '[P] Bill Gates [/P]']
What is the correct regex to get: ['[P] Barack Obama [/P]', '[P] Bill Gates [/p]']
or ['Barrack Obama', 'Bill Gates'].
import re
regex = ur"\[P\] (.+?) \[/P\]+?"
line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
person = re.findall(regex, line)
print(person)
yields
['Barack Obama', 'Bill Gates']
The regex ur"[\u005B1P\u005D.+?\u005B\u002FP\u005D]+?" is exactly the same
unicode as u'[[1P].+?[/P]]+?' except harder to read.
The first bracketed group [[1P] tells re that any of the characters in the list ['[', '1', 'P'] should match, and similarly with the second bracketed group [/P]].That's not what you want at all. So,
Remove the outer enclosing square brackets. (Also remove the
stray 1 in front of P.)
To protect the literal brackets in [P], escape the brackets with a
backslash: \[P\].
To return only the words inside the tags, place grouping parentheses
around .+?.
Try this :
for match in re.finditer(r"\[P[^\]]*\](.*?)\[/P\]", subject):
# match start: match.start()
# match end (exclusive): match.end()
# matched text: match.group()
Your question is not 100% clear, but I'm assuming you want to find every piece of text inside [P][/P] tags:
>>> import re
>>> line = "President [P] Barack Obama [/P] met Microsoft founder [P] Bill Gates [/P], yesterday."
>>> re.findall('\[P\]\s?(.+?)\s?\[\/P\]', line)
['Barack Obama', 'Bill Gates']
you can replace your pattern with
regex = ur"\[P\]([\w\s]+)\[\/P\]"
Use this pattern,
pattern = '\[P\].+?\[\/P\]'
Check here

Categories

Resources