I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office.
I removed punctuations from this text by the following code.
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()
here in the above code df_Train is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above.
The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office
Notice how two words Eksioglu and Handsome are combing due to no space after , . I need a way to overcome this issue.
In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:
df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
Or, if you have chunks of punctuation + whitespace to handle:
df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()
Output:
>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0 Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet Ideal for home or office
Name: BULLET_POINTS, dtype: object
The (?:[^\w\s]|_)+ regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.
The [\W_]+ pattern is similar but includes whitespace.
The .str.strip() part is necessary as the replacement might result in leading/trailing spaces.
Related
I am trying to find words and print using below code. Everything is working perfect but only issue is i am unable to print the last word(which is number).
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid ']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(i), textfile)
print(y)
Text file i working with:
textfile = """1, REBECCA M. ROTH , COLLECTOR OF TAXES of the taxing district of the
township of MORRIS for Six Hundred Sixty Seven dollars andFifty Two cents, the land
in said taxing district described as Block No. 10303 Lot No. 10 :
and known as 239 E HANOVER AVE , on the tax Taxes For: 2012
Sewer
Assessments For Improvements
Total Cost of Sale 35.00
Total
Premium (if any) Paid 1,400.00 """
Would like to know where am i making mistake.
Any suggestion is appreciated.
A couple of issues:
As others have mentioned, you need to escape special characters like parentheses ( ) and dots .. Very simply, you can use re.escape
Another issue is the trailing space in Premium \(if any\) Paid (it's trying to match two spaces instead of one as you're also checking for a space in your regex {} ([^ ]*))
You should instead change your code to the following:
See working code here
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{} ([^ ]*)'.format(re.escape(i)), textfile)
print(y)
Two problems:
Your current 'Premium (if any) Paid ' string ends on a space, and '{} ([^ ]*)' also has a space after {}, which adds them together. Delete the trailing space in 'Premium (if any) Paid '.
You need to escape parenthesis, so if you want to keep your regular expression unchanged, the string in the list should be ['Premium \(if any\) Paid']. You can also use re.escape instead.
For your particular cases, this seems to be an optimal solution:
words = ['Town of','Block No.','Lot No.','Premium (if any) Paid']
import re
for i in words:
y = re.findall('{}\s+([\S]*)'.format(re.escape(i)), text, re.I)
print(y)
How to removing non English words from text in df.columns words contain letters and numbers
Ex
df['text']
'the interiors nrd studio | happy mothers day ”there is no influence so powerful as that of the mother.” —sara josepha hale... happy mother’s day mom & to all the mothers around the world! lots of light natasha
0wet3bxtfl'
'but still missing you every day happy mothers day francis mcclafferty (mccool) 9wlhju7cxf'
from the above 2 rows I need to remove the word '0wet3bxtfl' & '9wlhju7cxf'
The example includes to retain some strings that would not be found in a list of English words ("nrd", "mcclafferty", "mccool") while removing "0wet3bxtfl" and "9wlhju7cxf", so the expected result is probably best achieved by removing any non-whitespace sequences that contain either a letter followed by digit or a digit followed by letter (together with any spaces that follow), without regard to whether words are "English" or not.
The following would do this:
import re
...
filtered = re.sub('[^\s]*(\d[a-zA-Z]|[a-zA-Z]\d)[^\s]* *', '', df['text'])
Here is an example string:
text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
I am trying to separate the words "patties", "burgers",
fingers", and "meat" from this text. I want to separate the words after chicken but before the closing quotation.
I have gotten stumped on how to even separate a single one. I can split after "chicken ' but then how can i select the text up until the next ' ?
I would like to iterate through a list to save the variables to an array. Thanks for any help you can provide.
You can use regular expressions:
import re
text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
match = re.findall(r'chicken \'(\S+)\'', text)
print (match)
Outputs:
['patties', 'burgers', 'fingers', 'meat']
This is a good use-case for regex.
import re
print(re.findall(r"chicken '(.*?)'", text))
Here's an explanation of the regex: https://regex101.com/r/8IdseD/1
Here's the python code running: https://repl.it/repls/SquareQuerulousModes
The regex, part by part:
chicken ' - matches that literal text
( - starts a capture group - the part that re.findall will spit out.
. - matches any character...
*? - ...any number of times, but as few possible (this is to ensure we don't capture the final ')
) - end the capture group
' - match a literal '.
So re.findall will give you a list of all the substrings that are captured in the group.
You can use zero-width lookarounds to match the surroundings:
(?<=chicken\s')[^']+(?=')
(?<=chicken\s') is zero-width positive lookbehind that matches chicken '
[^']+ matches the portion upto next single quote i.e. the desired substring
(?=') is zero-width positive lookahead that matches ' after the desired substring
Example:
In [713]: text = "hello, i like to eat beef 'sandwiches' and beef 'jerky' and chicken 'patties' and chicken 'burgers' and also chicken 'fingers' and other chicken 'meat' too."
In [714]: re.findall(r"(?<=chicken\s')[^']+(?=')", text)
Out[714]: ['patties', 'burgers', 'fingers', 'meat']
Select just the portion of the sentence from the first occurrence of "chicken":
chicken_text = text[text.find("chicken"):]
Split that text on spaces:
chicken_words = chicken_text.split(" ")
Scan the list for words that begin and end with a single quote:
for word in chicken_words:
if word[0] == "'" and word[-1] == "'":
print word[1:-1]
This won't work if the single-quoted words themselves contain spaces, but that isn't the case in the sample text you gave.
I am trying to crop a portion of a list of strings and print them. The data looks like the following -
Books are on the table\nPick them up
Pens are in the bag\nBring them
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found\nSearch them
Laptops, headphones are lost\nSearch for them
(This is just few lines from 100 lines of data in the file)
I have to crop the string before the \n in line 1,2,5,6 and print them. I have to also print line 3,4 along with them. Expected output -
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils erasers ruler cannot be found
Laptops headphones are lost
What I have tried so far -
First I replace the comma with a space - a = name.replace(',',' ');
Then I use regex to crop out the substring. My regex expression is - b = r'.*-\s([\w\s]+)\\n'. I am unable to print line 3 and 4 in which \n is not present.
The output that I am receiving now is -
Books are on the table
Pens are in the bag
Pencils erasers ruler cannot be found
Laptops headphones are lost
What should I add to my expression to print out lines 3 and 4 as well?
TIA
You may match and remove the line parts starting with a combination of a backslash and n, or all punctuation (non-word and non-whitespace) chars using a re.sub:
a = re.sub(r'\\n.*|[^\w\s]+', '', a)
See the regex demo
Details
\\n.* - a \, n, and then the rest of the line
| - or
[^\w\s]+ - 1 or more chars other than word and whitespace chars
If you need to make sure there is an uppercase letter after \n, you may add [A-Z] after n in the pattern.
I know many people like to twist their minds into knots with regex but why not,
with open('geek_lines.txt') as lines:
for line in lines:
print (line.rstrip().split(r'\n')[0])
Simple to write, simple to read, seems to produce the correct result.
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found
Laptops, headphones are lost
I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1