Cropping out a portion of a string and printing using regex - python

I am trying to crop a portion of a list of strings and print them. The data looks like the following -
Books are on the table\nPick them up
Pens are in the bag\nBring them
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found\nSearch them
Laptops, headphones are lost\nSearch for them
(This is just few lines from 100 lines of data in the file)
I have to crop the string before the \n in line 1,2,5,6 and print them. I have to also print line 3,4 along with them. Expected output -
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils erasers ruler cannot be found
Laptops headphones are lost
What I have tried so far -
First I replace the comma with a space - a = name.replace(',',' ');
Then I use regex to crop out the substring. My regex expression is - b = r'.*-\s([\w\s]+)\\n'. I am unable to print line 3 and 4 in which \n is not present.
The output that I am receiving now is -
Books are on the table
Pens are in the bag
Pencils erasers ruler cannot be found
Laptops headphones are lost
What should I add to my expression to print out lines 3 and 4 as well?
TIA

You may match and remove the line parts starting with a combination of a backslash and n, or all punctuation (non-word and non-whitespace) chars using a re.sub:
a = re.sub(r'\\n.*|[^\w\s]+', '', a)
See the regex demo
Details
\\n.* - a \, n, and then the rest of the line
| - or
[^\w\s]+ - 1 or more chars other than word and whitespace chars
If you need to make sure there is an uppercase letter after \n, you may add [A-Z] after n in the pattern.

I know many people like to twist their minds into knots with regex but why not,
with open('geek_lines.txt') as lines:
for line in lines:
print (line.rstrip().split(r'\n')[0])
Simple to write, simple to read, seems to produce the correct result.
Books are on the table
Pens are in the bag
Cats are roaming around
Dogs are sitting
Pencils, erasers, ruler cannot be found
Laptops, headphones are lost

Related

replace punctuation with space in text

I have a text like this Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu,Handsome cello wrapped hard magnet, Ideal for home or office.
I removed punctuations from this text by the following code.
import string
string.punctuation
def remove_punctuation(text):
punctuationfree="".join([i for i in text if i not in string.punctuation])
return punctuationfree
#storing the puntuation free text
df_Train['BULLET_POINTS']= df_Train['BULLET_POINTS'].apply(lambda x:remove_punctuation(x))
df_Train.head()
here in the above code df_Train is a pandas dataframe in which "BULLET_POINTS" column contains the kind of text data mentioned above.
The result I got is Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan EksiogluHandsome cello wrapped hard magnet Ideal for home or office
Notice how two words Eksioglu and Handsome are combing due to no space after , . I need a way to overcome this issue.
In these case, it makes sense to replace all the special chars with a space, and then strip the result and shrink multiple spaces to a single space:
df['BULLET_POINTS'] = df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
Or, if you have chunks of punctuation + whitespace to handle:
df['BULLET_POINTS'].str.replace(r'[\W_]+', ' ', regex=True).str.strip()
Output:
>>> df['BULLET_POINTS'].str.replace(r'(?:[^\w\s]|_)+', ' ', regex=True).str.strip()
0 Cat In A Tea Cup by New Yorker cover artist Gurbuz Dogan Eksioglu Handsome cello wrapped hard magnet Ideal for home or office
Name: BULLET_POINTS, dtype: object
The (?:[^\w\s]|_)+ regex matches one or more occurrences of any char other than word and whitespace chars or underscores (i.e. one or more non-alphanumeric chars), and replaces them with a space.
The [\W_]+ pattern is similar but includes whitespace.
The .str.strip() part is necessary as the replacement might result in leading/trailing spaces.

My Regex to remove RT is not working for some reason

The head of my dataframe looks like this
for i in df.index:
txt = df.loc[i]["tweet"]
txt=re.sub(r'#[A-Z0-9a-z_:]+','',txt)#replace username-tags
txt=re.sub(r'^[RT]+','',txt)#replace RT-tags
txt = re.sub('https?://[A-Za-z0-9./]+','',txt)#replace URLs
txt=re.sub("[^a-zA-Z]", " ",txt)#replace hashtags
df.at[i,"tweet"]=txt
However, running this does not remove the 'RT' tags. In addition, I would like to remove the 'b' tag also.
Raw result tweet column:
b Yal suppose you would people waiting for a tub of paint and garden furniture the league is gone and any that thinks anything else is a complete tool of a human who really needs to get down off that cloud lucky to have it back for
b RT watching porn aftern normal people is like no turn it off they don xe x x t love each other
b RT If not now when nn
b Used red wine as a chaser for Captain Morgan xe x x s Fun times
b RT shackattack Hold the front page s Lockdown property project sent me up the walls
Your regular expression is not working, beause this sing ^ means at the beginning of the string. But the two characters you want to remove are not at the beginning.
Change r'^[RT]+' to r'[RT]+' the two letters will be removed. But tbe carefull beacause all other matches will be removed, too.
If you want to remove the letter be as well, try r'^b\s([RT]+)?'.
I suggest you try it yourself on https://regex101.com/

Accumulating Characters in Python

So I have this textfile, and in that file it goes like this... (just a bit of it)
"The truest love that ever heart
Felt at its kindled core
Did through each vein in quickened start
The tide of being pour
Her coming was my hope each day
Her parting was my pain
The chance that did her steps delay
Was ice in every vein
I dreamed it would be nameless bliss
As I loved loved to be
And to this object did I press
As blind as eagerly
But wide as pathless was the space
That lay our lives between
And dangerous as the foamy race
Of ocean surges green
And haunted as a robber path
Through wilderness or wood
For Might and Right and Woe and Wrath
Between our spirits stood
I dangers dared I hindrance scorned
I omens did defy
Whatever menaced harassed warned
I passed impetuous by
On sped my rainbow fast as light
I flew as in a dream
For glorious rose upon my sight
That child of Shower and Gleam"
Now, the calculate the length of words without the letter 'e' in each line of text. So in the first line it should have 4, then 5, then 17, etc.
My current code is
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
for word in line_strip_split:
if "e" not in word:
word_e = word
print (len(word_e))
My explanation is: Strip each word from each other by removing spaces, so it becomes ['Felt','at','its','kindled','core'], etc. Then we split each word because we can regard it individually when removing words with 'e'?. So we want words without e, then print the length of the string.
HOWEVER, this separates each word into a different line by splitting then separating the string? So this doesn't add all the words together in each line but separates it, so the answer becomes "4 / 2 / 3"
Try this:
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
words_with_no_e = []
for word in line_strip_split:
if "e" not in word:
# Adding words without e to a new list
words_with_no_e.append(word)
# ''.join() will returns all the elements of array concatenated
# len() will count the length
print(len(''.join(words_with_no_e)))
It append all the words without e in into new list in each line, then concatenate all words then it prints length of it.

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

Regex Python [python-2.7]

I'm working on a Python program that sifts through a .txt file to find the genus and species name. The lines are formatted like this (yes, the equals signs are consistently around the common name):
1. =Common Name= Genus Species some other words that I don't want.
2. =Common Name= Genus Species some other words that I don't want.
I can't seem to figure out a regex that will work to match only the genus and species and not the common name. I know the equals signs (=) will probably help in some way but I cannot think of how to use them.
Edit: Some real data:
1. =Western grebe.= ÆCHMOPHORUS OCCIDENTALIS. Rare migrant; western species, chiefly interior regions of North America.
2. =Holboell's grebe.= COLYMBUS HOLBOELLII. Rare migrant; breeds far north; range, all of North America.
3. =Horned grebe.= COLYMBUS AURITUS. Rare migrant; range, almost the same as the last.
4. =American eared grebe.= COLYMBUS NIGRICOLLIS CALIFORNICUS. Summer resident; rare in eastern, common in western Colorado; breeds from plains to 8,000 feet; partial to alkali lakes; western species.
You probably don't need regex for this one. If the order of the words you need and the count of the words is always the same, you can just split each line into list of substrings and get the third (genus) and the fourth (species) element of that list. The code will probably look like that:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split()
genus, species = words[2], words[3]
It just looks a little more "pythonic" to me.
If common name can consist of multiple words, then suggested code will return an incorrect result. To get the right result in this case too, you can use this code:
myfile = open('myfilename.txt', 'r')
for line in myfile.readlines():
words = line.split('=')[2].split() # If the program returns wrong results, try changing the index from 2 to 1 or 3. What number is the right one depends on whether there can be any symbols before the first "=".
genus, species = words[0], words[1]
If it is enough to capture words in groups (and you dont't wont direct match) you can try with:
(?=\d\.\s*=[^=]+=\s(?:(?P<genus>\w+)\s(?P<species>\w+)))
DEMO
the desired values will be in groups <genus> and <species>. The whole regex is a positive lookbehind, so it match a zero point position on a beginning of string, but it captures some content into groups.
(?=\d\.\s*=[^=]+=\s - decimal folowed by some content between equal
signs and space,
(?:(?P<genus>\w+)\s(?P<species>\w+))) - capture first word to genus
groups, and second word do species groups,
You can try something like:
import re
txt='1. =Common Name= Genus Species some other words that I don\'t want.'
re1='.*?' # Non-greedy match on filler
re2='(?:[a-z][a-z]+)' # Uninteresting: word
re3='.*?' # Non-greedy match on filler
re4='(?:[a-z][a-z]+)' # Uninteresting: word
re5='.*?' # Non-greedy match on filler
re6='((?:[a-z][a-z]+))' # Word 1
re7='.*?' # Non-greedy match on filler
re8='((?:[a-z][a-z]+))' # Word 2
rg = re.compile(re1+re2+re3+re4+re5+re6+re7+re8,re.IGNORECASE|re.DOTALL)
m = rg.search(txt)
if m:
word1=m.group(1)
word2=m.group(2)
print "("+word1+")"+"("+word2+")"+"\n"
In your test input as shown in txt, this will print
(Genus)(Species)
You can you this awesome site to help do regexes like this!
Hope this helps

Categories

Resources