substitute '=' sign when an integer is encountered using python - python

Hi I am new to python and regex. I have a string which i want to reformat/substitute
string = '1John Radcliffe Hospital/Oxford/United Kingdom, 11Ruhr-Universität
3/Bochum/Bochum/Germany, 3University of British Columbia/Vancouver/Canada, 4National
Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan,
5University of Catania/Catania/Italy, 6F. Hoffmann-La Roche Ltd/Basel/Switzerland, 7
University of Colorado School of Medicine/Aurora/United States of America'
i did try with:
re.sub('(, \d+()?)', r'\1=', string).strip()
Expected output:
string = '1=John Radcliffe Hospital/Oxford/United Kingdom, 11=Ruhr-Universität
3/Bochum/Bochum/Germany, 3=University of British Columbia/Vancouver/Canada, 4=National
Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan,
5=University of Catania/Catania/Italy, 6=F. Hoffmann-La Roche Ltd/Basel/Switzerland,
7=University of Colorado School of Medicine/Aurora/United States of America'

You can match either the start of the string, or a space and comma without using a capture group and assert not a digit after matching a single digit.
(?:^|, )\d+(?!/)
The pattern matches
(?:^|, ) Non capture group, assert either the start of the string or moatch ,
\d+(?!/) Match 1+ digits asserting not a / directly to the right
Regex demo | Python demo
In the replacement use the full match followed by an equals sign
\g<0>=
Example
import re
string = ("1John Radcliffe Hospital/Oxford/United Kingdom, 11Ruhr-Universität \n"
"3/Bochum/Bochum/Germany, 3University of British Columbia/Vancouver/Canada, 4National \n"
"Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan, \n"
"5University of Catania/Catania/Italy, 6F. Hoffmann-La Roche Ltd/Basel/Switzerland, 7 \n"
"University of Colorado School of Medicine/Aurora/United States of America")
result = re.sub(r'(?:^|, )\d+(?!/)', r'\g<0>=', string, 0, re.MULTILINE).strip()
print(result)
Output
1=John Radcliffe Hospital/Oxford/United Kingdom, 11=Ruhr-Universität
3/Bochum/Bochum/Germany, 3=University of British Columbia/Vancouver/Canada, 4=National
Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan,
5=University of Catania/Catania/Italy, 6=F. Hoffmann-La Roche Ltd/Basel/Switzerland, 7=
University of Colorado School of Medicine/Aurora/United States of America
Another option could be using a positive lookahead to assert an uppercase char [A-Z] after matching a digit.
(?:^|, )\d+(?=\s*[A-Z])
Regex demo

Related

How to replace second instance of character in substring?

I have the following strings:
text_one = str("\"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.")
text_two = str("\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said.\"")
I need to replace every instance of s/S with $, but not the first instance of s/S in a given word. So the input/output would look something like:
> Mississippi
> Mis$i$$ippi
My idea is to do something like 'after every " " character, skip first "s" and then replace all others up until " " character' but I have no idea how I might go about this. I also thought about creating a list to handle each word.
Solution with re:
import re
text_one = '"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.'
text_two = '\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said."'
def replace(s):
return re.sub(
r"(?<=[sS])(\S+)",
lambda g: g.group(1).replace("s", "$").replace("S", "$"),
s,
)
print(replace(text_one))
print(replace(text_two))
Prints:
"A Ukrainian American woman who lives near Boston, Mas$achu$ett$, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Rus$ian attacks on Ukraine and the fear these attacks have engendered.
Many people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Rus$ian soldier$ overtake the area, the Boston-area woman said."
The first thing you're going to want to do is to find the index of the first s
Then, you'll want to split the string so you get the string until after the first s and the rest of the string into two separate variables
Next, replace all of the s's in the second string with dollar signs
Finally, join the two strings with an empty string
test = "mississippi"
first_index = test.find("s")
tests = [test[:first_index+1], test[first_index+1:]]
tests[1] = tests[1].replace("s", "$")
result = ''.join(tests)
print(result)

choosing certain movie characters from a string using Regex in python

superheroines = '''
batman
spiderman
ironman
wonderwoman
superman
captainamerica
blackpanther
joker
hulk
thor
'''
''' I only want spiderman, ironman, captain america, hulk, thor '''
''' I want to exclude batman, wonderwoman, superman, joker '''
pattern = re.compile(r'[^batman]+')
matches = pattern.findall(super_heroines)
for match in matches:
print(f'Marvel characters : {match}')

Python regex, negate a set of characters in between a string

I have several set of strings with numbers followed words and jumbled numbers and words etc.
For example,
"Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"
I am trying to separate the street names from the apartment numbers.
Here is my current code:
import re
pattern_street = re.compile(r'[A-Za-z]+\s?\w+\s?[A-Za-z]+\s?[A-Za-z]+',re.X)
pattern_apartmentnumber = re.compile(r'(^\d+\s? | [A-Za-z]+[\s?]+[0-9]+$)',re.X)
for i in ["Street 50 No 40", "5, saint bakers holy street", "32 Syndicate street"]:
match_street = pattern_street.search(i)
match_apartmentnumber = pattern_apartmentnumber.search(i)
fin_street = match_street[0]
fin_apartmentnumber = match_apartmentnumber[0]
print("street--",fin_street)
print("apartmentnumber--",fin_apartmentnumber)
which prints:
street-- Street 50 No
apartmentnumber-- No 40
street-- saint bakers holy street
apartmentnumber-- 5
street-- Syndicate street
apartmentnumber-- 32
I want to remove the "No" from the first street name. i.e. if there is any street with No followed by a number at the end, that needs to be taken as the apartment number,
and not as the street.
How can I do this for my above example strings?
First try the case where there is a No 123 at the end, use a positive lookahead.
If not found try a street without this.
pattern_street = re.compile(r'[A-Za-z]+[\s\w]+(?=\s[Nn]o\s\d+$)|[A-Za-z]+[\s\w]+',re.X)
You can find the street name by the following regex pattern to eliminate No [0-9] from the statement.
pattern_street = re.compile(r'[A-Za-z]+((?!No).)+',re.X)

Regex Pattern doesn't work using look behind without validating the fixed-width pattern

I need to find a regex that will extract the city name from strings below.
The order of string is the restaurant name, address, city, phone, cuisine type
Chinois on Main 2709 Main St. Santa Monica 310-392-9025 Pacific New Wave
Benita's Frites 1433 Third St. Promenade Santa Monica 310-458-2889 Fast Food
Indo Cafe 10428 1/2 National Blvd. LA 310-815-1290 Indonesian
Diaghilev 1020 N. San Vicente Blvd. W. Hollywood 310-854-1111 Russian
Jody Maroni's Sausage Kingdom 2011 Ocean Front Walk Venice 310-306-1995 Hot Dogs
I tried this regex, but it doesn't work:
zagat['city'] = zagat['raw'].str.extract("""
((?<=Ave.|Rd.|St.|Blvd.|Dr.|Way.|Pl.|Ln.|Ct.|Beach|Way ).+(?=...-...-....))
""", expand=True)
Can you help?
You may use
rx = r'(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)\s*(.+?)\s*\d{3}-\d{3}-\d{4}'
zagat['city'] = zagat['raw'].str.extract(rx, expand=False)
See the regex demo
Details
(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk) - Ave, Rd, St, Blvd, Dr, Way, Pl, Ln or Ct followed with . or Beach, Way or Walk
\s* - 0+ whitespaces
(.+?) - Group 1 (this value will be returned by .extract): any one or more chars other than line break chars, as few as possible
\s* - 0+ whitespaces
\d{3}-\d{3}-\d{4} - 3 digits, -, 3 digits, - and 4 digits.

python reg-ex pattern not matching

I have a reg-ex matching problem with the following pattern and the string. Pattern is basically a name followed by any number of characters followed by one of the phrases(see pattern below) follwed by any number of characters followed by institution name.
pattern = "[David Maxwell|David|Maxwell] .* [educated at|graduated from|attended|studied at|graduate of] .* Eton College"
str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
match = re.search(pattern, str)
But the search method returns a no match for the above str? Is my reg-ex proper? I'm new to reg-ex. Any help is appreciated
[...] means "any character from this set of characters". If you want "any word in this group of words" you need to use parenthesis: (...|...).
There's another problem in your expression, where you have .* (space, dot, star, space), which means "a space, followed by zero or more characters, followed by a space". In other words, the shortest possible match is two spaces. However, your text only has one space between "educated at" and "Eton College".
>>> pattern = '(David Maxwell|David|Maxwell).*(educated at|graduated from|attended|studied at|graduate of).*Eton College'
>>> str = "David Maxwell was educated at Eton College, where he was a King's Scholar and Captain of Boats, and at Cambridge University where he rowed in the winning Cambridge boat in the 1971 and 1972 Boat Races."
>>> re.search(pattern, str)
<_sre.SRE_Match object at 0x1006d10b8>

Categories

Resources