choosing certain movie characters from a string using Regex in python

choosing certain movie characters from a string using Regex in python - python

superheroines = '''
batman
spiderman
ironman
wonderwoman
superman
captainamerica
blackpanther
joker
hulk
thor
'''
''' I only want spiderman, ironman, captain america, hulk, thor '''
''' I want to exclude batman, wonderwoman, superman, joker '''
pattern = re.compile(r'[^batman]+')
matches = pattern.findall(super_heroines)
for match in matches:
print(f'Marvel characters : {match}')

Related

How to replace second instance of character in substring?

I have the following strings:
text_one = str("\"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.")
text_two = str("\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said.\"")
I need to replace every instance of s/S with $, but not the first instance of s/S in a given word. So the input/output would look something like:
> Mississippi
> Mis$i$$ippi
My idea is to do something like 'after every " " character, skip first "s" and then replace all others up until " " character' but I have no idea how I might go about this. I also thought about creating a list to handle each word.

Solution with re:
import re
text_one = '"A Ukrainian American woman who lives near Boston, Massachusetts, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Russian attacks on Ukraine and the fear these attacks have engendered.'
text_two = '\n\nMany people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Russian soldiers overtake the area, the Boston-area woman said."'
def replace(s):
return re.sub(
r"(?<=[sS])(\S+)",
lambda g: g.group(1).replace("s", "$").replace("S", "$"),
s,
)
print(replace(text_one))
print(replace(text_two))
Prints:
"A Ukrainian American woman who lives near Boston, Mas$achu$ett$, told Fox News Digital on Monday that she can no longer speak on the phone with her own mother, who lives in southern Ukraine, because of the Rus$ian attacks on Ukraine and the fear these attacks have engendered.
Many people in southern Ukraine — as well as throughout the country — are right now living in fear for their lives as Rus$ian soldier$ overtake the area, the Boston-area woman said."

The first thing you're going to want to do is to find the index of the first s
Then, you'll want to split the string so you get the string until after the first s and the rest of the string into two separate variables
Next, replace all of the s's in the second string with dollar signs
Finally, join the two strings with an empty string
test = "mississippi"
first_index = test.find("s")
tests = [test[:first_index+1], test[first_index+1:]]
tests[1] = tests[1].replace("s", "$")
result = ''.join(tests)
print(result)

substitute '=' sign when an integer is encountered using python

Hi I am new to python and regex. I have a string which i want to reformat/substitute
string = '1John Radcliffe Hospital/Oxford/United Kingdom, 11Ruhr-Universität
3/Bochum/Bochum/Germany, 3University of British Columbia/Vancouver/Canada, 4National
Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan,
5University of Catania/Catania/Italy, 6F. Hoffmann-La Roche Ltd/Basel/Switzerland, 7
University of Colorado School of Medicine/Aurora/United States of America'
i did try with:
re.sub('(, \d+()?)', r'\1=', string).strip()
Expected output:
string = '1=John Radcliffe Hospital/Oxford/United Kingdom, 11=Ruhr-Universität
3/Bochum/Bochum/Germany, 3=University of British Columbia/Vancouver/Canada, 4=National
Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan,
5=University of Catania/Catania/Italy, 6=F. Hoffmann-La Roche Ltd/Basel/Switzerland,
7=University of Colorado School of Medicine/Aurora/United States of America'

You can match either the start of the string, or a space and comma without using a capture group and assert not a digit after matching a single digit.
(?:^|, )\d+(?!/)
The pattern matches
(?:^|, ) Non capture group, assert either the start of the string or moatch ,
\d+(?!/) Match 1+ digits asserting not a / directly to the right
Regex demo | Python demo
In the replacement use the full match followed by an equals sign
\g<0>=
Example
import re
string = ("1John Radcliffe Hospital/Oxford/United Kingdom, 11Ruhr-Universität \n"
"3/Bochum/Bochum/Germany, 3University of British Columbia/Vancouver/Canada, 4National \n"
"Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan, \n"
"5University of Catania/Catania/Italy, 6F. Hoffmann-La Roche Ltd/Basel/Switzerland, 7 \n"
"University of Colorado School of Medicine/Aurora/United States of America")
result = re.sub(r'(?:^|, )\d+(?!/)', r'\g<0>=', string, 0, re.MULTILINE).strip()
print(result)
Output
1=John Radcliffe Hospital/Oxford/United Kingdom, 11=Ruhr-Universität
3/Bochum/Bochum/Germany, 3=University of British Columbia/Vancouver/Canada, 4=National
Institute of Neuroscience, National Center of Neurology and Psychiatry/Tokyo/Japan,
5=University of Catania/Catania/Italy, 6=F. Hoffmann-La Roche Ltd/Basel/Switzerland, 7=
University of Colorado School of Medicine/Aurora/United States of America
Another option could be using a positive lookahead to assert an uppercase char [A-Z] after matching a digit.
(?:^|, )\d+(?=\s*[A-Z])
Regex demo

Regex Pattern doesn't work using look behind without validating the fixed-width pattern

I need to find a regex that will extract the city name from strings below.
The order of string is the restaurant name, address, city, phone, cuisine type
Chinois on Main 2709 Main St. Santa Monica 310-392-9025 Pacific New Wave
Benita's Frites 1433 Third St. Promenade Santa Monica 310-458-2889 Fast Food
Indo Cafe 10428 1/2 National Blvd. LA 310-815-1290 Indonesian
Diaghilev 1020 N. San Vicente Blvd. W. Hollywood 310-854-1111 Russian
Jody Maroni's Sausage Kingdom 2011 Ocean Front Walk Venice 310-306-1995 Hot Dogs
I tried this regex, but it doesn't work:
zagat['city'] = zagat['raw'].str.extract("""
((?<=Ave.|Rd.|St.|Blvd.|Dr.|Way.|Pl.|Ln.|Ct.|Beach|Way ).+(?=...-...-....))
""", expand=True)
Can you help?

You may use
rx = r'(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk)\s*(.+?)\s*\d{3}-\d{3}-\d{4}'
zagat['city'] = zagat['raw'].str.extract(rx, expand=False)
See the regex demo
Details
(?:(?:Ave|Rd|St|Blvd|Dr|Way|Pl|Ln|Ct)\.|Beach|Way|Walk) - Ave, Rd, St, Blvd, Dr, Way, Pl, Ln or Ct followed with . or Beach, Way or Walk
\s* - 0+ whitespaces
(.+?) - Group 1 (this value will be returned by .extract): any one or more chars other than line break chars, as few as possible
\s* - 0+ whitespaces
\d{3}-\d{3}-\d{4} - 3 digits, -, 3 digits, - and 4 digits.

regular expression to exclude 2 consecutive capital letters

I'm having difficulty using regex to solve this expression,
e.g when given below:
regex_exp(address, "OG 56432")
It should return
"OG 56432: Middle Street Pollocksville | 686"
address is an array of strings:
address = [
"622 Gordon Lane St. Louisville OH 52071",
"432 Main Long Road St. Louisville OH 43071",
"686 Middle Street Pollocksville OG 56432"
]
My solution currently looks like this (Python):
import re
def regex_exp(address, zipcode):
for i in address:
if zipcode in i:
postal_code = (re.search("[A-Z]{2}\s[0-9]{5}", x)).group(0)
# returns "OG 56432"
digits = (re.search("\d+", x)).group(0)
# returns "686"
address = (re.search("\D+", x)).group(0)
# returns "Middle Street Pollocksville OG"
print(postal_code + ":" + address + "| " + digits)
regex_exp(address, "OG 56432")
# returns OG 56432: High Street Pollocksville OG | 686
As you can see from my second paragraph, this is not the correct answer - I need the returned value to be
"OG 56432: Middle Street Pollocksville | 686"
How do I manipulate my address variable Regex search to exclude the 2 capital consecutive capital letters? I've tried things like
address = (re.search("?!\D+", x)).group(0)
to remove the two consecutive capitals based on A regular expression to exclude a word/string but I think this is a step in the wrong direction.
PS: I understand there are easier methods to solve this, but I want to use regex to improve my fundamentals

If you just want to remove the two consecutive Capital Letters which are predecessor of zip-code(a 5 digit number) then use this
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long PC Market Road St. Louisville 43071
For removing all occurrences of two consecutive Capital Letters:
import re
text = "432 Main Long PC Market Road St. Louisville OG 43071"
address = re.sub(r'([A-Z]{2}[\s]{1})(?=[\d]{5})','',text)
print(address)
# Output: 432 Main Long Market Road St. Louisville 43071

With re.sub() and group capturing you can use:
s="686 Middle Street Pollocksville OG 56432"
re.sub(r"(\d+)(.*)\s+([A-Z]+\s+\d+)",r"\3: \2 | \1",s)
Out: 'OG 56432: Middle Street Pollocksville | 686'

Parse a very large text file with Python?

So, the file has about 57,000 book titles, author names and a ETEXT No. I am trying to parse the file to only get the ETEXT NOs
The File is like this:
TITLE and AUTHOR ETEXT NO.
Aspects of plant life; with special reference to the British flora,      56900
by Robert Lloyd Praeger
The Vicar of Morwenstow, by Sabine Baring-Gould 56899
[Subtitle: Being a Life of Robert Stephen Hawker, M.A.]
Raamatun tutkisteluja IV, mennessä Charles T. Russell 56898
[Subtitle: Harmagedonin taistelu]
[Language: Finnish]
Raamatun tutkisteluja III, mennessä Charles T. Russell 56897
[Subtitle: Tulkoon valtakuntasi]
[Language: Finnish]
Tom Thatcher's Fortune, by Horatio Alger, Jr. 56896
A Yankee Flier in the Far East, by Al Avery 56895
and George Rutherford Montgomery
[Illustrator: Paul Laune]
Nancy Brandon's Mystery, by Lillian Garis 56894
Nervous Ills, by Boris Sidis 56893
[Subtitle: Their Cause and Cure]
Pensées sans langage, par Francis Picabia 56892
[Language: French]
Helon's Pilgrimage to Jerusalem, Volume 2 of 2, by Frederick Strauss 56891
[Subtitle: A picture of Judaism, in the century
which preceded the advent of our Savior]
Fra Tommaso Campanella, Vol. 1, di Luigi Amabile 56890
[Subtitle: la sua congiura, i suoi processi e la sua pazzia]
[Language: Italian]
The Blue Star, by Fletcher Pratt 56889
Importanza e risultati degli incrociamenti in avicoltura, 56888
di Teodoro Pascal
[Language: Italian]
And this is what I tried:
def search_by_etext():
fhand = open('GUTINDEX.ALL')
print("Search by ETEXT:")
for line in fhand:
if not line.startswith(" [") and not line.startswith("~"):
if not line.startswith(" ") and not line.startswith("TITLE"):
words = line.rstrip()
words = line.lstrip()
words = words[-7:]
print (words)
search_by_etext()
Well the code mostly works. However for some lines it gives me part of title or other things. Like:
This kind of output(), containing 'decott' which is a part of author name and shouldn't be here.
2
For this:
The Bashful Earthquake, by Oliver Herford                                56765
[Subtitle: and Other Fables and Verses]
The House of Orchids and Other Poems, by George Sterling                 56764
North Italian Folk, by Alice Vansittart Strettel Carr                    56763
 and Randolph Caldecott
[Subtitle: Sketches of Town and Country Life]
Wild Life in New Zealand. Part 1, Mammalia, by George M. Thomson 56762
[Subtitle: New Zealand Board of Science and Art, Manual No. 2]
Universal Brotherhood, Volume 13, No. 10, January 1899, by Various 56761
De drie steden: Lourdes, door Émile Zola 56760
[Language: Dutch]
Another example:
4
For
Rhandensche Jongens, door Jan Lens 56702
[Illustrator: Tjeerd Bottema]
[Language: Dutch]
The Story of The Woman's Party, by Inez Haynes Irwin 56701
Mormon Doctrine Plain and Simple, by Charles W. Penrose 56700
[Subtitle: Or Leaves from the Tree of Life]
The Stone Axe of Burkamukk, by Mary Grant Bruce 56699
[Illustrator: J. Macfarlane]
The Latter-Day Prophet, by George Q. Cannon 56698
[Subtitle: History of Joseph Smith Written for Young People]
Here: Life] shouldn't be there. Lines starting with blank space has been parsed out with this:
if not line.startswith(" [") and not line.startswith("~"):
But Still I am getting those off values in my output results.

Simple solution: regexps to the rescue !
import re
with open("etext.txt") as f:
for line in f:
match = re.search(r" (\d+)$", line.strip())
if match:
print(match.group(1))
the regular expression (\d+)$ will match "at least one space followed by 1 or more digits at the end of the string", and capture only the "one or more digits" group.
You can eventually improve the regexp - ie if you know all etext codes are exactly 5 digits long, you can change the regexp to (\d{5})$.
This works with the example text you posted. If it doesn't properly work on your own file then we need enough of the real data to find out what you really have.

It could be that those extra lines that are not being filtered out start with whitespace other than a " " char, like a tab for example. As a minimal change that might work, try filtering out lines that start with any whitespace rather than specifically a space char?
To check for whitespace in general rather than a space char, you'll need to use regular expressions. Try if not re.match(r'^\s', line) and ...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

choosing certain movie characters from a string using Regex in python - python

Related

How to replace second instance of character in substring?

substitute '=' sign when an integer is encountered using python

Regex Pattern doesn't work using look behind without validating the fixed-width pattern

regular expression to exclude 2 consecutive capital letters

Parse a very large text file with Python?

Categories

Resources