How can I achieve this without using if operator and huge multiline constructions?
Example:
around $98 million last week
above $10 million next month
about €5 billion past year
after £1 billion this day
Convert to
around 98 million dollars last week
above 10 million dollars next month
about 5 billion euros past year
after 1 billion pounds this day
You probably have more cases than this so the regular expression may be too specific if you have more cases, but re.sub can be used with a function to process each match and make the correct replacement. Below solves for the cases provided:
import re
text = '''\
around $98 million last week
above $10 million next month
about €5 billion past year
after £1 billion this day
'''
currency_name = {'$':'dollars', '€':'euros', '£':'pounds'}
def replacement(match):
# group 2 is the digits and million/billon,
# tack on the currency type afterward using a lookup dictionary
return f'{match.group(2)} {currency_name[match.group(1)]}'
# capture the symbol followed by digits and million/billion
print(re.sub(r'([$€£])(\d+ [mb]illion)\b', replacement, text))
Output:
around 98 million dollars last week
above 10 million dollars next month
about 5 billion euros past year
after 1 billion pounds this day
You could make a dictionary mapping currency symbol to name, and then generate the regexes. Note that these regexes will only work for something in the form of a number and then a word.
import re
CURRENCIES = {
r"\$": "dollars", # Note the slash; $ is a special regex character
"€": "euros",
"£": "pounds",
}
REGEXES = []
for symbol, name in CURRENCIES.items():
REGEXES.append((re.compile(rf"{symbol}(\d+ [^\W\d_]+)"), rf"\1 {name}"))
text = """around $98 million last week
above $10 million next month
about €5 billion past year
after £1 billion this day"""
for regex, replacement in REGEXES:
text = regex.sub(replacement, text)
It's useful in a case like this to remember that re.sub can accept a lambda rather than just a string.
The following requires Python 3.8+ for the := operator.
s = "about €5 billion past year"
re.sub(r'([€$])(\d+)\s+([mb]illion)',
lambda m: f"{(g := m.groups())[1]} {g[2]} {'euros' if g[0] == '€' else 'dollars'}",
s)
# 'about 5 billion euros past year'
Related
I am trying to match n years m months and x days pattern using regex. n years, m months, x days and and may or may not be in the string. For exact match i am able to extract this using the regex:
re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?', '2 years 25 days')
which returns 2 years 25 days, but if there is addtional text in the string I don't get the match like:
re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?', 'in 2 years 25 days')
retunrs ''
I tried this:
re.search(r'.*(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))?.*', 'in 2 years 25 days')
whih returns the whole string, but I dont want the additional text.
You get an empty string with the last pattern as all the parts in the regex are optional, so it will also match an empty string.
If all the parts are optional but you want to match at least 1 of them, you can use a leading assertion.
\b(?=\d+ (?:years?|months?|days?)\b)(?:\d+ years?)?(?:\s*\d+ months?)?(?:\s*\d+ days?)?\b
Explanation
\b A word boundary
(?=\d+ (?:years?|months?|days?)\b) Assert to the right 1+ digits and 1 of the alternatives
(?:\d+ years?)? Match 1+ digits, space and year or years
(?:\s*\d+ months?)? Same for months
(?:\s*\d+ days?)? Same for years
\b A word boundary
Regex demo | Python demo
Example
import re
pattern = r'\b(?=\d+ (?:years?|months?|days?)\b)(?:\d+ years?)?(?:\s*\d+ months?)?(?:\s*\d+ days?)?\b'
m = re.search(pattern, 'in 2 years 25 days')
if m:
print(m.group())
Output
2 years 25 days
Since years, months, days are temporal units, you could use the pint module for that.
Parse temporal units with pint
See the String parsing tutorial and related features used:
printing quantities, see String formatting
converting quantities, see Converting to different units
from pint import UnitRegistry
ureg = UnitRegistry()
temporal_strings = '2 years and 25 days'.split('and') # remove and split
quantities = [ureg(q) for q in temporal_strings] # parse quantities
# [<Quantity(2, 'year')>, <Quantity(25, 'day')>]
# print the quantities separately
for q in quantities:
print(q)
# get the total days
print(f"total: {sum(quantities)}")
print(f"total days: {sum(quantities).to('days')}")
Output printed:
2 year
25 day
total: 2.0684462696783026 year
total days: 755.5 day
You can try this:
import re
match =re.search(r'(?:\d+ year(s?))?\s*(?:\d+ month(s?))?\s*(?:\d+ day(s?))', 'in 2 years 25 days')
if match:
print(match.group())
Output:
2 years 25 days
I am trying to write a python regular expression which captures multiple values from a few columns in dataframe. Below regular expression attempts to do the same. There are 4 parts of the string.
group 1: Date - month and day
group 2: Date - month and day
group 3: description text before amount i.e. group 4
group 4: amount - this group is optional
Some peculiar conditions for group 3 - text that
(1)the text itself might contain characters like "-" , "$". So we cannot use - & $ as the boundary of text.
(2) The text (group 3) sometimes may not be followed by amount.
(3) Empty space between group 3 and 4 is optional
Below is python function code which takes in a dataframe having 4 columns c1,c2,c3,c4 adds the columns dt, txt and amt after processing to dataframe.
def parse_values(args):
re_1='(([JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC]{3}\s{0,}[\d]{1,2})\s{0,}){2}(.*[\s]|.*[^\$]|.*[^-]){1}([-+]?\$[\d|,]+(?:\.\d+)?)?'
srch=re.search(re_1, args[0])
if srch is None:
return args
m = re.match(re_1, args[0])
args['dt']=m.group(1)
args['txt']=m.group(3)
args['amt']=m.group(4)
if m.group(4) is None:
if pd.isnull(args['c3']):
args['amt']=args.c2
else:
args['amt']=args.c3
return args
And in order to test the results I have below 6 rows which needs to return a properly formatted amt column in return.
tt=[{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL ','c2':'$16.84'},
{'c1':'OCT 7 OCT 8 HURRY CURRY THORNHILL','c2':'$16.84'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK -$80,00,7770.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK-$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK$2070.70'},
{'c1':'MAR 15 MAR 16 LOBLAWS FOODS INC - EAST YORK $80,00,7770.70'}
]
t=pd.DataFrame(tt,columns=['c1','c2','c3','c4'])
t=t.apply(parse_values,1)
t
However due to the error in my regular expression in re_1 I am not getting the amt column and txt column parsed properly as they return NaN or miss some words (as dipicted in some rows of the output image below).
How about this:
(((?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s*[\d]{1,2})\s*){2}(.*?)\s*(?=[\-$])([-+]?\$[\d|,]+(?:\.\d+)?)
As seen at regex101.com
Explanation:
First off, I've shortened the regex by changing a few minor details like using \s* instead of \s{0,}, which mean the exact same thing.
The whole [Jan|...|DEC] code was using a character class i.e. [], whcih only takes a single character from the entire set. Using non capturing groups is the correct way of selecting from different groups of multiple letters, which in your case are 'months'.
The meat of the regex: LOOKAHEADS
(?=[\-$]) tells the regex that the text before it in (.*) should match as much as it can until it finds a position followed by a dash or a dollar sign. Lookaheads don't actually match whatever they're looking for, they just tell the regex that the lookahead's arguments should be following that position.
Hello I have some messy text I am unable to process in any good way and I want to match all zip codes 5 digit numbers from the raw string then append them to a list. My string looks something like this:
string = '''
January 2020
Zip Code
Current Month
Sales Breakdown
(by type)
Last Month Last Year Year-to-Date
95608
Carmichael
95610
Citrus Heights
95621
Citrus Heights
95624
Elk Grove
95626
Elverta
95628
Fair Oaks
95630
Folsom
95632
Galt
95638
Herald
95641
Isleton
95655
Mather
95660
North Highlands
95662
Orangevale
Total Sales
43 REO Sales 0 45
40 43
Median Sales Price $417,000
$0 $410,000 $400,000
$417,000
'''
It can be done with re.findall and the regular expression \b\d{5}\b or even just \d{5}. Let's see an example:
import re
string = '''
January 2020
Zip Code
Current Month
Sales Breakdown
(by type)
Last Month Last Year Year-to-Date
95608
Carmichael
95610
Citrus Heights
95621
Citrus Heights
95624
Elk Grove
95626
Elverta
95628
Fair Oaks
95630
Folsom
95632
Galt
95638
Herald
95641
Isleton
95655
Mather
95660
North Highlands
95662
Orangevale
Total Sales
43 REO Sales 0 45
40 43
Median Sales Price $417,000
$0 $410,000 $400,000
$417,000
'''
regex = r'\b\d{5}\b'
zip_codes = re.findall(regex, string)
Then you can get each code from zip_codes. I recommend you to read re documentation and Regular Expression HOWTO. There are interesting tools to write and test regex, as Regex101.
I also recommend you that for the next time you ask, please investigate a bit by yourself and then try to do what you want, and then, if you have an issue, ask for this specific issue. The help page How I ask a good question? and How to create a Minimum, Reproducible example might help you to write a good question.
I have a list of strings and wish to find exact phases.
So far my code finds the month and year only, but the whole phase including “- Recorded” is needed, like “March 2016 - Recorded”.
How can it add on the “- Recorded” to the regex?
import re
texts = [
"Shawn Dookhit took annual leave in March 2016 - Recorded The report",
"Soondren Armon took medical leave in February 2017 - Recorded It was in",
"David Padachi took annual leave in May 2016 - Recorded It says",
"Jack Jagoo",
"Devendradutt Ramgolam took medical leave in August 2016 - Recorded Day back",
"Kate Dudhee",
"Vinaye Ramjuttun took annual leave in - Recorded Answering"
]
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s')
for t in texts:
try:
m = regex.search(t)
print m.group()
except:
print "keyword's not found"
You got 2 named groups here: month and year which takes month and year from your strings. To get - Recorded into recorded named group you can do this:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s(?P<recorded>- Recorded)')
Or if you can just add - Recorded to your regex without named group:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s- Recorded')
Or you can add named group other with hyphen and one capitalized word:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s(?P<other>- [A-Z][a-z]+)')
I think first or third option is preferable because you already got named groups. Also i recommend you to use this web site http://pythex.org/, it really helps to construct regex :).
Use a list comprehension with the corrected regex:
regex = re.compile('(?P<month>[a-zA-Z]+)\s+(?P<year>\d{4})\s* - Recorded')
matches = [match.groups() for text in texts for match in [regex.search(text)] if match]
print(matches)
# [('March', '2016'), ('February', '2017'), ('May', '2016'), ('August', '2016')]
I have a batch of raw text files. Each file begins with Date>>month.day year News garbage.
garbage is a whole lot of text I don't need, and varies in length. The words Date>> and News always appear in the same place and do not change.
I want to copy month day year and insert this data into a CSV file, with a new line for every file in the format day month year.
How do I copy month day year into separate variables?
I tryed to split a string after a known word and before a known word. I'm familiar with string[x:y], but I basically want to change x and y from numbers into actual words (i.e. string[Date>>:News])
import re, os, sys, fnmatch, csv
folder = raw_input('Drag and drop the folder > ')
for filename in os.listdir(folder):
# First, avoid system files
if filename.startswith("."):
pass
else:
# Tell the script the file is in this directory and can be written
file = open(folder+'/'+filename, "r+")
filecontents = file.read()
thestring = str(filecontents)
print thestring[9:20]
An example text file:
Date>>January 2. 2012 News 122
5 different news agencies have reported the story of a man washing his dog.
Here's a solution using the re module:
import re
s = "Date>>January 2. 2012 News 122"
m = re.match("^Date>>(\S+)\s+(\d+)\.\s+(\d+)", s)
if m:
month, day, year = m.groups()
print("{} {} {}").format(month, day, year)
Outputs:
January 2 2012
Edit:
Actually, there's another nicer (imo) solution using re.split described in the link Robin posted. Using that approach you can just do:
month, day, year = re.split(">>| |\. ", s)[1:4]
You can use the string method .split(" ") to separate the output into a list of variables split at the space character. Because year and month.day will always be in the same place you can access them by their position in the output list. To separate month and day use the .split function again, but this time for .
Example:
list = theString.split(" ")
year = list[1]
month= list[0].split(".")[0]
day = list[0].split(".")[1]
You could use string.split:
x = "A b c"
x.split(" ")
Or you could use regular expressions (which I see you import but don't use) with groups. I don't remember the exact syntax off hand, but the re is something like r'(.*)(Date>>)(.*). This re searches for the string "Date>>" in between two strings of any other type. The parentheses will capture them into numbered groups.