Python replace year mentions like '85 with 1985 - python

I am using this regular expression to replace all ocurrences of years of the form '85 with 1985 inside a string
import re
re.sub("'(\d\d)", "19\1", "Today '45")
but the result I get back is far from expected:
'Today 19\x01'
I would expect to get Today 1945. I am wondering what is the proper way to do it. Any help much appreciated.

Make the string a raw string
>>> re.sub(r"'(\d\d)", r"19\1", "Today '45")
'Today 1945'
Or as Avinash suggests, Use word boundaries \b. They are better as they would help you ignore digits that are not two digits, like 3456
>>> re.sub(r"'(\d{2})\b", r"19\1", "Today '45, '3456")
"Today 1945, '3456"

Reference the group with \g<1> instead of \1:
In [21]: re.sub("'(\d\d)", "19\g<1>", "Today '45")
Out[21]: 'Today 1945'
or use raw strings:
In [22]: re.sub("'(\d\d)", r"19\1", "Today '45")
Out[22]: 'Today 1945'
Your code isn't working, because Python interprets \1 as a character.

Related

regex to match several string in Python

I have questions with regex in Python.
Possible variations can be
10 hours, 12 weeks or 7 business days.
I want to have my regex something like
string = "I have an 7 business day trip and 12 weeks vacation."
re.findall(r'\d+\s(business)?\s(hours|weeks|days)', string)
so that I expect to find "7 business day" and "12 weeks" but it returns None
string = "I have an 7 business day trip and 12 weeks vacation."
print re.findall(r'\d+\s(?:business\s)?(?:hour|week|day)s?', string)
['7 business day', '12 weeks']
\d+\s(?:business\s)?(?:hour|week|day)s?
Debuggex Demo
The demo should explain how this works. The reason yours wasn't is because it was looking for 7 businessdays which doesn't match.
Although if you don't want to accept business week/hour, you'll need to modify it further:
\d+\s(?:hour|week|(?:business )?day)s?
Debuggex Demo
You need to tweak your regex to this:
>>> string = "I have an 7 business day trip and 12 weeks vacation."
>>> print re.findall(r'(\d+)\s*(?:business day|hour|week)s?', string)
['7', '12']
This matches any number that is followed by business day or hour or week and an optional s in the end.
Similar to #anubhava's answer but matches "7 business day" rather than just "7". Just move the closing parenthesis from after \d+ to the end:
re.findall(r'(\d+\s*(?:business day|hour|week)s?)', string)
\d+\s+(business\s)?(hour|week|day)s?

Match to second regular expression if first has no matches

I'm attempting to extract the text between HTML tags using regex in python. The catch is that sometimes there are no HTML tags in the string, so I want my regex to match the entire string. So far, I've got the part that matches the inner text of the tag:
(?<=>).*(?=<\/)
This would match to Russia in the tag below
<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>
Alternately, the entire string would be matched:
Typhoon Vongfong prompted ANA to cancel 101 flights, affecting about 16,600 passengers, the airline said in a faxed statement. Japan Airlines halted 31 flights today and three tomorrow, it said by fax. The storm turned northeast after crossing Okinawa, Japan’s southernmost prefecture, with winds gusting to 75 knots (140 kilometers per hour), according to the U.S. Navy’s Joint Typhoon Warning Center.
Otherwise I want it to return all the text in the string.
I've read a bit about regex conditionals online, but I can't seem to get them to work. If anyone can point me in the right direction, that would be great. Thanks in advance.
You could do this with a single regex. You don't need to go for any workaround.
>>> import re
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]+(?=</)|^(?!.*?>.*?</).*', s, re.M)
['This is Russia Today']
Here is a work-around. Instead of adjusting the regex, we adjust the string:
>>> s='<a density="sparse" href="http://topics.bloomberg.com/russia/">Russia</a>'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['Russia']
>>> s='This is Russia Today'
>>> re.findall(r'(?<=>)[^<>]*(?=<\/)', s if '>' in s else '>%s</' % s)
['This is Russia Today']

Python deal with date format like "1st, 2nd, 3rd, 4th"

I want to deal with those strings like:
"I will meet you at 1st."
"5th... OK, 5th?"
"today is 2nd\n"
"Aug.3rd"
To replace the "st|nd|rd|th" with other corresponsive string, actually are xml tags, I want to make those "1st, 2nd, 3rd, 4th" into superscript looks:
1<Font Script=”super”>rd</Font>
5<Font Script=”super”>th</Font> ... OK, 5<Font Script=”super”>th</Font>?
Like this
Use re module to identify the date patterns and replace them.
>>> re.sub(r"([0123]?[0-9])(st|th|nd|rd)",r"\1<sup>\2</sup>","Meet you on 5th")
'Meet you on 5<sup>th</sup>'
Regex demo: http://regexr.com/38lao

Python regex: How to match a number with a non-number?

I want split number with another character.
Example
Input:
we spend 100year
Output:
we speed 100 year
Input:
today i'm200 pound
Output
today i'm 200 pound
Input:
he maybe have212cm
Output:
he maybe have 212 cm
I tried re.sub(r'(?<=\S)\d', ' \d', string) and re.sub(r'\d(?=\S)', '\d ', string), which doesn't work.
This will do it:
ins='''\
we spend 100year
today i'm200 pound
he maybe have212cm'''
for line in ins.splitlines():
line=re.sub(r'\s*(\d+)\s*',r' \1 ', line)
print line
Prints:
we spend 100 year
today i'm 200 pound
he maybe have 212 cm
Same syntax for multiple matches in the same line of text:
>>> re.sub(r'\s*(\d+)\s*',r' \1 ', "we spend 100year + today i'm200 pound")
"we spend 100 year + today i'm 200 pound"
The capturing groups (generally) are numbered left to right and the \number refers to each numbered group in the match:
>>> re.sub(r'(\d)(\d)(\d)',r'\2\3\1','567')
'675'
If it is easier to read, you can name your capturing groups rather than using the \1 \2 notation:
>>> line="we spend 100year today i'm200 pound"
>>> re.sub(r'\s*(?P<nums>\d+)\s*',r' \g<nums> ',line)
"we spend 100 year today i'm 200 pound"
This takes care of one case:
>>> re.sub(r'([a-zA-Z])(?=\d)',r'\1 ',s)
'he maybe have 212cm'
And this takes care of the other:
>>> re.sub(r'(?<=\d)([a-zA-Z])',r' \1',s)
'he maybe have212 cm'
Hopefully someone with more regex experience than me can figure out how to combine them ...

Extracting sub-string after the first space in Python

I need help in regex or Python to extract a substring from a set of string. The string consists of alphanumeric. I just want the substring that starts after the first space and ends before the last space like the example given below.
Example 1:
A:01 What is the date of the election ?
BK:02 How long is the river Nile ?
Results:
What is the date of the election
How long is the river Nile
While I am at it, is there an easy way to extract strings before or after a certain character? For example, I want to extract the date or day like from a string like the ones given in Example 2.
Example 2:
Date:30/4/2013
Day:Tuesday
Results:
30/4/2013
Tuesday
I have actually read about regex but it's very alien to me. Thanks.
I recommend using split
>>> s="A:01 What is the date of the election ?"
>>> " ".join(s.split()[1:-1])
'What is the date of the election'
>>> s="BK:02 How long is the river Nile ?"
>>> " ".join(s.split()[1:-1])
'How long is the river Nile'
>>> s="Date:30/4/2013"
>>> s.split(":")[1:][0]
'30/4/2013'
>>> s="Day:Tuesday"
>>> s.split(":")[1:][0]
'Tuesday'
>>> s="A:01 What is the date of the election ?"
>>> s.split(" ", 1)[1].rsplit(" ", 1)[0]
'What is the date of the election'
>>>
There's no need to dig into regex if this is all you need; you can use str.partition
s = "A:01 What is the date of the election ?"
before,sep,after = s.partition(' ') # could be, eg, a ':' instead
If all you want is the last part, you can use _ as a placeholder for 'don't care':
_,_,theReallyAwesomeDay = s.partition(':')

Categories

Resources