Insert space to separate conjoined alpha and numeric strings - Python RegEx - python

In Python, I need to create a regex that inserts a space between any concatenated AlphaNum combinations. For example, this is what I want:
8min15sec ==> 8 min 15 sec
7m12s ==> 7 m 12 s
15mi25s ==> 15 mi 25 s
RegEx101 demo
I am blundering around with solutions found online, but they are a bit too complex for me to parse/modify. For example, I have this:
[a-zA-Z][a-zA-Z\d]*
but it only identifies the first insertion point: 8Xmin15sec (the X)
And this
(?<=[a-z])(?=[A-Z0-9])|(?<=[0-9])(?=[A-Z])
but it only finds this point: 8minX15sec (the X)
I could sure use a hand with the full syntax for finding each insertion point and inserting the spaces.
RegEx101 demo (same link as above)

How about the following approach:
import re
for test in ['8min15sec', '7m12s', '15mi25s']:
print(re.sub(r'(\d+|\D+)', r'\1 ', test).strip())
Which would give you:
8 min 15 sec
7 m 12 s
15 mi 25 s

You can use this regex, which marks the point which are boundaries of numbers and alphabets with either order i.e. number first then alphabets or vice versa.
(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)
This regex (?<=\d)(?=[a-zA-Z]) marks a point with positive lookahead to look for an alphabet and positive look behind to look for a digit.
Similarly, (?<=[a-zA-Z])(?=\d) does same but in opposite order.
And then just replace that mark by a space.
Demo
Here is sample python code for same.
import re
arr = ['8min15sec', '7m12s', '15mi25s']
for s in arr:
print (s + ' --> ' + re.sub('(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)', ' ',s))
Which prints following output,
8min15sec --> 8 min 15 sec
7m12s --> 7 m 12 s
15mi25s --> 15 mi 25 s

How about:
"(\d+)([a-zA-Z]+)"
to
"\1 \2 "
https://regex101.com/r/yvqCtQ/2
And in python:
In [59]: re.sub(r'(\d+)([a-zA-Z]+)', r'\1 \2 ', '8min15sec')
Out[59]: '8 min 15 sec '

Related

Dealing with comma and fullstops as per convention

I have various instance of strings such as:
- hello world,i am 2000to -> hello world, i am 2000 to
- the state was 56,869,12th -> the state was 66,869, 12th
- covering.2% -> covering. 2%
- fiji,295,000 -> fiji, 295,000
For dealing with first case, I came up with two step regex:
re.sub(r"(?<=[,])(?=[^\s])(?=[^0-9])", r" ", text) # hello world, i am 20,000to
re.sub(r"(?<=[0-9])(?=[.^[a-z])", r" ", text) # hello world, i am 20,000 to
But this breaks the text in some different ways and other cases are not covered as well. Can anyone suggest a more general regex that solves all cases properly. I've tried using replace, but it does some unintended replacements which in turn raise some other problems. I'm not an expert in regex, would appreciate pointers.
This approach covers your cases above by breaking the text into tokens:
in_list = [
'hello world,i am 2000to',
'the state was 56,869,12th',
'covering.2%',
'fiji,295,000',
'and another example with a decimal 12.3not4,5 is right out',
'parrot,, is100.00% dead'
'Holy grail runs for this portion of 100 minutes,!, 91%. Fascinating'
]
tokenizer = re.compile(r'[a-zA-Z]+[\.,]?|(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?(?:%|st|nd|rd|th)?[\.,]?')
for s in in_list:
print(' '.join(re.findall(pattern=tokenizer, string=s)))
# hello world, i am 2000 to
# the state was 56,869, 12th
# covering. 2%
# fiji, 295,000
# and another example with a decimal 12.3 not 4, 5 is right out
# parrot, is 100.00% dead
# Holy grail runs for this portion of 100 minutes, 91%. Fascinating
Breaking up the regex, each token is the longest available substring with:
Only letters with or without a period or comma,[a-zA-Z]+[\.,]?
OR |
A number-ish expression which could be
1 to 3 digits \d{1,3} followed by any number of groups of comma + 3 digits (?:,\d{3})+
OR | any number of comma-free digits \d+
optionally a decimal place followed by at least one digit (?:\.\d+),
optionally a suffix (percent, 'st', 'nd', 'rd', 'th') (?:[\.,%]|st|nd|rd|th)?
optionally period or comma [\.]?
Note the (?:blah) is used to suppress re.findall's natural desire to tell you how every parenthesized group matches up on an individual basis. In this case we just want it to walk forward through the string, and the ?: accomplishes this.

Using regex to split a column

The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.
I have this dataset:
Date_Title Date Copies
05-21 I. Don Quixote 1605 252
21-20 IV. Macbeth 1629 987
10-12 ML. To Kill a Mockingbird 1960 478
12 V. Invisible Man 1897 136
Basically, I would like to split the "Date Title", so, when I print a row, I would get this:
('05-21 I', 'I', 'Don Quixote', 1605, 252)
Or
('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)
In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.
You can use
df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
NumRoman Roman Name Date Copies
0 05-21 I I Don Quixote 1605 252
1 21-20 IV IV Macbeth 1629 987
2 10-12 ML ML To Kill a Mockingbird 1960 478
3 12 V V Invisible Man 1897 136
See the regex demo. Details:
^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
\d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
\s* - zero or more whitespaces
(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible
Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.
I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:
df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])
Or:
df['Date_Title'] = (df['Date_Title'].str.split().str[0],
df['Date_Title'].str.split().str[1],
' '.join(df['Date_Title'].str.split().str[2:])
Focusing on the string split:
string = "21-20 IV. Macbeth"
i = string.index(".") # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:] # Macbeth
df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))

Regular expression in Python, 2-3 numbers then 2 letters

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grĂ¥", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!
You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

Find date from image/text

I have dates like this and I need regex to find these types of dates
12-23-2019
29 10 2019
1:2:2018
9/04/2019
22.07.2019
here's what I did
first I removed all spaces from the text and here's what it looks like
12-23-2019291020191:02:2018
and this is my regex
re.findall(r'((\d{1,2})([.\/-])(\d{2}|\w{3,9})([.\/-])(\d{4}))',new_text)
it can find 12-23-2019 , 9/04/2019 , 22.07.2019 but cannot find 29 10 2019 and 1:02:2018
You may use
(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)
See the regex demo
Details
(?<!\d) - no digit right before
\d{1,2} - 1 or 2 digits
([.:/ -]) - a dot, colon, slash, space or hyphen (captured in Group 1)
(?:\d{1,2}|\w{3,}) - 1 or 2 digits or 3 or more word chars
\1 - same value as in Group 1
\d{4} - four digits
(?!\d) - no digit allowed right after
Python sample usage:
import re
text = 'Aaaa 12-23-2019, bddd 29 10 2019 <=== 1:2:2018'
pattern = r'(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)'
results = [x.group() for x in re.finditer(pattern, text)]
print(results) # => ['12-23-2019', '29 10 2019', '1:2:2018']

ReGex for surrounding numbers with whitespaces

I would like to find a Regex to convert string like the following one:
wienerstr256pta 18 graz austria8051 4
Into the following one:
wienerstr 256 pta 18 graz austria 8051 4
So I just want to surround every number set between spaces.
I know I can easily find the digits by:
/[0-9]+/g
But how can I replace this match with the same content plus extra whitespaces?
You may find all the positions between a non-digit/non-whitespace and a digit, or between a digit and a non-digit/non-whitespace and insert a space there:
(?<=[^0-9\s])(?=[0-9])|(?<=[0-9])(?=[^0-9\s])
Replace with a space.
See the regex demo.
Details
(?<=[^0-9\s]) - matches a position that is immediately preceded with a char other than a digit and a whitespace...
(?=[0-9]) - and is followed with a digit
| - or
(?<=[0-9]) - matches a position immediately preceded with a digit and
(?=[^0-9\s]) - followed with a char other than a digit and a whitespace.
A Pandas test:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> col_list = ['wienerstr256pta 18 graz austria8051 4']
>>> rx = r'(?<=[^0-9\s])(?=[0-9])|(?<=[0-9])(?=[^0-9\s])'
>>> df = pd.DataFrame(col_list, columns=['col'])
>>> df['col'].replace(rx," ", regex=True, inplace=True)
>>> df['col']
0 wienerstr 256 pta 18 graz austria 8051 4
Name: col, dtype: object
echo "wienerstr256pta18graz austria8051 4" \
| sed -r "s/([^0-9])([0-9])/\1 \2/g;s/([0-9])([^0-9])/\1 \2/g;s/ */ /g"
wienerstr 256 pta 18 graz austria 8051 4
Replace every change of number to nonnumber or nonnumber to number with both with blank in between. Condense multiple blanks by one in the end, since a blank is a nonnumber too.
Keeping multiple blanks - which might be in the input - together:
echo "wienerstr256pta18graz austria8051 4" | sed -r "s/([^0-9 ])([0-9])/\1 \2/g;s/([0-9])([^0-9 ])/\1 \2/g;"
wienerstr 256 pta 18 graz austria 8051 4

Categories

Resources