I would like to find a Regex to convert string like the following one:
wienerstr256pta 18 graz austria8051 4
Into the following one:
wienerstr 256 pta 18 graz austria 8051 4
So I just want to surround every number set between spaces.
I know I can easily find the digits by:
/[0-9]+/g
But how can I replace this match with the same content plus extra whitespaces?
You may find all the positions between a non-digit/non-whitespace and a digit, or between a digit and a non-digit/non-whitespace and insert a space there:
(?<=[^0-9\s])(?=[0-9])|(?<=[0-9])(?=[^0-9\s])
Replace with a space.
See the regex demo.
Details
(?<=[^0-9\s]) - matches a position that is immediately preceded with a char other than a digit and a whitespace...
(?=[0-9]) - and is followed with a digit
| - or
(?<=[0-9]) - matches a position immediately preceded with a digit and
(?=[^0-9\s]) - followed with a char other than a digit and a whitespace.
A Pandas test:
>>> from pandas import DataFrame
>>> import pandas as pd
>>> col_list = ['wienerstr256pta 18 graz austria8051 4']
>>> rx = r'(?<=[^0-9\s])(?=[0-9])|(?<=[0-9])(?=[^0-9\s])'
>>> df = pd.DataFrame(col_list, columns=['col'])
>>> df['col'].replace(rx," ", regex=True, inplace=True)
>>> df['col']
0 wienerstr 256 pta 18 graz austria 8051 4
Name: col, dtype: object
echo "wienerstr256pta18graz austria8051 4" \
| sed -r "s/([^0-9])([0-9])/\1 \2/g;s/([0-9])([^0-9])/\1 \2/g;s/ */ /g"
wienerstr 256 pta 18 graz austria 8051 4
Replace every change of number to nonnumber or nonnumber to number with both with blank in between. Condense multiple blanks by one in the end, since a blank is a nonnumber too.
Keeping multiple blanks - which might be in the input - together:
echo "wienerstr256pta18graz austria8051 4" | sed -r "s/([^0-9 ])([0-9])/\1 \2/g;s/([0-9])([^0-9 ])/\1 \2/g;"
wienerstr 256 pta 18 graz austria 8051 4
Related
I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)
str int
0 abc 123
1 gbc#* 23abc
2 abc123 abc200
3 124abc 1230&*
4 abcer£$%&*! 230!?*&
num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)
Output:
correct_int: ['23', '1230', '230']
As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first
correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']
now it removed for all but sometimes it's not removing when the value is '124abc'
Is my pattern wrong? I have also tried giving different patterns but nothing worked
I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int'
Expected output:
Once after cleaning and replacing with the old with the cleaned values the output would look like this.
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
You can do it with
df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
Returns:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
Try the following:
'\D' represents any non digit value, substitute those with empty string '' in int column
[^a-zA-Z] represents any character not in the range a-z and A-Z, substitute those with empty string '' in str column
Apply these transformations to both columns using .apply() and a lambda function
import pandas as pd
import re
d = [['abc','123'],['gbc#*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
df1['int'] = df1['int'].apply(lambda r: re.sub('\D', '', r))
df1['str'] = df1['str'].apply(lambda r: re.sub('[^a-zA-Z]', '', r))
print(df1)
Output:
str int
0 abc 123
1 gbc 23
2 abc 200
3 abc 1230
4 abcer 230
The regex I am using is \d+-\d+, but I'm not quite sure about how to separate the Roman numbers and how to create a new column with them.
I have this dataset:
Date_Title Date Copies
05-21 I. Don Quixote 1605 252
21-20 IV. Macbeth 1629 987
10-12 ML. To Kill a Mockingbird 1960 478
12 V. Invisible Man 1897 136
Basically, I would like to split the "Date Title", so, when I print a row, I would get this:
('05-21 I', 'I', 'Don Quixote', 1605, 252)
Or
('10-12 ML', 'ML', 'To Kill a Mockingbird',1960, 478)
In the first place, the numbers and the roman numeral, in the second; only the Roman numeral, in the third the name, and the fourth and fifth would be the same as the dataset.
You can use
df = pd.DataFrame({'Date_Title':['05-21 I. Don Quixote','21-20 IV. Macbeth','10-12 ML. To Kill a Mockingbird','12 V. Invisible Man'], 'Date':[1605,1629,1960,1897], 'Copies':[252,987,478,136]})
rx = r'^(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})))\.\s*(.*)'
df[['NumRoman','Roman','Name']] = df.pop('Date_Title').str.extract(rx)
df = df[['NumRoman','Roman','Name', 'Date', 'Copies']]
>>> df
NumRoman Roman Name Date Copies
0 05-21 I I Don Quixote 1605 252
1 21-20 IV IV Macbeth 1629 987
2 10-12 ML ML To Kill a Mockingbird 1960 478
3 12 V V Invisible Man 1897 136
See the regex demo. Details:
^ - start of string
(\d+(?:-\d+)?\s*(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3}))) - Group 1 ("NumRoman"):
\d+(?:-\d+)? - one or more digits followed with an optional sequence of a - and one or more digits
\s* - zero or more whitespaces
(M{0,4}(?:C[MD]|D?C{0,3})(?:X[CL]|L?X{0,3})(?:I[XV]|V?I{0,3})) - Group 2 ("Roman"): see How do you match only valid roman numerals with a regular expression? for explanation
\. - a dot
\s* - zero or more whitespaces
(.*) - Group 3 ("Name"): any zero or more chars other than line break chars, as many as possible
Note df.pop('Date_Title') removes the Date_Title column and yields it as input for the extract method. df = df[['NumRoman','Roman','Name', 'Date', 'Copies']] is necessary if you need to keep the original column order.
I am pretty sure there might be a more optimal solution, but this is would be a fast way of solving it:
df['Date_Title'] = df['Date_Title'].apply(lambda x: (x.split()[0],x.split()[1],' '.join(x.split()[2:])
Or:
df['Date_Title'] = (df['Date_Title'].str.split().str[0],
df['Date_Title'].str.split().str[1],
' '.join(df['Date_Title'].str.split().str[2:])
Focusing on the string split:
string = "21-20 IV. Macbeth"
i = string.index(".") # Finds the first point
date, roman = string[:i].split() # 21-20, IV
title = string[i+2:] # Macbeth
df=df.assign(x=df['Date_Title'].str.split('\.').str[0],y=df['Date_Title'].str.extract('(\w+(?=\.))'),z=df['Date_Title'].str.split('\.').str[1:].str.join(','))
I have dates like this and I need regex to find these types of dates
12-23-2019
29 10 2019
1:2:2018
9/04/2019
22.07.2019
here's what I did
first I removed all spaces from the text and here's what it looks like
12-23-2019291020191:02:2018
and this is my regex
re.findall(r'((\d{1,2})([.\/-])(\d{2}|\w{3,9})([.\/-])(\d{4}))',new_text)
it can find 12-23-2019 , 9/04/2019 , 22.07.2019 but cannot find 29 10 2019 and 1:02:2018
You may use
(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)
See the regex demo
Details
(?<!\d) - no digit right before
\d{1,2} - 1 or 2 digits
([.:/ -]) - a dot, colon, slash, space or hyphen (captured in Group 1)
(?:\d{1,2}|\w{3,}) - 1 or 2 digits or 3 or more word chars
\1 - same value as in Group 1
\d{4} - four digits
(?!\d) - no digit allowed right after
Python sample usage:
import re
text = 'Aaaa 12-23-2019, bddd 29 10 2019 <=== 1:2:2018'
pattern = r'(?<!\d)\d{1,2}([.:/ -])(?:\d{1,2}|\w{3,})\1\d{4}(?!\d)'
results = [x.group() for x in re.finditer(pattern, text)]
print(results) # => ['12-23-2019', '29 10 2019', '1:2:2018']
I want to search for region in s1 . I want to return 1 if i the text contains "region" or "région" or "regions" or "régions" and 0 in the other case.
i wrote the code below but it does'nt work
s1 = pd.Series(['here is region', 'my regions', 'régionally', 'région','régions','regions','region'])
s1.str.contains('r.gion[s][^a-zA-Z]', regex=True).astype(int)
In this case the result must be
[1,1,0,1,1,1,1]
You may use
s1.str.contains(r'\br[ée]gions?\b').astype(int)
If you want to save the regex in a file and then read in and use as a variable just write \br[ée]gions?\b there.
Test:
>>> import pandas as pd
>>> s1 = pd.Series(['here is region', 'my regions', 'régionally', 'région','régions','regions','region'])
>>> s1.str.contains(r'\br[ée]gions?\b').astype(int)
0 1
1 1
2 0
3 1
4 1
5 1
6 1
dtype: int32
Details
\b - a word boundary
r - r char
[ée] - one of the letters in the character class
gion - gion
s? - an optional s letter
\b - a word boundary.
In Python, I need to create a regex that inserts a space between any concatenated AlphaNum combinations. For example, this is what I want:
8min15sec ==> 8 min 15 sec
7m12s ==> 7 m 12 s
15mi25s ==> 15 mi 25 s
RegEx101 demo
I am blundering around with solutions found online, but they are a bit too complex for me to parse/modify. For example, I have this:
[a-zA-Z][a-zA-Z\d]*
but it only identifies the first insertion point: 8Xmin15sec (the X)
And this
(?<=[a-z])(?=[A-Z0-9])|(?<=[0-9])(?=[A-Z])
but it only finds this point: 8minX15sec (the X)
I could sure use a hand with the full syntax for finding each insertion point and inserting the spaces.
RegEx101 demo (same link as above)
How about the following approach:
import re
for test in ['8min15sec', '7m12s', '15mi25s']:
print(re.sub(r'(\d+|\D+)', r'\1 ', test).strip())
Which would give you:
8 min 15 sec
7 m 12 s
15 mi 25 s
You can use this regex, which marks the point which are boundaries of numbers and alphabets with either order i.e. number first then alphabets or vice versa.
(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)
This regex (?<=\d)(?=[a-zA-Z]) marks a point with positive lookahead to look for an alphabet and positive look behind to look for a digit.
Similarly, (?<=[a-zA-Z])(?=\d) does same but in opposite order.
And then just replace that mark by a space.
Demo
Here is sample python code for same.
import re
arr = ['8min15sec', '7m12s', '15mi25s']
for s in arr:
print (s + ' --> ' + re.sub('(?<=\d)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=\d)', ' ',s))
Which prints following output,
8min15sec --> 8 min 15 sec
7m12s --> 7 m 12 s
15mi25s --> 15 mi 25 s
How about:
"(\d+)([a-zA-Z]+)"
to
"\1 \2 "
https://regex101.com/r/yvqCtQ/2
And in python:
In [59]: re.sub(r'(\d+)([a-zA-Z]+)', r'\1 \2 ', '8min15sec')
Out[59]: '8 min 15 sec '