Selecting patterns in character sequence using regex - python

I would need to select all the accounts were 3 (or more) consecutive characters are identical and/or include also digits in the name, for example
Account
aaa12
43qas
42134dfsdd
did
Output
Account
aaa12
43qas
42134dfsdd
I am considering of using regex for this: [a-zA-Z]{3,} , but I am not sure of the approach. Also, this does not include the and/or condition on the digits. I would be interested in both for selecting accounts with at least one of these:
repeated identical characters,
numbers in the name.

Give this a try
n = 3 #for 3 chars repeating
pat = f'([a-zA-Z])\\1{{{n-1}}}|(\\d)+' #need `{{` to pass a literal `{`
df_final = df[df.Account.str.findall(pat).astype(bool)]
Out[101]:
Account
0 aaa12
1 43qas
2 42134dfsdd

Can you try :
x = re.search([a-zA-Z]{3}|\d, string)

Related

Want to replace comma with decimal point in text file where after each number there is a comma in python

eg
Arun,Mishra,108,23,34,45,56,Mumbai
o\p I want is
Arun,Mishra,108.23,34,45,56,Mumbai
Tried to replace the comma with dot but all the demiliters are replaced with comma
tried text.replace(',','.') but replacing all the commas with dot
You can use regex for these kind of tasks:
import re
old_str = 'Arun,Mishra,108,23,34,45,56,Mumbai'
new_str = re.sub(r'(\d+)(,)(\d+)', r'\1.\3', old_str, 1)
>>> 'Arun,Mishra,108.23,34,45,56,Mumbai'
The search pattern r'(\d+)(,)(\d+)' was to find a comma between two numbers. There are three capture groups, therefore one can use them in the replacement: r\1.\3 (\1 and \3 are first and third groups). The old_str is the string and 1 is to tell the pattern to only replace the first occurrence (thus keep 34, 45).
It may be instructive to show how this can be done without additional module imports.
The idea is to search the string for all/any commas. Once the index of a comma has been identified, examine the characters either side (checking for digits). If such a pattern is observed, modify the string accordingly
s = 'Arun,Mishra,108,23,34,45,56,Mumbai'
pos = 1
while (pos := s.find(',', pos, len(s)-1)) > 0:
if s[pos-1].isdigit() and s[pos+1].isdigit():
s = s[:pos] + '.' + s[pos+1:]
break
pos += 1
print(s)
Output:
Arun,Mishra,108.23,34,45,56,Mumbai
Assuming you have a plain CSV file as in your single line example, we can assume there are 8 columns and you want to 'merge' columns 3 and 4 together. You can do this with a regular expression - as shown below.
Here I explicitly match the 8 columns into 8 groups - matching everything that is not a comma as a column value and then write out the 8 columns again with commas separating all except columns 3 and 4 where I put the period/dot you require.
$ echo "Arun,Mishra,108,23,34,45,56,Mumbai" | sed -r "s/([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)/\1,\2,\3.\4,\5,\6,\7,\8/"
Arun,Mishra,108.23,34,45,56,Mumbai
This regex is for your exact data. Having a generic regex to replace any comma between two subsequent sets of digits might give false matches on other data however so I think explicitly matching the data based on the exact columns you have will be the safest way to do it.
You can take the above regex and code it into your python code as shown below.
import re
inLine = 'Arun,Mishra,108,23,34,45,56,Mumbai'
outLine = re.sub(r'([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)'
, r'\1,\2,\3.\4,\5,\6,\7,\8', inLine, 0)
print(outLine)
As Tim Biegeleisen pointed out in an original comment, if you have access to the original source data you would be better fixing the formatting there. Of course that is not always possible.
First split the string using s.split() and then replace ',' in 2nd element
after replacing join the string back again.
s= 'Arun,Mishra,108,23,34,45,56,Mumbai '
ls = s.split(',')
ls[2] = '.'.join([ls[2], ls[3]])
ls.pop(3)
s = ','.join(ls)
It changes all the commas to dots if dot have numbers before and after itself.
txt = "2459,12 is the best number. lets change the dots . with commas , 458,45."
commaindex = 0
while commaindex != -1:
commaindex = txt.find(",",commaindex+1)
if txt[commaindex-1].isnumeric() and txt[commaindex+1].isnumeric():
txt = txt[0:commaindex] + "." + txt[commaindex+1:len(txt)+1]
print(txt)

How To Extract Three Letters Followed By Five Digits Using Regex in Python

I have the following dataframe in Python:
abc12345
abc1234
abc1324.
How do I extract only the ones that have three letters followed by five digits?
The desired result would be:
abc12345.
df.column.str.extract('[^0-9](\d\d\d\d\d)$')
I think this works, but is there any better way to modify (\d\d\d\d\d) ?
What if I had like 30 digits. Then I'll have to type \d 30 times, which is inefficient.
You should be able to use:
'[a-zA-Z]{3}\d{5}'
If the strings don't include capital letters this can reduce to:
'[a-z]{3}\d{5}'
Change the values in the {x} to adjust the number of chars to capture.
Or like this following code:
'
import re
s = "abc12345"
p = re.compile(r"\d{5}")
c = p.match(s,3)
print(c.group())
'

separate upper case chars with digits from lower case chars with digits

I have a column Name with data in format below:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
I wanted to separate name as per column Name2. There is a pattern of 4 upper case chars, followed by 4-5 digits, and I'm interested in what follows these 4-5 digits.
Is there any way to achieve this?
You can try below logic:
import re
_names = ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']
result = []
for _name in _names:
m = re.search('^[A-Z]{4}[0-9]{4,5}(.+)', _name)
result.append(m.group(1))
print(result)
Using str.extract
import pandas as pd
df = pd.DataFrame({"Name": ['MORR1223ldkeha12', 'FRAN2771yetg4fq1', 'MORR56333gft4tsd1']})
df["Name2"] = df["Name"].str.extract(r"\d{4,5}(.*)")
print(df)
Output:
Name Name2
0 MORR1223ldkeha12 ldkeha12
1 FRAN2771yetg4fq1 yetg4fq1
2 MORR56333gft4tsd1 gft4tsd1
You could use a regex to find out if there are 4 or 5 digits and then remove either the first 8 or 9 letters. So if the pattern ^[A-Z]{4}[0-9]{5}.* matches, there are 5 digits, else 4.
If you change your re like this '(^[A-Z]{4})([0-9]{4,5})(.+)' you can access the different parts using the submatches of the match result.
So in Anil's code, group(0) will return the whole match, 1 the first group, 2 the second one and 3 the rest.

Need help to find a suitable regex

I have a pandas DataFrame with one column of prices that contains strings of various forms such as US$250.00, MYR35.50, and S$50, and have been facing trouble in developing a suitable regex in order to split the non-numerical portion from the numerical portion. The end result I would like to have is to split this single column of prices into two new columns. One of the columns would hold the alphabetical part as a string and be named "Currency", while the other column would hold the numbers as "Price".
The only possible alphabetical parts I would encounter in the strings, prepended to the numerical parts, are just of the forms: US$, BAHT, MYR, S$. Sometimes there might be a whitespace between the alphabetical part and numerical part, sometimes there might not be. All the help that I need here is just figure out the right regex for this job.
Help please! Thank you so much!
If you want to extend #Tristan's answer to pandas you can use the extractall method in the str accessor.
First create some data
s=pd.Series(['US$250.00', 'MYR35.50','&*', 'S$ 50', '50'])
0 US$250.00
1 MYR35.50
2 &*
3 S$ 50
4 50
Then use extractall. Notice that this method skips rows that do not have a match.
s.str.extractall('([A-Z$]+)\s*([\d.]+)')
0 1
match
0 0 US$ 250.00
1 0 MYR 35.50
3 0 S$ 50
You can use re.match on each cell with a regex like this:
import re
cell = 'US$50.00'
result = re.match(r'([A-Z$]+)\s*([\d.]+)', cell)
print(result.groups()[0], result.groups()[1])
The relevant different parts are captured in groups and can be accessed separately, while the optional whitespace is ignored.
Trick is to use ‘\$* *’ in your search pattern.
Since $ is a meta-character in RegEx, it needs to be escaped to be considered as literal $. So ‘\$*’ part tells RegRx that $ sign may appear zero or more times. Similarly ‘ *’ tells RegEx that space may appear zero or more times.
Hope this helps.
>>> import re
>>> string = 'Rs50 US$56 MYR83 S$102 Baht 105 Us$77'
>>> M = re.findall(r'[A-z]+\$*',string)
>>> M
['Rs', 'US$', 'MYR', 'S$', 'Baht', 'Us$']
>>> C = re.findall(r'[A-z]+\$* *([0-9]+)',string)
>>> C
['50', '56', '83', '102', '105', '77']
With this regex
^([^0-9]+)([0-9]+\.?[0-9]*)$
Group 1 will be the currency part and Group 2 will be the numerical part:
https://regex101.com/delete/MjfCYY4H8g1uCfCywL0TFImZ

Split and check first 8 digits are met

I have some data that looks like this, when reading this data from a file, is there a way to only add to the list if the first 8 digits are met?
11111111 ABC Data1
my current method is only splitting the space in between
Number = descr.split(' ')[0]
If you want to only add a 8 digit number from an input string, do it as shown below
descr = input()
reqd_int = int( descr.split(' ')[0:8] )
This will fail if the input contains less than 8 integers at the start.
The other option is to use regular expressions, use it as shown below
import re
reqd_int = int(re.search('\d{8}', descr))
What the re.search() function does is,for the first paramete, \d stands for single integers and the {8} tells it to look 8 such contagious block of integers.
You can look up more on regular expressions here.

Categories

Resources