I have a pandas DataFrame with one column of prices that contains strings of various forms such as US$250.00, MYR35.50, and S$50, and have been facing trouble in developing a suitable regex in order to split the non-numerical portion from the numerical portion. The end result I would like to have is to split this single column of prices into two new columns. One of the columns would hold the alphabetical part as a string and be named "Currency", while the other column would hold the numbers as "Price".
The only possible alphabetical parts I would encounter in the strings, prepended to the numerical parts, are just of the forms: US$, BAHT, MYR, S$. Sometimes there might be a whitespace between the alphabetical part and numerical part, sometimes there might not be. All the help that I need here is just figure out the right regex for this job.
Help please! Thank you so much!
If you want to extend #Tristan's answer to pandas you can use the extractall method in the str accessor.
First create some data
s=pd.Series(['US$250.00', 'MYR35.50','&*', 'S$ 50', '50'])
0 US$250.00
1 MYR35.50
2 &*
3 S$ 50
4 50
Then use extractall. Notice that this method skips rows that do not have a match.
s.str.extractall('([A-Z$]+)\s*([\d.]+)')
0 1
match
0 0 US$ 250.00
1 0 MYR 35.50
3 0 S$ 50
You can use re.match on each cell with a regex like this:
import re
cell = 'US$50.00'
result = re.match(r'([A-Z$]+)\s*([\d.]+)', cell)
print(result.groups()[0], result.groups()[1])
The relevant different parts are captured in groups and can be accessed separately, while the optional whitespace is ignored.
Trick is to use ‘\$* *’ in your search pattern.
Since $ is a meta-character in RegEx, it needs to be escaped to be considered as literal $. So ‘\$*’ part tells RegRx that $ sign may appear zero or more times. Similarly ‘ *’ tells RegEx that space may appear zero or more times.
Hope this helps.
>>> import re
>>> string = 'Rs50 US$56 MYR83 S$102 Baht 105 Us$77'
>>> M = re.findall(r'[A-z]+\$*',string)
>>> M
['Rs', 'US$', 'MYR', 'S$', 'Baht', 'Us$']
>>> C = re.findall(r'[A-z]+\$* *([0-9]+)',string)
>>> C
['50', '56', '83', '102', '105', '77']
With this regex
^([^0-9]+)([0-9]+\.?[0-9]*)$
Group 1 will be the currency part and Group 2 will be the numerical part:
https://regex101.com/delete/MjfCYY4H8g1uCfCywL0TFImZ
Related
This is an example of a bigger dataframe. Imagine I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({"ID":["4SSS50FX","2TT1897FA"],
"VALUE":[13, 56]})
df
Out[2]:
ID VALUE
0 4SSS50FX 13
1 2TT1897FA 56
I would like to insert "-" in the strings from df["ID"] everytime it changes from number to text and from text to number. So the output should be like:
ID VALUE
0 4-SSS-50-FX 13
1 2-TT-1897-FA 56
I could create specific conditions for each case, but I would like to automate it for all the samples. Anyone could help me?
You can use a regular expression with lookarounds.
df['ID'] = df['ID'].str.replace(r'(?<=\d)(?=[A-Z])|(?<=[A-Z])(?=\d)', '-')
The regexp matches an empty string that's either preceded by a digit and followed by a letter, or vice versa. This empty string is then replaced with -.
Use a regex.
>>> df['ID'].str.replace('(\d+(?=\D)|\D+(?=\d))', r'\1-', regex=True)
0 4-SSS-50-FX
1 2-TT-1897-FA
Name: ID, dtype: object
\d+(?=\D) means digits followed by non-digit.
\D+(?=\d)) means non-digits followed by digit.
Either of those are replaced with themselves plus a - character.
I have a column existing of rows with different strings (Python). ex.
5456656352
435365
46765432
...
I want to seperate the strings every 2 digits with a comma, so I have following result:
54,56,65,63,52
43,53,65
46,76,54,32
...
Can someone help me please.
Try:
text = "5456656352"
print(",".join(text[i:i + 2] for i in range(0, len(text), 2)))
output:
54,56,65,63,52
You can wrap it into a function if you want to apply it to a DF or ...
note: This will separate from left, so if the length is odd, there will be a single number at the end.
Not sure about the structure of desired output (pandas and dataframes, pure strings, etc.). But, you can always use a regex pattern like:
import re
re.findall("\d{2}", "5456656352")
Output
['54', '56', '65', '63', '52']
You can have this output as a string too:
",".join(re.findall("\d{2}", "5456656352"))
Output
54,56,65,63,52
Explanation
\d{2} is a regex pattern that points to a part of a string that has 2 digits. Using findall function, this pattern will divide each string to elements containing just two digits.
Edit
Based on your comment, you want to APPLY this on a column. In this case, you should do something like:
df["my_column"] = df["my_column"].apply(split_it)
I would need to select all the accounts were 3 (or more) consecutive characters are identical and/or include also digits in the name, for example
Account
aaa12
43qas
42134dfsdd
did
Output
Account
aaa12
43qas
42134dfsdd
I am considering of using regex for this: [a-zA-Z]{3,} , but I am not sure of the approach. Also, this does not include the and/or condition on the digits. I would be interested in both for selecting accounts with at least one of these:
repeated identical characters,
numbers in the name.
Give this a try
n = 3 #for 3 chars repeating
pat = f'([a-zA-Z])\\1{{{n-1}}}|(\\d)+' #need `{{` to pass a literal `{`
df_final = df[df.Account.str.findall(pat).astype(bool)]
Out[101]:
Account
0 aaa12
1 43qas
2 42134dfsdd
Can you try :
x = re.search([a-zA-Z]{3}|\d, string)
I have a list in Python with values
['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
I want to match only strings where length is 8 and there are 3 characters before underscore and 4 digits after underscore so I eliminate values not required. I am interested only in the MMM_YYYY values from above list.
Tried below and I am not able to filter values like YTD_TY_1 which has multiple underscores.
for c in col_headers:
d= (re.match('^(?=.*\d)(?=.*[A-Z0-9])[A-Z_0-9\d]{8}$',c))
if d:
data_period.append(d[0])
Update: based on #WiktorStribiżew observation that re.match does not require a full string match in Python
The regex I am using is based upon the one that #dvo provided in a comment:
import re
REGEX = '^[A-Z]{3}_[0-9]{4}$'
col_headers = ['JUL_2018', 'AUG_2018', 'SEP_2018', 'OCT_2018', 'NOV_2018', 'DEC_2018', 'JAN_2019', 'FEB_2019', 'MAR_2019', 'APR_2019', 'MAY_2019', 'JUN_2019', 'MAT_YA_1', 'MAT_TY_1', 'YTD_YA_1', 'YTD_TY_1', 'L3M_YA_1', 'L1M_YA_1']
regex = re.compile(REGEX)
data_period = list(filter(regex.search, col_headers))
Once again, based on a comment made by #WiktorStribiżew, if you do not want to match something as "SXX_0012" or "XYZ_0000", you should use the regex he has provided in a comment:
REGEX = r'^(?:JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)-[0-9]{4}$'
Rather than use regex for this, you should just try to parse it as a date in the first place:
from datetime import datetime
date_fmt = "%b_%Y"
for c in col_headers:
try:
d = datetime.strptime(c, date_fmt)
data_period.append(c) # Or just save the datetime object directly
except ValueError:
pass
The part of this code that is actually doing the matching in your solution is this
[A-Z_0-9\d]{8}
The problem with this is that you're asking to find exactly 8 characters that include A-Z, _, 0-9, and \d. Now, \d is equivalent to 0-9, so you can eliminate that, but that doesn't solve the whole problem, the issue here is that you've encased the entire solution in brackets []. Basically, your string will match anything that is 8 characters long and includes the above characters, ie: A_19_KJ9
What you need to do is specify that you want exactly 3 A-Z characters, then a single _, then 4 \d, see below:
[A-Z]{3}_\d{4}
This will match anything with exactly 3 A-Z characters, then a single _, then 4 \d(any numeric digit)
For a better understanding of regex, I'd encourage you to use an online tool, like regex101
I've tried to find ways to do this and searched online here, but cannot find examples to help me figure this out.
I'm reading in rows from a large csv and changing each row to a list. The problem is that the data source isn't very clean. It has empty strings or bad data sometimes, and I need to fill in default values when that happens. For example:
list_ex1 = ['apple','9','','2012-03-05','455.6']
list_ex2 = ['pear','0','45','wrong_entry','565.11']
Here, list_ex1 has a blank third entry and list_ex2 has erroneous data where a date should be. To be clear, I can create a regex that limits what each of the five entries should be:
reg_ex_check = ['[A-Za-z]+','[0-9]','[0-9]','[0-9]{4}-[0-1][0-9]-[0-3][0-9]','[0-9.]+']
That is:
1st entry: A string, no numbers
2nd entry: Exactly one digit between 0 and 9
3rd entry: Exactly one digit as well.
4th entry: Date in standard format (allowing any four digit ints for year)
5th entry: Float
If an entry is blank OR does not match the regular expression, then it should be filled in/replaced with the following defaults:
default_fill = ['empty','0','0','2000-01-01','0']
I'm not sure how the best way to go about this is. I think I could write a complicated loop, but it doesn't feel very 'pythonic' to me to do such things.
Any better ideas?
Use zip and a conditional expression in a list comprehension:
[x if re.match(r,x) else d for x,r,d in zip(list_ex2,reg_ex_check,default_fill)]
Out[14]: ['pear', '0', '45', '2000-01-01', '565.11']
You don't really need to explicitly check for blank strings since your various regexen (plural of regex) will all fail on blank strings.
Other note: you probably still want to add an anchor for the end of your string to each regex. Using re.match ensures that it tries to match from the start, but still provides no guarantee that there is not illegal stuff after your match. Consider:
['pear and a pear tree', '0blah', '4 4 4', '2000-01-0000', '192.168.0.bananas']
The above entire list is "acceptable" if you don't add a $ anchor to the end of each regex :-)
What about something like this?
map(lambda(x,y,z): re.search(y,x) and x or z, zip(list_ex1, reg_ex_check, default_fill))