I have a column existing of rows with different strings (Python). ex.
5456656352
435365
46765432
...
I want to seperate the strings every 2 digits with a comma, so I have following result:
54,56,65,63,52
43,53,65
46,76,54,32
...
Can someone help me please.
Try:
text = "5456656352"
print(",".join(text[i:i + 2] for i in range(0, len(text), 2)))
output:
54,56,65,63,52
You can wrap it into a function if you want to apply it to a DF or ...
note: This will separate from left, so if the length is odd, there will be a single number at the end.
Not sure about the structure of desired output (pandas and dataframes, pure strings, etc.). But, you can always use a regex pattern like:
import re
re.findall("\d{2}", "5456656352")
Output
['54', '56', '65', '63', '52']
You can have this output as a string too:
",".join(re.findall("\d{2}", "5456656352"))
Output
54,56,65,63,52
Explanation
\d{2} is a regex pattern that points to a part of a string that has 2 digits. Using findall function, this pattern will divide each string to elements containing just two digits.
Edit
Based on your comment, you want to APPLY this on a column. In this case, you should do something like:
df["my_column"] = df["my_column"].apply(split_it)
Related
I have a column within a dataframe that is composed of lists. I am trying to use an if statement to identify values in these lists that contain any special character or number. The numbers I am trying to identify are string values, not numeric. I have tried using regex to identify these values, but I don't know exactly how to use this in an if statement.
The code below gives me what I want, but I know there has to be a more succinct way to do it:
if '-' in row['col_name'].iloc[0] or '/' in row['col_name'].iloc[0] or '0' in row['col_name'].iloc[0] or '1' in row['col_name'].iloc[0]:
return action
I only included a few special characters and numbers in this example. I would like to find ANY special character or numeric value. Thank you in advance!
in reference to this post, the following might be what you need:
special_chars = ['-', '/', '0', '1']
# returns df with only the rows in which the column contains any of these characters
result_df = df.loc[df['col_name'].str.contains('|'.join(special_chars))]
the '|' will function as a regex character.
I did some search but couldn't find any useful information.
s = ['33PM']
My aim is to cut 'PM' from s[0] and append it as s[1].
You can use re.findall to extract continuous range of numbers and characters. \d+ would extract all numbers and \w+ would extract all character ranges
>>> import re
>>> s = re.findall(r'\d+|\w+', s[0])
>>> s
['33', 'PM']
Here is a method that uses simple Python code, avoiding the complications of regular expressions. This is designed for when you know that 'PM' is in the string, and if there is any text in the string after that it will be moved to the second list item together with the 'PM. This code also assumes that you care only about the first item in the list--any later items will be dropped.
s = ['33PM']
string0 = s[0]
loc = string0.find('PM')
s = [string0[:loc], string0[loc:]]
If you now print s the result is
['33', 'PM']
I have a pandas DataFrame with one column of prices that contains strings of various forms such as US$250.00, MYR35.50, and S$50, and have been facing trouble in developing a suitable regex in order to split the non-numerical portion from the numerical portion. The end result I would like to have is to split this single column of prices into two new columns. One of the columns would hold the alphabetical part as a string and be named "Currency", while the other column would hold the numbers as "Price".
The only possible alphabetical parts I would encounter in the strings, prepended to the numerical parts, are just of the forms: US$, BAHT, MYR, S$. Sometimes there might be a whitespace between the alphabetical part and numerical part, sometimes there might not be. All the help that I need here is just figure out the right regex for this job.
Help please! Thank you so much!
If you want to extend #Tristan's answer to pandas you can use the extractall method in the str accessor.
First create some data
s=pd.Series(['US$250.00', 'MYR35.50','&*', 'S$ 50', '50'])
0 US$250.00
1 MYR35.50
2 &*
3 S$ 50
4 50
Then use extractall. Notice that this method skips rows that do not have a match.
s.str.extractall('([A-Z$]+)\s*([\d.]+)')
0 1
match
0 0 US$ 250.00
1 0 MYR 35.50
3 0 S$ 50
You can use re.match on each cell with a regex like this:
import re
cell = 'US$50.00'
result = re.match(r'([A-Z$]+)\s*([\d.]+)', cell)
print(result.groups()[0], result.groups()[1])
The relevant different parts are captured in groups and can be accessed separately, while the optional whitespace is ignored.
Trick is to use ‘\$* *’ in your search pattern.
Since $ is a meta-character in RegEx, it needs to be escaped to be considered as literal $. So ‘\$*’ part tells RegRx that $ sign may appear zero or more times. Similarly ‘ *’ tells RegEx that space may appear zero or more times.
Hope this helps.
>>> import re
>>> string = 'Rs50 US$56 MYR83 S$102 Baht 105 Us$77'
>>> M = re.findall(r'[A-z]+\$*',string)
>>> M
['Rs', 'US$', 'MYR', 'S$', 'Baht', 'Us$']
>>> C = re.findall(r'[A-z]+\$* *([0-9]+)',string)
>>> C
['50', '56', '83', '102', '105', '77']
With this regex
^([^0-9]+)([0-9]+\.?[0-9]*)$
Group 1 will be the currency part and Group 2 will be the numerical part:
https://regex101.com/delete/MjfCYY4H8g1uCfCywL0TFImZ
I would like to create a single regular expression in Python that extracts two interleaved portions of text from a filename as named groups. An example filename is given below:
CM00626141_H12.d4_T0001F003L01A02Z03C02.tif
The part of the filename I'd like to extract is contained between the underscores, and consists of the following:
An uppercase letter: [A-H]
A zero-padded two-digit number: 01 to 12
A period
A lowercase letter: [a-d]
A single digit: 1 to 4
For the example above, I would like one group ('Row') to contain H.d, and the other group ('Column') to contain 12.4. However, I don't know how to do this this when the text is separated as it is here.
EDIT: A constraint which I omitted: it needs to be a single regex to handle the string. I've updated the text/title to reflect this point.
Regexp capturing groups (whether numbered or named) do not actually capture text - they capture starting/ending indices within the original text. Thus, it is impossible for them to capture non-contiguous text. Probably the best thing to do here is have four separate groups, and combine them into your two desired values manually.
You may do it in two steps using re.findall() as:
Step 1: Extract substring from the main string following your pattern as:
>>> import re
>>> my_file = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> my_content = re.findall(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', my_file)
# where content of my_content is: [('H', '12', 'd', '4')]
Step 2: Join tuples to get the value of row and column:
>>> row = ".".join(my_content[0][::2])
>>> row
'H.d'
>>> column = ".".join(my_content[0][1::2])
>>> column
'12.4'
I do not believe there is any way to capture everything you want in exactly two named capture groups and one regex call. The most straightforward way I see is to do the following:
>>> import re
>>> source = 'CM00626141_H12.d4_T0001F003L01A02Z03C02.tif'
>>> match = re.search(r'_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_', source)
>>> row, column = '.'.join(match.groups()[0::2]), '.'.join(match.groups()[1::2])
>>> row
'H.d'
>>> column
'12.4'
Alternatively, you might find it more appealing to handle the parsing almost completely in the regex:
>>> row, column = re.sub(
r'^.*_([A-H])(0[0-9]|1[0-2])\.([a-d])([1-4])_.*$',
r'\1.\3,\2.\4',
source).split(',')
>>> row, column
('H.d', '12.4')
I have a string format let's say where A = alphanumeric and N = Integer so the template is "AAAAAA-NNNN" now the user sometimes will ommit the dash, and sometimes the "NNNN" is only three digits in which case I need it to pad a 0. The first digit of "NNNN" has to be 0, thus if it is a number is is the last digit of the "AAAAAA" as opposed to the first digit of "NNNN". So in essence if I have the following inputs I want the following results:
Sample Inputs:
"SAMPLE0001"
"SAMPL1-0002"
"SAMPL3003"
"SAMPLE-004"
Desired Outputs:
"SAMPLE-0001"
"SAMPL1-0002"
"SAMPL3-0003"
"SAMPLE-0004"
I know how to check for this using regular expressions but essentially I want to do the opposite. I was wondering if there is a easy way to do this other than doing a nested conditional checking for all these variations. I am using python and pandas but either will suffice.
The regex pattern would be:
"[a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9][a-zA-Z0-9]-\d\d\d\d"
or in abbreviated form:
"[a-zA-Z0-9]{6}-[\d]{4}"
It would be possible through two re.sub functions.
>>> import re
>>> s = '''SAMPLE0001
SAMPL1-0002
SAMPL3003
SAMPLE-004'''
>>> print(re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)))
SAMPLE-0001
SAMPL1-0002
SAMPL3-0003
SAMPLE-0004
Explanation:
re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s) would be processed at first. It just places a hyphen after the 6th character from the beginning only if the following character is not a hyphen.
re.sub(r'(?m)(?<=-)(?=\d{3}$)', '0', re.sub(r'(?m)(?<=^[A-Z\d]{6})(?!-)', '-', s)) By taking the above command's output as input, this would add a digit 0 after to the hyphen and the characters following must be exactly 3.
An alternative solution, it uses str.join:
import re
inputs = ['SAMPLE0001', 'SAMPL1-0002', 'SAMPL3003','SAMPLE-004']
outputs = []
for input_ in inputs:
m = re.match(r'(\w{6})-?\d?(\d{3})', input_)
outputs.append('-0'.join(m.groups()))
print(outputs)
# ['SAMPLE-0001', 'SAMPL1-0002', 'SAMPL3-0003', 'SAMPLE-0004']
We are matching the regex (\w{6})-?\d?(\d{3}) against the input strings and joining the captured groups with the string '-0'. This is very simple and fast.
Let me know if you need a more in-depth explanation of the regex itself.