I have a dataframe with a column that looks like this:
0 EIAB^EIAB^6
1 8W^W844^A
2 8W^W844^A
3 8W^W858^A
4 8W^W844^A
...
826136 EIAB^EIAB^6
826137 SICU^6124^A
826138 SICU^6124^A
826139 SICU^6128^A
826140 SICU^6128^A
I just want to keep everything before the second caret, e.g.: 8W^W844, what regex would I use in Python? Similarly PACU^SPAC^06 would be PACU^SPAC. And to apply it to the whole column.
I tried r'[\\^].+$' since I thought it would take the last caret and everything after, but it didn't work.
You can negate the character group to find everything except ^ and put it in a match group. you don't need to escape the ^ in the character group but you do need to escape the one outside.
re.match(r"([^^]+\^[^^]+)", "8W^W844^A").group(1)
This is quite useful in a pandas dataframe. Assuming you want to do this on a single column you can extract the string you want with
df['col'].str.extract(r'^([^^]+\^[^^]+)', expand=False)
NOTE
Originally, I used replace, but the extract solution suggested in the comments executed in 1/4 the time.
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame({"foo":np.arange(1_000_000)})
df["bar"] = "8W^W844^A"
df2 = df.copy()
def t1():
df.bar.str.replace(r"([^^]+\^[^^]+).*", r"\1", regex=True)
def t2():
df.bar.str.extract(r'^([^^]+\^[^^]+)', expand=False)
print("replace", timeit("t1()", globals=globals(), number=20))
print("extract", timeit("t2()", globals=globals(), number=20))
output
replace 39.73989862400049
extract 9.910304663004354
I don't think regex is really necessary here, just slice the string up to the position of the second caret:
>>> s = 'PACU^SPAC^06'
>>> s[:s.find("^", s.find("^") + 1)]
'PACU^SPAC'
Explanation: str.find accepts a second argument of where to start the search, place it just after the position of the first caret.
Related
def split_it(email):
return re.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)", s)
df['email_list'] = df['email'].apply(lambda x: split_it(x))
This code seems to work for the first row of the df, but then will print the result of the first row on all other rows.
Is it not iterating through all rows? Or does it print the result of row 1 on all rows?
You do not need to use apply here, use Series.str.findall directly:
df['email_list'] = df['email'].str.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)")
If there are several emails per row, you can join the results:
df['email_list'] = df['email'].str.findall(r"[^#]+#[^#]+(?:\.com|\.se|\.br|\.org)").str.join(", ")
Note that the email pattern can be enhanced in many ways, but I would add \s into the negated character classes to exclude whitespace matching, and move \. outside the group to avoid repetition:
r"[^\s#]+#[^\s#]+\.(?:com|se|br|org)"
I'm trying to replace all but the first occurrence of a text string in an entire column. My specific case is replacing underscores with periods in data that looks like client_19_Aug_21_22_2022 and I need this to be client_19.Aug.21.22.2022
if I use [1], I get this error: string index out of range
but [:1] does all occurrences (it doesn't skip the first one)
[1:] inserts . after every character but doesn't find _ and replace
df1['Client'] = df1['Client'].str.replace('_'[:1],'.')
Not the simplest, but solution:
import re
df.str.apply(lambda s: re.sub(r'^(.*?)\.', r'\1_', s.replace('_', '.')))
Here in the lambda function we firstly replace all _ with .. Then we replace the first occurrence of . back with _. And finally, we apply lambda to each value in a column.
Pandas Series have a .map method that you can use to apply an arbitrary function to every row in the Series.
In your case you can write your own replace_underscores_except_first
function, looking something like:
def replace_underscores_except_first(s):
newstring = ''
# Some logic here to handle replacing all but first.
# You probably want a for loop with some conditional checking
return newstring
and then pass that to .map like:
df1['Client'] = df1['Client'].map(replace_underscores_except_first)
An example using map, and in the function check if the string contain an underscore. If it does, split on it, and join back all parts except the first with a dot.
import pandas as pd
items = [
"client_19_Aug_21_22_2022",
"client123"
]
def replace_underscore_with_dot_except_first(s):
if "_" in s:
parts = s.split("_")
return f"{parts[0]}_{'.'.join(parts[1:])}"
return s
df1 = pd.DataFrame(items, columns=["Client"])
df1['Client'] = df1['Client'].map(replace_underscore_with_dot_except_first)
print(df1)
Output
Client
0 client_19.Aug.21.22.2022
1 client123
I am cleaning data in my pandas dataframe, and i hope there is a better way than mine, to do this.
I have in the column["count"] in my pandas dateframe input like his:
~186-205
4 and 4
200
800-1000
550-550[2]
10, 20 or 50
5 (four score and bla bla)
38 or 30
88-80
If somebody could tell me how to add numbers together if they say "x and x" that would be great.
However, my main goal is just to get the lowest number from each row and everything else gone.
I succeed almost entirely with my solution:
df['Count'] = df['Count'].str.replace(r"\(.*\)","") #all square brackets with content
df['Count'] = df['Count'].str.replace(r"\[.*\]","") #all square brackets with content
df['Count'] = df['Count'].str.replace("(−).*","") #For one type of hyphens
df['Count'] = df['Count'].str.replace("(-).*","") #for another type of hyphens
df['Count'] = df['Count'].str.replace("(—).*","") #for yet another type of hyphens
df['Count'] = df['Count'].str.replace("(\u2013).*","") #because of different formating for hyphens
df['Count'] = df['Count'].str.replace("(or).*","") #for other alternatives, remove
df['Count'] = df['Count'].str.replace("(,).*","") #everything after commas
df['Count'] = df['Count'].replace(r'\D+', "", regex=True) #everything but numbers
any suggestions to make this more elegant?
either in a function, for loop or just something smarter...
Thank you for your time.
About your solution for stripping out unneeded symbols from the values, you can use the the built-in re module to collect all numbers in the string and just get the lowest one from them:
import re
min(map(int, re.findall(r'[0-9]+', value)))
To support only python operations you might try the built-in eval function, but if you need to support different operations like 'and' to sum your numbers, you will probably need to write a parser for more customizations. This is a cool article you can check for parsers and what are their parts.
Edit:
To apply it to the whole column extract to function of smallest number and then apply that function.
import re
def get_min_number(value):
return min(map(int, re.findall(r'[0-9]+', value)))
df['Count'].apply(get_min_number)
Here is an example:
cars2 = {'Brand': ['Hon*da\nCivic', 'BM*AMT*B6*W'],'Price': [22000, 55000]}
df2 = pd.DataFrame(cars2, columns = ['Brand', 'Price'])
df2['Allowed_Amount'] = np.where(
df2['Brand'].apply(lambda x: x.count("AMT" + "*" + "B6") > 0),
df2['Brand'].str.split("AMT" + "*").str[1].str.split("B6").str[1].str[1:].str.split('\n').str[0], 0.00)
Output:
Brand Price Allowed_Amount
0 Hon*da\nCivic 22000 0
1 BM*AMT*B6*W 55000 W
Which is exactly what I need.
However, if the df contains only one row, which does not satisfy the condition, I get an error:
cars = {'Brand': ['Hon*da\nCivic'],'Price': [22000]}
df = pd.DataFrame(cars, columns = ['Brand', 'Price'])
df['Allowed_Amount'] = np.where(
df['Brand'].apply(lambda x: x.count("AMT" + "*" + "B6") > 0),
df['Brand'].str.split("AMT" + "*").str[1].str.split("B6").str[1].str[1:].str.split('\n').str[0], 0.00)
Output:
AttributeError: Can only use .str accessor with string values!
What I need:
Brand Price Allowed_Amount
0 Hon*da\nCivic 22000 0
Why doesn't it exit when the condition is not met? How to make this code work with one row as well?
The problem with your code is that df['Brand'].str.split("AMT" + "")* in the
"negative" case returns a list of size 1 (the whole source string in a
single element).
In this case .str[1] (following the previous code) returns None and
"following" methods in your code can not be called on it.
But in Pandas the actual exception is raised only if the above case
occurs for each source element, just like in the case of df.
I also think that such a long sequence of str.split, str and index
selections is difficult to read.
Try another approach based on extract with a regex:
df['Allowed_Amount'] = df['Brand'].str.extract(r'AMT\*.*?B6.(.*)').fillna(0)
Details of the regex:
AMT\* - Match AMT and an asterisk.
.*? - Match any number of characters, as little as possible (chars
between "AMT*" and "B6", if any). Maybe you can drop this fragment
from the regex.
B6 - Represent themselves.
. - Match any single char (a counterpart of [1:] in your code).
(.*) - Match text up to a newline (excluding, as the dot does not match
the newline) or to the end of string, as a capturing group, so this
is just the extracted content.
If the above regex doesn't match, NaN is returned for this row.
These NaN values are then replaced with 0, due to call to fillna(0)
afterwards.
Try the same on df2.
So this way you will achieve your desired result with shorter and more readable code.
Of course, it requires some knowledge of regular expressions but it is
definitely worth to take some time to learn them.
Edit following the question
To replace the literal star it the regex with a given delimiter,
you can define the following function, generating the content
for the new column:
def myExtract(df, delimiter='*'):
pat = rf'AMT\{delimiter}B6.(.*)'
return df['Brand'].str.extract(pat).fillna(0)
As you can see:
the delimiter is incorporated into the regex using f-string
feature (can co-exist with r-string),
it must be preceded with a backslash, to treat it literally
(not as a special regex char).
And to generate the new colum, just call this function, passing at
least the source DataFrame (and optionally the right delimiter):
df['Allowed_Amount'] = myExtract(df); df
The same for df2.
I have the following data stored in my Pandas datframe:
Factor SimTime RealTime SimStatus
0 Factor[0.48] SimTime[83.01] RealTime[166.95] Paused[F]
1 Factor[0.48] SimTime[83.11] RealTime[167.15] Paused[F]
2 Factor[0.49] SimTime[83.21] RealTime[167.36] Paused[F]
3 Factor[0.48] SimTime[83.31] RealTime[167.57] Paused[F]
I want to create a new dataframe with only everything within [].
I am attempting to use the following code:
df = dataframe.apply(lambda x: x.str.slice(start=x.str.find('[')+1, stop=x.str.find(']')))
However, all I see in df is NaN. Why? What's going on? What should I do to achieve the desired behavior?
You can use regex to replace the contents.
df.replace(r'\w+\[([\S]+)\]', r'\1', regex=True)
Edit
replace function of pandas DataFrame
Replace values given in to_replace with value
The target string and the value with which it needs to be replaced can be regex expressions. And for that you need to set the regex=True in the arguments to replace
https://regex101.com/r/7KCs6q/1
Look at the above link to see the explanation of the regular expression in detail.
Basically it is using the non whitespace content within the square brackets as the value and any string with some characters followed by square brackets with non whitespace characters as the target string.