Using pandas extract regex with multiple groups - python

I am trying to extract a number from a pandas series of strings. For example consider this series:
s = pd.Series(['a-b-1', 'a-b-2', 'c1-d-5', 'c1-d-9', 'e-10-f-1-3.xl', 'e-10-f-2-7.s'])
0 a-b-1
1 a-b-2
2 c1-d-5
3 c1-d-9
4 e-10-f-1-3.xl
5 e-10-f-2-7.s
dtype: object
There are 6 rows, and three string formats/templates (known). The goal is to extract a number for each of the rows depending on the string. Here is what I came up with:
s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
and this correctly extracts the numbers that I want from each row:
0 1 2
0 1 NaN NaN
1 2 NaN NaN
2 NaN 5 NaN
3 NaN 9 NaN
4 NaN NaN 3
5 NaN NaN 7
However, since I have three groups in the regex, I have 3 columns, and here comes the question:
Can I write a regex that has one group or that can generate a single column, or do I need to coalesce the columns into one, and how can I do that without a loop if necessary?
Desired outcome would be a series like:
0 1
1 2
2 5
3 9
4 3
5 7

Simplest thing to do is bfill\ffill:
(s.str.extract('a-b-([0-9])|c1-d-([0-9])|e-10-f-[0-9]-([0-9])')
.bfill(axis=1)
[0]
)
Output:
0 1
1 2
2 5
3 9
4 3
5 7
Name: 0, dtype: object
Another way is to use optional non-capturing group:
s.str.extract('(?:a-b-)?(?:c1-d-)?(?:e-10-f-[0-9]-)?([0-9])')
Output:
0
0 1
1 2
2 5
3 9
4 3
5 7

You could use a single capturing group at the end, and add the 3 prefixes in a on capturing group (?:
As they all end with a hyphen, you could move that to after the non capturing group to shorted it a bit.
(?:a-b|c1-d|e-10-f-[0-9])-([0-9])
Regex demo
s.str.extract('(?:a-b|c1-d|e-10-f-[0-9])-([0-9])')
Ouput
0
0 1
1 2
2 5
3 9
4 3
5 7

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Drop consecutive duplicates in Pandas dataframe if repeated more than n times

Building off the question/solution here, I'm trying to set a parameter that will only remove consecutive duplicates if the same value occurs 5 (or more) times consecutively...
I'm able to apply the solution in the linked post which uses .shift() to check if the previous (or a specified value in the past or future by adjusting the shift periods parameter) equals the current value, but how could I adjust this to check several consecutive values simultaneously?
Suppose a dataframe that looks like this:
x y
1 2
2 2
3 3
4 3
5 3
6 3
7 3
8 4
9 4
10 4
11 4
12 2
I'm trying to achieve this:
x y
1 2
2 2
3 3
8 4
9 4
10 4
11 4
12 2
Where we lose rows 4,5,6,7 because we found five consecutive 3's in the y column. But keep rows 1,2 because it we only find two consecutive 2's in the y column. Similarly, keep rows 8,9,10,11 because we only find four consecutive 4's in the y column.
Let's try cumsum on the differences to find the consecutive blocks. Then groupby().transform('size') to get the size of the blocks:
thresh = 5
s = df['y'].diff().ne(0).cumsum()
small_size = s.groupby(s).transform('size') < thresh
first_rows = ~s.duplicated()
df[small_size | first_rows]
Output:
x y
0 1 2
1 2 2
2 3 3
7 8 4
8 9 4
9 10 4
10 11 4
11 12 2
Not straight forward, I would go with #Quang Hoang
Create a column which gives the number of times a values is duplicated. In this case I used np.where() and df.duplicated() and assigned any count> 4 to be NaN
df['g']=np.where(df.groupby('y').transform(lambda x: x.duplicated(keep='last').count())>4, np.nan,1)
I then create two dataframes. One where I drop all the NaNs and one with only NaNs. In the one with NaNs, I drop all apart from the last index using .last_valid_index(). I then append them and sort by index using .sort_index(). I use iloc[:,:2]) to slice out new column I created above
df.dropna().append(df.loc[df[df.g.isna()].last_valid_index()]).sort_index().iloc[:,:2]
x y
0 1.0 2.0
1 2.0 2.0
6 7.0 3.0
7 8.0 4.0
8 9.0 4.0
9 10.0 4.0
10 11.0 4.0
11 12.0 2.0

Replacing values with the next unique one

In my pandas dataframe I have a column of non-unique values
I want to add a second column that contains the next unique value
i.e,
col
1
5
5
2
2
4
col addedCol
1 5
5 2
5 2
2 4
2 4
4 (last value doesn't matter)
how can i achieve this using pandas?
I'll clarify what I meant, I want each row to contain the next value that is different than of that row's
I hope I better explained myself now
IIUC, you need the next value which is different from the current value.
df.loc[:, 'col2'] = df.drop_duplicates().shift(-1).col
df['col2'].ffill(inplace=True)
col col2
0 1 5.0
1 5 2.0
2 5 2.0
3 2 2.0
(Notice that last 2.0 value doesn't matter). As suggest by #MartijnPieters,
df['col2'] = df['col2'].astype(int)
Can make values back to original integers if needed.
Adding another good solution from #piRSquared
df.assign(addedcol=df.index.to_series().shift(-1).map(df.col.drop_duplicates()).bfill())
col addedcol
0 1 5.0
1 5 2.0
2 5 2.0
3 2 NaN
Another example, if df is
col
0 1
1 5
2 5
3 2
4 3
5 3
6 10
7 9
Then
df.loc[:, 'col2'] = df.drop_duplicates().shift(-1).col
df = df.ffill()
yields
col col2
0 1 5.0
1 5 2.0
2 5 2.0
3 2 3.0
4 3 10.0
5 3 10.0
6 10 9.0
7 9 9.0
Using factorize
s=pd.factorize(df.col)[0]
pd.Series(s+1).map(dict(zip(s,df.col)))
Out[242]:
0 5.0
1 2.0
2 2.0
3 NaN
dtype: float64
#df['newadd']=pd.Series(s+1).map(dict(zip(s,df.col))).values
Under Mart 's condition
s=df.col.diff().ne(0).cumsum()
(s+1).map(dict(zip(s,df.col)))
Out[260]:
0 5.0
1 2.0
2 2.0
3 4.0
4 4.0
5 5.0
6 NaN
7 NaN
Name: col, dtype: float64
Setup
Added additional data with multiple clusters
df = pd.DataFrame({'col': [*map(int, '1552554442')]})
Two interpretations
We have to consider when there exist non-contiguous clusters
df
col
0 1 # First instance of `1` Next unique is `5`
1 5 # First instance of `5` Next unique is `2`
2 5 # Next unique is `2`
3 2 # First instance of `2` Next unique is `4` because `5` is not new
4 5 # Next unique is `4`
5 5 # Next unique is `4`
6 4 # First instance of `4` Next unique is null
7 4 # First instance of `4` Next unique is null
8 4 # First instance of `4` Next unique is null
9 2 # Second time seen `2` Should Next unique be null or what it was before `4`
Allowed to look back
Use factorize and add 1. This is very much in the spirit of #Wen's answer
i, u = df.col.factorize()
u_ = np.append(u, -1) # Append an integer value to represent null
df.assign(addedcol=u_[i + 1])
col addedcol
0 1 5
1 5 2
2 5 2
3 2 4
4 5 2
5 5 2
6 4 -1
7 4 -1
8 4 -1
9 2 4
Only Forward
Similar to before except we'll track the cumulative maximum factorized value
i, u = df.col.factorize()
u_ = np.append(u, -1) # Append an integer value to represent null
x = np.maximum.accumulate(i)
df.assign(addedcol=u_[x + 1])
col addedcol
0 1 5
1 5 2
2 5 2
3 2 4
4 5 4
5 5 4
6 4 -1
7 4 -1
8 4 -1
9 2 -1
You'll notice that the difference is in the last value. When we can only look forward, we see that there is no next unique value.

what is the difference between with or without .loc when using groupby + transform in Pandas

I am new to python. here is the question I have, which is really weird to me.
A simple data frame looks like:
a1=pd.DataFrame({'Hash':[1,1,2,2,2,3,4,4],
'Card':[1,1,2,2,3,3,4,4]})
I need to group a1 by Hash, calculate how many rows in each group, then add one column in a1 to indicate row numbers. So, I want to use groupby + transform.
When I use:
a1['CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is correct:
Card Hash CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 3 2 3
5 3 3 1
6 4 4 2
7 4 4 2
But when I use:
a1.loc[:,'CustomerCount']=a1.groupby(['Hash']).transform(lambda x: x.shape[0])
The result is:
Card Hash CustomerCount
0 1 1 NaN
1 1 1 NaN
2 2 2 NaN
3 2 2 NaN
4 3 2 NaN
5 3 3 NaN
6 4 4 NaN
7 4 4 NaN
So, why does this happen?
As far as I know, loc and iloc (like a1.loc[:,'CustomerCount']) are better than nothing (like a1['CustomerCount']) so loc and iloc are usually recommanded to use. But why this happens?
Also, I have tried loc and iloc a lot of times to generate a new column in one data frame. They usualy work. So does this have something to do with groupby + transform?
The difference is how loc deals with assigning a DataFrame object to a single column. When you assigned the DataFrame with the columns of Card it attempted to line up the index and the column name. The columns didn't line up and you got NaNs. When assigning via direct column access, it determined that it was one column for another and just did it.
Reduce to a single column
You can resolve this by either reducing the result of the groupby operation to just one column thus allowing for easy resolution.
a1.loc[:,'CustomerCount'] = a1.groupby(['Hash']).Card.transform('size')
a1
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2
Rename the column
Don't really do this, the other answer is far simpler
a1.loc[:, 'CustomerCount'] = a1.groupby('Hash').transform(len).rename(
columns={'Card': 'CustomerCount'})
a1
pd.factorize and np.bincount
What I'd actually do
f, u = pd.factorize(a1.Hash)
a1['CustomerCount'] = np.bincount(f)[f]
a1
Or inline making a copy
a1.assign(CustomerCount=(lambda f: np.bincount(f)[f])(pd.factorize(a1.Hash)[0]))
Hash Card CustomerCount
0 1 1 2
1 1 1 2
2 2 2 3
3 2 2 3
4 2 3 3
5 3 3 1
6 4 4 2
7 4 4 2

Calculate Mean by Groupby, drop some rows with Boolean conditions and then save the file in original format

I have a data like this.
I calculate the mean of each IDs
df.groupby(['ID'], as_index= False)['A'].mean()
Now, I want to drop all those Ids whose mean value is more than 3
df.drop(df[df.A > 3].index)
And this is here i am stucked. I want to save the file but in original format (without grouping and no mean value) and without those Ids whose means were more than 3.
Any Idea How can i achieve this. Output something like this. Also I want to know how many unique Ids were removed while using drop.
Use transform for Series with same size as original DataFrame, so is possible filtering by changed condition from > 3 to <=3 by boolean indexing:
df1 = df[df.groupby('ID')['A'].transform('mean') <= 3]
print (df1)
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
Details:
print (df.groupby('ID')['A'].transform('mean'))
0 2.000000
1 2.000000
2 2.000000
3 6.666667
4 6.666667
5 6.666667
6 2.250000
7 2.250000
8 2.250000
9 2.250000
Name: A, dtype: float64
print (df.groupby('ID')['A'].transform('mean') <= 3)
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 True
Name: A, dtype: bool
Another solution using groupby and filter. This solutions is a slower than using transform with boolean indexing.
df.groupby('ID').filter(lambda x: x['A'].mean() < 3)
Output:
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1

Categories

Resources