How to groupby with consecutive occurrence of duplicates in pandas - python

I have a dataframe which contains two columns [Name,In.cl]. I want to groupby Name but it based on continuous occurrence. For example consider below DataFrame,
Code to generate below DF:
df=pd.DataFrame({'Name':['A','B','B','A','A','B','C','C','C','B','C'],'In.Cl':[2,1,5,2,4,2,3,1,8,5,7]})
Input:
In.Cl Name
0 2 A
1 1 B
2 5 B
3 2 A
4 4 A
5 2 B
6 3 C
7 1 C
8 8 C
9 5 B
10 7 C
I want to group the rows where it repeated consecutively. example group [B] (1,2), [A] (3,4), [C] (6,8) etc., and perform sum operation in In.cl column.
Expected Output:
In.Cl Name col1 col2
0 2 A A(1) 2
1 1 B B(2) 6
2 5 B B(2) 6
3 2 A A(2) 6
4 4 A A(2) 6
5 2 B B(1) 2
6 3 C C(3) 12
7 1 C C(3) 12
8 8 C C(3) 12
9 5 B B(1) 5
10 7 C C(1) 7
So far i tried combination of duplicate and groupby, it didn't work as i expected. I think I need some thing groupby + consecutive. but i don't have an idea to solve this problem.
Any help would be appreciated.

In [37]: g = df.groupby((df.Name != df.Name.shift()).cumsum())
In [38]: df['col1'] = df['Name'] + '(' + g['In.Cl'].transform('size').astype(str) + ')'
In [39]: df['col2'] = g['In.Cl'].transform('sum')
In [40]: df
Out[40]:
Name In.Cl col1 col2
0 A 2 A(1) 2
1 B 1 B(2) 6
2 B 5 B(2) 6
3 A 2 A(2) 6
4 A 4 A(2) 6
5 B 2 B(1) 2
6 C 3 C(3) 12
7 C 1 C(3) 12
8 C 8 C(3) 12
9 B 5 B(1) 5
10 C 7 C(1) 7

Slightly long-winded answer utilizing itertools.groupby.
For greater than ~1000 rows, use #MaxU's solution - it's faster.
from itertools import groupby, chain
from operator import itemgetter
chainer = chain.from_iterable
def sumfunc(x):
return (sum(map(itemgetter(1), x)), len(x))
grouper = groupby(zip(df['Name'], df['In.Cl']), key=itemgetter(0))
summer = [sumfunc(list(j)) for _, j in grouper]
df['Name'] += pd.Series(list(chainer(repeat(j, j) for i, j in summer))).astype(str)
df['col2'] = list(chainer(repeat(i, j) for i, j in summer))
print(df)
In.Cl Name col2
0 2 A1 2
1 1 B2 6
2 5 B2 6
3 2 A2 6
4 4 A2 6
5 2 B1 2
6 3 C3 12
7 1 C3 12
8 8 C3 12
9 5 B1 5
10 7 C1 7

Related

Pandas - replicate rows with new column value from a list for each replication

So I have a data frame that has two columns, State and Cost, and a separate list of new "what-if" costs
State Cost
A 2
B 9
C 8
D 4
New_Cost_List = [1, 5, 10]
I'd like to replicate all the rows in my data set for each value of New_Cost, adding a new column for each New_Cost for each state.
State Cost New_Cost
A 2 1
B 9 1
C 8 1
D 4 1
A 2 5
B 9 5
C 8 5
D 4 5
A 2 10
B 9 10
C 8 10
D 4 10
I thought a for loop might be appropriate to iterate through, replicating my dataset for the length of the list and adding the values of the list as a new column:
for v in New_Cost_List:
df_new = pd.DataFrame(np.repeat(df.values, len(New_Cost_List), axis=0))
df_new.columns = df.columns
df_new['New_Cost'] = v
The output of this gives me the correct replication of State and Cost but the New_Cost value is 10 for each row. Clearly I'm not connecting how to get it to run through the list for each replicated set, so any suggestions? Or is there a better way to approach this?
EDIT 1
Reducing the number of values in the New_Cost_List from 4 to 3 so there's a difference in row count and length of the list.
Here is a way using the keys paramater of pd.concat():
(pd.concat([df]*len(New_Cost_List),
keys = New_Cost_List,
names = ['New_Cost',None])
.reset_index(level=0))
Output:
New_Cost State Cost
0 1 A 2
1 1 B 9
2 1 C 8
3 1 D 4
0 5 A 2
1 5 B 9
2 5 C 8
3 5 D 4
0 10 A 2
1 10 B 9
2 10 C 8
3 10 D 4
If i understand your question correctly, this should solve your problem.
df['New Cost'] = new_cost_list
df = pd.concat([df]*len(new_cost_list), ignore_index=True)
Output:
State Cost New Cost
0 A 2 1
1 B 9 5
2 C 8 10
3 D 4 15
4 A 2 1
5 B 9 5
6 C 8 10
7 D 4 15
8 A 2 1
9 B 9 5
10 C 8 10
11 D 4 15
12 A 2 1
13 B 9 5
14 C 8 10
15 D 4 15
You can use index.repeat and numpy.tile:
df2 = (df
.loc[df.index.repeat(len(New_Cost_List))]
.assign(**{'New_Cost': np.repeat(New_Cost_List, len(df))})
)
or, simply, with a cross merge:
df2 = df.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
output:
State Cost New_Cost
0 A 2 1
0 A 2 5
0 A 2 10
1 B 9 1
1 B 9 5
1 B 9 10
2 C 8 1
2 C 8 5
2 C 8 10
3 D 4 1
3 D 4 5
3 D 4 10
For the provided order:
(df
.merge(pd.Series(New_Cost_List, name='New_Cost'), how='cross')
.sort_values(by='New_Cost', kind='stable')
.reset_index(drop=True)
)
output:
State Cost New_Cost
0 A 2 1
1 B 9 1
2 C 8 1
3 D 4 1
4 A 2 5
5 B 9 5
6 C 8 5
7 D 4 5
8 A 2 10
9 B 9 10
10 C 8 10
11 D 4 10

Construct a df such that every number within a range gets value 'A' assigned when knowing the start and end of the range values that belong to 'A'

Suppose I have the following Pandas dataframe:
In[285]: df = pd.DataFrame({'Name':['A','B'], 'Start': [1,6], 'End': [4,12]})
In [286]: df
Out[286]:
Name Start End
0 A 1 4
1 B 6 12
Now I would like to construct the dataframe as follows:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
My biggest struggle is in getting the 'Name' column right. Is there a smart way to do this in Python?
I would do pd.concat on a list comprehension:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1)})
.assign(Name=n)
for n,s,e in zip(df['Name'], df['Start'], df['End']))
Output:
Number Name
0 1 A
1 2 A
2 3 A
3 4 A
0 6 B
1 7 B
2 8 B
3 9 B
4 10 B
5 11 B
6 12 B
Update: As commented by #rafaelc:
pd.concat(pd.DataFrame({'Number': np.arange(s,e+1), 'Name': n})
for n,s,e in zip(df['Name'], df['Start'], df['End']))
works just fine.
Let us do it with this example (with 3 names):
import pandas as pd
df = pd.DataFrame({'Name':['A','B','C'], 'Start': [1,6,18], 'End': [4,12,20]})
You may create the target columns first, using list comprehensions:
name = [row.Name for i, row in df.iterrows() for _ in range(row.End - row.Start + 1)]
number = [k for i, row in df.iterrows() for k in range(row.Start, row.End + 1)]
And then you can create the target DataFrame:
expanded = pd.DataFrame({"Name": name, "Number": number})
You get:
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12
11 C 18
12 C 19
13 C 20
I'd take advantage of loc and index.repeat for a vectorized solution.
base = df.loc[df.index.repeat(df['End'] - df['Start'] + 1), ['Name', 'Start']]
base['Start'] += base.groupby(level=0).cumcount()
Name Start
0 A 1
0 A 2
0 A 3
0 A 4
1 B 6
1 B 7
1 B 8
1 B 9
1 B 10
1 B 11
1 B 12
Of course we can rename the columns and reset the index at the end, for a nicer showing.
base.rename(columns={'Start': 'Number'}).reset_index(drop=True)
Name Number
0 A 1
1 A 2
2 A 3
3 A 4
4 B 6
5 B 7
6 B 8
7 B 9
8 B 10
9 B 11
10 B 12

How two merge two dataframes without any index being based

Suppose I have two dataframes X and Y:
import pandas as pd
X = pd.DataFrame({'A':[1,4,7],'B':[2,5,8],'C':[3,6,9]})
Y = pd.DataFrame({'D':[1],'E':[11]})
In [4]: X
Out[4]:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
In [6]: Y
Out[6]:
D E
0 1 11
and then, I want to get the following result dataframe:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
how?
Assuming that Y contains only one row:
In [9]: X.assign(**Y.to_dict('r')[0])
Out[9]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
or a much nicer alternative from #piRSquared:
In [27]: X.assign(**Y.iloc[0])
Out[27]:
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11
Helper dict:
In [10]: Y.to_dict('r')[0]
Out[10]: {'D': 1, 'E': 11}
Here is another way
Y2 = pd.concat([Y]*3, ignore_index = True) #This duplicates the rows
Which produces:
D E
0 1 11
0 1 11
0 1 11
Then concat once again:
pd.concat([X,Y2], axis =1)
A B C D E
0 1 2 3 1 11
1 4 5 6 1 11
2 7 8 9 1 11

How can I split the dataframe adjusting some condition?(divid by columns change structure by T)

How can I split the dataframe adjusting some condition?(divid by columns change structure by T)
I have dataframe like
df:
id f1 c1 c2
0 a x 1 2
1 b x 3 4
3 c x 5 6
4 a y 7 8
5 b y 9 10
6 c y 11 12
and expected result
dfX (filtered f1=x value):
newid a b c
0 c1 1 3 5
1 c2 2 4 6
dfY (filtered f1=y value):
0 c1 7 9 11
1 c2 8 10 12
I need to split dataframe by f1 value and change the sturutre
I tried with 5~6 for loop and more than 20~30 lines and did it.
But it seems it's not the best way to do it.
It would be appreciated if you give me some tips to handle the data more efficent way
#Set ID as index and transpose the DF and rename the column name.
df[df.f1=='x'][['id','c1','c2']].set_index('id').T.reset_index().rename(columns={'index':'newid'})
Out[115]:
id newid a b c
0 c1 1 3 5
1 c2 2 4 6
Similarly you can filter f1 on 'y':
df[df.f1=='y'][['id','c1','c2']].set_index('id').T.reset_index().rename(columns={'index':'newid'})
Out[120]:
id newid a b c
0 c1 7 9 11
1 c2 8 10 12
You can use groupby.
>>> for _, b in df.groupby(['f1']):
... print b.pivot_table(columns='id').reset_index().rename(columns={'index':'newid'})
...
id newid a b c
0 c1 1 3 5
1 c2 2 4 6
id newid a b c
0 c1 7 9 11
1 c2 8 10 12

Repeating items in a data frame using pandas

I have the following dataFrame:
id z2 z3 z4
1 2 a fine
2 7 b good
3 9 c delay
4 30 d cold
I am going to generate a data frame by repeating each item in a row twice except items in column z4 (that should not be repeated). How I can do it using python and pandas.
The output should be like this:
id z1 z3 z4
1 2 a fine
1 2 a
1 2 a
2 7 b good
2 7 b
2 7 b
3 9 c delay
3 9 c
3 9 c
4 30 d cold
4 30 d
4 30 d
Another way to do this is to use indexing:
Notice that df.iloc[[0, 1, 2, 3]*2, :3] will give you two copies of the first three columns.
This can then be appended to the original df. Remove the NA. Then sort on index values and reset index (dropping the old index). All of which can be chained:
df.append(df.iloc[[0, 1, 2, 3]*2, :3]).fillna('').sort_index().reset_index(drop=True)
which produces:
id z2 z3 z4
0 1 2 a fine
1 1 2 a
2 1 2 a
3 2 7 b good
4 2 7 b
5 2 7 b
6 3 9 c delay
7 3 9 c
8 3 9 c
9 4 30 d cold
10 4 30 d
11 4 30 d
groupby and apply will do the trick:
def func(group):
copy = group.copy()
copy['z4'] = ""
return pd.concat((group, copy, copy))
df.groupby('id').apply(func).reset_index(drop=True)
id z2 z3 z4
0 1 2 a fine
1 1 2 a
2 1 2 a
3 2 7 b good
4 2 7 b
5 2 7 b
6 3 9 c delay
7 3 9 c
8 3 9 c
9 4 30 d cold
10 4 30 d
11 4 30 d

Categories

Resources