cumcout groupby --- how to list by groups - python

My question is related to this question
import pandas as pd
df = pd.DataFrame(
[['A', 'X', 3], ['A', 'X', 5], ['A', 'Y', 7], ['A', 'Y', 1],
['B', 'X', 3], ['B', 'X', 1], ['B', 'X', 3], ['B', 'Y', 1],
['C', 'X', 7], ['C', 'Y', 4], ['C', 'Y', 1], ['C', 'Y', 6]],
columns=['c1', 'c2', 'v1'])
df['CNT'] = df.groupby(['c1', 'c2']).cumcount()+1
I got column 'CNT'. But I'd like to break it apart according to group 'c2' to obtain cumulative count of 'X' and 'Y' respectively.
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
Any suggestions? I am just starting to explore Pandas and appreciate your help.

I don't directly know a way to do this directly, but starting from the calculated CNT column, you can do it as follows:
Make the Xcnt and Ycnt columns:
In [13]: df['Xcnt'] = df['CNT'][df['c2']=='X']
In [14]: df['Ycnt'] = df['CNT'][df['c2']=='Y']
In [15]: df
Out[15]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 NaN
1 A X 5 2 2 NaN
2 A Y 7 1 NaN 1
3 A Y 1 2 NaN 2
4 B X 3 1 1 NaN
5 B X 1 2 2 NaN
6 B X 3 3 3 NaN
7 B Y 1 1 NaN 1
8 C X 7 1 1 NaN
9 C Y 4 1 NaN 1
10 C Y 1 2 NaN 2
11 C Y 6 3 NaN 3
Next, we want to fill the NaN's per group of c1 by forward filling:
In [23]: df['Xcnt'] = df.groupby('c1')['Xcnt'].fillna(method='ffill')
In [24]: df['Ycnt'] = df.groupby('c1')['Ycnt'].fillna(method='ffill').fillna(0)
In [25]: df
Out[25]:
c1 c2 v1 CNT Xcnt Ycnt
0 A X 3 1 1 0
1 A X 5 2 2 0
2 A Y 7 1 2 1
3 A Y 1 2 2 2
4 B X 3 1 1 0
5 B X 1 2 2 0
6 B X 3 3 3 0
7 B Y 1 1 3 1
8 C X 7 1 1 0
9 C Y 4 1 1 1
10 C Y 1 2 1 2
11 C Y 6 3 1 3
For the Ycnt an extra fillna was needed to fill the convert the NaN's to 0's where the group started with NaNs (couldn't fill forward).

Related

Compute a new column as delta to another value in pandas dataframe

I have this data frame:
rank cost brand city
0 1 1 a x
1 2 2 a x
2 3 3 a x
3 4 4 a x
4 5 5 a x
5 1 2 b y
6 2 4 b y
7 3 6 b y
8 4 8 b y
9 5 10 b y
I want to create a new column 'delta' which contains the cost difference compared to rank 1 for a certain brand-city combination.
Desired outcome:
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
This answer gave me some hints, but I am stuck on the fact that I cannot map a series to a multi-index.
To save on typing, here is some code:
data = {'rank': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'cost': [1, 2, 3, 4, 5, 2, 4, 6, 8, 10],
'brand': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
'city': ['x', 'x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
'delta': ['0', '1', '2', '3', '4', '0', '2', '4', '6', '8']
}
This is transform + first
df['delta']=df.cost-df.groupby(['brand','city'])['cost'].transform('first')
df
Out[291]:
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
Use groupby with apply
data['delta'] = (data.groupby(['brand', 'city'], group_keys=False)
.apply(lambda x: x['cost'] - x[x['rank'].eq(1)]['cost'].values[0]))
data
rank cost brand city delta
0 1 1 a x 0
1 2 2 a x 1
2 3 3 a x 2
3 4 4 a x 3
4 5 5 a x 4
5 1 2 b y 0
6 2 4 b y 2
7 3 6 b y 4
8 4 8 b y 6
9 5 10 b y 8
solution without using groupby. it sorts rank and uses pd.merge_ordered and assign to create delta column
In [1077]: pd.merge_ordered(data.sort_values(['brand', 'city', 'rank']), data.query('rank == 1'), how='left', on=['brand', 'city', 'rank'], fill_method='ffill').assign(delta=lambda x: x.cost_x - x.cost_y).drop('cost_y', 1)
Out[1077]:
brand city cost_x rank delta
0 a x 1 1 0
1 a x 2 2 1
2 a x 3 3 2
3 a x 4 4 3
4 a x 5 5 4
5 b y 2 1 0
6 b y 4 2 2
7 b y 6 3 4
8 b y 8 4 6
9 b y 10 5 8

Adding dataframes only on selected rows

I have a dataframe like this
import pandas as pd
df = pd.DataFrame({'id' : [1, 1, 1, 1, 2, 2, 2, 3, 3, 3],\
'crit_1' : [0, 0, 1, 0, 0, 0, 1, 0, 0, 1], \
'crit_2' : ['a', 'a', 'b', 'b', 'a', 'b', 'a', 'a', 'a', 'a'],\
'value' : [3, 4, 3, 5, 1, 2, 4, 6, 2, 3]}, \
columns=['id' , 'crit_1', 'crit_2', 'value' ])
df
Out[41]:
id crit_1 crit_2 value
0 1 0 a 3
1 1 0 a 4
2 1 1 b 3
3 1 0 b 5
4 2 0 a 1
5 2 0 b 2
6 2 1 a 4
7 3 0 a 6
8 3 0 a 2
9 3 1 a 3
I pull a subset out of this frame based on crit_1
df_subset = df[(df['crit_1']==1)]
Then I perform a complex operation (the nature of which is unimportant for this question) on that subeset producing a new column
df_subset['some_new_val'] = [1, 4,2]
df_subset
Out[42]:
id crit_1 crit_2 value some_new_val
2 1 1 b 3 1
6 2 1 a 4 4
9 3 1 a 3 2
Now, I want to add some_new_val and back into my original dataframe onto the column value. However, I only want to add it in where there is a match on id and crit_2
The result should look like this
id crit_1 crit_2 value new_value
0 1 0 a 3 3
1 1 0 a 4 4
2 1 1 b 3 4
3 1 0 b 5 6
4 2 0 a 1 1
5 2 0 b 2 6
6 2 1 a 4 4
7 3 0 a 6 8
8 3 0 a 2 4
9 3 1 a 3 5
You can use merge with left join and then add:
#filter only columns for join and for append
cols = ['id','crit_2', 'some_new_val']
df = pd.merge(df, df_subset[cols], on=['id','crit_2'], how='left')
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 NaN
1 1 0 a 4 NaN
2 1 1 b 3 1.0
3 1 0 b 5 1.0
4 2 0 a 1 4.0
5 2 0 b 2 NaN
6 2 1 a 4 4.0
7 3 0 a 6 2.0
8 3 0 a 2 2.0
9 3 1 a 3 2.0
df['some_new_val'] = df['some_new_val'].add(df['value'], fill_value=0)
print (df)
id crit_1 crit_2 value some_new_val
0 1 0 a 3 3.0
1 1 0 a 4 4.0
2 1 1 b 3 4.0
3 1 0 b 5 6.0
4 2 0 a 1 5.0
5 2 0 b 2 2.0
6 2 1 a 4 8.0
7 3 0 a 6 8.0
8 3 0 a 2 4.0
9 3 1 a 3 5.0

Split a MultiIndex DataFrame base on another DataFrame

Say you have a multiindex DataFrame
x y z
a 1 0 1 2
2 3 4 5
b 1 0 1 2
2 3 4 5
3 6 7 8
c 1 0 1 2
2 0 4 6
Now you have another DataFrame which is
col1 col2
0 a 1
1 b 1
2 b 3
3 c 1
4 c 2
How do you split the multiindex DataFrame based on the one above?
Use loc by tuples:
df = df1.loc[df2.set_index(['col1','col2']).index.tolist()]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
df = df1.loc[[tuple(x) for x in df2.values.tolist()]]
print (df)
x y z
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Or join:
df = df2.join(df1, on=['col1','col2']).set_index(['col1','col2'])
print (df)
x y z
col1 col2
a 1 0 1 2
b 1 0 1 2
3 6 7 8
c 1 0 1 2
2 0 4 6
Simply using isin
df[df.index.isin(list(zip(df2['col1'],df2['col2'])))]
Out[342]:
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6
You can also do this using the MultiIndex reindex method https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html
## Recreate your dataframes
tuples = [('a', 1), ('a', 2),
('b', 1), ('b', 2),
('b', 3), ('c', 1),
('c', 2)]
data = [[1, 0, 1, 2],
[2, 3, 4, 5],
[1, 0, 1, 2],
[2, 3, 4, 5],
[3, 6, 7, 8],
[1, 0, 1, 2],
[2, 0, 4, 6]]
idx = pd.MultiIndex.from_tuples(tuples, names=['index1','index2'])
df= pd.DataFrame(data=data, index=idx)
df2 = pd.DataFrame([['a', 1],
['b', 1],
['b', 3],
['c', 1],
['c', 2]])
# Answer Question
idx_subset = pd.MultiIndex.from_tuples([(a, b) for a, b in df2.values], names=['index1', 'index2'])
out = df.reindex(idx_subset)
print(out)
0 1 2 3
index1 index2
a 1 1 0 1 2
b 1 1 0 1 2
3 3 6 7 8
c 1 1 0 1 2
2 2 0 4 6

My dataframe has a list as an index, how do I access a cell or edit a cell

My pandas dataframe looks like this
A B C D E
(Name1, 1) NaN NaN NaN NaN NaN
(Name2, 2) NaN NaN NaN NaN NaN
How do I access the a particular cell or change the value of a particular cell
I created the dataframe using this
id=list(product(array1,array2))
data=pd.DataFrame(index=id ,columns=array3)
I think you need MultiIndex:
np.random.seed(124)
array1 = np.array(['Name1','Name2'])
array2 = np.array([1,2])
array3 = np.array(list('ABCDE'))
idx= pd.MultiIndex.from_product([array1,array2])
data=pd.DataFrame(np.random.randint(10, size=[len(idx), len(array3)]),
index=idx ,columns=array3)
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 4 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
print (data.index)
MultiIndex(levels=[['Name1', 'Name2'], [1, 2]],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
data.loc[('Name1', 2), 'B'] = 20
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 20 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
For complicated selects are used slicers:
idx = pd.IndexSlice
data.loc[idx['Name1', 2], 'B'] = 20
print (data)
A B C D E
Name1 1 1 7 2 9 0
2 4 20 5 5 6
Name2 1 9 6 0 8 9
2 9 0 2 2 1
idx = pd.IndexSlice
print (data.loc[idx['Name1', 2], 'A'])
4
#select all values with 2 of second level and column A
idx = pd.IndexSlice
print (data.loc[idx[:, 2], 'A'])
Name1 2 4
Name2 2 9
Name: A, dtype: int32
#select 1 form second level and slice between B and D columns
idx = pd.IndexSlice
print (data.loc[idx[:, 1], idx['B':'D']])
B C D
Name1 1 7 2 9
Name2 1 6 0 8
For simplier selects use DataFrame.xs:
print (data.xs('Name1', axis=0, level=0))
A B C D E
1 1 7 2 9 0
2 4 4 5 5 6

Reordering columns in CSV

Question has been posted before but the requirements were not properly conveyed. I have a csv file with more than 1000 columns:
A B C D .... X Y Z
1 0 0.5 5 .... 1 7 6
2 0 0.6 4 .... 0 7 6
3 0 0.7 3 .... 1 7 6
4 0 0.8 2 .... 0 7 6
Here X , Y and Z are the 999, 1000, 10001 column and A, B, C , D are the 1st,2nd,3rd and 4th. I need to reorder the columns in such a way that it gives me the following.
D Y Z A B C ....X
5 7 6 1 0 0.5 ....1
4 7 6 2 0 0.6 ....0
3 7 6 3 0 0.7 ....1
2 7 6 4 0 0.8 ....0
that is 4th column becomes the 1st, 1000 and 1001th column becomes 2nd and 3rd and the other columns are shifted right accordingly.
So the question is how to reorder your columns in a custom way.
For example you have the following DF and you want to reorder your columns in the following way (indices):
5, 3, rest...
DF
In [82]: df
Out[82]:
A B C D E F G
0 1 0 0.5 5 1 7 6
1 2 0 0.6 4 0 7 6
2 3 0 0.7 3 1 7 6
3 4 0 0.8 2 0 7 6
columns
In [83]: cols = df.columns.tolist()
In [84]: cols
Out[84]: ['A', 'B', 'C', 'D', 'E', 'F', 'G']
reordered:
In [88]: cols = [cols.pop(5)] + [cols.pop(3)] + cols
In [89]: cols
Out[89]: ['F', 'D', 'A', 'B', 'C', 'E', 'G']
In [90]: df[cols]
Out[90]:
F D A B C E G
0 7 5 1 0 0.5 1 6
1 7 4 2 0 0.6 0 6
2 7 3 3 0 0.7 1 6
3 7 2 4 0 0.8 0 6
In [4]: df
Out[4]:
A B C D X Y Z
0 1 0 0.5 5 1 7 6
1 2 0 0.6 4 0 7 6
2 3 0 0.7 3 1 7 6
3 4 0 0.8 2 0 7 6
In [5]: df.reindex(columns=['D','Y','Z','A','B','C','X'])
Out[5]:
D Y Z A B C X
0 5 7 6 1 0 0.5 1
1 4 7 6 2 0 0.6 0
2 3 7 6 3 0 0.7 1

Categories

Resources