I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks
You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
Related
I am trying to flatten a Pandas Dataframe MultiIndex so that there is only a single level index. The usual solution based on any number of SE posts is to use the df.reset_index command, but that is just not fixing the problem.
I started out with an Xarray DataArray and converted it to a dataframe. The original dataframe looked like this.
results
simdata a_ss_yr attr attr1 attr2 attr3
run year
0 0 0 0 0 0 0
1 1 6 2 0 4
2 2 4 2 2 0
3 3 1 0 0 1
4 4 2 0 2 0
To flatten the index I used
df.reset_index(drop=True)
This only accomplished this:
run year results
simdata a_ss_yr attr attr1 attr2
0 0 0 0 0 0 0
1 0 1 1 6 2 0
2 0 2 2 4 2 2
3 0 3 3 1 0 0
4 0 4 4 2 0 2
I tried doing the df.reset_index() option more than once, but this is still not flattening the index, and I want to get this to only a single level index.
More specifically I need the "run" and "year" variables to go to the level 0 set of column names, and I need to remove the "result" heading entirely.
I have been reading the Pandas documentation, but it seems like doing this kind of surgery on the index is not really described. Does anyone have a sense of how to do this?
Use first droplevel for remove first level of MultiIndex and then reset_index:
df.columns = df.columns.droplevel(0)
df = df.reset_index()
I am using panda version 0.23.0. I want to use data frame group by function to generate new aggregated columns using [lambda] functions..
My data frame looks like
ID Flag Amount User
1 1 100 123345
1 1 55 123346
2 0 20 123346
2 0 30 123347
3 0 50 123348
I want to generate a table which looks like
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM Flag0_User_Count Flag1_User_Count
1 2 2 0 155 0 2
2 2 0 50 0 2 0
3 1 0 50 0 1 0
here:
Flag0_Count is count of Flag = 0
Flag1_Count is count of Flag = 1
Flag0_Amount_SUM is SUNM of amount when Flag = 0
Flag1_Amount_SUM is SUNM of amount when Flag = 1
Flag0_User_Count is Count of Distinct User when Flag = 0
Flag1_User_Count is Count of Distinct User when Flag = 1
I have tried something like
df.groupby(["ID"])["Flag"].apply(lambda x: sum(x==0)).reset_index()
but it creates a new a new data frame. This means I will have to this for all columns and them merge them together into a new data frame.
Is there an easier way to accomplish this?
Use DataFrameGroupBy.agg by dictionary by column names with aggregate function, then reshape by unstack, flatten MultiIndex of columns, rename columns and last reset_index:
df = (df.groupby(["ID", "Flag"])
.agg({'Flag':'size', 'Amount':'sum', 'User':'nunique'})
.unstack(fill_value=0))
#python 3.6+
df.columns = [f'{i}{j}' for i, j in df.columns]
#python below
#df.columns = [f'{}{}'.format(i, j) for i, j in df.columns]
d = {'Flag0':'Flag0_Count',
'Flag1':'Flag1_Count',
'Amount0':'Flag0_Amount_SUM',
'Amount1':'Flag1_Amount_SUM',
'User0':'Flag0_User_Count',
'User1':'Flag1_User_Count',
}
df = df.rename(columns=d).reset_index()
print (df)
ID Flag0_Count Flag1_Count Flag0_Amount_SUM Flag1_Amount_SUM \
0 1 0 2 0 155
1 2 2 0 50 0
2 3 1 0 50 0
Flag0_User_Count Flag1_User_Count
0 0 2
1 2 0
2 1 0
I am still new to Python pandas' pivot_table and im trying to reshape the data to have a binary indicator if a value is in a certain observation. I have follow some previous codes and got some encouraging results, however instead of 1 and zeros as Is my ideal result I get a sum. Please see a small sample data set below
ID SKILL NUM
1 A 1
1 A 1
1 B 1
2 C 1
3 C 1
3 C 1
3 E 1
The results I am aiming for is:
ID A B C E
1 1 1 0 0
2 0 0 1 0
3 0 0 0 1
My code atm get the following result:
ID A B C E
1 2 1 0 0
2 0 0 2 0
3 0 0 0 1
Should I remove the duplicates first??
The code I'm using atm is below;
df_pivot = df2.pivot_table(index='Job_posting_ID', columns='SKILL', aggfunc=len, fill_value=0)
You can use get_dummies with set_index for indicator columns and then get max values per index:
df = pd.get_dummies(df.set_index('ID')['SKILL']).max(level=0)
For better performance remove duplicates by drop_duplicates and reshape by set_index with unstack:
df = df.drop_duplicates(['ID','SKILL']).set_index(['ID','SKILL'])['NUM'].unstack(fill_value=0)
Solution with pivot, but then is necessary replace NaNs to 0:
df = df.drop_duplicates(['ID','SKILL']).pivot('ID','SKILL','NUM').fillna(0).astype(int)
If want use your solution, just remove duplicates, but better is unstack, beacuse data are never aggregated, because not duplicated pairs ID with SKILL:
df2 = df.drop_duplicates(['ID','SKILL'])
df_pivot = (df2.pivot_table(index='ID',
columns='SKILL',
values='NUM',
aggfunc=len,
fill_value=0))
print (df_pivot)
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
Try like this:
df.pivot_table(index='ID', columns='SKILL', values='NUM', aggfunc=lambda x: len(x.unique()), fill_value=0)
Or this:
df.pivot_table(index='ID', columns='SKILL',aggfunc=lambda x: int(x.any()), fill_value=0)
Whichever suits you best.
You can use aggfunc='any' and convert to int as a separate step. This avoids having to use a lambda / custom function, and may be more efficient.
df_pivot = df.pivot_table(index='ID', columns='SKILL',
aggfunc='any', fill_value=0).astype(int)
print(df_pivot)
NUM
SKILL A B C E
ID
1 1 1 0 0
2 0 0 1 0
3 0 0 1 1
The same would work with aggfunc=len + conversion to int, except this is likely to be more expensive.
I have a simple question which relates to similar questions here, and here.
I am trying to drop all columns from a pandas dataframe, which have only zeroes (vertically, axis=1). Let me give you an example:
df = pd.DataFrame({'a':[0,0,0,0], 'b':[0,-1,0,1]})
a b
0 0 0
1 0 -1
2 0 0
3 0 1
I'd like to drop column asince it has only zeroes.
However, I'd like to do it in a nice and vectorized fashion if possible. My data set is huge - so I don't want to loop. Hence I tried
df = df.loc[(df).any(1), (df!=0).any(0)]
b
1 -1
3 1
Which allows me to drop both columns and rows. But if I just try to drop the columns, locseems to fail. Any ideas?
You are really close, use any - 0 are casted to Falses:
df = df.loc[:, df.any()]
print (df)
b
0 0
1 1
2 0
3 1
If it's a matter of 0s and not sum, use df.any:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1
In [73]: df.loc[:, df.ne(0).any()]
Out[73]:
b
0 0
1 1
2 0
3 1
or:
In [71]: df.loc[:, ~df.eq(0).all()]
Out[71]:
b
0 0
1 1
2 0
3 1
If we want to check those that do NOT sum up to 0:
In [78]: df.loc[:, df.sum().astype(bool)]
Out[78]:
b
0 0
1 1
2 0
3 1
Ok, I admit, I had troubles to really formulate a good header for that. So I will try to make give an example.
This is my sample dataframe:
df = pd.DataFrame([
(1,"a","good"),
(1,"a","good"),
(1,"b","good"),
(1,"c","bad"),
(2,"a","good"),
(2,"b","bad"),
(3,"a","none")], columns=["id", "type", "eval"])
What I do with it is the following:
df.groupby(["id", "type"])["id"].agg({'id':'count'})
This results in:
id
id type
1 a 2
b 1
c 1
2 a 1
b 1
3 a 1
This is fine, although what I will need later on is that e.g. the id would be repeated in every row. But this is not the most important part.
What I would need now is something like this:
id good bad none
id type
1 a 2 2 0 0
b 1 1 0 0
c 1 0 1 0
2 a 1 1 0 0
b 1 0 1 0
3 a 1 0 0 1
And even better would be a result like this, because I will need this back in a dataframe (and finally in an Excel sheet) with all fields populated. In reality, there will be many more columns I am grouping by. They would have to be completely populated as well.
id good bad none
id type
1 a 2 2 0 0
1 b 1 1 0 0
1 c 1 0 1 0
2 a 1 1 0 0
2 b 1 0 1 0
3 a 1 0 0 1
Thank you for helping me out.
You can use groupby + size (last column was added) or value_counts with unstack:
df1 = df.groupby(["id", "type", 'eval'])
.size()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
df1 = df.groupby(["id", "type"])[ 'eval']
.value_counts()
.unstack(fill_value=0)
.rename_axis(None, axis=1)
print (df1)
bad good none
id type
1 a 0 2 0
b 0 1 0
c 1 0 0
2 a 0 1 0
b 1 0 0
3 a 0 0 1
But for write to excel get:
df1.to_excel('file.xlsx')
So need reset_index last.
df1.reset_index().to_excel('file.xlsx', index=False)
EDIT:
I forget for id column, but it is duplicate column name, so need id1:
df1.insert(0, 'id1', df1.sum(axis=1))