Pandas groupby to create new dataframe with values as columns - python

I want to reshape the data by Date in Python as dataframe.
Required:
IS there any Pandas function?

Create additional key by using cumcount , then we do pivot , Data from jpp
df.assign(key=df.groupby('Col1').cumcount()).pivot('key','Col1','Col2')
Out[29]:
Col1 A B C
key
0 1.0 4.0 6.0
1 2.0 5.0 7.0
2 3.0 NaN 8.0

One way is to use pandas.concat on series derived from unique values in your key column.
Here is a minimal example.
import pandas as pd
df = pd.DataFrame({'Col1': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C'],
'Col2': [1, 2, 3, 4, 5, 6, 7, 8]})
res = pd.concat({k: df.loc[df['Col1']==k, 'Col2'].reset_index(drop=True)
for k in df['Col1'].unique()}, axis=1)
print(res)
A B C
0 1 4.0 6
1 2 5.0 7
2 3 NaN 8

Related

Find Rows And Delete it - Pandas DataFrame

Example dataframe:
name stuff floats ints
0 Mike a 1.0 1
1 Joey d 2.2 3
2 Zendaya c NaN 8
3 John a 1.0 1
4 Abruzzi d NaN 3
I have 'to_delete' list :
[['Abruzzi', 'd', pd.NA, 3], ['Mike', 'a', 1.0, 1]]
How can i remove data in the dataframe based on the 'to_delete' list?
What pandas method suit this?
So i will get new dataframe like:
name stuff floats ints
1 Joey d 2.2 3
2 Zendaya c NaN 8
3 John a 1.0 1
Thanks,
*im new to pandas
I would use a merge with indicator:
keep = (
df.merge(pd.DataFrame(to_delete, columns=df.columns), how='left', indicator=True)
.query('_merge == "left_only"').index
)
out = df.loc[keep]
print(out)
Output:
name stuff floats ints
1 Joey d 2.2 3
2 Zendaya c <NA> 8
3 John a 1.0 1
You can use the drop function to delete rows and columns in a Pandas DataFrame.
You can use the following code for your help finding the row and delete.
import pandas as pa
res = pa.DataFrame({
'name': ['Mike', 'Joey', 'Zendaya', 'John', 'Abruzzi'],
'stuff': ['a', 'd', 'c', 'a', 'd'],
'floats': [1.0, 2.2, pa.NA, 1.0, pa.NA],
'ints': [1, 3, 8, 1, 3]
})
remove = [['Abruzzi', 'd', pa.NA, 3], ['Mike', 'a', 1.0, 1]]
res = res[~res.isin(remove)].dropna(how='all')

Can I use pandas' pivot_table to aggregate over a column with missing values?

Can I use pandas pivot_table to aggregate over a column with missing values and have those missing values included as separate category?
In:
df = pd.DataFrame({'a': pd.Series(['X', 'X', 'Y', 'Y', 'N', 'N'], dtype='category'),
'b': pd.Series([None, None, 'd', 'd', 'd', 'd'], dtype='category')})
Out:
a b
0 X NaN
1 X NaN
2 Y d
3 Y d
4 N d
5 N d
In:
df.groupby('a')['b'].apply(lambda x: x.value_counts(dropna=False)).unstack(1)
Out:
NaN d
a
N NaN 2.0
X 2.0 0.0
Y NaN 2.0
Can I achieve the same result using pandas pivot_table? If yes than how? Thanks.
For some unknown reason, dtype="category" does not work with pivot_table() when counting NaN values. Casting them to regular strings enables regular pivot_table(aggfunc="size").
df.astype(str).pivot_table(index="a", columns="b", aggfunc="size")
Result
b d nan
a
N 2.0 NaN
X NaN 2.0
Y 2.0 NaN
One can optionally do .fillna(0) to replace nans with 0s

Pandas groupby with specified conditions

I'm learning Python/Pandas with a DataFrame having the following structure:
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
print(df1)
unique_id brand
0 1 A
1 1 B
2 2 A
3 2 C
4 2 X
5 3 A
6 3 C
7 3 X
8 3 X
9 3 X
My goal is to make some calculations on the above DataFrame.
Specifically, for each unique_id, I want to:
Count the number of brands without taking brand X into account;
Count only how many times brand ´X´ appears.
Visually, using the above example, the resulting DataFrame I'm looking for should look like this:
unique_id count_brands_not_x count_brand_x
0 1 2 0
1 2 2 1
2 3 2 3
I have used the groupby method on simple examples in the past but I don't know how to specify conditions in a groupby to solve this new problem I have. Any help would be appreciated.
You can use GroupBy and merge:
maskx = df1['brand'].eq('X')
d1 = df1[~maskx].groupby('unique_id')['brand'].size().reset_index()
d2 = df1[maskx].groupby('unique_id')['brand'].size().reset_index()
df = d1.merge(d2, on='unique_id', how='outer', suffixes=['_not_x', '_x']).fillna(0)
unique_id brand_not_x brand_x
0 1 2 0.00
1 2 2 1.00
2 3 2 3.00
I use pd.crosstab on True/False mask of comparing against value X
s = df1.brand.eq('X')
df_final = (pd.crosstab(df1.unique_id, s)
.rename({False: 'count_brands_not_x' , True: 'count_brand_x'}, axis=1))
Out[134]:
brand count_brands_not_x count_brand_x
unique_id
1 2 0
2 2 1
3 2 3
You can subset the original DataFrame and use the appropriate groupby operations for each calculation. concat joins the results.
import pandas as pd
s = df1.brand.eq('X')
res = (pd.concat([df1[~s].groupby('unique_id').brand.nunique().rename('unique_not_X'),
df1[s].groupby('unique_id').size().rename('count_X')],
axis=1)
.fillna(0))
# unique_not_X count_X
#unique_id
#1 2 0.0
#2 2 1.0
#3 2 3.0
If instead of "unique_brands" you just want the number of rows with brands that are not "X" then we can perform a single groupby and unstack the result.
(df1.groupby(['unique_id', df1.brand.eq('X').map({True: 'count_X', False: 'count_not_X'})])
.size().unstack(-1).fillna(0))
#brand count_X count_not_X
#unique_id
#1 0.0 2.0
#2 1.0 2.0
#3 3.0 2.0
I would first create groups and later count elements in groups
But maybe there is better function to count items in agg()
import pandas as pd
df1 = pd.DataFrame({'unique_id' : [1, 1, 2, 2, 2, 3, 3, 3, 3, 3],
'brand' : ['A', 'B', 'A', 'C', 'X', 'A', 'C', 'X', 'X', 'X']})
g = df1.groupby('unique_id')
df = pd.DataFrame()
df['count_brand_x'] = g['brand'].agg(lambda data:sum(data=='X'))
df['count_brands_not_x'] = g['brand'].agg(lambda data:sum(data!='X'))
df = df.reset_index()
print(df)
EDIT: If I have df['count_brand_x'] then other can count
df['count_brands_not_x'] = g['brand'].count() - df['count_brand_x']

Replace missing values in all columns except one in pandas dataframe

I have a pandas dataframe with 10 columns and I want to fill missing values for all columns except one (lets say that column is called test). Currently, if I do this:
df.fillna(df.median(), inplace=True)
It replaces NA values in all columns with median value, how do I exclude specific column(s) without specifying ALL the other columns
you can use pd.DataFrame.drop to help out
df.drop('unwanted_column', 1).fillna(df.median())
Or pd.Index.difference
df.loc[:, df.columns.difference(['unwanted_column'])].fillna(df.median())
Or just
df.loc[:, df.columns != 'unwanted_column']
Input to difference function should be passed as an array (Edited).
Just select whatever columns you want using pandas' column indexing:
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'A': [np.nan, 5, 2, np.nan, 3], 'B': [np.nan, 4, 3, 5, np.nan], 'C': [np.nan, 4, 3, 2, 1]})
>>> df
A B C
0 NaN NaN NaN
1 5.0 4.0 4.0
2 2.0 3.0 3.0
3 NaN 5.0 2.0
4 3.0 NaN 1.0
>>> cols = ['A', 'B']
>>> df[cols] = df[cols].fillna(df[cols].median())
>>> df
A B C
0 3.0 4.0 NaN
1 5.0 4.0 4.0
2 2.0 3.0 3.0
3 3.0 5.0 2.0
4 3.0 4.0 1.0

Pandas: Merge two dataframe columns

Consider two dataframes:
df_a = pd.DataFrame([
['a', 1],
['b', 2],
['c', NaN],
], columns=['name', 'value'])
df_b = pd.DataFrame([
['a', 1],
['b', NaN],
['c', 3],
['d', 4]
], columns=['name', 'value'])
So looking like
# df_a
name value
0 a 1
1 b 2
2 c NaN
# df_b
name value
0 a 1
1 b NaN
2 c 3
3 d 4
I want to merge these two dataframes and fill in the NaN values of the value column with the existing values in the other column. In other words, I want out:
# DESIRED RESULT
name value
0 a 1
1 b 2
2 c 3
3 d 4
Sure, I can do this with a custom .map or .apply, but I want a solution that uses merge or the like, not writing a custom merge function. How can this be done?
I think you can use combine_first:
print (df_b.combine_first(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Or fillna:
print (df_b.fillna(df_a))
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0
Solution with update is not so common as combine_first:
df_b.update(df_a)
print (df_b)
name value
0 a 1.0
1 b 2.0
2 c 3.0
3 d 4.0

Categories

Resources