With a df as below:
name year total
0 A 2015 100
1 A 2016 200
2 A 2017 500
3 C 2016 400
4 B 2016 100
5 B 2015 200
6 B 2017 800
How do I create a new dataframe with years as columns and amount as values:
Name 2015 2016 2017
A 100 200 500
B 200 100 800
C 0 400 0
UMMM pivot with fillna
df.pivot(*df.columns).fillna(0).reset_index()
Out[815]:
year name 2015 2016 2017
0 A 100.0 200.0 500.0
1 B 200.0 100.0 800.0
2 C 0.0 400.0 0.0
Related
This question already has answers here:
pandas or python equivalent of tidyr complete
(4 answers)
Closed last year.
My toy DataFrame is similar to
import pandas as pd
data = {'year': [1999, 2000, 2001, 2002, 2003, 2004, 2005,
1999, 2000, 2003, 2004, 2005],
'id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'price': [1200, 150, 300, 450, 200, 300, 400, 120,
140, 150, 155, 156]
}
df = pd.DataFrame(data)
What's the most elegant way to add missing years?
In the example, the years 2001 and 2002 are missing for id = 2 because of missing data. In such cases, I still want to have the years in the DataFrame, id should be 2 and price = NaN.
My real DataFrame has thousands of IDs.
Use a cross merge to create all possible combinations of "Year" and "ID" and merge back to the original DataFrame:
>>> df["year"].drop_duplicates().to_frame().merge(df["id"].drop_duplicates(), how="cross").merge(df, how="left")
year id price
0 1999 1 1200.0
1 1999 2 120.0
2 2000 1 150.0
3 2000 2 140.0
4 2001 1 300.0
5 2001 2 NaN
6 2002 1 450.0
7 2002 2 NaN
8 2003 1 200.0
9 2003 2 150.0
10 2004 1 300.0
11 2004 2 155.0
12 2005 1 400.0
13 2005 2 156.0
You could make "year" a Categorical variable and include it in the groupby:
df['year'] = pd.Categorical(df['year'], categories=df['year'].unique())
out = df.groupby(['id', 'year'], as_index=False).first()
Output:
id year price
0 1 1999 1200.0
1 1 2000 150.0
2 1 2001 300.0
3 1 2002 450.0
4 1 2003 200.0
5 1 2004 300.0
6 1 2005 400.0
7 2 1999 120.0
8 2 2000 140.0
9 2 2001 NaN
10 2 2002 NaN
11 2 2003 150.0
12 2 2004 155.0
13 2 2005 156.0
Update
You can also use product from itertools:
# from itertools import product
>>> df.set_index(['year', 'id']).reindex(product(set(df['year']), set(df['id']))) \
.sort_index(level=1).reset_index()
year id price
0 1999 1 1200.0
1 2000 1 150.0
2 2001 1 300.0
3 2002 1 450.0
4 2003 1 200.0
5 2004 1 300.0
6 2005 1 400.0
7 1999 2 120.0
8 2000 2 140.0
9 2001 2 NaN
10 2002 2 NaN
11 2003 2 150.0
12 2004 2 155.0
13 2005 2 156.0
Create a MultiIndex of all combinations of year and id columns. Set this columns as index and reindex by the multi-index:
mi = pd.MultiIndex.from_product([df['year'].unique(), df['id'].unique()], names=['year', 'id'])
out = df.set_index(['year', 'id']).reindex(mi).reset_index().sort_values('id', ignore_index=True)
Output:
>>> out
year id price
0 1999 1 1200.0
1 2000 1 150.0
2 2001 1 300.0
3 2002 1 450.0
4 2003 1 200.0
5 2004 1 300.0
6 2005 1 400.0
7 1999 2 120.0
8 2000 2 140.0
9 2001 2 NaN
10 2002 2 NaN
11 2003 2 150.0
12 2004 2 155.0
13 2005 2 156.0
I have the following dataframe:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016
2 20 2014 2014
1 30 2017 2016
1 40 2016 2016
4 300 2015 2000
5 150 2005 2002
What I'm looking for is the Amount Paid should appear in the withNYears column if the payment was made within n years of start date otherwise you get NaN.
N years can be any number but let's say 2 for this example (as I will be playing with this to see findings).
so basically the above dataframe would come out like this if the amount was paid within 2 years:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016 100
2 20 2014 2014 20
1 30 2017 2016 30
1 40 2016 2016 40
4 300 2015 2000 NaN
5 150 2005 2002 NaN
does anyone know how to achieve this? cheers.
Subtract columns and compare by scalar for boolean mask and then set value by numpy.where, Series.where or DataFrame.loc:
m = (df['PaymentReceivedDate'] - df['StartDate']) < 2
df['withinNYears'] = np.where(m, df['AmountPaid'], np.nan)
#alternatives
#df['withinNYears'] = df['AmountPaid'].where(m)
#df.loc[m, 'withinNYears'] = df['AmountPaid']
print (df)
PersonID AmountPaid PaymentReceivedDate StartDate \
0 1 100 2017 2016
1 2 20 2014 2014
2 1 30 2017 2016
3 1 40 2016 2016
4 4 300 2015 2000
5 5 150 2005 2002
withinNYears
0 100.0
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
EDIT:
If StartDate column have datetimes:
m = (df['PaymentReceivedDate'] - df['StartDate'].dt. year) < 2
Just do with assign using loc
df.loc[(df['PaymentReceivedDate'] - df['StartDate']<2),'withinNYears']=df.AmountPaid
df
Out[37]:
PersonID AmountPaid ... StartDate withinNYears
0 1 100 ... 2016 100.0
1 2 20 ... 2014 20.0
2 1 30 ... 2016 30.0
3 1 40 ... 2016 40.0
4 4 300 ... 2000 NaN
5 5 150 ... 2002 NaN
[6 rows x 5 columns]
I have a DF with 50ish columns and duplicate ID's. The section I'm interested in kind of looks like this
ID Value year
0 3 200 1995
1 3 100 2001
2 4 300 1995
3 4 250 2000
All first entries of each ID = 1995, however the second entries correspond to a ValuedFrom column (the second entry is the retirement age of each object, and so its last value in most cases). Id like to merge all three of these columns so that I end up with two, like so
ID Value1995 ValueRetired
0 3 200 100
1 4 300 250
Any ideas on how I might do this?
General solution:
print (df)
ID year Value
1 3 2003 95
2 3 1995 200
2 3 2001 100
3 4 1995 300
4 4 2000 250
5 4 2004 150
6 5 2000 201
7 5 1995 202 <- remove this row with 1995, because last value of group 5, if seelct next row it is in another group
8 6 2000 203
9 6 2000 204
First select indices of 1995 and all next rows:
idx = df.index[(df['year'] == 1995) & (df.groupby('ID').cumcount(ascending=False) != 0)]
idx2 = df.index.intersection(idx + 1).union(idx)
df = df.loc[idx2]
print (df)
ID year Value ValuedFrom
2 3 1995 200 1995
2 3 2001 100 2001
3 4 1995 300 1995
4 4 2000 250 2000
Detail:
print (df.groupby('ID').cumcount(ascending=False))
1 2
2 1
2 0
3 2
4 1
5 0
6 1
7 0
8 1
9 0
dtype: int64
Then change values of column year for reshape by unstack:
df['year'] = np.where(df['year'] == 1995, 'Value1995', 'ValueRetired')
df = df.set_index(['ID', 'year'])['Value'].unstack().reset_index().rename_axis(None, axis=1)
print (df)
ID Value1995 ValueRetired
0 3 200 100
1 4 300 250
You can create a series mapping year to labels, then use pd.DataFrame.pivot:
df['YearType'] = np.where(df['year'] == 1995, 'Value1995', 'ValueRetired')
res = df.pivot(index='ID', columns='YearType', values='Value')
print(res)
YearType Value1995 ValueRetired
ID
3 200 100
4 300 250
5 150 95
I have a dataframe with price column (p) and I have some undesired values like (0, 1.50, 92.80, 0.80). Before I calculate the mean of the price by product code, I would like to remove these outliers
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
7 100 2017 1 28 2.0 92.80
8 100 2017 2 1 0.0 0.00
9 100 2017 2 7 2.0 1.50
10 100 2017 2 8 5.0 0.80
11 100 2017 2 9 1.0 45.05
12 100 2017 2 11 1.0 1.50
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
16 100 2017 3 30 2.0 1.50
How would be a good way to filter the outliers for each product (group by code) ?
I tried this:
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Code','P']].groupby('Code').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
df[outliers.any(axis=1)]
And then :
print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean())
But the outlier filter doesn`t work properly.
IIUC You can use a groupby on Code, do your z score calculation on P, and filter if the z score is greater than your threshold:
stds = 1.0
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
11 100 2017 2 9 1.0 45.05
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
P
Code Year Month
100 2017 1 44.821429
2 45.050000
3 46.666667
You have the right idea. Just take the Boolean opposite of your outliers['P'] series via ~ and filter your dataframe via loc:
res = df.loc[~outliers['P']]\
.groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()
print(res)
Code Year Month P
0 100 2017 1 44.821429
1 100 2017 2 45.050000
2 100 2017 3 46.666667
I am dealing with a multi-index dataframe because I've used pd.pivot_table. There are two levels in my column header.
I am currently processing it and would like to sum two columns together.
I would like to make my code cleaner by processing the df in one chain using .pipe()
What I have come up with is this:
reg_cat =
1 or 0 total_orders year
0 1 2000 2011
1 0 5500 2012
2 1 6000 2013
3 0 1000 2014
4 0 3000 2015
pivot = (
reg_cat
.pivot_table(values=['total_orders'],index=['year'],columns=['1 or 0'], aggfunc=np.sum)
.reset_index()
.fillna(0)
.pipe(lambda x: x.assign(total_orders_total = x['total_orders',0] + x['total_orders',1]))
)
The output is presented as such:
year total_orders total_orders total_orders_total
1 or 0 0 1
0 2011 0.0 2000.0 2000.0
1 2012 5500.0 0.0 5500.0
2 2013 0.0 6000.0 6000.0
3 2014 1000.0 0.0 1000.0
4 2015 3000.0 0.0 3000.0
How can I insert a 2nd level column name for the column 'total_orders_total' with this method? So that it will look something like this:
year total_orders total_orders total_orders_total
1 or 0 0 1 total
0 2011 0.0 2000.0 2000.0
1 2012 5500.0 0.0 5500.0
2 2013 0.0 6000.0 6000.0
3 2014 1000.0 0.0 1000.0
4 2015 3000.0 0.0 3000.0