% Difference Pivot Table python - python

I have a sample dataframe/table as below and I would like to do a simple pivot table in Python to calculate the % difference from the previous year.
DataFrame
Year Month Count Amount Retailer
2019 5 10 100 ABC
2019 3 8 80 XYZ
2020 3 8 80 ABC
2020 5 7 70 XYZ
...
Expected Output
MONTH %Diff
ABC 7 -0.2
XYG 8 -0.125
Thanks,
EDIT: I would like to reiterate that I would like to create the following table below. Not to do a join with the two tables

It looks like you need a groupby not pivot
gdf = df.groupby(['Retailer']).agg({'Amount': 'pct_change'})
Then rename and merge with original df
df = gdf.rename(columns={'Amount': '%Diff'}).dropna().merge(df, how='left', left_index=True, right_index=True)
%Diff Year Month Count Amount Retailer
2 -0.200 2020 3 7 80 ABC
3 -0.125 2020 5 8 70 XYZ

Related

Filling in missing values from other dataframe with multiple conditions

I am trying to fill the missing value from other dataframe with matching column values and conditions.
The two Dataframes that I'm working with looks like below:
The First Dataframe contains values that are recorded in Quarter periods (Mar.-June-Sep.-Dec.)
Name
Date
Value_1
Value_2
Value_3
John
2022-03
1
5
9
John
2022-06
2
NA
10
John
2022-09
NA
7
11
John
2022-12
4
8
12
Dave
2021-03
NA
5
10
Dave
2021-06
20
NA
10
Dave
2021-09
10
5
8
Dave
2021-12
NA
NA
30
The Second Dataframe contains values that are recorded in Semiannual periods (June and Dec.)
Name
Date
Value_1
Value_2
Value_3
John
2022-06
NA
NA
30
John
2022-12
50
NA
NA
Dave
2021-06
NA
NA
NA
Dave
2021-12
30
20
10
There are two things that I want to put as a condition when merging this two Dataframe.
(1.) If the 'Value_1' and 'Value_3' of Semiannual column is missing, replace with values in Quarterly Dataframe by same Year and Month. For example, the Value_3 second column will be '12' since the Value'_3, forth row in Quarter dataframe has the value 12. In this case when filling the values, values in March and September in quarter data will be ignored.
(2.) If the 'Value_2' of Semiannual column is missing, repace the values in Quarterly Dataframe by adding the relavant months. For example, the Value_2 second column (2022-12) in Semiannual Dataframe will be replace with the value by adding 2022-09 and 2022-12 in Quarterly dataframe.
The Final Dataframe will look like below:
Name
Date
Value_1
Value_2
Value_3
John
2022-06
2 (from quarter df)
5 (5 + NA from quarter df)
30
John
2022-12
50
15 (7 + 8 from quarter df)
12 (from quarter df
Dave
2021-06
20 (from quarter df)
NA (NA + NA from quarter df)
10 (from quarter df)
Dave
2021-12
30
20
10
I am trying to develop the code below, but it only satisfies the first condition.
dt_quarter= [['John','2022-03',1,5,9], ['John','2022-06',2,'NA',10], ['John','2022-09','NA',7,11], ['John','2022-12',4,8,12],
['Dave','2021-03','NA',5,10], ['Dave','2021-06',20,'NA',10], ['Dave','2021-09',10,5,8], ['Dave','2021-12','NA','NA',30]]
df_quarter= pd.DataFrame(dt_quarter, columns=['Name','Date','Value_1','Value_2','Value_3'])
dt_semi= [['John','2022-06','NA','NA',30], ['John','2022-12',50,'NA','NA'],
['Dave','2021-06','NA','NA','NA'], ['Dave','2021-12',30,20,10]]
df_semi= pd.DataFrame(dt_semi, columns=['Name','Date','Value_1','Value_2','Value_3'])
group_cols = ['Name', 'Date']
df_combined = df_semi.set_index(group_cols).fillna(df_quarter.set_index(group_cols)).reset_index()
Please give advice on this problem.
Thanks

How to spread the data in pandas?

i'm working on spread r equivalent in pandas my dataframe looks like below
Name age Language year Period
Nik 18 English 2018 Beginer
John 19 French 2019 Intermediate
Kane 33 Russian 2017 Advanced
xi 44 Thai 2015 Beginer
and looking for output like this
Name age Language Beginer Intermediate Advanced
Nik 18 English 2018
John 19 French 2019
Kane 33 Russian 2017
John 44 Thai 2015
my code
pd.pivot(x1,values='year', columns=['Period'])
i'm getting only these columns Beginer,Intermediate,Advanced not the entire dataframe
while reshaping it i tried using index but says no duplicates in index.
So i created new index column but still not getting entire dataframe
If I understood correctly you could do something like this:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Advanced Beginner Intermediate
0 Nik 18 English 0 2018 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 2017 0 0
3 xi 44 Thai 0 2015 0
If you want to match the exact same output, convert the column to categorical first, and specify the order:
# encode as categorical
df['Period'] = pd.Categorical(df['Period'], ['Beginner', 'Advanced', 'Intermediate'], ordered=True)
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
# concat and drop columns
output = pd.concat((df.drop(['year', 'Period'], 1), res), 1)
print(output)
Output
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018 0 0
1 John 19 French 0 0 2019
2 Kane 33 Russian 0 2017 0
3 xi 44 Thai 2015 0 0
Finally if you want to replace the 0, with missing values, add a third step:
# create dummy columns
res = pd.get_dummies(df['Period']).astype(np.int64)
res.values[np.arange(len(res)), np.argmax(res.values, axis=1)] = df['year']
res = res.replace(0, np.nan)
Output (with missing values)
Name age Language Beginner Advanced Intermediate
0 Nik 18 English 2018.0 NaN NaN
1 John 19 French NaN NaN 2019.0
2 Kane 33 Russian NaN 2017.0 NaN
3 xi 44 Thai 2015.0 NaN NaN
One way you can get to the equivalent of R's spread function using pd.pivot_table:
If you don't mind about the index, you can use reset_index() on the newly created df:
new_df = (pd.pivot_table(df, index=['Name','age','Language'],columns='Period',values='year',aggfunc='sum')).reset_index()
which will get you:
Period Name age Language Advanced Beginer Intermediate
0 John 19 French NaN NaN 2019.0
1 Kane 33 Russian 2017.0 NaN NaN
2 Nik 18 English NaN 2018.0 NaN
3 xi 44 Thai NaN 2015.0 NaN
EDIT
If you have many columns in your dataframe and you want to include them in the reshaped dataset:
Grab in a list the columns to be used in pivot table (i.e. Period and year)
Grab all the other columns in your dataframe in a list (using not in)
Use the index_cols as index in the pd.pivot_table() command
non_index_cols = ['Period','year'] # SPECIFY THE 2 COLUMNS IN THE PIVOT TABLE TO BE USED
index_cols = [i for i in df.columns if i not in non_index_cols] # GET ALL THE REST IN A LIST
new_df = (pd.pivot_table(df, index=index_cols,columns='Period',values='year',aggfunc='sum')).reset_index()
The new_df, will include all the columns of your initial dataframe.

Adding a column with values from another dataframe based on column conditions

I have two dataframes of differing index lengths that look like this:
df_1:
State Month Total Time ... N columns
AL 4 1000
AL 5 500
.
.
.
VA 11 750
VA 12 1500
df_2:
State Month ... N columns
AL 4
AL 5
.
.
.
VA 11
VA 12
I would like to add a Total Time column to df_2 that uses the values from the Total Time column of df_1 if the State and Month value are the same between dataframes. Ultimately, I would end up with:
df_2:
State Month Total Time ... N columns
AL 4 1000
AL 5 500
.
.
.
VA 11 750
VA 12 1500
I want to do this conditionally since the index lengths are not the same. I have tried this so far:
def f(row):
if (row['State'] == row['State']) & (row['Month'] == row['Month']):
val = x for x in df_1["Total Time"]
return val
df_2['Total Time'] = df_2.apply(f, axis=1)
This did not work. What method would you use to accomplish this?
Any help is appreciated!
You can do this:
Consider my sample dataframes:
In [2327]: df_1
Out[2327]:
State Month Total Time
0 AL 2 1000
1 AB 4 500
2 BC 1 600
In [2328]: df_2
Out[2328]:
State Month
0 AL 2
1 AB 5
In [2329]: df_2 = pd.merge(df_2, df_1, on=['State', 'Month'], how='left')
In [2330]: df_2
Out[2330]:
State Month Total Time
0 AL 2 1000.0
1 AB 5 NaN
As mentioned in other comment, pd.merge() is how you would join two dataframes and extract a column. The issue is that merging solely on 'State' and 'Month' would result in every permutation becoming a new column (all Al-4 in df_1 would join with all other AL-4 in df_2).
With your example, there'd be
df_1
State Month Total Time df_1 col...
0 AL 4 1000 6
1 AL 4 500 7
2 VA 12 750 8
3 VA 12 1500 9
df_2
State Month df_2 col...
0 AL 4 1
1 AL 4 2
2 VA 12 3
3 VA 12 4
df_1_cols_to_use = ['State', 'Month', 'Total Time']
# note the selection of the column to use from df_1. We only want the column
# we're merging on, plus the column(s) we want to bring in, in this case 'Total Time'
new_df = pd.merge(df_2, df_1[df_1_cols_to_use], on=['State', 'Month'], how='left')
new_df:
State Month df_2 col... Total Time
0 AL 4 1 1000
1 AL 4 1 500
2 AL 4 2 1000
3 AL 4 2 500
4 VA 12 3 750
5 VA 12 3 1500
6 VA 12 4 750
7 VA 12 4 1500
You mention these have differing index lengths. Based on the parameters of the question, it's not possible to determine what value of Total Time would match up with a row in df_2. If there's three AL-4 entries in df_2, do they each get 1000, 500, or some combination? That info would be needed. Without this, this would be the best guess at getting all possibilities.

How to extract info from original dataframe after doing some analysis on it?

So I had a dataframe and I had to do some cleansing to minimize the duplicates. In order to do that I created a dataframe that had instead of 40 only 8 of the original columns. Now I have two columns I need for further analysis from the original dataframe but they would mess with the desired outcome if I used them in my previous analysis. Anyone have any idea on how to "extract" these columns based on the new "clean" dataframe I have?
You can merge the new "clean" dataframe with the other two variables by using the indexes. Let me use a pratical example. Suppose the "initial" dataframe, called "df", is:
df
name year reports location
0 Jason 2012 4 Cochice
1 Molly 2012 24 Pima
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
4 Amy 2014 3 Yuma
while the "clean" dataframe is:
d1
year location
0 2012 Cochice
2 2013 Santa Cruz
3 2014 Maricopa
The remaing columns are saved in dataframe "d2" ( d2 = df[['name','reports']] ):
d2
name reports
0 Jason 4
1 Molly 24
2 Tina 31
3 Jake 2
4 Amy 3
By using the inner join on the indexes d1.merge(d2, how = 'inner' left_index= True, right_index = True) you get the following result:
name year reports location
0 Jason 2012 4 Cochice
2 Tina 2013 31 Santa Cruz
3 Jake 2014 2 Maricopa
You can make a new dataframe with the specified columns;
import pandas
#If your columns are named a,b,c,d etc
df1 = df[['a','b']]
#This will extract columns 0, to 2 based on their index
#[remember that pandas indexes columns from zero!
df2 = df.iloc[:,0:2]
If you could, provide a sample piece of data, that'd make it easier for us to help you.

create separate columns whose titles are based on values in a column

I am trying to create values for each location of data. I have:
Portafolio Zona Region COM PROV Type of Housing
654738 1 2 3 21 compuesto
65344 3 8 4 22 error
I want to make new columns for each of the types of housing and for their values i want to be able to count how many there are total in each portafolio, zona, region, com, and prov. I have struggled with it for 2 days and I am new to python pandas. It should look like this:
Zona Region COM PROV Compuesto Error
1 2 3 21 24 444
3 8 4 22 34 32
You want pd.pivot_table specifying that the aggregation function is size
df1 = pd.pivot_table(df, index=['Zona', 'Region', 'COM', 'PROV'],
columns='Type of Housing',
aggfunc='size').reset_index()
df1.columns.name=None
Output: df1
Zona Region COM PROV compuesto error
0 1 2 3 21 1.0 NaN
1 3 8 4 22 NaN 1.0

Categories

Resources