Dataframe GroupBy and List Columns Values and Pivot Column Generation - python

I have input table below
ID
DepartmentName
Salary
1
Sales
100
1
Sales
200
2
Sales
100
2
Marketing
300
2
Finance
400
3
Sales
100
2
Marketing
600
Desired Output is
Code tried but could not do the pivot
from pyspark.sql.functions import *
import pandas as pd
df = spark.sql("""select * from EmployeeTable""")
df_pan = df.toPandas()
df_pan = df_pan.groupby(["ID","Department"]).agg({"Salary": lambda x: list(x)})
# df_pan.reset_index().pivot()
print(df_pan)

Related

Select Multiple Conditions - 2 DataFrames - Create New Column

I have 2 DataFrames:
DF1 - Master List
ID Item
100 Wood
101 Steel
102 Brick
103 Soil
DF2
ID
100
103
I want my final DataFrame to look like this:
ID Item / ID
100 100 - Wood
103 103 - Soil
The issue I'm having is DF2 doesn't have the Item Column.
I could manually do it by np.select(conditions, choices, default='N/A') but the full dataset is huge and would take a lot of time.
I also tried an np.select from the 2 different datasets, citing the columns but got a Can only compare identically-labeled DataFrame objects. error.
Is there a way to pull the relevant information from Item so I can join the strings values from ID to create Item / ID
Thanks in advance
dff = pd.merge(df1, df2, how = 'right', on = 'ID')
dff['Item / ID'] = dff['ID'].astype(str) + ' - ' + dff['Item']
dff.drop('Item', axis=1, inplace=True)
print(dff)
output:
ID Item / ID
0 100 100 - Wood
1 103 103 - Soil

Extrapolate exact same rows from yearly to monthly data point in python

I have example dataframe in yearly granularity:
df = pd.DataFrame({
"date": ["2020-01-01", "2021-01-01", "2022-01-01"],
"cost": [100, 1000, 150],
"person": ["Tom","Jerry","Brian"]
})
I want to create dataframe with monthly granularity without any estimation methods (just repeat a row 12 times for each year. So in a result from this 3 row dataframe I would like to get 36 rows exactly like:
2020-01-01 / 100 / Tom
2020-02-01 / 100 / Tom
2020-03-01 / 100 / Tom
2020-04-01 / 100 / Tom
2020-05-01 / 100 / Tom
[...]
2022-10-01 / 150 / Brian
2022-11-01 / 150 / Brian
2022-12-01 / 150 / Brian
I tried
df.resample('M', on = 'date').apply(lambda x:x)
but cant seem to get it working...
Im beginner so forgive me my ignorance
Thanks for help in advance!
Here is a way to do that.
count = len(df)
for var in df[['date','cost','person']].values:
for i in range(2,13):
df.loc[count] = [(var[0][0:5] + "{:02d}".format(i) + var[0][7:]),var[1], var[2]]
count += 1
df = df.sort_values('date')
Following should also work,
#Typecasting
df['date'] = pd.to_datetime(df['date'])
#Making new dataframe based on frequency
op = pd.DataFrame(pd.date_range(start=df['date'].min(), end=df['date'].max()+pd.offsets.DateOffset(months=11),freq='MS'),columns = ['date'])
#merging both results on year using merge( with outer join)
res = pd.merge(df,op,left_on=df['date'].apply(lambda x: x.year), right_on = op['date'].apply(lambda x: x.year), how = 'outer')
#dropping key columns from left side
res.drop(['key_0','date_x'],axis=1,inplace=True)

Reshape multi-header table in python

I am new to python and trying to reshape table from excel file as I have multiple header I am trying to convert first header into 2 separate column. i am attaching my code output and data here.
Input Table
import pandas as pd
import numpy as nm
df = pd.read_excel(r'.\test.xlsx', header=[0, 1])
df = (df.stack(0, dropna=False)
.rename_axis(index=('Customer','Date'), columns=None)
.reset_index())
df.to_csv(r'.\testnew.csv',index=False)
print(df)
Printed Output -
Desired Output -
Customer
Date
Budget
Actual
Amount
John
Jan-20
100
50
0
John
Feb-20
John
Mar-20
Chris
Jan-20
120
80
0
Chris
Feb-20
50
10
20
Chris
Mar-20
50
45
I believe you need DataFrame.stack:
df = pd.read_excel(r'.\test.xlsx', header=[0, 1])
df = (df.stack(0, dropna=False)
.rename_axis(index=('Customer','Date'), columns=None)
.reset_index())

Pandas - Possible to select which index in multi-index to use when dividing a multi-index series by a single index series?

Data
data = {"account":{"0":383080,"1":383080,"2":383080,"3":412290,"4":412290,"5":412290,"6":412290,"7":412290,"8":218895,"9":218895,"10":218895,"11":218895},"name":{"0":"Will LLC","1":"Will LLC","2":"Will LLC","3":"Jerde-Hilpert","4":"Jerde-Hilpert","5":"Jerde-Hilpert","6":"Jerde-Hilpert","7":"Jerde-Hilpert","8":"Kulas Inc","9":"Kulas Inc","10":"Kulas Inc","11":"Kulas Inc"},"order":{"0":10001,"1":10001,"2":10001,"3":10005,"4":10005,"5":10005,"6":10005,"7":10005,"8":10006,"9":10006,"10":10006,"11":10006},"sku":{"0":"B1-20000","1":"S1-27722","2":"B1-86481","3":"S1-06532","4":"S1-82801","5":"S1-06532","6":"S1-47412","7":"S1-27722","8":"S1-27722","9":"B1-33087","10":"B1-33364","11":"B1-20000"},"quantity":{"0":7,"1":11,"2":3,"3":48,"4":21,"5":9,"6":44,"7":36,"8":32,"9":23,"10":3,"11":-1},"unit price":{"0":33.69,"1":21.12,"2":35.99,"3":55.82,"4":13.62,"5":92.55,"6":78.91,"7":25.42,"8":95.66,"9":22.55,"10":72.3,"11":72.18},"ext price":{"0":235.83,"1":232.32,"2":107.97,"3":2679.36,"4":286.02,"5":832.95,"6":3472.04,"7":915.12,"8":3061.12,"9":518.65,"10":216.9,"11":72.18}}
pd.DataFrame(data=data)
Current Solution
sku_total = df.groupby(['order','sku'])['ext price'].sum().rename('sku total').reset_index()
sku_total['sku total'] / sku_total['order'].map(df.groupby('order')['ext price'].sum())
Question
How to divide:
df.groupby(['order','sku'])['ext price'].sum()
by
df.groupby('order')['ext price'].sum()
Without having to reset_index?
Doesn't div do the trick or am I understanding something inccorectly?
import pandas as pd
import numpy as np
data = {"account":{"0":383080,"1":383080,"2":383080,"3":412290,"4":412290,"5":412290,"6":412290,"7":412290,"8":218895,"9":218895,"10":218895,"11":218895},"name":{"0":"Will LLC","1":"Will LLC","2":"Will LLC","3":"Jerde-Hilpert","4":"Jerde-Hilpert","5":"Jerde-Hilpert","6":"Jerde-Hilpert","7":"Jerde-Hilpert","8":"Kulas Inc","9":"Kulas Inc","10":"Kulas Inc","11":"Kulas Inc"},"order":{"0":10001,"1":10001,"2":10001,"3":10005,"4":10005,"5":10005,"6":10005,"7":10005,"8":10006,"9":10006,"10":10006,"11":10006},"sku":{"0":"B1-20000","1":"S1-27722","2":"B1-86481","3":"S1-06532","4":"S1-82801","5":"S1-06532","6":"S1-47412","7":"S1-27722","8":"S1-27722","9":"B1-33087","10":"B1-33364","11":"B1-20000"},"quantity":{"0":7,"1":11,"2":3,"3":48,"4":21,"5":9,"6":44,"7":36,"8":32,"9":23,"10":3,"11":-1},"unit price":{"0":33.69,"1":21.12,"2":35.99,"3":55.82,"4":13.62,"5":92.55,"6":78.91,"7":25.42,"8":95.66,"9":22.55,"10":72.3,"11":72.18},"ext price":{"0":235.83,"1":232.32,"2":107.97,"3":2679.36,"4":286.02,"5":832.95,"6":3472.04,"7":915.12,"8":3061.12,"9":518.65,"10":216.9,"11":72.18}}
df = pd.DataFrame(data=data)
print(df)
df_1 = df.groupby(['order','sku'])['ext price'].sum()
df_2 = df.groupby('order')['ext price'].sum()
df_res = df_1.div(df_2)
print(df_res)
Output:
order sku
10001 B1-20000 0.409342
B1-86481 0.187409
S1-27722 0.403249
10005 S1-06532 0.429090
S1-27722 0.111798
S1-47412 0.424170
S1-82801 0.034942
10006 B1-20000 0.018657
B1-33087 0.134058
B1-33364 0.056063
S1-27722 0.791222
Name: ext price, dtype: float64
IIUC,
we can use transform which allows you to do groupby operations while maintaing the index:
you can then assign the variable to a new column if u wish.
s = (df.groupby(['order','sku'])['ext price'].transform('sum')
/ df.groupby('order')['ext price'].transform('sum'))
print(s)
0 0.409342
1 0.403249
2 0.187409
3 0.429090
4 0.034942
5 0.429090
6 0.424170
7 0.111798
8 0.791222
9 0.134058
10 0.056063
11 0.018657

How to unpivot multiple Columns and name months dynamically in Python?

I need some help in converting multiple columns into individual observations. Last time with your help, I tried to convert Demand columns, now I have to add more columns like Jobs and PO 12 columns of each and want to convert as three individual observations and then later calculate Future free column (Future Free = Max(Job,PO)-Demand)
from sqlalchemy import create_engine
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt
import calendar
from pandas.tseries.offsets import MonthEnd
engine = create_engine('mssql+pyodbc://server/driver=SQL+Server')
con=engine.connect()
rs=con.execute("""Select StockCode, Demand00, Demand01, Demand02,
Demand03, Demand04, Demand05, Demand06, Demand07, Demand08, Demand09,
Demand10, Demand11 from ForecastData""")
df= pd.DataFrame(rs.fetchall())
df.columns = ["StockCode", "Demand01","Demand02", "Demand03", "Demand04",
"Demand05", "Demand06","Demand07", "Demand08", "Demand09", "Demand10",
"Demand11", "Demand12"]
df.set_index('StockCode')
demand_columns=[i for i in df.columns if i.startswith('Demand')]
today=pd.Timestamp.now()
month_list=[(today+pd.DateOffset(months=i)) for i in
range(len(demand_columns))]
dic_month={col:month for col,month in zip(demand_columns,month_list)}
df.rename(columns=dic_month)
df2=pd.DataFrame(df.rename(columns=dict(zip(demand_columns,month_list))).set_
index('StockCode').stack()).reset_index()
df2.columns = ['StockCode', 'Month', 'Value']
df2['Month'] = pd.to_datetime(df2['Month'], format = '%Y%m').dt.date
Previous Output
StockCode Month Value
ABC 2019-01-01 100
ABC 2019-02-01 80
BXY 2019-01-01 50
Desired Output
StockCode Month Demand Job PO FutureFree
ABC January 100 120 0 20
ABC February 120 80 0 0
BXY January 50 00 60 10

Categories

Resources