Below is the data that I have:
There are three columns the class, date and marks
I need my output to be in the format below:
Here column headers 1,2,3 are the classes which contain the total marks stored month and year wise
My approach to this was using the following logic:
(Although I get the result, I am not satisfied with the code. I was hoping that someone could help me with a more efficient solution for the same?)
class=sorted(df.Class.unique())
dict_all = dict.fromkeys(class , 1)
for c in class:
actuals=[]
for i in range(2018,2019):
for j in range(1,13):
a = df['date'].map(lambda x : x.year == i)
b = df['date'].map(lambda x : x.month == j)
x= df[a & b & (df.class==c)].marks.sum()
actuals.append(x)
dict_all[c]=actuals
result = pd.DataFrame.from_dict(dict_all)
Input:
Class Date Total
0 3 01-01-2018 32
1 2 01-01-2018 69
2 1 01-01-2018 129
3 3 01-01-2019 12
Using df.pivot
df1 = df.pivot(index="Date",columns="Class", values="Total", ).reset_index().fillna(0)
print(df1)
Using crosstab
df1 = pd.crosstab(index=df["Date"],columns=df["Class"], values=df["Total"],aggfunc="max").reset_index().fillna(0)
print(df1)
Using groupby & unstack()
df1 = df.groupby(['Date','Class'])['Total'].max().unstack(fill_value=0).reset_index()
print(df1)
All the code(s) gives the output:
Class Date 1 2 3
0 01-01-2018 129 69 32
1 01-01-2019 0 0 12
Related
Let's say I have the dataset:
df1 = pd.DataFrame()
df1['number'] = [0,0,0,0,0]
df1["decade"] = ["1970", "1980", "1990", "2000", "2010"]`
print(df1)
#output:
number decade
0 0 1970
1 0 1980
2 0 1990
3 0 2000
4 0 2010
and I want to merge it with another dataset:
df2 = pd.DataFrame()
df2['number'] = [1,1]
df2["decade"] = ["1990", "2010"]
print(df2)
#output:
number decade
0 1 1990
1 1 2010
such that it get's values only from the decades from df2 that have values in them and leaves the others untouched, yielding:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
how must one go about doing that in pandas? I've tried stuff like join, merge, and concat but they all seem to either not give the desired result or not work because of the different dimensions of the 2 datasets. Any suggestions regarding which function I should be looking at?
Thank you so much!
You can use pandas.DataFrame.merge with pandas.Series.fillna :
out = (
df1[["decade"]]
.merge(df2, on="decade", how="left")
.fillna({"number": df1["number"]}, downcast="infer")
)
# Output :
print(out)
decade number
0 1970 0
1 1980 0
2 1990 1
3 2000 0
4 2010 1
What about using apply?
First you create a function
def validation(previous,latest):
if pd.isna(latest):
return previous
else:
return latest
Then you can use the function dataframe.apply to compare the data in df1 to df2
df1['number'] = df1.apply(lambda row: validation(row['number'],df2.loc[df2['decade'] == row.decade].number.max()),axis = 1)
Your result:
number decade
0 0 1970
1 0 1980
2 1 1990
3 0 2000
4 1 2010
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
I have two dataframes, just like below.
Dataframe1:
country
type
start_week
end_week
1
a
12
13
2
b
13
14
Dataframe2:
country
type
week
value
1
a
12
1000
1
a
13
900
1
a
14
800
2
b
12
1000
2
b
13
900
2
b
14
800
I want to add to the first dataframe column with the mean value from the second dataframe for key (country+type) and between start_week and end_week.
I want desired output to look like the below:
country
type
start_week
end_week
avg
1
a
12
13
950
2
b
13
14
850
here is one way :
combined = df1.merge(df2 , on =['country','type'])
combined = combined.loc[(combined.start_week <= combined.week) & (combined.week <= combined.end_week)]
output = combined.groupby(['country','type','start_week','end_week'])['value'].mean().reset_index()
output:
>>
country type start_week end_week value
0 1 a 12 13 950.0
1 2 b 13 14 850.0
You can use pd.melt and comparison of numpy arrays.
# melt df1
melted_df1 = df1.melt(id_vars=['country','type'],value_name='week')[['country','type','week']]
# for loop to compare two dataframe arrays
result = []
for i in df2.values:
for j in melted_df1.values:
if (j == i[:3]).all():
result.append(i)
break
# Computing mean of the result dataframe
result_df = pd.DataFrame(result,columns=df2.columns).groupby('type').mean().reset_index()['value']
# Assigning result_df to df1
df1['avg'] = result_df
country type start_week end_week avg
0 1 a 12 13 950.0
1 2 b 13 14 850.0
I'm trying to subtract years from one column based on a number in another column.
This is what i mean:
base_date amount_years
0 2006-09-01 2
1 2007-04-01 4
The result would be:
base_date amount_years
0 2008-09-01 2
1 20011-04-01 4
Is there a way to achieve this in python?
Use DateOffset with apply and axis=1 for process per rows:
f = lambda x: x['base_date'] + pd.offsets.DateOffset(years=x['amount_years'])
df['base_date'] = df.apply(f, axis=1)
print (df)
base_date amount_years
0 2008-09-01 2
1 2011-04-01 4
I have two files which show information about a transaction over products
Operations of type 1
d_op_1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3],'cost':[10,20,20,20,10,20,20,20],
'date':[2000,2006,2012,2000,2009,2009,2002,2006]})
Operations of type 2
d_op_2 = pd.DataFrame({'id':[1,1,2,2,3,4,5,5],'cost':[3000,3100,3200,4000,4200,3400,2000,2500],
'date':[2010,2015,2008,2010,2006,2010,1990,2000]})
I want to keep only those registers were there have been operations of type one between two operations of type 2.
E.G. for the product wit the id "1" there was an operation of type 1 (2012) between two operations of type 2 (2010,2015) so I want to keep that record.
The desired output it cloud be either this:
or this:
Using pd.merge() I got this result:
How can I filter this to get the desired output?
You can use:
#concat DataFrames together
df4 = pd.concat([d_op_1.rename(columns={'cost':'cost1'}),
d_op_2.rename(columns={'cost':'cost2'})]).fillna(0).astype(int)
#print (df4)
#find max and min dates per goups
df3 = d_op_2.groupby('id')['date'].agg({'start':'min','end':'max'})
#print (df3)
#join max and min dates to concated df
df = df4.join(df3, on='id')
df = df[(df.date > df.start) & (df.date < df.end)]
#reshape df for min, max and dated between them
df = pd.melt(df,
id_vars=['id','cost1'],
value_vars=['date','start','end'],
value_name='date')
#remove columns
df = df.drop(['cost1','variable'], axis=1) \
.drop_duplicates()
#merge to original, sorting
df = pd.merge(df, df4, on=['id', 'date']) \
.sort_values(['id','date']).reset_index(drop=True)
#reorder columns
df = df[['id','cost1','cost2','date']]
print (df)
id cost1 cost2 date
0 1 0 3000 2010
1 1 20 0 2012
2 1 0 3100 2015
3 2 0 3200 2008
4 2 10 0 2009
5 2 20 0 2009
6 2 0 4000 2010
#if need lists for duplicates
df = df.groupby(['id','cost2', 'date'])['cost1'] \
.apply(lambda x: list(x) if len(x) > 1 else x.values[0]) \
.reset_index()
df = df[['id','cost1','cost2','date']]
print (df)
id cost1 cost2 date
0 1 20 0 2012
1 1 0 3000 2010
2 1 0 3100 2015
3 2 [10, 20] 0 2009
4 2 0 3200 2008
5 2 0 4000 2010