I have example dataframe in yearly granularity:
df = pd.DataFrame({
"date": ["2020-01-01", "2021-01-01", "2022-01-01"],
"cost": [100, 1000, 150],
"person": ["Tom","Jerry","Brian"]
})
I want to create dataframe with monthly granularity without any estimation methods (just repeat a row 12 times for each year. So in a result from this 3 row dataframe I would like to get 36 rows exactly like:
2020-01-01 / 100 / Tom
2020-02-01 / 100 / Tom
2020-03-01 / 100 / Tom
2020-04-01 / 100 / Tom
2020-05-01 / 100 / Tom
[...]
2022-10-01 / 150 / Brian
2022-11-01 / 150 / Brian
2022-12-01 / 150 / Brian
I tried
df.resample('M', on = 'date').apply(lambda x:x)
but cant seem to get it working...
Im beginner so forgive me my ignorance
Thanks for help in advance!
Here is a way to do that.
count = len(df)
for var in df[['date','cost','person']].values:
for i in range(2,13):
df.loc[count] = [(var[0][0:5] + "{:02d}".format(i) + var[0][7:]),var[1], var[2]]
count += 1
df = df.sort_values('date')
Following should also work,
#Typecasting
df['date'] = pd.to_datetime(df['date'])
#Making new dataframe based on frequency
op = pd.DataFrame(pd.date_range(start=df['date'].min(), end=df['date'].max()+pd.offsets.DateOffset(months=11),freq='MS'),columns = ['date'])
#merging both results on year using merge( with outer join)
res = pd.merge(df,op,left_on=df['date'].apply(lambda x: x.year), right_on = op['date'].apply(lambda x: x.year), how = 'outer')
#dropping key columns from left side
res.drop(['key_0','date_x'],axis=1,inplace=True)
Related
I have dataframe df having revenue of 3 months by user, and need to find percent change between august and july using python,
user revenuejune revenuejuly revenueaugust
Sam 231.13 1345.2 2455
Output
user revenuejune revenuejuly revenueaugust change
Sam 231.13 1345.2 2455. 82.5
Use:
df['change'] = ((df['revenueaugust'] - df['revenuejuly'])/df['revenuejuly']*100)
Output:
user revenuejune revenuejuly revenueaugust change
0 Sam 231.13 1345.2 2455 82.500743
I am not sure I understood it right. But I guess you just need to add a new column based on a operation between coluns revenueaugust and revenuejuly.
import pandas as pd
import json
data = pd.DataFrame(
{
'name': ['Sam', 'Bob'],
'revenuejune': [231.13, 200],
'revenuejuly': [1345.2, 300],
'revenueaugust': [2455, 400],
}
)
data['change'] = (data['revenueaugust'] - data['revenuejuly'])/ data['revenuejuly'] * 100
print(data)
Output:
name revenuejune revenuejuly revenueaugust change
0 Sam 231.13 1345.2 2455 82.500743
1 Bob 200.00 300.0 400 33.333333
this should work for calculating the difference between august and july.
df["change"] = (df["revenueaugust."] - df["revenuejuly"]) / df["revenuejuly"] * 100
I have 2 Data Frames which needs to be compared iteratively and mismatch rows has to be stored in a csv. Since it has historical dates, need to perform comparison based on year. How can this be achieve in Pandas
product_1 price_1 Date of purchase
0 computer 1200 2022-01-02
1 monitor 800 2022-01-03
2 printer 200 2022-01-04
3 desk 350 2022-01-05
product_2 price_2 Date of purchase
0 computer 900 2022-01-02
1 monitor 800 2022-01-03
2 printer 300 2022-01-04
3 desk 350 2022-01-05
I would use a split/merge/where
df1['Date of purchase'] = df1['Date of purchase'].apply(lambda x : x.split('-')[0])
df2['Date of purchase'] = df2['Date of purchase'].apply(lambda x : x.split('-')[0])
From there you can merge the two columns using a join or merge
After that you can use an np.where()
merge_df['Check'] = np.where(merge_df['comp_column'] != merge_df['another_comp_column'])
From there you can just look for where the comp columns didn't match
merge_df.loc[merge_df['Check'] == False]
First, let's solve the problem for any group of dates/years. First, you could merge your data using the date and product names:
df = df1.merge(df2, left_on=["Date of purchase", "product_1"], right_on=["Date of purchase", "product_2"])
# Bonus points if you rename "product_2" and only use `on` instead of `left_on` and `right_on`
After that, you could simply use .loc to find the rows where prices do not match:
df.loc[df["price_1"] != df["price_2"]])
product_1 price_1 Date of purchase product_2 price_2
0 computer 1200 2022-01-02 computer 900
2 printer 200 2022-01-04 printer 300
Now, you could process each year by iterating a list of years, querying only the data from that year on df1 and df2 and then using the above procedure to find the price mismatches:
# List available years
years = pd.concat([df1["Date of purchase"].dt.year, df2["Date of purchase"].dt.year], axis=0).unique()
# Rename columns for those bonus points
df1 = df1.rename(columns={"product_1": "product"})
df2 = df2.rename(columns={"product_2": "product"})
# Accumulate your rows in a new dataframe (starting from a list)
output_rows = list()
for year in years:
# find data for this `year`
df1_year = df1.loc[df1["Date of purchase"].dt.year == year]
df2_year = df2.loc[df2["Date of purchase"].dt.year == year]
# Apply the procedure described at the beginning
df = df1_year .merge(df2_year , on=["Date of purchase", "product"])
# Find rows where prices do no match
mismatch_rows = df.loc[df["price_1"] != df["price_2"]]
output_rows.append(mismatch_rows)
# Now, transform your rows into a single dataframe
output_df = pd.concat(output_rows)
Output:
product price_1 Date of purchase price_2
0 computer 1200 2022-01-02 900
2 printer 200 2022-01-04 300
I have a dataframe containing two columns of dates: start date and end date. I need to set up a dataframe where all months of the year are set up in separate columns based on the start and end date intervals so I can sum values from another column for each of the months per name.
To illustrate:
Original df:
Start Date End Date Name Value
10/22/20 01/25/21 John 100
10/12/20 04/30/21 John 50
02/25/21 None John 20
Desired df:
Name Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 Jul_21 Aug_21 ...
John 150 150 150 150 70 70 70 20 20 20 20 ...
Any suggestions or pointers on how I could achieve that result would be greatly appreciated!
First convert values to datetimes with replace non datetimes to missing values and replace them to some date, then in list comprehension get all months to Series, which is used for pivoting by DataFrame.pivot_table:
end = '2021-12-31'
df['Start'] = pd.to_datetime(df['Start Date'])
df['End'] = pd.to_datetime(df['End Date'], errors='coerce').fillna(end)
s = pd.concat([pd.Series(r.Index,pd.date_range(r.Start, r.End, freq='M'))
for r in df.itertuples()])
df1 = pd.DataFrame({'Date': s.index}, s).join(df)
df2 = df1.pivot_table(index='Name',
columns='Date',
values='Value',
aggfunc='sum',
fill_value=0)
df2.columns = df2.columns.strftime('%b_%y')
print (df2)
Date Oct_20 Nov_20 Dec_20 Jan_21 Feb_21 Mar_21 Apr_21 May_21 Jun_21 \
Name
John 150 150 150 50 70 70 70 20 20
Date Jul_21 Aug_21 Sep_21 Oct_21 Nov_21 Dec_21
Name
John 20 20 20 20 20 20
I have a file, df, that I wish to take the delta of every 7 day period
df:
Date Value
10/15/2020 75
10/14/2020 70
10/13/2020 65
10/12/2020 60
10/11/2020 55
10/10/2020 50
10/9/2020 45
10/8/2020 40
10/7/2020 35
10/6/2020 30
10/5/2020 25
10/4/2020 20
10/3/2020 15
10/2/2020 10
10/1/2020 5
Desired Output:
Date Value
10/9/2020 30
10/2/2020 30
This is what I am doing, thanks to the help of someone on this platform:
df.Date = pd.to_datetime(df.Date)
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['Delta'] = df['New'] - df['Value']
df[['Date','Delta']].dropna()
However, this gives me a running delta, I wish to have the delta displayed for every 7 day period, as shown in the Desired Output.
Any suggestion is appreciated
i think the way you have done is the perfect way. I think modifying it a bit will give you the desired result. Try this:
df.Date = pd.to_datetime(df.Date)
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['Delta'] = df['New'] - df['Value']
df_new=df[['Date','Delta']].dropna()
df_new.iloc[::7, :]
I have a dataframe containing dates and prices. I need to add all prices belonging to the week of ex: 17/12 to 23/12 and put it infront of a new label corresponding to that week.
Date Price
12/17/2015 10
12/18/2015 20
12/19/2015 30
12/21/2015 40
12/24/2015 50
I want the output to be the following
week total
17/12-23/12 100
24/12-30/12 50
I tried using different datetime functions and groupby functions but was not able to get the o/p. Please help
what about this approach?
In [19]: df.groupby(df.Date.dt.weekofyear)['Price'].sum().rename_axis('week_no').reset_index(name='total')
Out[19]:
week_no total
0 51 60
1 52 90
UPDATE:
In [49]: df.resample(on='Date', rule='7D', base='4D').sum().rename_axis('week_from') \
.reset_index('total')
Out[49]:
week_from Price
0 2015-12-17 100
1 2015-12-24 50
UPDATE2:
x = (df.resample(on='Date', rule='7D', base='4D')
.sum()
.reset_index()
.rename(columns={'Price':'total'}))
x = x.assign(week=x['Date'].dt.strftime('%d/%m')
+'-'
+(x.pop('Date')+pd.DateOffset(days=7)).dt.strftime('%d/%m'))
In [127]: x
Out[127]:
total week
0 100 17/12-24/12
1 50 24/12-31/12
Using resample
df['Date'] = pd.to_datetime(df['Date'])
df.set_index(df.Date, inplace = True)
df = df.resample('W').sum()
Price
Date
2015-12-20 60
2015-12-27 90