I have the folowing table
TimeStamp
Name
Marks
Subject
2022-01-01 00:00:02.969
Chris
70
DK
2022-01-01 00:00:04.467
Chris
75
DK
2022-01-01 00:00:05.965
Mark
80
DK
2022-01-01 00:00:08.962
Cuban
60
DK
2022-01-01 00:00:10.461
Cuban
58
DK
I want to aggregate the table for each column into 20minute aggregate which includes max, min, values
Expected output
TimeStamp
Subject
Chris_Min
Chris_Max
Chris_STD
Mark_Min
Mark_Max
Mark_STD
2022-01-01 00:00:00.000
DK
70
75
2022-01-01 00:20:00.000
DK
etc
etc
2022-01-01 00:40:00.000
DK
etc
etc
I am having hard time aggregating the data into required output.
The agggregation should be dynamic so as to change to 10min or 30min.
I tried using bins to do it, but not getting the desired results.
Please Help.
You could try the following:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg(Min=("Marks", "min"), Max=("Marks", "max"), STD=("Marks", "std"))
.unstack(0)
.swaplevel(0, 1).reset_index()
)
First setting TimeStamp as index, and grouping by Subject and Name to get the right chunks to work on.
Then .resampling() the groups with the given frequency rule.
Then aggregating the required stats by using .agg() with named tuples.
Unstacking the first index level (Name) to get it in the columns.
Swapping the remaining index levels to get the right order when finally resetting the index.
Result for the given sample:
TimeStamp Subject Min Max STD
Name Chris Cuban Mark Chris Cuban Mark Chris Cuban Mark
0 2022-01-01 DK 70 58 80 75 60 80 3.535534 1.414214 NaN
If you want the columns exactly like in your expected output then you could add the following
result = result[
list(result.columns[:2]) + sorted(result.columns[2:], key=lambda c: c[1])
]
result.columns = [f"{lev1}_{lev0}" if lev1 else lev0 for lev0, lev1 in result.columns]
to get
TimeStamp Subject Chris_Min Chris_Max ... Cuban_STD Mark_Min Mark_Max Mark_STD
0 2022-01-01 DK 70 75 ... 1.414214 80 80 NaN
If you're getting the TypeError: aggregate() missing 1 required positional argument... error (the comment is gone), then it could be that you're working with an older Pandas version that can't deal with named tuples. You could try the following instead:
rule = "10min"
result = (
df.set_index("TimeStamp").groupby(["Name", "Subject"])
.resample(rule)
.agg({"Marks": ["min", "max", "std"]})
.droplevel(0, axis=1)
.unstack(0)
.swaplevel(0, 1).reset_index()
)
...
Is your table a pandas dataframe ?
If it's a pandas dataframe you can use resample:
# only if timestamp is not the index yet:
df = df.set_index('TimeStamp')
# the important part, you can use any function in agg or some str for simple
# functions like mean:
df = df.resample('10Min').agg('max','min')
# only if you had to set index to timestamp and want to go back to normal index:
df = df.reset_index()
Edit to get second table in the function:
# choose aggregation function
agg_functions = ['min', 'max', 'std']
# set_index on time column, resample
resampled_df = df.set_index('TimeStamp').resample('10Min').agg(agg_functions)
# flatten multiindex
resampled_df.columns = resampled_df.columns.map('_'.join)
# drop time column
resampled_df = resampled_df.reset_index(drop=True)
# concatenate with original df
pd.concat([df, resampled_df], axis=1)
Related
I have a dataframe which contains sales information of products, what i need to do is to create a function which based on the product id, product type and date, calculates the average sales for a time period which is less than the given date in the function.
This is how I have implemented it, but this approach takes a lot of time and I was wondering if there was a faster way to do this.
Dataframe:
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
Current code:
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
Calling the function:
cal_avg(2,'A','2018-02-12 15:00:00')
53.983
If you are running the calc_avg function "rarely" then I suggest ignoring my answer. Otherwise, it might be beneficial to you to simply calculate the expanding window average for each product/product type. It might be slow depending on your dataset size (in which case maybe just run it on specific product types?), but you'll only need to run it once. First sort by the column you want to perform the 'expanding' on (expanding is missing the 'on' parameter) to ensure the proper row order. Then 'groupby' and transform each group (to keep the indices of the original dataframe) with your expanding window aggregation of choice (in this case 'mean').
df = df.sort_values('sales_time')
df['exp_mean_sales'] = df.groupby(['prod_id', 'prod_type'])['sale_amt'].transform(lambda gr: gr.expanding().mean())
With the result being:
df.head()
prod_id prod_type sales_time sale_amt exp_mean_sales
0 2 B 2018-01-01 00:00:00 8 8.000000
1 2 B 2018-01-01 03:00:00 72 40.000000
2 2 B 2018-01-01 06:00:00 33 37.666667
3 2 A 2018-01-01 09:00:00 81 81.000000
4 2 B 2018-01-01 12:00:00 83 49.000000
Check Below code, with %%timeit comparison (Google Colab)
import pandas as pd
product_type = ['A','B']
df = pd.DataFrame({'prod_id':np.repeat(np.arange(start=2,stop=5,step=1),235),'prod_type': np.random.choice(np.array(product_type), 705),'sales_time': pd.date_range(start ='1-1-2018',
end ='3-30-2018', freq ='3H'),'sale_amt':np.random.randint(4,100,size = 705)})
## OP's function
def cal_avg(product,ptype,pdate):
temp_df = df[(df['prod_id']==product) & (df['prod_type']==ptype) & (df['sales_time']<= pdate)]
return temp_df['sale_amt'].mean()
## Numpy data prep
prod_id_array = np.array(df.values[:,:1])
prod_type_array = np.array(df.values[:,1:2])
sales_time_array = np.array(df.values[:,2:3], dtype=np.datetime64)
values = np.array(df.values[:,3:])
OP's function -
%%timeit
cal_avg(2,'A','2018-02-12 15:00:00')
Output:
Numpy version
%%timeit -n 1000
cal_vals = [2,'A','2018-02-12 15:00:00']
mask = np.logical_and(prod_id_array == cal_vals[0], prod_type_array == cal_vals[1], sales_time_array <= np.datetime64(cal_vals[2]) )
np.mean(values[mask])
Output:
I have a dataframe with three columns lets say
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
Zara qrs 2021-02-25
I want to compare each date in Date column with all the other dates in the Date column and only keep those rows which lie within 6 months of atleast one of all the dates.
for example: (2022-01-01 - 2022-06-06) = 5 months so we keep both these dates
but,
(2022-06-06 - 2021-02-25) and (2022-01-01 - 2021-02-25) exceed the 6 month limit
so we will drop that row.
Desired Output:
Name Address Date
faraz xyz 2022-01-01
Abdul abc 2022-06-06
I have tried a couple of approches such a nested loops, but I got 1 million+ entries and it takes forever to run that loop. Some of the dates repeat too. Not all are unique.
for index, row in dupes_df.iterrows():
for date in uniq_dates_list:
format_date = datetime.strptime(date,'%d/%m/%y')
if (( format_date.year - row['JournalDate'].year ) * 12 + ( format_date.month - row['JournalDate'].month ) <= 6):
print("here here")
break
else:
dupes_df.drop(index, inplace=True)
I need a much more omptimal solution for it. Studied about lamba functions, but couldn't get to the depths of it.
IIUC, this should work for you:
import pandas as pd
import itertools
from io import StringIO
data = StringIO("""Name;Address;Date
faraz;xyz;2022-01-01
Abdul;abc;2022-06-06
Zara;qrs;2021-02-25
""")
df = pd.read_csv(data, sep=';', parse_dates=['Date'])
df_date = pd.DataFrame([sorted(l, reverse=True) for l in itertools.combinations(df['Date'], 2)], columns=['Date1', 'Date2'])
df_date['diff'] = (df_date['Date1'] - df_date['Date2']).dt.days
df[df.Date.isin(df_date[df_date['diff'] <= 180].iloc[:, :-1].T[0])]
Output:
Name Address Date
0 faraz xyz 2022-01-01
1 Abdul abc 2022-06-06
First I think it's be easier if you use 'relativedelta' from 'dateutil'.
Reference: https://pynative.com/python-difference-between-two-dates-in-months/
Second, I think you need to add a column, let's call it score.
At the second loop, if delta <= 6 month :
set score = 1 and 'continue'
This way each row is compared to all rows.
Delete all rows that have score == 0.
Say I have the following code that generates a dataframe:
df = pd.DataFrame({"customer_code": ['1234','3411','9303'],
"main_purchases": [3,10,5],
"main_revenue": [103.5,401.5,99.0],
"secondary_purchases": [1,2,4],
"secondary_revenue": [43.1,77.5,104.6]
})
df.head()
There's the customer_code column that's the unique ID for each client.
And then there are 2 columns to indicate the purchases that took place and revenue generated from main branches by those clients.
And another 2 columns to indicate the purchases/revenue from secondary branches by those clients.
I want to get the data into a format like this, where a pivot is done where there's a new column to differentiate between main vs secondary, but the revenue numbers and purchase columns are not mixed up:
The obvious solution is just to split this into 2 dataframes, and then simply do a concatenate, but I'm wondering whether there's a built-in way to do this in a line or two - this strikes me as the kind of thing someone might have thought to bake in a solution for.
With a little column renaming to get the "revenue" and "purchases" in the column names first using a regular expression and str.replace we can use pd.wide_to_long to convert these now stubnames from columns to rows:
# Reorder column names so stubnames are first
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
# Convert wide_to_long
df = (
pd.wide_to_long(
df,
i='customer_code',
stubnames=['purchases', 'revenue'],
j='type',
sep='_',
suffix='.*'
)
.sort_index() # Optional sort to match expected output
.reset_index() # retrieve customer_code from the index
)
df:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
1
1234
secondary
1
43.1
2
3411
main
10
401.5
3
3411
secondary
2
77.5
4
9303
main
5
99
5
9303
secondary
4
104.6
What does reordering the column headers do?
df.columns = [df.columns[0],
*df.columns[1:].str.replace(r'(.*)_(.*)', r'\2_\1', regex=True)]
Produces:
Index(['customer_code', 'purchases_main', 'revenue_main',
'purchases_secondary', 'revenue_secondary'],
dtype='object')
The "type" column is now the suffix of the column header which allows wide_to_long to process the table as expected.
You can abstract the reshaping process with pivot_longer from pyjanitor; they are just a bunch of wrapper functions in Pandas:
#pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index = 'customer_code',
names_to=('type', '.value'),
names_sep='_',
sort_by_appearance=True)
customer_code type purchases revenue
0 1234 main 3 103.5
1 1234 secondary 1 43.1
2 3411 main 10 401.5
3 3411 secondary 2 77.5
4 9303 main 5 99.0
5 9303 secondary 4 104.6
The .value in names_to signifies to the function that you want that part of the column to remain as a header; the other part goes under the type column. The split is determined in this case by names_sep (there is a names_pattern option, that allows regular expression split); if you do not care about the order of appearance, you can set sort_by_appearance as False.
You can also use melt() and concat() function to solve this problem.
import pandas as pd
df1 = df.melt(
id_vars='customer_code',
value_vars=['main_purchases', 'secondary_purchases'],
var_name='type',
value_name='purchases',
ignore_index=True)
df2 = df.melt(
id_vars='customer_code',
value_vars=['main_revenue', 'secondary_revenue'],
var_name='type',
value_name='revenue',
ignore_index=True)
Then we use concat() with the parameter axis=1 to join side by side and use sort_values(by='customer_code') to sort data by customer.
result= pd.concat([df1,df2['revenue']],
axis=1,
ignore_index=False).sort_values(by='customer_code')
Using replace() with regex to align type names:
result.type.replace(r'_.*$','', regex=True, inplace=True)
The above code will output the below dataframe:
customer_code
type
purchases
revenue
0
1234
main
3
103.5
3
1234
secondary
1
43.1
1
3411
main
10
401.5
4
3411
secondary
2
77.5
2
9303
main
5
99
5
9303
secondary
4
104.6
I have a CSV that initially creates following dataframe:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-05 52304.0
Using the following script, I would like to fill the missing dates and have a corresponding NaN value in the Portfoliovalue column with NaN. So the result would be this:
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
I first tried the method here: Fill the missing date values in a Pandas Dataframe column
However the bfill replaces all my NaN's and removing it only returns an error.
So far I have tried this:
df = pd.read_csv("Tickers_test5.csv")
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
portfolio_value = portfolio_value + cash
date = datetime.date(datetime.now())
df2.loc[len(df2)] = [date, portfolio_value]
print(df2.asfreq('D'))
However, this only returns this:
Date Portfoliovalue
1970-01-01 NaN NaN
Thanks for your help. I am really impressed at how helpful this community is.
Quick update:
I have added the code, so that it fills my missing dates. However, it is part of a programme, which tries to update the missing dates every time it launches. So when I execute the code and no dates are missing, I get the following error:
ValueError: cannot reindex from a duplicate axis”
The code is as follows:
df2 = pd.read_csv("Portfoliovalues.csv")
portfolio_value = df['Currentvalue'].sum()
date = datetime.date(datetime.now())
df2.loc[date, 'Portfoliovalue'] = portfolio_value
#Solution provided by Uts after asking on Stackoverflow
df2.Date = pd.to_datetime(df2.Date)
df2 = df2.set_index('Date').asfreq('D').reset_index()
So by the looks of it the code adds a duplicate date, which then causes the .reindex() function to raise the ValueError. However, I am not sure how to proceed. Is there an alternative to .reindex() or maybe the assignment of today's date needs changing?
Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
Code
df.Date = pd.to_datetime(df.Date)
df = df.set_index('Date').asfreq('D').reset_index()
Output
Date Portfoliovalue
0 2021-05-01 50000.0
1 2021-05-02 NaN
2 2021-05-03 NaN
3 2021-05-04 NaN
4 2021-05-05 52304.0
Pandas has reindex method: given a list of indices, it remains only indices from list.
In your case, you can create all the dates you want, by date_range for example, and then give it to reindex. you might needed a simple set_index and reset_index, but I assume you don't care much about the original index.
Example:
df.set_index('Date').reindex(pd.date_range(start=df['Date'].min(), end=df['Date'].max(), freq='D')).reset_index()
On first we set 'Date' column as index. Then we use reindex, it full list of dates (given by date_range from minimal date to maximal date in 'Date' column, with daily frequency) as new index. It result nans in places without former value.
I need to resample a Pandas MultiIndex consisting of two levels. The inner level is a datetime index. which needs to be resampled.
import numpy as np
import pandas as pd
rng = pd.date_range('2019-01-01', '2019-04-27', freq='B', name='date')
df = pd.DataFrame(np.random.randint(0, 100, (len(rng), 2)), index=rng, columns=['sec1', 'sec2'])
df['month'] = df.index.month
df.set_index(['month', rng], inplace=True)
print(df)
# At that point I need to apply pd.resample. I'm wondering how to specify the level that I would like to resample?
df = df.resample('M').last() # is not working;
# I'm looking for somthing like this: df = df.resample('M', level=1).last()
Try:
df.groupby('month').resample('M', level=1).last()
Output:
sec1 sec2
month date
1 2019-01-31 59 87
2 2019-02-28 70 33
3 2019-03-31 71 38
4 2019-04-30 56 79
Details.
First, group the dataframe on 'month' or level=0 of the index.
Next, use resample with the level parameter for MultiIndex.
The level parameter can use either str, the index level name such as 'date' in this case, or the level number.
Lastly, chain and aggregration function such as last.