I would like to groupby id, rolling 7 dates, and apply a simple OLS regression to each id and each period.
Here's my function:
endo_var = 'monthly_gross_ret'
exogenous_vars = ['bcmk_ret']
def get_r2(data_it):
data_it.exog = sm.add_constant(data_it[exogenous_vars], prepend=False)
data_it.endog = data_it[endo_var]
mod = sm.OLS(data_it.endog, data_it.exog)
res = mod.fit()
return res.rsquared
The fabricated sample data is:
ids = ['A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B']
dates = [1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8]
bcmk_rets = [4,5,6,3,7,8,3,5,1,5,8,4,7,3,7,3]
monthly_gross_rets = [8,4,7,2,5,7,2,6,8,3,6,6,2,7,3,2]
df = pd.DataFrame({'id':ids,'date':dates,'bcmk_ret':bcmk_rets,'monthly_gross_ret':monthly_gross_rets})
I would like to groupby id and rolling 7 dates then apply the function get_r2, something like this:
data.groupby('id').rolling(7).apply(get_r2)
However, the error says:KeyError: "None of [Index(['bcmk_ret'], dtype='object')] are in the [index]"
I know how to do this in for loop:
record_r2 = pd.DataFrame()
count = 0
for i in data['id'].unique():
data_i = data[(data['id']==i)]
for t in data_i['date'].unique():
data_it = data_i[(data_i['date']>=t) & (data_i['date']<=t + 6)]
if len(data_it) ==7: # should have put 24, but there's literally ZERO result, because the data quality is rather poor. (we have lots of missing months. For e.g., id == 6679 has May to Nov, then missing Dec and Jan, then have Feb to Dec...)
data_it.exog = sm.add_constant(data_it[exogenous_vars], prepend=False)
data_it.endog = data_it[endo_var]
mod = sm.OLS(data_it.endog, data_it.exog)
res = mod.fit()
record_r2.loc[count,'id'] = i
record_r2.loc[count,'date'] = t
record_r2.loc[count,'R2'] = res.rsquared
count = count + 1
The for loop returns the desired dataframe that records the R^2 of each group's regression results:
{'id': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'date': {0: 1.0, 1: 2.0, 2: 1.0, 3: 2.0},
'R2': {0: 0.3512152777777777,
1: 0.7200000000000001,
2: 0.4736842105263158,
3: 0.026894573272770783},
'adj_R2': {0: 0.2214583333333333,
1: 0.6640000000000001,
2: 0.368421052631579,
3: -0.16772651207267497}}
But the for loop is computationaly inefficient and takes ages to run my full data sample. That's why I would love to use the groupby method instead of for loop. Could someone help out?
Thank you very much!
Best,
Darcy
Related
I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2 I want to sort them with Digit being the highest priority and Char_2 being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted() with this list sorting function as my sorted( , key=):
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("\d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key= in df.sort_values() in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )
If you look at the definition of the parameter key in sort_values it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
First way:
sort_int_key = lambda col: col.str.extract("(\d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)\d+(\w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
Assign extracted values as temporary columns and sort them using by parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(\d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)\d+(\w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])
I have a code that check in different columns for all the dates that are >= "2022-03-01" and i <= "2024-12-31 then it append it to a list ext=[].
What I would like is to be able to extract the more information about located on the same row.
My code:
from pandas import *
data = read_csv("Book1.csv")
# converting column data to list
D_EXT_1 = data['D_EXT_1'].tolist()
D_INT_1 = data['D_INT_1'].tolist()
D_EXT_2 = data['D_EXT_2'].tolist()
D_INT_2 = data['D_INT_2'].tolist()
D_EXT_3 = data['D_EXT_3'].tolist()
D_INT_3 = data['D_INT_3'].tolist()
D_EXT_4 = data['D_EXT_4'].tolist()
D_INT_4 = data['D_INT_4'].tolist()
D_EXT_5 = data['D_EXT_5'].tolist()
D_INT_5 = data['D_INT_5'].toList()
D_EXT_6 = data['D_EXT_6'].toList()
D_INT_6 = data['D_INT_6'].toList()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 + D_INT_3 + D_INT_4 + D_INT_5 + D_INT_6 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
Example of data:
NAME,ADRESS,D_INT_1,D_EXT_1,D_INT_2,D_EXT_2
ALEX,h4n1p8,2020-01-01,2024-01-01,2023-02-02,2020-01-01
What my code will print with that data:
2024-01-01
DESIRED OUTPUT:
Alex, 2024-01-01
As requested by not_speshal
-> data.head().to_dict()
{'EMPL. NO': {0: 5}, "NOM A L'EMPLACEMENT": {0: 'C010 - HOPITAL REGIONAL DE RIMOUSKI/CENTRE SERVEUR OPTILAB'}, 'ADRESSE': {0: '150 AVENUE ROULEAU'}, 'VILLE': {0: 'RIMOUSKI'}, 'PROV': {0: 'QC'}, 'OBJET NO': {0: 67}, "EMPLACEMENT DE L'APPAREIL": {0: 'CHAUFFERIE'}, 'RBQ 2018': {0: nan}, "DESCRIPTION DE L'APPAREIL": {0: 'CHAUDIERE AQUA. A VAPEUR'}, 'MANUFACTURIER': {0: 'MIURA'}, 'DIMENSIONS': {0: nan}, 'MAWP': {0: 170}, 'SVP': {0: 150}, 'DERNIERE INSP. EXT.': {0: '2019-05-29'}, 'FREQ. EXT.': {0: 12}, 'DERNIERE INSP. INT.': {0: '2020-06-03'}, 'FREQ. INT.': {0: 12}, 'D_EXT_1': {0: '2020-05-29'}, 'D_INT_1': {0: '2021-06-03'}, 'D_EXT_2': {0: '2021-05-29'}, 'D_INT_2': {0: '2022-06-03'}, 'D_EXT_3': {0: '2022-05-29'}, 'D_INT_3': {0: '2023-06-03'}, 'D_EXT_4': {0: '2023-05-29'}, 'D_INT_4': {0: '2024-06-03'}, 'D_EXT_5': {0: '2024-05-29'}, 'D_INT_5': {0: '2025-06-03'}, 'D_EXT_6': {0: '2025-05-29'}, 'D_INT_6': {0: '2026-06-03'}}
Start with
import pandas as pd
cols = [prefix + str(i) for prefix in ['D_EXT_','D_INT_'] for i in range(1,7)]
data = pd.read_csv("Book1.csv")
for col in cols:
data.loc[:,col] = pd.to_datetime(data.loc[:,col])
Then use
ext = data[
(
data.loc[:,cols].ge(pd.to_datetime("2022-03-01"))\
& data.loc[:,cols].le(pd.to_datetime("2024-12-13"))\
).any(axis=1)
]
EDIT: while it's not clear what date you want if multiple are in the required range, to get what (I understand) you're requesting, use
# assuming
import numpy as np
import pandas as pd
# and
cols = [prefix + str(i) for prefix in ['D_EXT_','D_INT_'] for i in range(1,7)]
ext = data[
np.concatenate(
(
np.setdiff1d(data.columns,cols),
np.array(
(data.loc[:,cols].gt(pd.to_datetime("2022-03-01"))\
& data.loc[:,cols].lt(pd.to_datetime("2024-12-13"))\
).idxmax(axis=1)
)
),
axis=None
)]
where cols is as above
IIUC, try:
columns = ['D_EXT_1', 'D_EXT_2', 'D_EXT_3', 'D_EXT_4', 'D_EXT_5', 'D_EXT_6', 'D_INT_1', 'D_INT_2', 'D_INT_3', 'D_INT_4', 'D_INT_5', 'D_INT_6']
data[columns] = data[columns].apply(pd.to_datetime)
output = data[((data[columns]>="2022-03-01")&(data[columns]<="2024-12-31")).any(axis=1)]
This will return all the rows where any date in the columns list is between 2022-03-01 and 2024-12-31
It seems that you want to get only rows where at least one of the dates is in the range ["2022-03-01", "2024-12-31"], correct?
First, convert all the date columns to datetime, using DataFrame.apply + pandas.to_datetime.
import pandas as pd
date_cols = ['D_EXT_1', 'D_EXT_2', 'D_EXT_3', 'D_EXT_4', 'D_EXT_5', 'D_EXT_6', 'D_INT_1', 'D_INT_2', 'D_INT_3', 'D_INT_4', 'D_INT_5', 'D_INT_6']
data[date_cols] = data[date_cols].apply(pd.to_datetime)
Then create a 2D boolean mask of all the dates that are in the desired range
is_between_dates = (data[date_cols] > "2022-03-01") & (data[datecols] <= "2024-12-31")
# print(is_between_dates) to clearly understand what it represents
Finally, select the rows that contain at least one True value, meaning that there is at least one date in that row that belongs to the date range. This can be achieved using DataFrame.any with axis=1 on the 2D boolean mask, is_between_dates.
# again, print(is_between_dates.any(axis=1)) to see
data = data[is_between_dates.any(axis=1)]
Use melt to reformat your dataframe to be easily searchable:
df = pd.read_csv('Book1.csv').melt(['NAME', 'ADRESS']) \
.astype({'value': 'datetime64'}) \
.query("'2022-03-01' <= value & value <= '2024-12-31'")
At this point your dataframe looks like:
>>> df
NAME ADRESS variable value
1 ALEX h4n1p8 D_EXT_1 2024-01-01
2 ALEX h4n1p8 D_INT_2 2023-02-02
Now it's easy to get a NAME for a date:
>>> df.loc[df['value'] == '2024-01-01', 'NAME']
1 ALEX
Name: NAME, dtype: object
# OR
>>> df.loc[df['value'] == '2024-01-01', 'NAME'].tolist()
['ALEX']
I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()
I would like to mask my dataframe conditional on multiple columns inside a loop. I am trying to do something like this:
dfs = []
val_dict = {0: 'a', 1: 'b', 2: 'c', 3: 'd'}
for i in range(4):
items = [val_dict[i] for i in range(i+1)]
df_ = df[(df['0'] == items[0]) & (df['1'] == items[1]) & ... ]
dfs.append(df_)
Please note that the second condition I wrote above would not exist for the first iteration of the loop because there would be no items[1] element.
Here is a sample dataframe you are welcome to test on:
df = pd.DataFrame({'0': ['a']*3 + ['b']*3 + ['c']*3,
'1': ['a']*3 + ['b']*6,
'2': ['b']*4 + ['c']*5,
'3': ['c']*5 + ['d']*4})
The only solution I have come up with uses eval which I would like very much to avoid.
If you subset your DataFrame to include only the columns you want to use for comparison (as you have done in your example) and the keys in your val_dict are the same as the columns you want to compare, then you can get Pandas to do this for you.
Making a slight modification to your df
df = pd.DataFrame({0: ['a']*3 + ['b']*3 + ['c']*3,
1: ['a']*3 + ['d']*6,
2: ['b']*4 + ['c']*5,
3: ['c']*5 + ['a']*4})
You can now accomplish what you want by the following
dfs = []
val_dict = {0: 'a', 1: 'b', 2: 'c', 3: 'd'}
val_series = pd.Series(val_dict)
for i in range(4):
mask = (df == val_series).all(axis=1)
dfs.append(df[mask])
EDIT
I am leaving my original solution even though it addresses a different problem than OP intended to solve. The intended problem can be solved by the following:
mask = True
for key in range(4):
mask &= df[key] == val_dict[key]
dfs.append(df[mask])
Again, this is using the modified df used earlier in my original answer.
I'll share my eval solution.
for i in range(4):
items = [val_dict[i] for i in range(i+1)]
df_ = eval('df[(' + ') & ('.join(['df["'+str(j)+'"] == items['+str(j)+']' for j in range(i+1)]) + ')]')
dfs.append(df_)
It works... but so ugly :(
I have a pandas df that contains 2 columns 'start' and 'end' (both are integers). I would like an efficient method to search for rows such that the range that is represented by the row [start,end] contains a specific value.
Two additional notes:
It is possible to assume that ranges don't overlap
The solution should support a batch mode - that given a list of inputs, the output will be a mapping (dictionary or whatever) to the row indices that contain the matching range.
For example:
start end
0 7216 7342
1 7343 7343
2 7344 7471
3 7472 8239
4 8240 8495
and the query of
[7215,7217,7344]
will result in
{7217: 0, 7344: 2}
Thanks!
Brute force solution, could use lots of improvements though.
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]})
search = [7215, 7217, 7344]
res = {}
for i in search:
mask = (df.start <= i) & (df.end >= i)
idx = df[mask].index.values
if len(idx):
res[i] = idx[0]
print res
Yields
{7344: 2, 7217: 0}
Selected solution
This new solution could have better performances. But there is a limitation, it will only works if there is no gap between ranges like in the example provided.
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
query = [7215,7217,7344]
# Reshaping the original DataFrame
df = df.reset_index()
df = pd.concat([df['start'], df['end']]).reset_index()
df = df.set_index(0).sort_index()
# Creating a DataFrame with a continuous index
max_range = max(df.index) + 1
min_range = min(df.index)
s = pd.DataFrame(index=range(min_range,max_range))
# Joining them
s = s.join(df)
# Filling the gaps
s = s.fillna(method='backfill')
# Then a simple selection gives the result
s.loc[query,:].dropna().to_dict()['index']
# Result
{7217: 0.0, 7344: 2.0}
Previous proposal
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
# Constructing a DataFrame containing the query numbers
query = [7215,7217,7344]
result = pd.DataFrame(np.tile(query, (len(df), 1)), columns=query)
# Merging the data and the query
df = pd.concat([df, result], axis=1)
# Making the test
df = df.apply(lambda x: (x >= x['start']) & (x <= x['end']), axis=1).loc[:,query]
# Keeping only values found
df = df[df==True]
df = df.dropna(how='all', axis=(0,1))
# Extracting to the output format
result = df.to_dict('split')
result = dict(zip(result['columns'], result['index']))
# The result
{7217: 0, 7344: 2}