I have a code that check in different columns for all the dates that are >= "2022-03-01" and i <= "2024-12-31 then it append it to a list ext=[].
What I would like is to be able to extract the more information about located on the same row.
My code:
from pandas import *
data = read_csv("Book1.csv")
# converting column data to list
D_EXT_1 = data['D_EXT_1'].tolist()
D_INT_1 = data['D_INT_1'].tolist()
D_EXT_2 = data['D_EXT_2'].tolist()
D_INT_2 = data['D_INT_2'].tolist()
D_EXT_3 = data['D_EXT_3'].tolist()
D_INT_3 = data['D_INT_3'].tolist()
D_EXT_4 = data['D_EXT_4'].tolist()
D_INT_4 = data['D_INT_4'].tolist()
D_EXT_5 = data['D_EXT_5'].tolist()
D_INT_5 = data['D_INT_5'].toList()
D_EXT_6 = data['D_EXT_6'].toList()
D_INT_6 = data['D_INT_6'].toList()
ext = []
ext = [i for i in D_INT_1 + D_INT_2 + D_INT_3 + D_INT_4 + D_INT_5 + D_INT_6 if i >= "2022-03-01" and i <= "2024-12-31"]
print(*ext, sep="\n")
Example of data:
NAME,ADRESS,D_INT_1,D_EXT_1,D_INT_2,D_EXT_2
ALEX,h4n1p8,2020-01-01,2024-01-01,2023-02-02,2020-01-01
What my code will print with that data:
2024-01-01
DESIRED OUTPUT:
Alex, 2024-01-01
As requested by not_speshal
-> data.head().to_dict()
{'EMPL. NO': {0: 5}, "NOM A L'EMPLACEMENT": {0: 'C010 - HOPITAL REGIONAL DE RIMOUSKI/CENTRE SERVEUR OPTILAB'}, 'ADRESSE': {0: '150 AVENUE ROULEAU'}, 'VILLE': {0: 'RIMOUSKI'}, 'PROV': {0: 'QC'}, 'OBJET NO': {0: 67}, "EMPLACEMENT DE L'APPAREIL": {0: 'CHAUFFERIE'}, 'RBQ 2018': {0: nan}, "DESCRIPTION DE L'APPAREIL": {0: 'CHAUDIERE AQUA. A VAPEUR'}, 'MANUFACTURIER': {0: 'MIURA'}, 'DIMENSIONS': {0: nan}, 'MAWP': {0: 170}, 'SVP': {0: 150}, 'DERNIERE INSP. EXT.': {0: '2019-05-29'}, 'FREQ. EXT.': {0: 12}, 'DERNIERE INSP. INT.': {0: '2020-06-03'}, 'FREQ. INT.': {0: 12}, 'D_EXT_1': {0: '2020-05-29'}, 'D_INT_1': {0: '2021-06-03'}, 'D_EXT_2': {0: '2021-05-29'}, 'D_INT_2': {0: '2022-06-03'}, 'D_EXT_3': {0: '2022-05-29'}, 'D_INT_3': {0: '2023-06-03'}, 'D_EXT_4': {0: '2023-05-29'}, 'D_INT_4': {0: '2024-06-03'}, 'D_EXT_5': {0: '2024-05-29'}, 'D_INT_5': {0: '2025-06-03'}, 'D_EXT_6': {0: '2025-05-29'}, 'D_INT_6': {0: '2026-06-03'}}
Start with
import pandas as pd
cols = [prefix + str(i) for prefix in ['D_EXT_','D_INT_'] for i in range(1,7)]
data = pd.read_csv("Book1.csv")
for col in cols:
data.loc[:,col] = pd.to_datetime(data.loc[:,col])
Then use
ext = data[
(
data.loc[:,cols].ge(pd.to_datetime("2022-03-01"))\
& data.loc[:,cols].le(pd.to_datetime("2024-12-13"))\
).any(axis=1)
]
EDIT: while it's not clear what date you want if multiple are in the required range, to get what (I understand) you're requesting, use
# assuming
import numpy as np
import pandas as pd
# and
cols = [prefix + str(i) for prefix in ['D_EXT_','D_INT_'] for i in range(1,7)]
ext = data[
np.concatenate(
(
np.setdiff1d(data.columns,cols),
np.array(
(data.loc[:,cols].gt(pd.to_datetime("2022-03-01"))\
& data.loc[:,cols].lt(pd.to_datetime("2024-12-13"))\
).idxmax(axis=1)
)
),
axis=None
)]
where cols is as above
IIUC, try:
columns = ['D_EXT_1', 'D_EXT_2', 'D_EXT_3', 'D_EXT_4', 'D_EXT_5', 'D_EXT_6', 'D_INT_1', 'D_INT_2', 'D_INT_3', 'D_INT_4', 'D_INT_5', 'D_INT_6']
data[columns] = data[columns].apply(pd.to_datetime)
output = data[((data[columns]>="2022-03-01")&(data[columns]<="2024-12-31")).any(axis=1)]
This will return all the rows where any date in the columns list is between 2022-03-01 and 2024-12-31
It seems that you want to get only rows where at least one of the dates is in the range ["2022-03-01", "2024-12-31"], correct?
First, convert all the date columns to datetime, using DataFrame.apply + pandas.to_datetime.
import pandas as pd
date_cols = ['D_EXT_1', 'D_EXT_2', 'D_EXT_3', 'D_EXT_4', 'D_EXT_5', 'D_EXT_6', 'D_INT_1', 'D_INT_2', 'D_INT_3', 'D_INT_4', 'D_INT_5', 'D_INT_6']
data[date_cols] = data[date_cols].apply(pd.to_datetime)
Then create a 2D boolean mask of all the dates that are in the desired range
is_between_dates = (data[date_cols] > "2022-03-01") & (data[datecols] <= "2024-12-31")
# print(is_between_dates) to clearly understand what it represents
Finally, select the rows that contain at least one True value, meaning that there is at least one date in that row that belongs to the date range. This can be achieved using DataFrame.any with axis=1 on the 2D boolean mask, is_between_dates.
# again, print(is_between_dates.any(axis=1)) to see
data = data[is_between_dates.any(axis=1)]
Use melt to reformat your dataframe to be easily searchable:
df = pd.read_csv('Book1.csv').melt(['NAME', 'ADRESS']) \
.astype({'value': 'datetime64'}) \
.query("'2022-03-01' <= value & value <= '2024-12-31'")
At this point your dataframe looks like:
>>> df
NAME ADRESS variable value
1 ALEX h4n1p8 D_EXT_1 2024-01-01
2 ALEX h4n1p8 D_INT_2 2023-02-02
Now it's easy to get a NAME for a date:
>>> df.loc[df['value'] == '2024-01-01', 'NAME']
1 ALEX
Name: NAME, dtype: object
# OR
>>> df.loc[df['value'] == '2024-01-01', 'NAME'].tolist()
['ALEX']
Related
I would like to groupby id, rolling 7 dates, and apply a simple OLS regression to each id and each period.
Here's my function:
endo_var = 'monthly_gross_ret'
exogenous_vars = ['bcmk_ret']
def get_r2(data_it):
data_it.exog = sm.add_constant(data_it[exogenous_vars], prepend=False)
data_it.endog = data_it[endo_var]
mod = sm.OLS(data_it.endog, data_it.exog)
res = mod.fit()
return res.rsquared
The fabricated sample data is:
ids = ['A','A','A','A','A','A','A','A','B','B','B','B','B','B','B','B']
dates = [1,2,3,4,5,6,7,8,1,2,3,4,5,6,7,8]
bcmk_rets = [4,5,6,3,7,8,3,5,1,5,8,4,7,3,7,3]
monthly_gross_rets = [8,4,7,2,5,7,2,6,8,3,6,6,2,7,3,2]
df = pd.DataFrame({'id':ids,'date':dates,'bcmk_ret':bcmk_rets,'monthly_gross_ret':monthly_gross_rets})
I would like to groupby id and rolling 7 dates then apply the function get_r2, something like this:
data.groupby('id').rolling(7).apply(get_r2)
However, the error says:KeyError: "None of [Index(['bcmk_ret'], dtype='object')] are in the [index]"
I know how to do this in for loop:
record_r2 = pd.DataFrame()
count = 0
for i in data['id'].unique():
data_i = data[(data['id']==i)]
for t in data_i['date'].unique():
data_it = data_i[(data_i['date']>=t) & (data_i['date']<=t + 6)]
if len(data_it) ==7: # should have put 24, but there's literally ZERO result, because the data quality is rather poor. (we have lots of missing months. For e.g., id == 6679 has May to Nov, then missing Dec and Jan, then have Feb to Dec...)
data_it.exog = sm.add_constant(data_it[exogenous_vars], prepend=False)
data_it.endog = data_it[endo_var]
mod = sm.OLS(data_it.endog, data_it.exog)
res = mod.fit()
record_r2.loc[count,'id'] = i
record_r2.loc[count,'date'] = t
record_r2.loc[count,'R2'] = res.rsquared
count = count + 1
The for loop returns the desired dataframe that records the R^2 of each group's regression results:
{'id': {0: 'A', 1: 'A', 2: 'B', 3: 'B'},
'date': {0: 1.0, 1: 2.0, 2: 1.0, 3: 2.0},
'R2': {0: 0.3512152777777777,
1: 0.7200000000000001,
2: 0.4736842105263158,
3: 0.026894573272770783},
'adj_R2': {0: 0.2214583333333333,
1: 0.6640000000000001,
2: 0.368421052631579,
3: -0.16772651207267497}}
But the for loop is computationaly inefficient and takes ages to run my full data sample. That's why I would love to use the groupby method instead of for loop. Could someone help out?
Thank you very much!
Best,
Darcy
I have referred to this post but cannot get it to run for my particular case. I have two dataframes:
import pandas as pd
df1 = pd.DataFrame(
{
"ein": {0: 1001, 1: 1500, 2: 3000},
"ein_name": {0: "H for Humanity", 1: "Labor Union", 2: "Something something"},
"lname": {0: "Cooper", 1: "Cruise", 2: "Pitt"},
"fname": {0: "Bradley", 1: "Thomas", 2: "Brad"},
}
)
df2 = pd.DataFrame(
{
"lname": {0: "Couper", 1: "Cruise", 2: "Pit"},
"fname": {0: "Brad", 1: "Tom", 2: "Brad"},
"score": {0: 3, 1: 3.5, 2: 4},
}
)
Then I do:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from itertools import product
N = 60
names = {
tup: fuzz.ratio(*tup)
for tup in product(df1["lname"].tolist(), df2["lname"].tolist())
}
s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]
degrees = {
tup: fuzz.ratio(*tup)
for tup in product(df1["fname"].tolist(), df2["fname"].tolist())
}
s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
df2["lname"] = df2["lname"].map(s1).fillna(df2["lname"])
df2["fname"] = df2["fname"].map(s2).fillna(df2["fname"])
df = df1.merge(df2, on=["lname", "fname"], how="outer")
The result is not what I expect. Can you help me with editing this code please? Note that I have millions of lines in df1 and millions in df2, so I need some efficiency as well.
Basically, I need to match people from df1 to people in df2. In the above example, I am matching them on last name (lname) and first name (fname). I also have a third one, which I leave out here for parsimony.
The expected result should look like:
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4
You could try this:
from functools import cache
import pandas as pd
from fuzzywuzzy import fuzz
# First, define indices and values to check for matches
indices_and_values = [(i, value) for i, value in enumerate(df2["lname"] + df2["fname"])]
# Define helper functions to find matching rows and get corresponding score
#cache
def find_match(x):
return [i for i, value in indices_and_values if fuzz.ratio(x, value) > 75]
def get_score(x):
try:
return df2.loc[x[0], "score"]
except (KeyError, IndexError):
return pd.NA
# Add scores to df1:
df1["score"] = (
(df1["lname"] + df1["fname"])
.apply(find_match)
.apply(get_score)
)
And then:
print(df1)
ein ein_name lname fname score
0 1001 H for Humanity Cooper Bradley 3.0
1 1500 Labor Union Cruise Thomas 3.5
2 3000 Something something Pitt Brad 4.0
Given the size of your dataframes, I suppose you have namesakes (identical first and last names), hence the use of #cache decorator from Python standard library in order to try speeding things up (but you can do without it).
I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()
I have a list of dictionary
items = [{'name':'Fruit','title':'Apple','id':'1'},
{'name':'Fruit','title':'Banana','id':'1'},
{'name':'Vegetable','title':'Tomato','id':'2'},
{'name':'Vegetable','title':'Onion','id':'2'}]
and I have a dataframe
df = pd.Dataframe({'Name': {0: 'Banana', 1: 'Apple', 2: 'Orange', 3: 'Tomato', 4: 'Onion'},'Price': {0: 25, 1: 20, 2: 10, 3: 26, 4: 45},'Kg': {0: 1, 1: 25, 2: 3, 3: 55, 4: 10}})
Now I need to map the title in next column, if the value not exist in dict, it can be Empty
Expected output:
What I have tried is
df["Title"] = df["Name"].map(lambda x : for i in items[name] == x)
Adding a new pandas column with mapped value from a dictionary
But its not working as items is a list of dictonary.
You can flatten your list into a single dict and use that to map.
df['Title'] = df['Name'].map({i['title']:i['name'] for i in items})
While I find help and documentation on how to convert a pandas DataFrame to dictionary so that columns are keys and values are rows, I find myself stuck when I would like to have one of the column's values as keys and the associated values from another column as values, so that a df like this
a b
1 car
1 train
2 boot
2 computer
2 lipstick
converts to the following dictionary {'1': ['car','train'], '2': ['boot','computer','lipstick]}
I have a feeling it's something pretty simple but I'm out of ideas. I tried df.groupby('a').to_dict() but was unsuccessful
Any suggestions?
You could view this as a groupby-aggregation (i.e., an operation which turns each group into one value -- in this case a list):
In [85]: df.groupby(['a'])['b'].agg(lambda grp: list(grp))
Out[85]:
a
1 [car, train]
2 [boot, computer, lipstick]
dtype: object
In [68]: df.groupby(['a'])['b'].agg(lambda grp: list(grp)).to_dict()
Out[68]: {1: ['car', 'train'], 2: ['boot', 'computer', 'lipstick']}
You can't perform a to_dict() on a the result of groupby, but you can use it to perform your own dictionary construction. The following code will work with the example you provided.
import pandas as pd
df = pd.DataFrame(dict(a=[1,1,2,2,2],
b=['car', 'train', 'boot', 'computer', 'lipstick']))
# Using a loop
dt = {}
for g, d in df.groupby('a'):
dt[g] = d['b'].values
# Using dictionary comprehension
dt2 = {g: d['b'].values for g, d in df.groupby('a')}
Now both dt and dt2 will be dictionaries like this:
{1: array(['car', 'train'], dtype=object),
2: array(['boot', 'computer', 'lipstick'], dtype=object)}
Of course you can put the numpy arrays back into lists, if you so desire.
Yes, because DataFrameGroupBy has no attribute of to_dict, only DataFrame has to_dict attribute.
DataFrame.to_dict(outtype='dict')
Convert DataFrame to dictionary.
You can read more about DataFrame.to_dict here
Take a look of this:
import pandas as pd
df = pd.DataFrame([np.random.sample(9), np.random.sample(9)])
df.columns = [c for c in 'abcdefghi']
# it will convert the DataFrame to dict, with {column -> {index -> value}}
df.to_dict()
{'a': {0: 0.53252618404947039, 1: 0.78237275521385163},
'b': {0: 0.43681232450879315, 1: 0.31356312459390356},
'c': {0: 0.84648298651737541, 1: 0.81417040486070058},
'd': {0: 0.48419015448536995, 1: 0.37578177386187273},
'e': {0: 0.39840348154035421, 1: 0.35367537180764919},
'f': {0: 0.050381560155985827, 1: 0.57080653289506755},
'g': {0: 0.96491634442628171, 1: 0.32844653606404517},
'h': {0: 0.68201236712813085, 1: 0.0097104037581828839},
'i': {0: 0.66836630467152902, 1: 0.69104505886376366}}
type(df)
pandas.core.frame.DataFrame
# DataFrame.groupby is another type
type(df.groupby('a'))
pandas.core.groupby.DataFrameGroupBy
df.groupby('a').to_dict()
AttributeError: Cannot access callable attribute 'to_dict' of 'DataFrameGroupBy' objects, try using the 'apply' method