removing similar data after grouping and sorting python - python

I have this data:
lat = [79.211, 79.212, 79.214, 79.444, 79.454, 79.455, 82.111, 82.122, 82.343, 82.231, 79.211, 79.444]
lon = [0.232, 0.232, 0.233, 0.233, 0.322, 0.323, 0.321, 0.321, 0.321, 0.411, 0.232, 0.233]
val = [2.113, 2.421, 2.1354, 1.3212, 1.452, 2.3553, 0.522, 0.521, 0.5421, 0.521, 1.321, 0.422]
df = pd.DataFrame({"lat": lat, 'lon': lon, 'value':val})
and I am grouping it by lat & lon and then sorting by the value column and taking the top 5 as shown below:
grouped = df.groupby(["lat", "lon"])
val_max = grouped['value'].max()
df_1 = pd.DataFrame(val_max)
df_1 = df_1.sort_values('value', ascending = False)[0:5]
The output I get is this:
value
lat lon
79.212 0.232 2.4210
79.455 0.323 2.3553
79.214 0.233 2.1354
79.211 0.232 2.1130
79.454 0.322 1.4520
I want to remove any row that is within 1 of the last decimal place of any of the above. So we see that row 1 is almost the same location as row 4 and row 2 is almost the same location as row 5 so 4 and 5 would be replaced by the next ranked lat lon, which would make the output:
value
lat lon
79.212 0.232 2.4210
79.455 0.323 2.3553
79.214 0.233 2.1354
82.343 0.321 0.5421
82.111 0.321 0.5220
Please le me know how I can do this.

You could sort the dataframe, like this:
grouped = df.groupby(["lat", "lon"])
val_max = grouped["value"].max()
df_1 = pd.DataFrame(val_max)
df_1 = (
df_1.sort_values("value", ascending=False).reset_index().sort_values(["lat", "lon"])
)
Then, iterate on each row and compare it to the previous one, find and drop similar ones :
# Find similar rows and mark them in a new "match" column
df_1["match"] = ""
for i in range(df_1.shape[0] + 1):
if i == 0:
continue
df_1.loc[
(df_1.iloc[i, 0] - df_1.iloc[i - 1, 0] <= 0.001)
| (df_1.iloc[i, 1] - df_1.iloc[i - 1, 1] <= 0.001),
"match",
] = pd.NA
# Remove empty rows
df_1 = df_1.dropna(how="all").reset_index(drop=True)
# Remove unwanted rows and cleanup
index = [i - 1 for i in df_1[df_1["match"].isna()].index]
df_1 = df_1.drop(index=index).drop(columns="match").reset_index(drop=True)
Which outputs:
print(df_1)
lat lon value
0 79.212 0.232 2.4210
1 79.214 0.233 2.1354
2 79.444 0.233 1.3212
3 79.455 0.323 2.3553
4 82.111 0.321 0.5220
5 82.122 0.321 0.5210
6 82.231 0.411 0.5210
7 82.343 0.321 0.5421

Related

Optimal way to acquire percentiles of DataFrame rows

Problem
I have a pandas DataFrame df:
year val0 val1 val2 ... val98 val99
1983 -42.187 15.213 -32.185 12.887 -33.821
1984 39.213 -142.344 23.221 0.230 1.000
1985 -31.204 0.539 2.000 -1.000 3.442
...
2007 4.239 5.648 -15.483 3.794 -25.459
2008 6.431 0.831 -34.210 0.000 24.527
2009 -0.160 2.639 -2.196 52.628 71.291
My desired output, i.e. new_df, contains the 9 different percentiles including the median, and should have the following format:
year percentile_10 percentile_20 percentile_30 percentile_40 median percentile_60 percentile_70 percentile_80 percentile_90
1983 -40.382 -33.182 -25.483 -21.582 -14.424 -9.852 -3.852 6.247 10.528
...
2009 -3.248 0.412 6.672 10.536 12.428 20.582 46.248 52.837 78.991
Attempt
The following was my initial attempt:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
However, instead of returning the percentiles of all columns, it calculated these percentiles for each val column and therefore returned 1000 columns. As it calculated the percentiles for each val, all percentiles returned the same values.
I still managed to run the desired task by trying the following:
list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []
for i in range(len(df)):
list_1.append(np.percentile(df.iloc[i,1:],10))
list_2.append(np.percentile(df.iloc[i,1:],20))
list_3.append(np.percentile(df.iloc[i,1:],30))
list_4.append(np.percentile(df.iloc[i,1:],40))
mlist.append(np.median(df.iloc[i,1:]))
list_6.append(np.percentile(df.iloc[i,1:],60))
list_7.append(np.percentile(df.iloc[i,1:],70))
list_8.append(np.percentile(df.iloc[i,1:],80))
list_9.append(np.percentile(df.iloc[i,1:],90))
df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9
new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
But this blatantly is such a laborous, manual, and one-dimensional way to achieve the task. What is the most optimal way to find the percentiles of each row for multiple columns?
You can get use .describe() function like this:
# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)
output:
count mean std min 25% 50% 75% max
0 3.0 0.422915 1.440097 -0.940519 -0.330152 0.280215 1.104632 1.929049
1 3.0 1.615037 0.766079 0.799817 1.262538 1.725259 2.022647 2.320036
2 3.0 0.221560 0.700770 -0.585020 -0.008149 0.568721 0.624849 0.680978
3 3.0 -0.119638 0.182402 -0.274168 -0.220240 -0.166312 -0.042373 0.081565
4 3.0 -0.569942 0.807865 -1.085838 -1.035455 -0.985072 -0.311994 0.361084
if you want other percentiles than the default 0.25, .05, .075 you can create your own function where you change the values of .describe(percentiles = [0.1, 0.2...., 0.9])
Use DataFrame.quantile with convert year to index and last transpose with rename columns by custom lambda function:
a = np.arange(1, 10) / 10
f = lambda x: f'percentile_{int(x * 100)}' if x != 0.5 else 'median'
new_df = df.set_index('year').quantile(a, axis=1).T.rename(columns=f)
print (new_df)
percentile_10 percentile_20 percentile_30 percentile_40 median \
year
1983 -38.8406 -35.4942 -33.4938 -32.8394 -32.185
1984 -85.3144 -28.2848 0.3840 0.6920 1.000
1985 -19.1224 -7.0408 -0.6922 -0.0766 0.539
2007 -21.4686 -17.4782 -11.6276 -3.9168 3.794
2008 -20.5260 -6.8420 0.1662 0.4986 0.831
2009 -1.3816 -0.5672 0.3998 1.5194 2.639
percentile_60 percentile_70 percentile_80 percentile_90
year
1983 -14.1562 3.8726 13.3522 14.2826
1984 9.8884 18.7768 26.4194 32.8162
1985 1.1234 1.7078 2.2884 2.8652
2007 3.9720 4.1500 4.5208 5.0844
2008 3.0710 5.3110 10.0502 17.2886
2009 22.6346 42.6302 56.3606 63.8258

Pandas: select the first value which is not negative anymore, return the row

For now my code looks like this:
df = pd.DataFrame()
max_exp = []
gammastar = []
for idx,rw in df_gamma_count.iterrows():
exp = rw['Pr_B']*(rw['gamma_index']*float(test_spread)*(1+f)-(f+f))
df = df.append({'exp': exp, 'gamma_perc': rw['gamma_index'], 'Pr_B':rw['Pr_B'], 'spread-test in %': test_spread }, ignore_index=True)
df = df.sort_values(by= ['exp'], ascending=True)
df
which gives me the following dataframe:
Pr_B exp gamma_perc spread-test in %
10077 0.000066 -2.078477e-08 1.544700 0.001090292473058004120128368625
10078 0.000066 -2.073422e-08 1.545400 0.001090292473058004120128368625
10079 0.000066 -2.071978e-08 1.545600 0.001090292473058004120128368625
10080 0.000066 -2.071256e-08 1.545700 0.001090292473058004120128368625
10081 0.000000 -0.000000e+00 1.545900 0.001090292473058004120128368625
10082 0.000000 -0.000000e+00 1.546200 0.001090292473058004120128368625
10083 0.000000 0.000000e+00 1.546300 0.001090292473058004120128368625
10084 0.000000 1 1.546600 0.001090292473058004120128368625
What I need now is to select the first value from the column exp which is not negative anymore. What I did for now is to sort the dataframe based on the column exp but after that I am a bit stuck and do not know where to go... any idea?
Try:
df.loc[df.exp.gt(0).idxmax()]
this will - select the first value from the column exp which is not negative anymore
if you are tying to get the largest value in a series
df.exp.nlargest(1)
EDIT:
Use this to get your desired output:
df.loc[df.exp==np.where(all(i > 0 for i in df.exp.tolist()),min([n for n in df.exp.tolist() if n<=0]),max([n for n in df.exp.tolist() if n<=0]))]
print(df.loc[df.exp==np.where(all(i > 0 for i in df.exp.tolist()),min([n for n in df.exp.tolist() if n<=0]),max([n for n in df.exp.tolist() if n<=0]))].head(1))
Pr_B exp gamma_perc spread-test in %
4 0.0 0.0 1.5459 0.00109
I would screen for number larger than 0 and get the first index
data = [-1,-2,-3, 0]
df = pd.DataFrame(data, columns=['exp'])
value = df.exp[df.exp >= 0].iloc[0] if df.exp[df.exp >= 0].any() else df.exp.max()

Find out the percentage of missing values in each column in the given dataset

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))
input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0
and the output should be
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
How about this? I think I actually found something similar on here once before, but I'm not seeing it now...
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
And if you want the missing percentages sorted, follow the above with:
missing_value_df.sort_values('percent_missing', inplace=True)
As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:
percent_missing = df.isnull().sum() * 100 / len(df)
Update let's use mean with isnull:
df.isnull().mean() * 100
Output:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
IIUC:
df.isnull().sum() / df.shape[0] * 100.00
Output:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.238124
Discount 0.654840
Order_Quantity 0.654840
Profit 0.654840
Shipping_Cost 0.654840
Product_Base_Margin 1.297774
dtype: float64
single line solution
df.isnull().mean().round(4).mul(100).sort_values(ascending=False)
To cover all missing values and round the results:
((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)
The output:
Out[556]:
Ord_id 0.00
Prod_id 0.00
Ship_id 0.00
Cust_id 0.00
Sales 0.24
Discount 0.65
Order_Quantity 0.65
Profit 0.65
Shipping_Cost 0.65
Product_Base_Margin 1.30
dtype: float64
The solution you're looking for is :
round(df.isnull().mean()*100,2)
This will round up the percentage upto 2 decimal places
Another way to do this is
round((df.isnull().sum()*100)/len(df),2)
but this is not efficient as using mean() is.
import numpy as np
import pandas as pd
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'],
'age': [22, np.nan, 23, 24, 25],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'Test1_Score': [4, np.nan, 0, 0, 0],
'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])
results
first_name last_name age sex Test1_Score Test2_Score
0 Jason Miller 22.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina NaN 23.0 f 0.0 NaN
3 Jake Milner 24.0 m 0.0 0.0
4 Amy Cooze 25.0 f 0.0 0.0
You can use following function, which will give you output in Dataframe
Zero Values
Missing Values
% of Total Values
Total Zero Missing Values
% Total Zero Missing Values
Data Type
Just copy and paste following function and call it by passing your pandas Dataframe
def missing_zero_values_table(df):
zero_val = (df == 0.00).astype(int).sum(axis=0)
mis_val = df.isnull().sum()
mis_val_percent = 100 * df.isnull().sum() / len(df)
mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
mz_table = mz_table.rename(
columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
mz_table['Data Type'] = df.dtypes
mz_table = mz_table[
mz_table.iloc[:,1] != 0].sort_values(
'% of Total Values', ascending=False).round(1)
print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"
"There are " + str(mz_table.shape[0]) +
" columns that have missing values.")
# mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
return mz_table
missing_zero_values_table(results)
Output
Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.
Zero Values Missing Values % of Total Values Total Zero Missing Values % Total Zero Missing Values Data Type
last_name 0 2 40.0 2 40.0 object
Test2_Score 2 2 40.0 4 80.0 float64
first_name 0 1 20.0 1 20.0 object
age 0 1 20.0 1 20.0 float64
sex 0 1 20.0 1 20.0 object
Test1_Score 3 1 20.0 4 80.0 float64
If you want to keep it simple then you can use following function to get missing values in %
def missing(dff):
print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(results)
Test2_Score 40.0
last_name 40.0
Test1_Score 20.0
sex 20.0
age 20.0
first_name 20.0
dtype: float64
One-liner
I'm wondering nobody takes advantage of the size and count? It seems the shortest (and probably fastest) way to do it.
df.apply(lambda x: 1-(x.count()/x.size))
Resulting in:
Ord_id 0.000000
Prod_id 0.000000
Ship_id 0.000000
Cust_id 0.000000
Sales 0.002381
Discount 0.006548
Order_Quantity 0.006548
Profit 0.006548
Shipping_Cost 0.006548
Product_Base_Margin 0.012978
dtype: float64
If you find any reason why this is not a good way, please comment
If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage
def miss_data(df):
x = ['column_name','missing_data', 'missing_in_percentage']
missing_data = pd.DataFrame(columns=x)
columns = df.columns
for col in columns:
icolumn_name = col
imissing_data = df[col].isnull().sum()
imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
print(missing_data)
By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.
Input:
In [1]:
all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)
Output :
Out[1]:
Missing Ratio
left_eyebrow_outer_end_x 68.435239
left_eyebrow_outer_end_y 68.435239
right_eyebrow_outer_end_y 68.279189
right_eyebrow_outer_end_x 68.279189
left_eye_outer_corner_x 67.839410
left_eye_outer_corner_y 67.839410
right_eye_inner_corner_x 67.825223
right_eye_inner_corner_y 67.825223
right_eye_outer_corner_x 67.825223
right_eye_outer_corner_y 67.825223
mouth_left_corner_y 67.811037
mouth_left_corner_x 67.811037
left_eyebrow_inner_end_x 67.796851
left_eyebrow_inner_end_y 67.796851
right_eyebrow_inner_end_y 67.796851
mouth_right_corner_x 67.796851
mouth_right_corner_y 67.796851
right_eyebrow_inner_end_x 67.796851
left_eye_inner_corner_x 67.782664
left_eye_inner_corner_y 67.782664
For me I did it like that :
def missing_percent(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_percent = 100 * df.isnull().sum() / len(df)
# Make a table with the results
mis_table = pd.concat([mis_val, mis_percent], axis=1)
# Rename the columns
mis_columns = mis_table.rename(
columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
# Sort the table by percentage of missing descending
mis_columns = mis_columns[
mis_columns.iloc[:,1] != 0].sort_values(
'Percent of Total Values', ascending=False).round(2)
# Print some summary information
print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"
"There are " + str(mis_columns.shape[0]) +
" columns that have missing values.")
# Return the dataframe with missing information
return mis_columns
Let's break down your ask
you want the percentage of missing value
it should be sorted in ascending order and the values to be rounded to 2 floating point
Explanation:
dhr[fill_cols].isnull().sum() - gives the total number of missing values column wise
dhr.shape[0] - gives the total number of rows
(dhr[fill_cols].isnull().sum()/dhr.shape[0]) - gives you a series with percentage as values and column names as index
since the output is a series you can round and sort based on the values
code:
(dhr[fill_cols].isnull().sum()/dhr.shape[0]).round(2).sort_values()
Reference:
sort, round
import numpy as np
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.loc[np.isnan(df['Product_Base_Margin']),['Product_Base_Margin']]=df['Product_Base_Margin'].mean()
print(round(100*(df.isnull().sum()/len(df.index)), 2))
Try this solution
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(100*(df.isnull().sum()/len(df.index)),2))
The best solution I have found - (Only shows the missing columns)
missing_values = [feature for feature in df.columns if df[feature].isnull().sum() > 1]
for feature in missing_values:
print(f"{feature} {np.round(df[feature].isnull().mean(), 4)}% missing values")
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.isna().sum()
Output:
Ord_id 0
Prod_id 0
Ship_id 0
Cust_id 0
Sales 20
Discount 55
Order_Quantity 55
Profit 55
Shipping_Cost 55
Product_Base_Margin 109
dtype: int64
df.shape
Output: (8399, 10)
# for share [0; 1] of nan in each column
df.isna().sum() / df.shape[0]
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0024 # (20 / 8399)
Discount 0.0065 # (55 / 8399)
Order_Quantity 0.0065 # (55 / 8399)
Profit 0.0065 # (55 / 8399)
Shipping_Cost 0.0065 # (55 / 8399)
Product_Base_Margin 0.0130 # (109 / 8399)
dtype: float64
# for percent [0; 100] of nan in each column
df.isna().sum() / (df.shape[0] / 100)
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.2381 # (20 / (8399 / 100))
Discount 0.6548 # (55 / (8399 / 100))
Order_Quantity 0.6548 # (55 / (8399 / 100))
Profit 0.6548 # (55 / (8399 / 100))
Shipping_Cost 0.6548 # (55 / (8399 / 100))
Product_Base_Margin 1.2978 # (109 / (8399 / 100))
dtype: float64
# for share [0; 1] of nan in dataframe
df.isna().sum() / (df.shape[0] * df.shape[1])
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0002 # (20 / (8399 * 10))
Discount 0.0007 # (55 / (8399 * 10))
Order_Quantity 0.0007 # (55 / (8399 * 10))
Profit 0.0007 # (55 / (8399 * 10))
Shipping_Cost 0.0007 # (55 / (8399 * 10))
Product_Base_Margin 0.0013 # (109 / (8399 * 10))
dtype: float64
# for percent [0; 100] of nan in dataframe
df.isna().sum() / ((df.shape[0] * df.shape[1]) / 100)
Output:
Ord_id 0.0000
Prod_id 0.0000
Ship_id 0.0000
Cust_id 0.0000
Sales 0.0238 # (20 / ((8399 * 10) / 100))
Discount 0.0655 # (55 / ((8399 * 10) / 100))
Order_Quantity 0.0655 # (55 / ((8399 * 10) / 100))
Profit 0.0655 # (55 / ((8399 * 10) / 100))
Shipping_Cost 0.0655 # (55 / ((8399 * 10) / 100))
Product_Base_Margin 0.1298 # (109 / ((8399 * 10) / 100))
dtype: float64

Pandas - get last n values from a group with an offset.

I have data frame (pandas,python3.5) with date as index.
The electricity_use is the label I should predict.
e.g.
City Country electricity_use
DATE
7/1/2014 X A 1.02
7/1/2014 Y A 0.25
7/2/2014 X A 1.21
7/2/2014 Y A 0.27
7/3/2014 X A 1.25
7/3/2014 Y A 0.20
7/4/2014 X A 0.97
7/4/2014 Y A 0.43
7/5/2014 X A 0.54
7/5/2014 Y A 0.45
7/6/2014 X A 1.33
7/6/2014 Y A 0.55
7/7/2014 X A 2.01
7/7/2014 Y A 0.21
7/8/2014 X A 1.11
7/8/2014 Y A 0.34
7/9/2014 X A 1.35
7/9/2014 Y A 0.18
7/10/2014 X A 1.22
7/10/2014 Y A 0.27
Of course the data is larger.
My goal is to create to each row the last 3 electricity_use on the group ('City' 'country'), with gap of 5 days (i.e. - to take the last first 3 values from 5 days back). the dates can be non-consecutive, but they are ordered.
For example, to the two last rows the result should be:
City Country electricity_use prev_1 prev_2 prev_3
DATE
7/10/2014 X A 1.22 0.54 0.97 1.25
7/10/2014 Y A 0.27 0.45 0.43 0.20
because the date is 7/10/2014, and the gap is 5 days, so we start looking from 7/5/2014 and those are the 3 last values from this date, to each group (in this case, the groups are (X,A) and (Y,A).
I implemented in with a loop that is going over each group, but I have a feeling it could be done in a much more efficient way.
A naive approach to do this would be to reindex your dataframe and iteratively merge n times
from datetime import datetime,timedelta
# make sure index is in datetime format
df['index'] = df.index
df1 = df.copy()
for i in range(3):
df1['index'] = df['index'] - timedelta(5+i)
df = df1.merge(df,left_on=['City','Country','date'],right_on=['City','Country','date'],how='left',suffixes=('','_'+str(i)))
A faster approach would be to use shift by and remove bogus values
df['date'] = df.index
df.sort_values(by=['City','Country','date'],inplace=True)
temp = df[['City','Country','date']].groupby(['City','Country']).first()
# To pick the oldest date of every city + country group
df.merge(temp,left_on=['City','Country'],right_index=True,suffixes=('','_first'))
df['diff_date'] = df['date'] - df['date_first']
df.diff_date = [int(i.days) for i in df['diff_date']]
# Do a shift by 5
for i range(5,8):
df['days_prior_'+str(i)] = df['electricity_use'].shift(i)
# Top i values for every City + Country code would be bogus values as they would be values of the group prior to it
df.loc[df['diff_date'] < i,'days_prior_'+str(i)] = 0

Transaction data analysis

I have a transactions data frame (15k lines):
customer_id order_id order_date var1 var2 product_id \
79 822067 1990-10-21 0 0 51818
79 771456 1990-11-29 0 0 580866
79 771456 1990-11-29 0 0 924147
156 720709 1990-06-08 0 0 167205
156 720709 1990-06-08 0 0 132120
product_type_id designer_id gross_spend net_spend
139 322 0.174 0.174
139 2366 1.236 1.236
432 919 0.205 0.205
474 4792 0.374 0.374
164 2243 0.278 0.278
I'd like to group by product_type_id and time bin of a transaction for each customer. To be more clear for each customer_id I'd like know how many times the customer bought from the same category in the last 30, 60, 90, 120, 150, 180, 360 days in the past (from date 1991-01-01 for example).
For each customer also I'd like to have how many total purchases he's made, from how many different distinct product_type_id he's bought the total net_spend.
It is not clear to me how to reduce the data as a flat pandas data frame with one line per customer_id....
I can a simplifiead view with something like:
transactions['order_date'] = transactions['order_date'].apply(lambda x: dt.datetime.strptime(x,"%Y-%m-%d"))
NOW = dt.datetime(1991,01,01)
Table = transactions.groupby('customer_id').agg({ 'order_date': lambda x: (NOW - x.max()).days,'order_id': lambda x: len(set(x)), 'net_spend': lambda x: x.sum()})
Table.rename(columns={'order_date': 'Recency', 'order_id': 'Frequency', 'net_spend': 'Monetization'}, inplace=True)
Use:
date = '1991-01-01'
last = [30,60,90]
#get all last datetimes shifted by last
a = [pd.to_datetime(date)- pd.Timedelta(x, unit='d') for x in last]
d1 = {}
#create new columns by conditions with between
for i, x in enumerate(a):
df['last_' + str(last[i])] = df['order_date'].between(x, date).astype(int)
#create dictionary for aggregate
d1['last_' + str(last[i])] = 'sum'
#aggregating dictionary
d = {'customer_id':'size', 'product_type_id':'nunique', 'net_spend':'sum'}
#add d1 to d
d.update(d1)
print (d)
{'product_type_id': 'nunique', 'last_30': 'sum', 'net_spend': 'sum',
'last_60': 'sum', 'customer_id': 'size', 'last_90': 'sum'}
df1 = df.groupby('customer_id').agg(d)
#change order of columns if necessary
cs = df1.columns
m = cs.str.startswith('last')
cols = cs[~m].tolist() + cs[m].tolist()
df1 = df1.reindex(columns=cols)
print (df1)
product_type_id net_spend customer_id last_30 last_60 \
customer_id
79 2 1.615 3 0 2
156 2 0.652 2 0 0
last_90
customer_id
79 3
156 0

Categories

Resources