I want to split a dataframe into quartiles of a specific column.
I have data from 800 companies. One row displays a specific score which ranges from 0 to 100.
I want to split the dataframe in 4 groups (quartiles) with same size (Q1 to Q4, Q4 should contain the companies with the highest scores). So each group should contain 200 companies. How can I divide the companies into 4 equal sized groups according to their score of a specific column (here the last column "ESG Combined Score 2011")? I want to extract the groups to separate sheets in excel (Q1 in a sheet named Q1, Q2 in a sheet named Q2 and so on).
Here is an extract of the data:
df1
Company Common Name Company Market Capitalization ESG Combined Score 2011
0 SSR Mining Inc 3.129135e+09 32.817325
1 Fluor Corp 3.958424e+09 69.467729
2 CBRE Group Inc 2.229251e+10 59.632423
3 Assurant Inc 8.078239e+09 46.492803
4 CME Group Inc 6.269954e+10 42.469682
5 Peabody Energy Corp 3.842130e+09 73.374671
And as an additional question: How can I turn off the scientific notation of the column in the middle? I want it to display with separators.
Thanks for your help
Suppose your dataframe sorted by some values already.
import numpy as np
import pandas as pd
writer = pd.ExcelWriter('splited_df.xlsx', engine='xlsxwriter')
# determine you want to divide into how many parts
n_groups= 4
# make the dataframe slicing list
separator= list(map(int, np.linspace(0,len(df1),n_groups+1)))
for idx in range(len(separator)):
if idx>=len(separator)-2:
df1.iloc[separator[idx]:,:].to_excel(writer, sheet_name=f'Sheet{idx+1}')
break
df1.iloc[separator[idx]:separator[idx+1],:].to_excel(writer, sheet_name=f'Sheet{idx+1}')
writer.save()
writer.close()
And to suppress scientific notation, you can refer to this Stackoverflow post
I hope this can satisfy your question.
You will need to sort the dataframe and partition it based on indices corresponding to quantiles:
def partition_quantiles(df, by, quantiles):
num_samples = len(df)
df = df.sort_values(by)
q_idxs = [0, *(int(num_samples * q) for q in quantiles), num_samples + 1]
for q_start, q_end in zip(q_idxs[:-1], q_idxs[1:]):
yield df.iloc[q_start:q_end]
It will work as follows:
from random import choices
from string import ascii_letters
import numpy as np
import pandas as pd
num_rows = 12
companies = ["".join(choices(ascii_letters, k=10)) for _ in range(num_rows)]
capitalizations = np.random.rand(num_rows) * 1e6
scores = np.random.rand(num_rows) * 1e2
df = pd.DataFrame(
{
"company": companies,
"capital": capitalizations,
"score": scores,
}
)
for partition in partition_quantiles(df, "score", [0.25, 0.5, 0.75]):
print("-" * 40)
print(partition)
which prints:
----------------------------------------
company capital score
7 QVdnUUiaSV 599523.318607 0.506453
2 CahcnFEMlB 247175.132381 11.201345
10 OpvllkCfWp 203289.934774 36.328395
----------------------------------------
company capital score
6 YzqHvWewOC 774025.801826 49.618631
4 taDrQHvHoB 354491.773921 60.153841
11 JrZmmTvwyD 248947.408524 62.414680
----------------------------------------
company capital score
8 nvkomHSjtP 139345.993291 63.949291
9 soigFZMVjo 666688.879067 64.449568
5 LQSInRRnZd 691896.831968 85.375991
----------------------------------------
company capital score
0 wNMoypFeXN 12712.591339 85.638396
1 XNDqUMDrTb 858545.389446 92.531258
3 okUNZChvsJ 697386.417437 95.398392
You can use numpy's array_split for that:
import numpy as np
dfs = np.array_split(df.sort_values(by=['ESG Combined Score 2011']), 4)
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for index, df in enumerate(dfs):
df.to_excel(writer, sheet_name=f'Sheet{index+1}')
You can use pandas.qcut
Overall solution (edited, to use partial of #RJ Adriaansen solution):
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')
Input:
df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'], 'score': [1, 2, 3, 4]})
company score
0 A 1
1 B 2
2 C 3
3 D 4
Script:
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
Output:
company score categories
0 A 1 low
1 B 2 low-mid
2 C 3 mid-high
3 D 4 high
Then separate to different Excel sheets (edited, to use partial of #RJ Adriaansen solution):
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')
Related
I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
My objective is to compute the drill down info for each column
Let me explain by an example.
If we filter the dataframe by df[df['grade']=='A'], we get 2 records as result. let's consider the filtered column grade as parent_variable. Out of those 2 records returned as result, how much dumeel column (child_variable) values and dash column (child_variable) values account for target column values (which is 0 and 1). All categorical/object columns other than parent variable are called child variables.
We have to repeat the above exaple procedure for all the categorical/object variables in our dataset
As a first step, I made use of the below from a SO post
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
out.rename(columns={'variable': 'parent_variable','value': 'parent_value'}, inplace=True)
But the above, gets me only the % and count of target based on all parent variable. I would like to get the breakdown by child variables as well (for each parent variable)
%_contrib is obtained by computing the % of that record to the target value. ex: for dash=P, we have one grade values A (for target = 1). So, it has to be 100%. Hope this helps.
I expect my output to be like as shown below. I have shown sample only for couple of columns under parent_variable. But in my real data, there will be more than 20 categorical variables. So, any efficient approach is welcome and useful
As you are using a random function to generate the DataFrame it is hard for me to reproduce your example, but I think you are looking for value_counts -
This is the DataFrame I generated with your code -
grade dash dumeel dumma target
0 D P W 50 1
1 D S R 595 0
2 C P E 495 1
3 A Q Q 690 0
4 B P W 653 1
5 D R E 554 0
6 C P Q 392 1
7 D Q Q 186 0
8 B Q E 1228 1
9 C P E 14 0
When I do a value_counts() on the two columns -
df[(df['dash']=='P') & (df['target'] == 1)]['dumeel'].value_counts(normalize=True)
W 0.50
Q 0.25
E 0.25
Name: dumeel, dtype: float64
df[(df['dash']=='P') & (df['target'] == 1)]['grade'].value_counts(normalize=True)
C 0.50
D 0.25
B 0.25
Name: grade, dtype: float64
If you want to loop over all the child_columns - you can do
excl_cols = ['dash', 'target']
child_cols = [col for col in df.columns if col not in excl_cols]
for col in child_cols:
print(df[(df['dash']=='P') & (df['target'] == 1)][col].value_counts(normalize=True))
If you want to loop over all the columns - then you can use:
loop_columns = set(df.columns) - {'target'}
for parent_col in loop_columns:
print(f'Parent column is {parent_col}\n')
parent_vals = df[parent_col].unique()
child_cols = loop_columns - {parent_col}
for parent_val in parent_vals:
for child_col in child_cols:
print(df[(df[parent_col]==parent_val) & (df['target'] == 1)][child_col].value_counts(normalize=True))
I'm trying to use hypothesis to generate pandas dataframes where some column values are dependant on other column values. So far, I haven't been able to 'link' two columns.
This code snippet:
from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames , column, range_indexes
def create_dataframe():
id1 = st.integers().map(lambda x: x)
id2 = st.shared(id1).map(lambda x: x * 2)
df = data_frames(index = range_indexes(min_size=10, max_size=100), columns=[
column(name='id1', elements=id1, unique=True),
column(name='id2', elements=id2),
])
return df
Produces a dataframe with a static second column:
id1 program_id
0 1.170000e+02 110.0
1 3.600000e+01 110.0
2 2.876100e+04 110.0
3 -1.157600e+04 110.0
4 5.300000e+01 110.0
5 2.782100e+04 110.0
6 1.334500e+04 110.0
7 -3.100000e+01 110.0
I think that you're after the rows argument, which allows you to compute some column values from other columns. For example, if we wanted a full_price and a sale_price column where the sale price has some discount applied:
from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames, range_indexes
def create_dataframe():
full = st.floats(1, 1000) # all items cost $1 to $1,000
discounts = st.sampled_from([0, 0.1, 0.25, 0.5])
rows = st.tuples(full, discounts).map(
lambda xs: dict(price=xs[0], sale_price=xs[0] * (1-xs[1]))
)
return data_frames(
index = range_indexes(min_size=10, max_size=100),
rows = rows
)
price sale_price
0 757.264509 378.632254
1 824.384095 618.288071
2 401.187339 300.890504
3 723.193610 650.874249
4 777.171038 699.453934
5 274.321034 205.740776
So what went wrong with your example code? It looks like you imagined that the id1 and id2 strategies were defined relative to each other on a row-wise basis, but they're actually independent - and the shared() strategy shares a single value between every row in the column.
I want to do the following to my dataframe:
For each row identify outliers/anomalies
Highlight/color the identified outliers' cells (preferably 'red' color)
Count the number of identified outliers in each row (store in a column 'anomaly_count')
Export the output as an xlsx file
See below for sample data
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16,5)),
columns=list('ABCDE')
)
df
A B C D E
0 -1.685112 -0.432143 0.876200 1.626578 1.512677
1 0.401134 0.439393 1.027222 0.036267 -0.655949
2 -0.074890 0.312793 -0.236165 0.660909 0.074468
3 0.842169 2.759467 0.223652 0.432631 -0.484871
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380
5 0.083653 0.792835 -0.643204 1.182606 -1.207692
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188
8 2.354769 1.099483 -0.653342 -0.532208 0.269307
9 0.431649 0.666982 0.361765 0.419482 0.531072
10 -0.124268 -0.170720 -0.979012 -0.410861 1.000371
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283
14 0.029966 -0.579152 0.648176 0.833141 -0.942752
15 0.824767 0.974580 0.363170 0.428062 -0.232174
The desired outcome should look something like this:
## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.
## Please note: the outliers NaN in the sample are randomly assigned.
A B C D E Anomaly_Count
0 NaN -0.432143 0.876200 NaN 1.512677 2
1 0.401134 0.439393 1.027222 0.036267 -0.655949 0
2 -0.074890 0.312793 -0.236165 0.660909 0.074468 0
3 0.842169 NaN 0.223652 0.432631 -0.484871 1
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380 0
5 0.083653 0.792835 -0.643204 NaN NaN 2
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728 0
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188 0
8 2.354769 1.099483 -0.653342 -0.532208 0.269307 0
9 0.431649 0.666982 0.361765 0.419482 0.531072 0
10 -0.124268 -0.170720 -0.979012 -0.410861 NaN 1
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289 0
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504 0
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283 0
14 0.029966 -0.579152 0.648176 0.833141 -0.942752 0
15 0.824767 NaN 0.363170 0.428062 -0.232174 1
See below for my attempt, I am open to other approaches
import numpy as np
from scipy import stats
def outlier_detection (data):
# step I: identify the outliers in each row
df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)] # unfortunately this removes the outliers which I dont want
# step II: color/highlight the outlier cell
df = df.style.highlight_null('red')
# Step III: count the number of outliers in each row
df['Anomaly_count'] = df.isnull().sum(axis=1)
# step IV: export as xlsx file
df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
outlier_detection(df)
Thanks for your time.
This works for me
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)
mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")
sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)
Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.
The resulting excel file looks now like this
I have a Pandas dataframe with the columns ['week', 'price_per_unit', 'total_units']. I wish to create a new column called 'weighted_price' as follows: first group by 'week' and then for each week calculate price_per_unit * total_units / sum(total_units) for that week. I have code that does this:
import pandas as pd
import numpy as np
def create_features_by_group(df):
# first group data
grouped = df.groupby(['week'])
df_temp = pd.DataFrame(columns=['weighted_price'])
# run through the groups and create the weighted_price per group
for name, group in grouped:
res = (group['total_units'] * group['price_per_unit']) / np.sum(group['total_units'])
for idx in res.index:
df_temp.loc[idx] = [res[idx]]
df.join(df_temp['weighted_price'])
return df
The only problem is that this is very, very slow. Is there some faster way to do this?
I used the following code to test the function.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['week', 'price_per_unit', 'total_units'])
for i in range(10):
df.loc[i] = [round(int(i % 3), 0) , 10 * np.random.rand(), round(10 * np.random.rand(), 0)]
I think you need to do it this way:
df
price total_units week
0 5 100 1
1 7 200 1
2 9 150 2
3 11 250 2
4 13 125 2
def fun(table):
table['measure'] = table['price'] * (table['total_units'] / table['total_units'].sum())
return table
df.groupby('week').apply(fun)
price total_units week measure
0 5 100 1 1.666667
1 7 200 1 4.666667
2 9 150 2 2.571429
3 11 250 2 5.238095
4 13 125 2 3.095238
I have grouped the dataset by 'Week' to calculate the weighted price for each week.
Then I joined the original dataset with the grouped dataset to get the result:
# importing the libraries
import pandas as pd
import numpy as np
# creating the dataset
df = {
'Week' : [1,1,1,1,2,2],
'price_per_unit' : [10,11,22,12,12,45],
'total_units' : [10,10,10,10,10,10]
}
df = pd.DataFrame(df)
df['price'] = df['price_per_unit'] * df['total_units']
# calculate the total sales and total number of units sold in each week
df_grouped_week = df.groupby(by = 'Week').agg({'price' : 'sum', 'total_units' : 'sum'}).reset_index()
# calculate the weighted price
df_grouped_week['wt_price'] = df_grouped_week['price'] / df_grouped_week['total_units']
# merging df and df_grouped_week
df_final = pd.merge(df, df_grouped_week[['Week', 'wt_price']], how = 'left', on = 'Week')
I have the following python pandas dataframe:
| Number of visits per year |
user id | 2013 | 2014 | 2015 | 2016 |
A 4 3 6 0
B 3 0 7 3
C 10 6 3 0
I want to calculate the percentage of users who returned based on their numbers of visits. I am sorry , I don't have any code yet, I wasn't sure how to start this.
This is the end result I am looking for:
| Number of visits in the year |
Year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
2014 7% 3% 4% 15% 6% 7% 18% 17% 3% 2%
2015 3% ....
2016
So based on the above I can say that 15% of clients who visited the store 4 times in 2013, came back to the store in 2014.
Thank you very much.
UPDATE: This is what I did, maybe there is a better way through a loop?
For each year, I had a csv like this:
user_id | NR_V
A 4
B 3
C 10
NR_V stands for number of visits.
So I uploaded each csv as it's own df and I had df_2009, df_2010, ... until df_2016.
For each file I added a column column with 0/1 if they shopped the next year.
df_2009['shopped2010'] = np.where(df_2009['user_ID'].isin(df_2010['user_ID']), 1, 0)
Then I pivoted each dataframe.
pivot_2009 = pd.pivot_table(df_2009,index=["NR_V"],aggfunc={"NR_V":len, "shopped2010":np.sum})
Next, for each dataframe I created a new dataframe with the a column calculating the percentage by number of visits.
p_2009 = pd.DataFrame()
p_2009['%returned2010'] = (pivot_2009['shopped2010']/pivot_2009['NR_V'])*100
Finally, I merged all those dataframes into one.
dfs = [p_2009, p_2010, p_2011, p_2012, p_2013, p_2014, p_2015 ]
final = pd.concat(dfs, axis=1)
Consider the sample visits dataframe df
df = pd.DataFrame(
np.random.randint(1, 10, (100, 5)),
pd.Index(['user_{}'.format(i) for i in range(1, 101)], name='user id'),
[
['Number of visits per year'] * 5,
[2012, 2013, 2014, 2015, 2016]
]
)
df.head()
You can apply pd.value_counts with parameter normalize=True.
Also, since an entry of 8 represents 8 separate visits, it should count 8 times. I'll use repeat to accomplish this prior to value_counts
def count_visits(col):
v = col.values
return pd.value_counts(v.repeat(v), normalize=True)
df.apply(count_visits).stack().unstack(0)
I used the index value of every visitor and checked if the same index value (aka the same vistor_ID) was more then 0 the next year. This was then added to a dictionary in the form of True or False, which you could use for a bar-chart. I also made two lists (times_returned and returned_at_all) for additional data manipulation.
import pandas as pd
# Part 1, Building the dataframe.
df = pd.DataFrame({
'Visitor_ID':[1,2,3],
'2010' :[4,3,10],
'2011' :[3,0,6],
'2012' :[6,7,3],
'2013' :[0,3,0]
})
df.set_index("Visitor_ID", inplace=True)
# Part 2, preparing the required variables.
def dictionary (max_visitors):
dictionary={}
for x in range(max_visitors):
dictionary["number_{}".format(x)] = []
# print(dictionary)
return dictionary
# Part 3, Figuring out if the customer returned.
def compare_yearly_visits(current_year, next_year):
index = 1
years = df.columns
for x in df[current_year]:
# print (df[years][current_year][index], 'this year.')
# print (df[years][next_year][index], 'Next year.')
how_many_visits = df[years][current_year][index]
did_he_return = df[years][next_year][index]
if did_he_return > 0:
# If the visitor returned, add to a bunch of formats:
returned_at_all.append([how_many_visits, True])
times_returned.append([how_many_visits, did_he_return])
dictionary["number_{}".format(x)].append(True)
else:
## If the visitor did not return, add to a bunch of formats:
returned_at_all.append([how_many_visits, False])
dictionary["number_{}".format(x)].append(False)
index = index +1
# Part 4, The actual program:
highest_amount_of_visits = 11 # should be done automatically, max(visits)?
relevant_years = len(df.columns) -1
times_returned = []
returned_at_all = []
dictionary = dictionary(highest_amount_of_visits)
for column in range(relevant_years):
# print (dictionary)
this_year = df.columns[column]
next_year = df.columns[column+1]
compare_yearly_visits(this_year, next_year)
print ("cumulative dictionary up to:", this_year,"\n", dictionary)
Please find below my solution. As a note, I am pretty positive that this can be improved.
# step 0: create data frame
df = pd.DataFrame({'2013':[4, 3, 10], '2014':[3, 0, 6], '2015':[6, 7, 3], '2016':[0, 3, 0]}, index=['A', 'B', 'C'])
# container list of dataframes to be concatenated
frames = []
# iterate through the dataframe one column at a time and determine its value_counts(freq table)
for name, series in df.iteritems():
frames.append(series.value_counts())
# Merge frequency table for all columns into a dataframe
temp_df = pd.concat(frames, axis=1).transpose().fillna(0)
# Find the key for the new dataframe (i.e. range for number of columns), and append missing ones
cols = temp_df.columns
min = cols.min()
max = cols.max()
for i in range(min, max):
if (not i in a):
temp_df[str(i)] = 0
# Calculate percentage
final_df = temp_df.div(temp_df.sum(axis=1), axis=0)