Split a dataframe based on values in a column - python

I want to split a dataframe into quartiles of a specific column.
I have data from 800 companies. One row displays a specific score which ranges from 0 to 100.
I want to split the dataframe in 4 groups (quartiles) with same size (Q1 to Q4, Q4 should contain the companies with the highest scores). So each group should contain 200 companies. How can I divide the companies into 4 equal sized groups according to their score of a specific column (here the last column "ESG Combined Score 2011")? I want to extract the groups to separate sheets in excel (Q1 in a sheet named Q1, Q2 in a sheet named Q2 and so on).
Here is an extract of the data:
df1
Company Common Name Company Market Capitalization ESG Combined Score 2011
0 SSR Mining Inc 3.129135e+09 32.817325
1 Fluor Corp 3.958424e+09 69.467729
2 CBRE Group Inc 2.229251e+10 59.632423
3 Assurant Inc 8.078239e+09 46.492803
4 CME Group Inc 6.269954e+10 42.469682
5 Peabody Energy Corp 3.842130e+09 73.374671
And as an additional question: How can I turn off the scientific notation of the column in the middle? I want it to display with separators.
Thanks for your help

Suppose your dataframe sorted by some values already.
import numpy as np
import pandas as pd
writer = pd.ExcelWriter('splited_df.xlsx', engine='xlsxwriter')
# determine you want to divide into how many parts
n_groups= 4
# make the dataframe slicing list
separator= list(map(int, np.linspace(0,len(df1),n_groups+1)))
for idx in range(len(separator)):
if idx>=len(separator)-2:
df1.iloc[separator[idx]:,:].to_excel(writer, sheet_name=f'Sheet{idx+1}')
break
df1.iloc[separator[idx]:separator[idx+1],:].to_excel(writer, sheet_name=f'Sheet{idx+1}')
writer.save()
writer.close()
And to suppress scientific notation, you can refer to this Stackoverflow post
I hope this can satisfy your question.

You will need to sort the dataframe and partition it based on indices corresponding to quantiles:
def partition_quantiles(df, by, quantiles):
num_samples = len(df)
df = df.sort_values(by)
q_idxs = [0, *(int(num_samples * q) for q in quantiles), num_samples + 1]
for q_start, q_end in zip(q_idxs[:-1], q_idxs[1:]):
yield df.iloc[q_start:q_end]
It will work as follows:
from random import choices
from string import ascii_letters
import numpy as np
import pandas as pd
num_rows = 12
companies = ["".join(choices(ascii_letters, k=10)) for _ in range(num_rows)]
capitalizations = np.random.rand(num_rows) * 1e6
scores = np.random.rand(num_rows) * 1e2
df = pd.DataFrame(
{
"company": companies,
"capital": capitalizations,
"score": scores,
}
)
for partition in partition_quantiles(df, "score", [0.25, 0.5, 0.75]):
print("-" * 40)
print(partition)
which prints:
----------------------------------------
company capital score
7 QVdnUUiaSV 599523.318607 0.506453
2 CahcnFEMlB 247175.132381 11.201345
10 OpvllkCfWp 203289.934774 36.328395
----------------------------------------
company capital score
6 YzqHvWewOC 774025.801826 49.618631
4 taDrQHvHoB 354491.773921 60.153841
11 JrZmmTvwyD 248947.408524 62.414680
----------------------------------------
company capital score
8 nvkomHSjtP 139345.993291 63.949291
9 soigFZMVjo 666688.879067 64.449568
5 LQSInRRnZd 691896.831968 85.375991
----------------------------------------
company capital score
0 wNMoypFeXN 12712.591339 85.638396
1 XNDqUMDrTb 858545.389446 92.531258
3 okUNZChvsJ 697386.417437 95.398392

You can use numpy's array_split for that:
import numpy as np
dfs = np.array_split(df.sort_values(by=['ESG Combined Score 2011']), 4)
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for index, df in enumerate(dfs):
df.to_excel(writer, sheet_name=f'Sheet{index+1}')

You can use pandas.qcut
Overall solution (edited, to use partial of #RJ Adriaansen solution):
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')
Input:
df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'], 'score': [1, 2, 3, 4]})
company score
0 A 1
1 B 2
2 C 3
3 D 4
Script:
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
Output:
company score categories
0 A 1 low
1 B 2 low-mid
2 C 3 mid-high
3 D 4 high
Then separate to different Excel sheets (edited, to use partial of #RJ Adriaansen solution):
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')

Related

How to drill down data using pandas - pythonic way?

I have a dataframe like as shown below
import numpy as np
import pandas as pd
from numpy.random import default_rng
rng = default_rng(100)
cdf = pd.DataFrame({'Id':[1,2,3,4,5],
'grade': rng.choice(list('ACD'),size=(5)),
'dash': rng.choice(list('PQRS'),size=(5)),
'dumeel': rng.choice(list('QWER'),size=(5)),
'dumma': rng.choice((1234),size=(5)),
'target': rng.choice([0,1],size=(5))
})
My objective is to compute the drill down info for each column
Let me explain by an example.
If we filter the dataframe by df[df['grade']=='A'], we get 2 records as result. let's consider the filtered column grade as parent_variable. Out of those 2 records returned as result, how much dumeel column (child_variable) values and dash column (child_variable) values account for target column values (which is 0 and 1). All categorical/object columns other than parent variable are called child variables.
We have to repeat the above exaple procedure for all the categorical/object variables in our dataset
As a first step, I made use of the below from a SO post
funcs = {
'cnt of records': 'count',
'target met': lambda x: sum(x),
'target met %': lambda x: f"{round(100 * sum(x) / len(x), 2):.2f}%"
}
out = df.select_dtypes('object').melt(ignore_index=False).join(df['target']) \
.groupby(['variable', 'value'])['target'].agg(**funcs).reset_index()
out.rename(columns={'variable': 'parent_variable','value': 'parent_value'}, inplace=True)
But the above, gets me only the % and count of target based on all parent variable. I would like to get the breakdown by child variables as well (for each parent variable)
%_contrib is obtained by computing the % of that record to the target value. ex: for dash=P, we have one grade values A (for target = 1). So, it has to be 100%. Hope this helps.
I expect my output to be like as shown below. I have shown sample only for couple of columns under parent_variable. But in my real data, there will be more than 20 categorical variables. So, any efficient approach is welcome and useful
As you are using a random function to generate the DataFrame it is hard for me to reproduce your example, but I think you are looking for value_counts -
This is the DataFrame I generated with your code -
grade dash dumeel dumma target
0 D P W 50 1
1 D S R 595 0
2 C P E 495 1
3 A Q Q 690 0
4 B P W 653 1
5 D R E 554 0
6 C P Q 392 1
7 D Q Q 186 0
8 B Q E 1228 1
9 C P E 14 0
When I do a value_counts() on the two columns -
df[(df['dash']=='P') & (df['target'] == 1)]['dumeel'].value_counts(normalize=True)
W 0.50
Q 0.25
E 0.25
Name: dumeel, dtype: float64
df[(df['dash']=='P') & (df['target'] == 1)]['grade'].value_counts(normalize=True)
C 0.50
D 0.25
B 0.25
Name: grade, dtype: float64
If you want to loop over all the child_columns - you can do
excl_cols = ['dash', 'target']
child_cols = [col for col in df.columns if col not in excl_cols]
for col in child_cols:
print(df[(df['dash']=='P') & (df['target'] == 1)][col].value_counts(normalize=True))
If you want to loop over all the columns - then you can use:
loop_columns = set(df.columns) - {'target'}
for parent_col in loop_columns:
print(f'Parent column is {parent_col}\n')
parent_vals = df[parent_col].unique()
child_cols = loop_columns - {parent_col}
for parent_val in parent_vals:
for child_col in child_cols:
print(df[(df[parent_col]==parent_val) & (df['target'] == 1)][child_col].value_counts(normalize=True))

Generate a Pandas Dataframe with python hypothesis library where one row is dependant on another

I'm trying to use hypothesis to generate pandas dataframes where some column values are dependant on other column values. So far, I haven't been able to 'link' two columns.
This code snippet:
from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames , column, range_indexes
def create_dataframe():
id1 = st.integers().map(lambda x: x)
id2 = st.shared(id1).map(lambda x: x * 2)
df = data_frames(index = range_indexes(min_size=10, max_size=100), columns=[
column(name='id1', elements=id1, unique=True),
column(name='id2', elements=id2),
])
return df
Produces a dataframe with a static second column:
id1 program_id
0 1.170000e+02 110.0
1 3.600000e+01 110.0
2 2.876100e+04 110.0
3 -1.157600e+04 110.0
4 5.300000e+01 110.0
5 2.782100e+04 110.0
6 1.334500e+04 110.0
7 -3.100000e+01 110.0
I think that you're after the rows argument, which allows you to compute some column values from other columns. For example, if we wanted a full_price and a sale_price column where the sale price has some discount applied:
from hypothesis import strategies as st
from hypothesis.extra.pandas import data_frames, range_indexes
def create_dataframe():
full = st.floats(1, 1000) # all items cost $1 to $1,000
discounts = st.sampled_from([0, 0.1, 0.25, 0.5])
rows = st.tuples(full, discounts).map(
lambda xs: dict(price=xs[0], sale_price=xs[0] * (1-xs[1]))
)
return data_frames(
index = range_indexes(min_size=10, max_size=100),
rows = rows
)
price sale_price
0 757.264509 378.632254
1 824.384095 618.288071
2 401.187339 300.890504
3 723.193610 650.874249
4 777.171038 699.453934
5 274.321034 205.740776
So what went wrong with your example code? It looks like you imagined that the id1 and id2 strategies were defined relative to each other on a row-wise basis, but they're actually independent - and the shared() strategy shares a single value between every row in the column.

How to identify and highlight outliers in each row of a pandas dataframe

I want to do the following to my dataframe:
For each row identify outliers/anomalies
Highlight/color the identified outliers' cells (preferably 'red' color)
Count the number of identified outliers in each row (store in a column 'anomaly_count')
Export the output as an xlsx file
See below for sample data
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16,5)),
columns=list('ABCDE')
)
df
A B C D E
0 -1.685112 -0.432143 0.876200 1.626578 1.512677
1 0.401134 0.439393 1.027222 0.036267 -0.655949
2 -0.074890 0.312793 -0.236165 0.660909 0.074468
3 0.842169 2.759467 0.223652 0.432631 -0.484871
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380
5 0.083653 0.792835 -0.643204 1.182606 -1.207692
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188
8 2.354769 1.099483 -0.653342 -0.532208 0.269307
9 0.431649 0.666982 0.361765 0.419482 0.531072
10 -0.124268 -0.170720 -0.979012 -0.410861 1.000371
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283
14 0.029966 -0.579152 0.648176 0.833141 -0.942752
15 0.824767 0.974580 0.363170 0.428062 -0.232174
The desired outcome should look something like this:
## I want to ONLY identify the outliers NOT remove or substitute them. I only used NaN to depict the outlier value. Ideally, the outlier values cell should be colored/highlighted 'red'.
## Please note: the outliers NaN in the sample are randomly assigned.
A B C D E Anomaly_Count
0 NaN -0.432143 0.876200 NaN 1.512677 2
1 0.401134 0.439393 1.027222 0.036267 -0.655949 0
2 -0.074890 0.312793 -0.236165 0.660909 0.074468 0
3 0.842169 NaN 0.223652 0.432631 -0.484871 1
4 -0.619873 -1.738938 -0.054074 0.337663 0.358380 0
5 0.083653 0.792835 -0.643204 NaN NaN 2
6 -1.168773 -1.456870 -0.707450 -0.439400 0.319728 0
7 2.316974 -0.177750 1.289067 -2.472729 -1.310188 0
8 2.354769 1.099483 -0.653342 -0.532208 0.269307 0
9 0.431649 0.666982 0.361765 0.419482 0.531072 0
10 -0.124268 -0.170720 -0.979012 -0.410861 NaN 1
11 -0.392863 0.933516 -0.502608 -0.759474 -1.364289 0
12 1.405442 -0.297977 0.477609 -0.046791 -0.126504 0
13 -0.711799 -1.042558 -0.970183 -1.672715 -0.524283 0
14 0.029966 -0.579152 0.648176 0.833141 -0.942752 0
15 0.824767 NaN 0.363170 0.428062 -0.232174 1
See below for my attempt, I am open to other approaches
import numpy as np
from scipy import stats
def outlier_detection (data):
# step I: identify the outliers in each row
df[(np.abs(stats.zscore(df)) < 3).all(axis = 0)] # unfortunately this removes the outliers which I dont want
# step II: color/highlight the outlier cell
df = df.style.highlight_null('red')
# Step III: count the number of outliers in each row
df['Anomaly_count'] = df.isnull().sum(axis=1)
# step IV: export as xlsx file
df.to_excel(r'Path to store the exported excel file\File Name.xlsx', sheet_name='Your sheet name', index = False)
outlier_detection(df)
Thanks for your time.
This works for me
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed([5, 1591])
df = pd.DataFrame(
np.random.normal(size=(16, 5)),
columns=list('ABCDE')
)
mask = pd.DataFrame(abs(stats.zscore(df)) > 1, columns=df.columns)
df["Count"] = mask.sum(axis=1)
mask["Count"] = False
style_df = mask.applymap(lambda x: "background-color: red" if x else "")
sheet_name = "Values"
with pd.ExcelWriter("score_test.xlsx", engine="openpyxl") as writer:
df.style.apply(lambda x: style_df, axis=None).to_excel(writer,
sheet_name=sheet_name,
index=False)
Here the mask is the boolean conditional where we have true if zscore exceeds the limit. Based on this boolean mask I create a string dataframe 'style_df' with the values 'background: red' on the deviating cells. The values of the style_df is imposed with the last statement on the style of the df dataframe.
The resulting excel file looks now like this

Creating a new feature column on grouped data in a Pandas dataframe

I have a Pandas dataframe with the columns ['week', 'price_per_unit', 'total_units']. I wish to create a new column called 'weighted_price' as follows: first group by 'week' and then for each week calculate price_per_unit * total_units / sum(total_units) for that week. I have code that does this:
import pandas as pd
import numpy as np
def create_features_by_group(df):
# first group data
grouped = df.groupby(['week'])
df_temp = pd.DataFrame(columns=['weighted_price'])
# run through the groups and create the weighted_price per group
for name, group in grouped:
res = (group['total_units'] * group['price_per_unit']) / np.sum(group['total_units'])
for idx in res.index:
df_temp.loc[idx] = [res[idx]]
df.join(df_temp['weighted_price'])
return df
The only problem is that this is very, very slow. Is there some faster way to do this?
I used the following code to test the function.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['week', 'price_per_unit', 'total_units'])
for i in range(10):
df.loc[i] = [round(int(i % 3), 0) , 10 * np.random.rand(), round(10 * np.random.rand(), 0)]
I think you need to do it this way:
df
price total_units week
0 5 100 1
1 7 200 1
2 9 150 2
3 11 250 2
4 13 125 2
def fun(table):
table['measure'] = table['price'] * (table['total_units'] / table['total_units'].sum())
return table
df.groupby('week').apply(fun)
price total_units week measure
0 5 100 1 1.666667
1 7 200 1 4.666667
2 9 150 2 2.571429
3 11 250 2 5.238095
4 13 125 2 3.095238
I have grouped the dataset by 'Week' to calculate the weighted price for each week.
Then I joined the original dataset with the grouped dataset to get the result:
# importing the libraries
import pandas as pd
import numpy as np
# creating the dataset
df = {
'Week' : [1,1,1,1,2,2],
'price_per_unit' : [10,11,22,12,12,45],
'total_units' : [10,10,10,10,10,10]
}
df = pd.DataFrame(df)
df['price'] = df['price_per_unit'] * df['total_units']
# calculate the total sales and total number of units sold in each week
df_grouped_week = df.groupby(by = 'Week').agg({'price' : 'sum', 'total_units' : 'sum'}).reset_index()
# calculate the weighted price
df_grouped_week['wt_price'] = df_grouped_week['price'] / df_grouped_week['total_units']
# merging df and df_grouped_week
df_final = pd.merge(df, df_grouped_week[['Week', 'wt_price']], how = 'left', on = 'Week')

Python Pandas Calculate percentage of return per category

I have the following python pandas dataframe:
| Number of visits per year |
user id | 2013 | 2014 | 2015 | 2016 |
A 4 3 6 0
B 3 0 7 3
C 10 6 3 0
I want to calculate the percentage of users who returned based on their numbers of visits. I am sorry , I don't have any code yet, I wasn't sure how to start this.
This is the end result I am looking for:
| Number of visits in the year |
Year | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
2014 7% 3% 4% 15% 6% 7% 18% 17% 3% 2%
2015 3% ....
2016
So based on the above I can say that 15% of clients who visited the store 4 times in 2013, came back to the store in 2014.
Thank you very much.
UPDATE: This is what I did, maybe there is a better way through a loop?
For each year, I had a csv like this:
user_id | NR_V
A 4
B 3
C 10
NR_V stands for number of visits.
So I uploaded each csv as it's own df and I had df_2009, df_2010, ... until df_2016.
For each file I added a column column with 0/1 if they shopped the next year.
df_2009['shopped2010'] = np.where(df_2009['user_ID'].isin(df_2010['user_ID']), 1, 0)
Then I pivoted each dataframe.
pivot_2009 = pd.pivot_table(df_2009,index=["NR_V"],aggfunc={"NR_V":len, "shopped2010":np.sum})
Next, for each dataframe I created a new dataframe with the a column calculating the percentage by number of visits.
p_2009 = pd.DataFrame()
p_2009['%returned2010'] = (pivot_2009['shopped2010']/pivot_2009['NR_V'])*100
Finally, I merged all those dataframes into one.
dfs = [p_2009, p_2010, p_2011, p_2012, p_2013, p_2014, p_2015 ]
final = pd.concat(dfs, axis=1)
Consider the sample visits dataframe df
df = pd.DataFrame(
np.random.randint(1, 10, (100, 5)),
pd.Index(['user_{}'.format(i) for i in range(1, 101)], name='user id'),
[
['Number of visits per year'] * 5,
[2012, 2013, 2014, 2015, 2016]
]
)
df.head()
You can apply pd.value_counts with parameter normalize=True.
Also, since an entry of 8 represents 8 separate visits, it should count 8 times. I'll use repeat to accomplish this prior to value_counts
def count_visits(col):
v = col.values
return pd.value_counts(v.repeat(v), normalize=True)
df.apply(count_visits).stack().unstack(0)
I used the index value of every visitor and checked if the same index value (aka the same vistor_ID) was more then 0 the next year. This was then added to a dictionary in the form of True or False, which you could use for a bar-chart. I also made two lists (times_returned and returned_at_all) for additional data manipulation.
import pandas as pd
# Part 1, Building the dataframe.
df = pd.DataFrame({
'Visitor_ID':[1,2,3],
'2010' :[4,3,10],
'2011' :[3,0,6],
'2012' :[6,7,3],
'2013' :[0,3,0]
})
df.set_index("Visitor_ID", inplace=True)
# Part 2, preparing the required variables.
def dictionary (max_visitors):
dictionary={}
for x in range(max_visitors):
dictionary["number_{}".format(x)] = []
# print(dictionary)
return dictionary
# Part 3, Figuring out if the customer returned.
def compare_yearly_visits(current_year, next_year):
index = 1
years = df.columns
for x in df[current_year]:
# print (df[years][current_year][index], 'this year.')
# print (df[years][next_year][index], 'Next year.')
how_many_visits = df[years][current_year][index]
did_he_return = df[years][next_year][index]
if did_he_return > 0:
# If the visitor returned, add to a bunch of formats:
returned_at_all.append([how_many_visits, True])
times_returned.append([how_many_visits, did_he_return])
dictionary["number_{}".format(x)].append(True)
else:
## If the visitor did not return, add to a bunch of formats:
returned_at_all.append([how_many_visits, False])
dictionary["number_{}".format(x)].append(False)
index = index +1
# Part 4, The actual program:
highest_amount_of_visits = 11 # should be done automatically, max(visits)?
relevant_years = len(df.columns) -1
times_returned = []
returned_at_all = []
dictionary = dictionary(highest_amount_of_visits)
for column in range(relevant_years):
# print (dictionary)
this_year = df.columns[column]
next_year = df.columns[column+1]
compare_yearly_visits(this_year, next_year)
print ("cumulative dictionary up to:", this_year,"\n", dictionary)
Please find below my solution. As a note, I am pretty positive that this can be improved.
# step 0: create data frame
df = pd.DataFrame({'2013':[4, 3, 10], '2014':[3, 0, 6], '2015':[6, 7, 3], '2016':[0, 3, 0]}, index=['A', 'B', 'C'])
# container list of dataframes to be concatenated
frames = []
# iterate through the dataframe one column at a time and determine its value_counts(freq table)
for name, series in df.iteritems():
frames.append(series.value_counts())
# Merge frequency table for all columns into a dataframe
temp_df = pd.concat(frames, axis=1).transpose().fillna(0)
# Find the key for the new dataframe (i.e. range for number of columns), and append missing ones
cols = temp_df.columns
min = cols.min()
max = cols.max()
for i in range(min, max):
if (not i in a):
temp_df[str(i)] = 0
# Calculate percentage
final_df = temp_df.div(temp_df.sum(axis=1), axis=0)

Categories

Resources