If I have the following dataframe, df, with millons variables
id score
140 0.1223
142 0.01123
148 0.1932
166 0.0226
.. ..
My problem is,
How can I study the distribution of each percentile?
So, my idea was divide score into percentiles and see how much percentage corresponds to each one.
I would like to get something like
percentil countofindex percentage
1 154.000 %20
2 100.320 %17
3 250.000 %21
...
where countofindex, is the number of differents Id, and percentage is the percentage that represent the first, second,.. percentil.
So for this, I get df['percentage'] = df['score'] / df['score'].sum() * 100, but this is the percentage of all data.
To get the percentage of each score you can get the sum of all scores and divide each one by it:
df= pd.DataFrame({'score': [0.1223,0.01123,0.1932]})
df['percentage'] = df['score'] / df['score'].sum() * 100
score percentage
0 0.12230 37.431518
1 0.01123 3.437089
2 0.19320 59.131393
To sort you can use .sort_values:
df.sort_values(by=['percentage'], ascending=False)
df.insert(1, 'percentile', range(1,len(df)+1))
score percentile percentage
2 0.19320 1 59.131393
0 0.12230 2 37.431518
1 0.01123 3 3.437089
let's go through the following example.
print(df)
0
0 0.127975
1 0.146976
2 0.721326
3 0.003722
df[0].sum()
1.0
Now, to create the chart:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_axes([0,0,1,1])
langs = [str(round(i[0]*100,2)) for i in df.iloc]
data = [round(i[0]*100,2) for i in df.iloc]
ax.bar(langs,data)
plt.ylabel("Percentiles")
plt.xlabel("Values")
plt.xticks(rotation=45)
plt.show()
If you want to add it to the dataset, you can use the code below.
df['Percentiles (%)']=df.apply(lambda x: round(x*100,2))
print(df)
0 Percentiles (%)
0 0.127975 12.80
1 0.146976 14.70
2 0.721326 72.13
3 0.003722 0.37
Related
I'm trying to create a new df from race_dbs that's grouped by 'horse_id' showing the number of times 'place' = 1 as well as the total number of times that 'horse_id' occurs.
Some background on the dataset if it's helpful;
race_dbs contains horse race data. There are 12 horses in a race, for each is shown their odds, fire, place, time, and gate number.
What I'm trying to achieve from this code is the calculation of win rates for each horse.
A win is denoted by 'place' = 1
Total race count will be calculated by how many times a particular 'horse_id' occurs in the db.
race_dbs
race_id
horse_id
odds
fire
place
horse_time
gate
V14qANzi
398807
NaN
0
1
72.0191
7
xeieZak
191424
NaN
0
8
131.3010
10
xeieZak
139335
NaN
0
1
131.3713
9
xeieZak
137195
NaN
0
11
131.6310
11
xeieZak
398807
NaN
0
12
131.7886
2
...
...
..
..
...
...
..
From this simple table the output would look like, but please bear in mind my dataset is very large, containing 12882353 rows in total.
desired output
horse_id
wins
races
win rate
398807
1
2
50%
191424
0
1
0%
139335
1
1
100%
137195
0
1
0%
...
..
..
...
It should be noted that I'm a complete coding beginner so forgive me if this is an easy solve.
I have tried to use the groupby and lambda pandas functions but I am struggling to combine both functions, and believe there will be a much simpler way.
import pandas as pd
race_db = pd.read_csv('horse_race_data_db.csv')
race_db_2 = pd.read_csv('2_horse_race_data.csv')
frames = [race_db, race_db_2]
race_dbs = pd.concat(frames, ignore_index=True, sort=False)
race_dbs_horse_wins = race_dbs.groupby('horse_id')['place'].apply(lambda x: x[x == 1].count())
race_dbs_horse_sums = race_dbs.groupby('horse_id').aggregate({"horse_id":['sum']})
Thanks for the help!
For count Trues values create helper boolean column and aggregate sum, for win rate aggregate mean and for count use GroupBy.size in named aggregations by GroupBy.agg:
out = (race_dbs.assign(no1 = race_dbs['place'].eq(1))
.groupby('horse_id', sort=False, as_index=False)
.agg(**{'wins':('no1','sum'),
'races':('horse_id','size'),
'win rate':('no1','mean')}))
print (out)
horse_id wins races win rate
0 398807 1 2 0.5
1 191424 0 1 0.0
2 139335 1 1 1.0
3 137195 0 1 0.0
can you try this way:
Example code
import pandas as pd
import numpy as np
new_technologies= {
'Courses':["Python","Java","Python","Ruby","Ruby"],
'Fees' :[22000,25000,23000,24000,26000],
'Duration':['30days','50days','30days', '30days', '30days']
}
print('new_technologies:',new_technologies)
df = pd.DataFrame(new_technologies)
print('df:',df)
#calculate precentage of aggregated functions
df2 = df.groupby(['Courses', 'Fees']).agg({'Fees': 'sum'})
print(df2)
# Percentage by lambda and DataFrame.apply() method.
df3 = df2.groupby(level=0).apply(lambda x:100 * x / float(x.sum()))
print(df3)
output:
I prefer to use matrix multiplication for coding, because it's so much more efficient than iterating, but curious on how to do this if the dimensions are different.
I have two different dataframes
A:
Orig_vintage
Q12018
185
Q22018.
200
and B:
default_month
1
2
3
orig_vintage
Q12018
0
25
35
Q22018
0
15
45
Q32018
0
35
65
and I'm trying to divide A through columns of B, so the B dataframe becomes (note I've rounded random percentages):
default_month
1
2
3
orig_vintage
Q12018
0
.03
.04
Q22018
0
.04
.05
Q32018
0
.06
.07
But bottom line want to divide the monthly defaults by the total origination figure to get to a monthly default %.
first step is get data side by side with a right join()
then divide all columns by required value Divide multiple columns by another column in pandas
required value as I understand is sum, if join did not give a value.
import pandas as pd
import io
df1 = pd.read_csv(
io.StringIO("""Orig_vintage,Unnamed: 1\nQ12018,185\nQ22018,200\n"""), sep=","
)
df2 = pd.read_csv(
io.StringIO(
"""default_month,1,2,3\nQ12018,0.0,25.0,35.0\nQ22018,0.0,15.0,45.0\nQ32018,0.0,35.0,65.0\n"""
),
sep=",",
)
df1.set_index("Orig_vintage").join(df2.set_index("default_month"), how="right").pipe(
lambda d: d.div(d["Unnamed: 1"].fillna(d["Unnamed: 1"].sum()), axis=0)
)
default_month
Unnamed: 1
1
2
3
Q12018
1
0
0.135135
0.189189
Q22018
1
0
0.075
0.225
Q32018
nan
0
0.0909091
0.168831
I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve
With some subsequent code I have created a pandas dataframe that looks like the example below:
Distance. No. of values Mean rSquared
1 500 0.6
2 80 0.3
3 40 0.4
4 30 0.2
5 50 0.2
6 30 0.1
I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.
I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below
Mean Distance. No. Of values Mean rSquared
1 500 0.6
(80*2+40*3)/120 (80+40) = 120 (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110 (30+50+30) = 110 (30*0.2+50*0.2+30*0.1)/110
etc...
I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.
Any help would be appreciated!
You can use this code snippet to solve your problem.
# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]
min_threshold = 100
indexes = []
temp_sum = 0
# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]
# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)
# main loop to check and compute the desired output
for index, _ in df.iterrows():
temp_sum += df.iloc[index]["No. of values"]
indexes.append(index)
# if the sum exceeds 'min_threshold' then do some computation
if temp_sum >= min_threshold:
temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
# create temporary dataframe and concatenate with the 'final_df'
temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
final_df = pd.concat([final_df, temp_df])
# reset the variables
temp_sum = 0
indexes = []
Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.
Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.
I also leveraged #sujanay's idea of calculating the weighted values first.
c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
[4,30,0.2], [5,50,0.2], [6,30,0.1]]
import pandas as pd
import numpy as np
df = pd.DataFrame(d,columns=c)
#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]
#use numpy.frompyfunc to setup the threshold condition
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)
#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()
#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]
#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
i = j+1
print (df)
The output of this will be:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
1 2 80 ... NaN NaN
2 3 40 ... 2.33 0.33
3 4 30 ... NaN NaN
4 5 50 ... NaN NaN
5 6 30 ... 5.00 0.17
If you want to extract only the records that are non NaN, you can do:
final_df = df[df['Mean Distance'].notnull()]
This will result in:
Distance No. of values ... Mean Distance New Mean rSquared
0 1 500 ... 1.00 0.60
2 3 40 ... 2.33 0.33
5 6 30 ... 5.00 0.17
I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value
If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):
>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
... if cumsum >= 100:
... cumsum = 0
... group = group + 1
... cumsum = cumsum + n
... groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]
Before doing the grouped operations you need to use the No. of values information to get the weighting in:
df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)
Now get the sums like this:
>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0 500
1 120
2 110
Name: No. of values, dtype: int64
And finally the weighted group averages like this:
>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
Distance. Mean rSquared
0 1.000000 0.600000
1 2.333333 0.333333
2 5.000000 0.172727
I want to create a new column in my table by implementing equation, but there might be 2 possible equations for the new table.
(1) frequency = (total x 100) / hour
(2) frequency = (total x 1000000) / km_length
the table is similar to this:
type hour km_length total
A 1 - 1
B - 2 1
the calculation for "frequency" table would depend on which columns between hour and km_length that has value.
then, I expect the table will be like this:
type hour km_length total frequency
A 1 - 1 100
B - 2 1 500000
I have tried using np.nan_to_num before but it did not show the expected table I want.
is there anyway I can make it using python? Looking forward to any help
thankyou.
We can use np.where for assigning values based on a condition:
df[["hour", "km_length"]] = df[["hour", "km_length"]].apply(pd.to_numeric, errors="coerce")
df["frequency"] = np.where(
df["km_length"].isna(),
df["total"] * 100 / df["hour"],
df["total"] * 1_000_000 / df["km_length"]
)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
Make your values numeric then multiply. Because a missing value indicates with method to use and because division with NaN results in a NaN do both multiplications and use .fillna to determine the correct resulting value.
df[['hour', 'km_length']] = df[['hour', 'km_length']].apply(pd.to_numeric, errors='coerce')
s1 = df['total'].divide(df['hour']).multiply(100)
s2 = df['total'].divide(df['km_length']).multiply(10**6)
df['frequency'] = s1.fillna(s2)
type hour km_length total frequency
0 A 1.0 NaN 1 100.0
1 B NaN 2.0 1 500000.0
You can store the data in numpy array.
import numpy as np
table = np.array([['hour' , 'km_lenght' , 'total' , 'frequrncy']] #set the value of frequency as 0
for i in table:
try:
i[3] = (i[2]*100)/i[0]
except:
i[3] = (i[2]*1000000)/i[1]
print(table)
This should print the desired table.
I'm learning pandas and have a query about aggregate functions. Apologies for what might be a very basic question for experts on this forum :).
Here's a sample of my dataset:
EmpID Age_Range Salary
0 321 20, 35 34000
1 561 20, 35 24000
2 789 50, 65 34000
the above dataset is df, and i'm saving down avg. salary info per employee age range into a separate dataframe (df_age), where I'm persisting the above data. I was able to successfully apply mean() on the salary table to get the avg. salary per age range.
So basically what I want is the count of employees for each age_range.
df_age['EmpCount'] = df.groupby('Age_Range')['EmpID'].count() doesn't work, and returns a 'NaN' in my dataset.
additionally, when I used the transform function
df_age['EmpCount'] = df.groupby('Age_Range')['EmpID'].transform(count)
it returns values, but the same value across the three age ranges - 37, which is not correct. There are a total of 100 entries in my dataset.
desired output for df_age:
0 (20, 35] 50000 27
1 (35, 50] 37000 11
2 (50, 65] 65000 30
Thanks!
If I understood your question correctly you want a new column which has count of employees for age_range. Well, you can use aggregate function to get your answer as follows:
df_age = df.set_index(['Age_Range','EmpID']).groupby(level =0).size().reset_index(name='count_of_employees')
df_age['Ave_Salary'] = df.set_index(['Age_Range','Salary']).groupby(level =0).mean()
You can use size or len in a transform, just like you did with count:
# Dummy data
df = pd.DataFrame({"sample": ["sample1", "sample2", "sample2", "sample3", "sample3", "sample3"]})
df["number_of_samples"] = df.groupby("sample").sample.transform("size")
df["number_of_samples_again"] = df.groupby("sample").sample.transform(len)
Output:
sample number_of_samples number_of_samples_again
0 sample1 1 1
1 sample2 2 2
2 sample2 2 2
3 sample3 3 3
4 sample3 3 3
5 sample3 3 3
I have found a solution to this, but it's not neat / efficient:
df_age1 = df.groupby('Age_Range')['Salary'].mean()
df_age1 = df_age1.reset_index()
df_age1.rename(columns={'Salary':'SalAvg'}, inplace=True)
df_age2 = df.groupby('Age_Range')['EmpID'].count()
df_age2 = df_age2.reset_index()
df_age2.rename(columns={'EmpID':'EmpCount'}, inplace=True)
Then finally,
df_age = pd.merge(df_age1, df_age2, on='Age_Range')
The above iteration gives me what I need, but across three dataframes - I'll obviously be ignoring df_age1 and 2, but I'm still on the lookout for an efficient answer!