This question already has answers here:
Pandas percentage of total with groupby
(16 answers)
Closed 4 years ago.
I have a pandas groupby that I've done
grouped = df.groupby(['name','type'])['count'].count().reset_index()
Looks like this:
name type count
x a 32
x b 1111
x c 4214
What I need to do is take this and generate percentages, so i would get something like this (I realize the percentages are incorrect):
name type count
x a 1%
x b 49%
x c 50%
I can think of some pseudocode that might make sense but I haven't been able to get anything that actually works...
something like
def getPercentage(df):
for name in df:
total = 0
where df['name'] = name:
total = total + df['count']
type_percent = (df['type'] / total) * 100
return type_percent
df.apply(getPercentage)
Is there a good way to do this with pandas?
Try:
df.loc[:,'grouped'] = df.groupby(['name','type'])['count'].count() / df.groupby(['name','type'])['count'].sum()
Using crosstab + normalize
pd.crosstab(df.name,df.type,normalize='index').stack().reset_index()
Any series can be normalized by just passing in an argument "normalize=False" as follows (it's cleaner than deviding by count):
Series.value_counts(normalize=True, sort=True, ascending=False)
So, it will be something like (which is a series, not a dataframe):
df['type'].value_counts(normalize=True) * 100
or, if you use groupby, you can simply do:
total = grouped['count'].sum()
grouped['count'] = grouped['count']/total * 100
Related
I have the following dataframe df, in which I highlighted in green the cells with values of interest:
enter image description here
and I would like to obtain for each columns (therefore by considering the whole dataframe) the following statistics: the occurrence of a value less or equal to 0.5 (green cells in the dataframe) -Nan values are not to be included- and its percentage in the considered columns in order to use say 50% as benchmark.
For the point asked I tried with value_count like (df['A'].value_counts()/df['A'].count())*100, but this returns the partial result not the way I would and only for specific columns; I was also thinking about using filter or lamba function like df.loc[lambda x: x <= 0.5] but cleary that is not the result I wanted.
The goal/output will be a dataframe as shown below in which are displayed just the columns that "beat" the benchmark (recall: at least (half) 50% of their values <= 0.5).
enter image description here
e.g. in column A the count would be 2 and the percentage: 2/3 * 100 = 66%, while in column B the count would be 4 and the percentage: 4/8 * 100 = 50%. (The same goes for columns X, Y and Z). On the other hand in column C where 2/8 *100 = 25% won't beat the benchmark and therefore not considered in the output.
Is there a suitable way to achieve this IYHO? Apologies in advance if this was a kinda duplicated question but I found no other questions able to help me out, and thx to any saviour.
I believe I have understood your ask in the below code...
It would be good if you could provide an expected output in your question so that it is easier to follow.
Anyways the first part of the code below is just set up so can be ignored as you already have your data set up.
Basically I have created a quick function for you that will return the percentage of values that are under a threshold that you can define.
This function is called in a loop of all the columns within your dataframe and if this percentage is more than the output threshold (again you can define it) it will keep it for actually outputting.
import pandas as pd
import numpy as np
import random
import datetime
### SET UP ###
base = datetime.datetime.today()
date_list = [base - datetime.timedelta(days=x) for x in range(10)]
def rand_num_list(length):
peak = [round(random.uniform(0,1),1) for i in range(length)] + [0] * (10-length)
random.shuffle(peak)
return peak
df = pd.DataFrame(
{
'A':rand_num_list(3),
'B':rand_num_list(5),
'C':rand_num_list(7),
'D':rand_num_list(2),
'E':rand_num_list(6),
'F':rand_num_list(4)
},
index=date_list
)
df = df.replace({0:np.nan})
##############
print(df)
def less_than_threshold(thresh_df, thresh_col, threshold):
if len(thresh_df[thresh_col].dropna()) == 0:
return 0
return len(thresh_df.loc[thresh_df[thresh_col]<=threshold]) / len(thresh_df[thresh_col].dropna())
output_dict = {'cols':[]}
col_threshold = 0.5
output_threshold = 0.5
for col in df.columns:
if less_than_threshold(df, col, col_threshold) >= output_threshold:
output_dict['cols'].append(col)
df_output = df.loc[:,output_dict.get('cols')]
print(df_output)
Hope this achieves your goal!
I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5
I would like to calculate a sum of variables for a given day. Each day contains a different calculation, but all the days use the variables consistently.
There is a df which specifies my variables and a df which specifies how calculations will change depending on the day.
How can I create a new column containing answers from these different equations?
import pandas as pd
import numpy as np
conversion = [["a",5],["b",1],["c",10]]
conversion_table = pd.DataFrame(conversion,columns=['Variable','Cost'])
data1 = [[1,"3a+b"],[2,"c"],[3,"2c"]]
to_solve = pd.DataFrame(data1,columns=['Day','Q1'])
desired = [[1,16],[2,10],[3,20]]
desired_table=pd.DataFrame(desired,columns=['Day','Q1 solved'])
I have separated my variables and equations based on row. Can I loop though these equations to find non-numerics and re-assign them?
#separate out equations and values
for var in conversion_table["Variable"]:
cost=(conversion_table.loc[conversion_table['Variable'] == var, 'Cost']).mean()
for row in to_solve["Q1"]:
equation=row
A simple suggestion, perhaps you need to rewrite a part of your code. Not sure if your want something like this:
a = 5
b = 1
c = 10
# Rewrite the equation that is readable by Python
# e.g. replace 3a+b by 3*a+b
data1 = [[1,"3*a+b"],
[2,"c"],
[3,"2*c"]]
desired_table = pd.DataFrame(data1,
columns=['Day','Q1'])
desired_table['Q1 solved'] = desired_table['Q1'].apply(lambda x: eval(x))
desired_table
Output:
Day Q1 Q1 solved
0 1 3*a+b 16
1 2 c 10
2 3 2*c 20
If it's possible to have the equations changed to equations with * then you could do this.
Get the mapping from the
mapping = dict(zip(conversion_table['Variable'], conversion_table['Cost'])
the eval the function and replace variables with numeric from the mapping
desired_table['Q1 solved'] = to_solve['Q1'].map(lambda x: eval(''.join([str(mapping[i]) if i.isalpha() else str(i) for i in x])))
0 16
1 10
2 20
Coming from R, the code would be
x <- data.frame(vals = c(100,100,100,100,100,100,200,200,200,200,200,200,200,300,300,300,300,300))
x$state <- cumsum(c(1, diff(x$vals) != 0))
Which marks every time the difference between rows is non-zero, so that I can use it to spot transitions in data, like so:
vals state
1 100 1
...
7 200 2
...
14 300 3
What would be a clean equivalent in Python?
Additional question
The answer to the original question is posted below, but won't work properly for a grouped dataframe with pandas.
Data here: https://pastebin.com/gEmPHAb7. Notice that there are 2 different filenames.
When imported as df_all I group it with the following, and then apply solution posted below.
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
Using diff and cumsum, as in your R example:
df['state'] = (df['vals'].diff()!= 0).cumsum()
This uses the fact that True has integer value 1
Bonus question
df_grouped = df_all.groupby("filename")
df_all["state"] = (df_grouped['Fit'].diff() != 0).cumsum()
I think you misunderstand what groupby does. All groupby does is create groups based on the criterium (filename in this instance). You then need to tell add another operation to tell what needs to happen with this group.
Common operations are mean, sum, or more advanced as apply and transform.
You can find more information here or here
If you can explain more in detail what you want to achieve with the groupby I can help you find the correct method. If you want to perform the above operation per filename, you probably need something like this:
def get_state(group):
return (group.diff()!= 0).cumsum()
df_all['state'] = df_all.groupby('filename')['Fit'].transform(get_state)
I have a dataframe:
Type Name Cost
A X 545
B Y 789
C Z 477
D X 640
C X 435
B Z 335
A X 850
B Y 152
I have all such combinations in my dataframe with Type ['A','B','C','D'] and Names ['X','Y','Z'] . I used the groupby method to get stats on a specific combination together like A-X , A-Y , A-Z .Here's some code:
df = pd.DataFrame({'Type':['A','B','C','D','C','B','A','B'] ,'Name':['X','Y','Z','X','X','Z','X','Y'], 'Cost':[545,789,477,640,435,335,850,152]})
df.groupby(['Name','Type']).agg([mean,std])
#need to use mad instead of std
I need to eliminate the observations that are more than 3 MADs away ; something like:
test = df[np.abs(df.Cost-df.Cost.mean())<=(3*df.Cost.mad())]
I am confused with this as df.Cost.mad() returns the MAD for the Cost on the entire data rather than a specific Type-Name category. How could I combine both?
You can use groupby and transform to create new data series that can be used to filter out your data.
groups = df.groupby(['Name','Type'])
mad = groups['Cost'].transform(lambda x: x.mad())
dif = groups['Cost'].transform(lambda x: np.abs(x - x.mean()))
df2 = df[dif <= 3*mad]
However, in this case, no row is filtered out since the difference is equal to the mean absolute deviation (the groups have only two rows at most).
You can get your aggregate function on the grouped object:
df["mad"] = df.groupby(['Name','Type'])["Cost"].transform("mad")
df = df.loc[df.mad<3]