Afternoon,
I am trying to recreate a table but replacing the raw numbers with percentage of the total column. For instance, i have:
Code 03/31/2016 12/31/2015 09/30/2015
F55 425 387 369
F554 109 106 106
F508 105 105 106
the desired output is a new dataframe, with the numbers replaced by the percentage with the total being the sum of the column (03/31/2016 = 425+109+105)
Code 03/31/2016 12/31/2015 09/30/2015
F55 66.5% 64.7% 63.5%
F554 17% 17.7% 18.2%
F508 16.4% 17.5% 18.2%
thanks for your help
I'm sure there's a more elegant answer somewhere but this will work:
df['03/31/2016'].apply(lambda x : x/df['03/31/2016'].sum())
or if you want to do this for the entire dataframe:
df.apply(lambda x : x/x.sum(), axis=0)
Related
I have my dataframe object df which looks like this:
product 7.month 8.month 9.month 10.month 11.month 12.month 1.month 2.month 3.month 4.month 5.month 6.month
0 phone 68 137 202 230 143 220 110 173 187 149 204 90
1 television <same kind of numerical data>
2
3
4
...
I would like to plot this data, but I'm not sure how to plot this, because months are horizontal (columns) and also have around 20 products (rows) in my dataframe, so people could read from it
Transpose the dataframe
df1 = df.T
and now plot df1
I agree and recommend Aavesh's approach. However, if it is absolutely necessary to access the data horizontally, then you can use list(df.iloc[index]) where index is the index of the row.
Then plot.
I am having an algoritmic problem which I am trying to solve in python. I have a pandas dataframe ( say) of two columns as: ( I have it kept it sorted in descending here to make it easier to explain the problem)
df:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
LA7 185
LA8 180
LA9 150
LA10 100
I have a threshold value of BCOL, say 215. So what I want is to get the maximal subset from the above pandas dataframe, which when I take the average of BCOL will give me greater than or equal to 215.
So in this case, if I keep the BCOL values upto 200 then the mean of (234, 230,... 200) is 218.67, whereas if I keep up to 185 ( 234, 230, ..., 200, 185), the mean is 213.86. So my maximal subset to get the BCOL mean greater than 215 should be from ( 234,... 200). So I will drop the rest of the rows. So my final output pandas dataframe should be :
dfnew:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
I was trying to put the BCOL into a list and trying a for/while loop, but it is not pythonic and also a bit time consuming for very large data table. Is there a way in pandas to achieve this more pythonic way.
Will appreciate any help. Thanks.
IIUC, you could do:
# guarantee that the DF is sorted by non ascending
df = df.sort_values(by=['BCOL'], ascending=False)
# cumulative mean, then find where is gt 215
mask = (df['BCOL'].cumsum() / np.arange(1, len(df) + 1)) > 215.0
print(df[mask])
Output
ACOL BCOL
0 LA1 234
1 LA2 230
2 LA3 220
3 LA4 218
4 LA5 210
5 LA6 200
In the titanic dataset, I wish to calculate the percentage of passengers who survived with each of Passenger class (Pclass) 1,2 & 3. I figured out how to get the count of passengers and no. of passengers who survived using group by as below:
train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))
So, I get the below output:
PassengerCount SurvivedPassengerCount
Pclass
1 216 136
2 184 87
3 491 119
But how do I get a percentage column? I want the output as below:
PassengerCount SurvivedPassengerCount PercSurvived
Pclass
1 216 136 62.9%
2 184 87 47.3%
3 491 119 24.2%
Thanks in advance!
Since you only need to divide SurvivedPassengerCount by PassengerCount, you can do this using the .assign method:
result = train[['PassengerId','Pclass','Survived']]\
.groupby('Pclass')\
.agg(PassengerCount=pd.NamedAgg(column='PassengerId', aggfunc='count'),
SurvivedPassengerCount=pd.NamedAgg(column='Survived',aggfunc='sum'))\
result = result.assign(PercSurvived=df['PassengerCount']/df['SurvivedPassengerCount'])
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47
I have the following dataframe:
df:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
I'm running a model to find out Ratio for each id.
My model is finding Ratio, it is in another data frame, that looks like this:
result:
143
Wins 32
Ratio 987
However, I'm struggling to update df with ratio. I'm looking for a function that simply updates df for the id 143. Tryed to use the pd.dataframe.update() but seems it doesn't work that way (or I was unable to make it work). Can someone help on that?
Where:
df
Outputs:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
And:
result
Outputs:
143
Wins 32
Ratio 98
You can update df using combine_first:
df.replace('None',np.nan).combine_first(result.T)
Output:
Wins Ratio
143 32 98.0
234 10 NaN
678 2 NaN