I am having an algoritmic problem which I am trying to solve in python. I have a pandas dataframe ( say) of two columns as: ( I have it kept it sorted in descending here to make it easier to explain the problem)
df:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
LA7 185
LA8 180
LA9 150
LA10 100
I have a threshold value of BCOL, say 215. So what I want is to get the maximal subset from the above pandas dataframe, which when I take the average of BCOL will give me greater than or equal to 215.
So in this case, if I keep the BCOL values upto 200 then the mean of (234, 230,... 200) is 218.67, whereas if I keep up to 185 ( 234, 230, ..., 200, 185), the mean is 213.86. So my maximal subset to get the BCOL mean greater than 215 should be from ( 234,... 200). So I will drop the rest of the rows. So my final output pandas dataframe should be :
dfnew:
ACOL BCOL
LA1 234
LA2 230
LA3 220
LA4 218
LA5 210
LA6 200
I was trying to put the BCOL into a list and trying a for/while loop, but it is not pythonic and also a bit time consuming for very large data table. Is there a way in pandas to achieve this more pythonic way.
Will appreciate any help. Thanks.
IIUC, you could do:
# guarantee that the DF is sorted by non ascending
df = df.sort_values(by=['BCOL'], ascending=False)
# cumulative mean, then find where is gt 215
mask = (df['BCOL'].cumsum() / np.arange(1, len(df) + 1)) > 215.0
print(df[mask])
Output
ACOL BCOL
0 LA1 234
1 LA2 230
2 LA3 220
3 LA4 218
4 LA5 210
5 LA6 200
Related
I have a csv dataset where I have a column name "Types of Incidents" and another column named "Number of units".
Using Python and Pandas I am trying to find the average of "Number of units" when the value in type of incidents is 111. (It is found multiple times).
I have tried searching for multiple pandas methods but couldn't find how to find it on a huge dataset.
Here is the question:
What is the ratio of the average number of units that arrive to a scene of an incident classified as '111 - Building fire' to the number that arrive for '651 - Smoke scare, odor of smoke'?
An alternate to ML-Nielsen's value specific answer:
df.groupby('Types of Incidents')['Number of units'].mean()
This will provide the average Number of units for all Incident Types.
You can specify multiple columns as well if needed.
Reproducible Example:
data = {
"Incident_Type": [111, 380, 390, 111, 651, 651],
"Number_of_units": [50, 40, 45, 99, 12, 13]
}
data = pd.DataFrame(data)
data
Incident_Type Number_of_units
0 111 50
1 380 40
2 390 45
3 111 99
4 651 12
5 651 13
data.groupby('Incident_Type')['Number_of_units'].mean()
Incident_Type
111 74.5
380 40.0
390 45.0
651 12.5
Name: Number_of_units, dtype: float64
Now if you wish to find the ratios of the units you will need to store this result as a dataframe.
average_units = data.groupby('Incident_Type')['Number_of_units'].mean().to_frame()
average_units = average_units.reset_index()
average_units
Incident_Type Number_of_units
0 111 74.5
1 380 40.0
2 390 45.0
3 651 12.5
So we have our result stored in a dataframe called average_units.
incident1_units = average_units[average_units['Incident_Type']==111]['Number_of_units'].values[0]
incident2_units = average_units[average_units['Incident_Type']==651]['Number_of_units'].values[0]
incident1_units / incident2_units
5.96
If I understand correctly, you probably have to first select the right rows and then calculate the mean. Something like this:
df.loc[df['Types of Incidents']==111, 'Number of units'].mean()
This will give you the mean of Number of units where the condition df['Types of Incidents']==111 is true.
I have a very long dataframe dfWeather with the column TEMP. Because of its size, I want to keep only relevant information. Concretely, keep only entries, where the temperature changed by more than 1 since the last entry I kept. I want to use dfWeather.apply, since it seems to iterate much faster (10x) over the rows than a for-loop over dfWeather.iloc. I tried the following.
dfTempReduced = pd.DataFrame(columns = dfWeather.columns)
dfTempReduced.append(dfWeather.iloc[0])
dfWeather.apply(lambda x: dfTempReduced = dfTempReduced.append(x) if np.abs(TempReduced[-1].TEMP - x.TEMP) >= 1 else None, axis = 1)
unfortunately I get the error
SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Is there a fast way to get that desired result? Thanks!
EDIT:
Here is some example data
dfWeather[200:220].TEMP
Out[208]:
200 12.28
201 12.31
202 12.28
203 12.28
204 12.24
205 12.21
206 12.17
207 11.93
208 11.83
209 11.76
210 11.66
211 11.55
212 11.48
213 11.43
214 11.37
215 11.33
216 11.36
217 11.33
218 11.29
219 11.27
The desired result would yield only the first and the last entry, since the absolute difference is larger than 1. The first entry is always included.
If you don't want to call this recursive (so you have [1, 2, 3] and you want to keep [1, 3] because 2 is only 1 degree larger than 1 but 3 is more than 1 degree larger than 1, but not than 2) than you can simply use diff.
However, this doesn't work if the values stay longer below the 1°C threshold. To overcome this limitation, you could round the values (to whatever precision but 1°C suggests that to zero-precision would be a good idea;) )
Let us create an example:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['TEMP'] = np.random.rand(100) * 2
so now if you are OK with using diff it can be done very efficiently just by:
# either slice
lg = df['TEMP'].apply(round).diff().abs() > 1
df = df[lg]
# or drop
lg = df['TEMP'].apply(round).diff().abs() < 1
df.drop(index=lg.index, inplace=True)
You even have two options to to the reduction. I guess that drop take a minimal twink longer but is more memory efficient than the slicing way.
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47
I have the following dataframe:
df:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
I'm running a model to find out Ratio for each id.
My model is finding Ratio, it is in another data frame, that looks like this:
result:
143
Wins 32
Ratio 987
However, I'm struggling to update df with ratio. I'm looking for a function that simply updates df for the id 143. Tryed to use the pd.dataframe.update() but seems it doesn't work that way (or I was unable to make it work). Can someone help on that?
Where:
df
Outputs:
Wins Ratio
id
234 10 None
143 32 None
678 2 None
And:
result
Outputs:
143
Wins 32
Ratio 98
You can update df using combine_first:
df.replace('None',np.nan).combine_first(result.T)
Output:
Wins Ratio
143 32 98.0
234 10 NaN
678 2 NaN
Afternoon,
I am trying to recreate a table but replacing the raw numbers with percentage of the total column. For instance, i have:
Code 03/31/2016 12/31/2015 09/30/2015
F55 425 387 369
F554 109 106 106
F508 105 105 106
the desired output is a new dataframe, with the numbers replaced by the percentage with the total being the sum of the column (03/31/2016 = 425+109+105)
Code 03/31/2016 12/31/2015 09/30/2015
F55 66.5% 64.7% 63.5%
F554 17% 17.7% 18.2%
F508 16.4% 17.5% 18.2%
thanks for your help
I'm sure there's a more elegant answer somewhere but this will work:
df['03/31/2016'].apply(lambda x : x/df['03/31/2016'].sum())
or if you want to do this for the entire dataframe:
df.apply(lambda x : x/x.sum(), axis=0)