Related
My goal is for a dataset similar to the example below, to group by [s_num, ip, f_num, direction] and then filter the score columns using separate thresholds and count how many values are above the threshold.
id s_num ip f_num direction algo_1_x algo_2_x algo_1_score algo_2_score
0 0.0 0.0 0.0 0.0 X -4.63 -4.45 0.624356 0.664009
15 19.0 0.0 2.0 0.0 X -5.44 -5.02 0.411217 0.515843
16 20.0 0.0 2.0 0.0 X -12.36 -5.09 0.397237 0.541112
20 24.0 0.0 2.0 1.0 X -4.94 -5.15 0.401744 0.526032
21 25.0 0.0 2.0 1.0 X -4.78 -4.98 0.386410 0.564934
22 26.0 0.0 2.0 1.0 X -4.89 -5.03 0.394326 0.513896
24 28.0 0.0 2.0 2.0 X -4.78 -5.00 0.420078 0.521993
25 29.0 0.0 2.0 2.0 X -4.91 -5.14 0.407355 0.485878
26 30.0 0.0 2.0 2.0 X 11.83 -4.97 0.392242 0.659122
27 31.0 0.0 2.0 2.0 X -4.73 -5.07 0.377011 0.524774
the result should look something like:
Each entry in algo_i column is the # of values in the group larger than the corresponding threshold
So far I tried first grouping, and applying custom aggregation like so:
def count_success(x,thresh):
return ((x > thresh)*1).sum()
thresholds=[0.1,0.2]
df.groupby(attr_cols).agg({f'algo_{i+1}_score':count_success(thresh) for i, thresh in enumerate(thresholds)})
but this results in an error :
count_success() missing 1 required positional argument: 'thresh'
So, how can I pass another argument to a function using .agg( )? or is there an easier way to do it using some pandas function?
Named aggregation does not allow extra parameter to be passed to your function. You can use numpy boardcasting:
attr_cols = ["s_num", "ip", "f_num", "direction"]
score_cols = df.columns[df.columns.str.match("algo_\d+_score")]
# Convert everything to numpy to prepare for broadcasting
score = df[score_cols].to_numpy()
threshold = np.array([0.1, 0.5])
# Raise `threshold` up 2 dimensions so that every value in `score` is
# broadcast against every value in `threshold`
mask = score > threshold[:, None, None]
# Assemble the result
row_index = pd.MultiIndex.from_frame(df[attr_cols])
col_index = pd.MultiIndex.from_product([threshold, score_cols], names=["threshold", "algo"])
result = (
pd.DataFrame(np.hstack(mask), index=row_index, columns=col_index)
.groupby(attr_cols)
.sum()
)
Result:
threshold 0.1 0.5
algo algo_1_score algo_2_score algo_1_score algo_2_score
s_num ip f_num direction
0.0 0.0 0.0 X 1 1 1 1
2.0 0.0 X 2 2 0 2
1.0 X 3 3 0 3
2.0 X 4 4 0 3
I am new to python and pandas, so my doubt can be silly also.
Problem:
So I have two data frames let's say df1 and df2 where
df1 is like
treatment1 treatment2 value comparision test adjustment statsig p_value
0 Treatment Control 0.795953 Treatment:Control t-test Benjamini-Hochberg False 0.795953
1 Treatment2 Control 0.795953 Treatment2:Control t-test Benjamini-Hochberg False 0.795953
2 Treatment2 Treatment 0.795953 Treatment2:Treatment t-test Benjamini-Hochberg False 0.795953
and df2 is like
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
.. ... ...
336 Treatment3 35.0
337 Treatment3 9.0
338 Treatment3 35.0
339 Treatment3 9.0
340 Treatment3 35.0
I want to add a column mean_percentage_lift in df1 where
lift_mean_percentage = (mean(treatment1)/mean(treatment2) -1) * 100
where `treatment1` and `treatment2` can be anything in `[Treatment, Control, Treatment2]`
My Approach:
I am using the assign function of the data frame.
df1.assign(mean_percentage_lift = lambda dataframe: lift_mean_percentage(df2, dataframe['treatment1'], dataframe['treatment2']))
where
def lift_mean_percentage(df, treatment1, treatment2):
treatment1_data = df[df[group_type_col] == treatment1]
treatment2_data = df[df[group_type_col] == treatment2]
mean1 = treatment1_data['metric'].mean()
mean2 = treatment2_data['metric'].mean()
return (mean1/mean2 -1) * 100
But I am getting this error Can only compare identically-labeled Series objects for line
treatment1_data = df[df[group_type_col] == treatment1]. Is there something I am doing wrong is there any alternative to this.
For dataframe df2:
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
5 Treatment3 35.0
6 Treatment3 9.0
7 Treatment 35.0
8 Treatment3 9.0
9 Control 5.0
You can try:
def lift_mean_percentage(df, T1, T2):
treatment1= df['metric'][df['group_type']==T1].mean()
treatment2= df['metric'][df['group_type']==T2].mean()
return (treatment1/treatment2 -1) * 100
runing:
lift_mean_percentage(df2,'Treatment2','Control')
the result:
260.8695652173913
Initially had everything written pandas and for this exploratory exercise i had did a lot of groupby's and while running with the whole data it was ran for 1h 26m. Over the last weekend, i had changed everything from pandas to using numpy, currently it took(Wall time: 38min 27s). I would like to know if it can be further improved
While converting to numpy, i had additionally used numpy_indexed.
Overall what i am doing is calling the below function this loop(i have read in lots of places loops are bad). Dataset has around 657058 rows and there around 5000 tickers.
for idx, ticker in enumerate(ticker_list):
...
df_temp = weekly_trend_analysis(exchange, df_weekly, df_daily)
...
df_weekly_all = pd.concat([df_weekly_all, df_temp], sort=False)
def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
if exchange == 'BSE':
ticker = df_daily.iloc[0]['sc_code']
else:
ticker = df_daily.iloc[0]['symbol']
arr_yearWeek = df_daily['yearWeek'].to_numpy()
arr_close = df_daily['close'].to_numpy()
arr_prevclose = df_daily['prevclose'].to_numpy()
arr_chng = df_daily['chng'].to_numpy()
arr_chngp = df_daily['chngp'].to_numpy()
arr_ts = df_daily['ts'].to_numpy()
arr_volumes = df_daily['volumes'].to_numpy()
# Close
arr_concat = np.column_stack((arr_yearWeek, arr_close))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
#a = df_temp[['yearWeek', 'close']].to_numpy()
yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
cmaxs, cmins = [], []
first, last, wChng, wChngp = [], [], [], []
for idx,subarr in enumerate(npi_gb):
cmaxs.append( np.amax(subarr) )
cmins.append( np.amin(subarr) )
first.append(subarr[0])
last.append(subarr[-1])
wChng.append( subarr[-1] - subarr[0] )
wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Chng
arr_concat = np.column_stack((arr_yearWeek, arr_chng))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
HSDL, HSDG = [], []
for idx,subarr in enumerate(npi_gb):
HSDL.append( np.amin(subarr) )
HSDG.append( np.amax(subarr) )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Chngp
arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
HSDLp, HSDGp = [], []
for idx,subarr in enumerate(npi_gb):
HSDLp.append( np.amin(subarr) )
HSDGp.append( np.amax(subarr) )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Last Traded Date of the Week
i = df_daily[['yearWeek', 'ts']].to_numpy()
j = npi.group_by(i[:, 0]).split(i[:, 1])
lastTrdDoW = []
for idx,subarr in enumerate(j):
lastTrdDoW.append( subarr[-1] )
i = np.empty((100,100))
#j.clear()
# Times inreased
TI = np.where(arr_close > arr_prevclose, 1, 0)
# Below npi_gb_yearWeekTI is used in volumes section
arr_concat = np.column_stack((arr_yearWeek, TI))
npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
# Volume ( dependent on above section value t_group , thats the reason to move from top to here)
arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
for idx,subarr in enumerate(npi_gb):
vavgs.append( np.mean(subarr) )
ldvotWk = subarr[-1]
lastDVotWk.append(ldvotWk)
#print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
ixDel = np.argmax(subarr)
hV = subarr[ixDel]
vmaxs.append( hV )
if(len(subarr)>1):
subarr = np.delete(subarr, ixDel)
vawoHV = np.mean(subarr)
else:
vawoHV = np.mean(subarr)
volAvgWOhv.append( vawoHV )
HVdAV.append(hV / vawoHV)
CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
lastDVdAV.append(ldvotWk / vawoHV)
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Preparing the dataframe
# yearWeek and occurances
#yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
yearWeek = yearWeek.astype(int)
HSDL = np.round(HSDL,2)
HSDG = np.round(HSDG,2)
HSDLp = np.round(HSDLp,2)
HSDGp = np.round(HSDGp,2)
first = np.round(first,2)
last = np.round(last,2)
wChng = np.round(wChng,2)
wChngp = np.round(wChngp,2)
vavgs = np.array(vavgs).astype(int)
volAvgWOhv = np.array(volAvgWOhv).astype(int)
HVdAV = np.round(HVdAV,2)
dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
df_weekly = pd.DataFrame(data=dict_temp)
df_weekly['sc_code'] = ticker
cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
, 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
, 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
df_weekly = df_weekly[cols].copy()
# df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
if df_weekly_all.shape[0] == 0:
df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
# Removing all yearWeek in df_weekly2 from df_weekly
a = set(df_weekly_all['yearWeek'])
b = set(df_weekly['yearWeek'])
c = list(a.difference(b))
#print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
# Append the latest week data to df_weekly
df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
#print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
return df_weekly_all
Input data
ts = ['2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00']
sc_code = ['500002','500002','500002','500002','500002','500002','500002','500002','500002','500002']
high = [1326.6, 208.45, 732.15, 14.87, 1979.0, 57.8, 11.55, 1.68, 8.1, 139.4]
low = [1306.35, 204.0, 717.05, 13.41, 1937.65, 54.65, 11.2, 1.52, 7.75, 135.65]
close = [1313.55, 206.65, 723.05, 13.53, 1955.25, 56.0, 11.21, 1.68, 8.1, 136.85]
prevclose = [1319.85, 202.95, 718.95, 14.29, 1967.3, 54.65, 11.22, 1.6, 7.75, 135.05]
volumes = [7785, 6150, 21641, 46296, 707019, 40089, 25300, 5920, 500, 235355]
yearWeek = [201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913]
chng = [-6.29, 3.70, 4.09, -0.75, -12.04, 1.35, -0.09, 0.079, 0.34, 1.79]
chngp = [-0.48, 1.82, 0.57, -5.32, -0.61, 2.47, -0.09, 5.0, 4.52, 1.33]
dict_temp = {'ts':ts, 'sc_code':sc_code, 'high':high, 'low':low, 'close':close, 'prevclose':prevclose, 'volumes':volumes, 'yearWeek':yearWeek, 'chng':chng, 'chngp':chngp}
df_weekly = pd.DataFrame(data=dict_temp)
Adding line-profiler details,
('getcwd : ', '/home/bobby_dreamer')
Timer unit: 1e-06 s
Total time: 0.043637 s
File: BTD-Analysis1V3.py
Function: weekly_trend_analysis at line 36
Line # Hits Time Per Hit % Time Line Contents
==============================================================
36 def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
37
38 1 3.0 3.0 0.0 if exchange == 'BSE':
39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']
40 else:
41 ticker = df_daily.iloc[0]['symbol']
42
95 # Last Traded Date of the Week
96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()
97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])
98
160
161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)
166
167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker
168
169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
172
173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()
174
175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:
177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
178
179 # Removing all yearWeek in df_weekly2 from df_weekly
180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])
181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])
182 1 5.0 5.0 0.0 c = list(a.difference(b))
183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
185
186 # Append the latest week data to df_weekly
187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
189
190 1 2.0 2.0 0.0 return df_weekly_all
After reviewing the above profile, made changes to code which consumed more time, basically added more numpy code removed pandas in the function. Below code when run with whole data took only Wall time: 7min 47s.
Encountered some numpy errors like below, handled via writing intermediate files. I am using windows machine and intermediate files were < 3MB. Not sure if there were any limitations.
MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 def weekly_trend_analysis_np(exchange, np_weekly_all, df_daily):
39
40 1 4.0 4.0 0.0 if exchange == 'BSE':
43 1 152.0 152.0 1.2 ticker = df_daily['sc_code'].to_numpy()[0]
44 else:
47 ticker = df_daily['symbol'].to_numpy()[0]
48
101 # Last Traded Date of the Week
102 1 33.0 33.0 0.3 arr_concat = np.column_stack((arr_yearWeek, arr_ts))
103 1 341.0 341.0 2.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
104
152 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)
153 1 59.0 59.0 0.5 HSDL = np.round(HSDL,2)
154 1 26.0 26.0 0.2 HSDG = np.round(HSDG,2)
155 1 23.0 23.0 0.2 HSDLp = np.round(HSDLp,2)
156 1 23.0 23.0 0.2 HSDGp = np.round(HSDGp,2)
157
158 1 23.0 23.0 0.2 first = np.round(first,2)
159 1 23.0 23.0 0.2 last = np.round(last,2)
160 1 23.0 23.0 0.2 wChng = np.round(wChng,2)
161 1 23.0 23.0 0.2 wChngp = np.round(wChngp,2)
162
163 1 12.0 12.0 0.1 vavgs = np.array(vavgs).astype(int)
164 1 16.0 16.0 0.1 volAvgWOhv = np.array(volAvgWOhv).astype(int)
165 1 24.0 24.0 0.2 HVdAV = np.round(HVdAV,2)
166
167 1 16.0 16.0 0.1 ticker = np.full(yearWeek.shape[0], ticker)
168 1 2.0 2.0 0.0 np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
169 1 2.0 2.0 0.0 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
170 1 546.0 546.0 4.2 , CPveoHVD, lastDVotWk, lastDVdAV))
171
173 1 2.0 2.0 0.0 if len(np_weekly_all) > 0:
175 1 2.0 2.0 0.0 a = np_weekly_all[:,1]
176 1 1.0 1.0 0.0 b = np_weekly[:,1]
177 1 205.0 205.0 1.6 tf_1 = np.isin(a, b, invert=True)
179 1 13.0 13.0 0.1 t_result = list(compress(range(len(tf_1)), tf_1))
181 1 13.0 13.0 0.1 np_weekly_all = np_weekly_all[t_result]
182 1 40.0 40.0 0.3 np_weekly_all = np.vstack((np_weekly_all, np_weekly))
183 else:
184 np_weekly_all = []
185 np_weekly_all = np.vstack((np_weekly))
186
187 1 2.0 2.0 0.0 return np_weekly_all
I would glad to hear your suggestions and thanks for pointing to profiler, i didn't know about that.
Edited and posted the line profiler code above. I have updated the code from pandas to numpy which had reduced the times considerable from inital 1h 26min to now 7mins.
I have a series (of length 201) created from reading a .xlsx spread sheet, as follows:
xl = pandas.ExcelFile(file)
data = xl.parse('Sheet1')
data.columns = ["a", "b", "c", "d", "e", "f", "g", "h"]
A = data.a
So I am working with A and if I print (A) I get
0 76.0
1 190.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 320.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I want to iterate through A and find all the values => 180 and make a new array (or series) where for the values in A => 180 I subtract 180 but for values in A =< 180 I use the original value. I have tried the following but I get errors:
nx = len(A)
for i in range (nx):
if A_new(i) >= A(i) + 180:
else A_new(i) == A(i)
Use Series.mask / Series.where:
new_s = s.mask(s.ge(180),s.sub(180))
#new_s = s.sub(180).where(s.ge(180),s) #or series.where
or np.where
new_s = pd.Series(data = np.where(s.ge(180),s.sub(180),s),
index = s.index,
name = s.name)
We could also use Series.loc
new_s = s.copy()
new_s.loc[s.ge(180)] =s.sub(180)
new_s output
0 76.0
1 10.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 140.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I am trying to modify cells in existing df -- if I find string with no alpha characters (e.g such as "*" ) I set it to "0.0" string and when all cells are processed I try to convert a column numeric type.
But setting "0.0" for some reason does not reflect in resulting df
for i, col in enumerate(cols):
for ii in range(0, df.shape[0]):
row = df.iloc[ii]
value = row[col]
if isinstance(value, str):
if not( utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
df.iat[ii, i] = "0.0"
df[col] = df[col].astype(np.float_)
#df[col] = df[col].to_numeric() #this throws error that Series does not have to_numeric()
I get error
could not convert string to float: 'cat'
And when I print df I see that values were not changed.
What could be the issue?
Thanks!
df
f289,f290,f291,f292,f293,f294,f295,f296,f297,f298,f299,f300,f301,f302,f303,f304,f305,f306,f307,f308,f309,f310
01M015,P.S. 015 Roberto Clemente,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M019,P.S. 019 Asher Levy,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M020,P.S. 020 Anna Silver,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M034,P.S. 034 Franklin D. Roosevelt,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,14
01M063,The STAR Academy - P.S.63,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,6
01M064,P.S. 064 Robert Simon,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M110,P.S. 110 Florence Nightingale,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M134,P.S. 134 Henrietta Szold,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M137,P.S. 137 John L. Bernstein,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M140,P.S. 140 Nathan Straus,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M142,P.S. 142 Amalia Castro,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M184,P.S. 184m Shuang Wen,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M188,P.S. 188 The Island School,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,10
So, in this case, I expect this df to have "0.0" instead of "*" and these cols to have numeric datatype e.g float after conversion
You can change condition for returning 0.0, I set for test x=="*"
df.iloc[:,3:] = df.iloc[:,3:].applymap(lambda x: 0.0 if x=="*" else x)
f289 f290 f291 ... f308 f309 f310
0 01M015 P.S. 015 Roberto Clemente Elementary ... 0.0 0.0 0
1 01M019 P.S. 019 Asher Levy Elementary ... 0.0 0.0 0
2 01M020 P.S. 020 Anna Silver Elementary ... 0.0 0.0 0
3 01M034 P.S. 034 Franklin D. Roosevelt K-8 ... 0.0 0.0 14
4 01M063 The STAR Academy - P.S.63 Elementary ... 0.0 0.0 6
5 01M064 P.S. 064 Robert Simon Elementary ... 0.0 0.0 0
6 01M110 P.S. 110 Florence Nightingale Elementary ... 0.0 0.0 0
7 01M134 P.S. 134 Henrietta Szold Elementary ... 0.0 0.0 0
8 01M137 P.S. 137 John L. Bernstein Elementary ... 0.0 0.0 0
9 01M140 P.S. 140 Nathan Straus K-8 ... 0.0 0.0 0
10 01M142 P.S. 142 Amalia Castro Elementary ... 0.0 0.0 0
11 01M184 P.S. 184m Shuang Wen K-8 ... 0.0 0.0 0
12 01M188 P.S. 188 The Island School K-8 ... 0.0 0.0 10
Update
define function
def f(value) :
if isinstance(value, str):
if not(utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
return 0.0
return float(value)
Apply it to each cell
df = df.applymap(f)