Pandas Seaborn Data labels showing as 0.00 - python

I have two questions:
My data labels are showing as 0.00 and isn't matching my cross tab. I don't know why...
Updating with the full code:
df = pd.read_csv('2018_ms_data_impact_only.csv', low_memory=False)
df.head()
StartDate EndDate Status IPAddress Progress duration Finished RecordedDate ResponseId RecipientLastName ... Gender LGBTQ Mobile organizing_interest Parent Policy policy_interest reg_to_vote unique_id Veteran
0 4/6/18 10:32 4/6/18 10:39 1 NaN 100 391 1 4/6/18 10:39 R_1liSDxRmTKDLFfT Mays ... Woman 0.0 4752122624 Currently in this field 1.0 NaN NaN 0.0 0034000001VAbTAAA1 0.0
1 4/9/18 6:31 4/9/18 6:33 1 NaN 100 160 1 4/9/18 6:33 R_0ezRf2zyaLwFDa1 Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
2 4/9/18 9:14 4/9/18 9:15 1 NaN 100 70 1 4/9/18 9:15 R_DeHh3DQ23uQZwLD Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
3 4/9/18 9:21 4/9/18 9:22 1 NaN 100 69 1 4/9/18 9:22 R_1CC0ckmyS7E1qs3 Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
4 4/9/18 9:28 4/9/18 9:29 1 NaN 100 54 1 4/9/18 9:29 R_01GuM5KqtHIgvEl Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
def impact_action_yn_new(series):
if series == 3:
return 'No'
elif series == 1:
return 'Yes'
df['impact_action_yn_new'] = df['impact_action_yn'].apply(impact_action_yn_new)
df['impact_action_yn_new'].value_counts(sort=False)
# clean up engagement - collapse nan and 0, 2s
def engagement_new(series):
if series == '0':
return 'Null'
elif series == 'NaN':
return 'Null'
elif series == '1':
return '1'
elif series == '2':
return '2a'
elif series == '2a':
return '2a'
elif series == '2b':
return '2b'
elif series == '3':
return '3'
elif series == '4':
return '4'
elif series == '5':
return '5'
df['engagement_new'] = df['Engagement'].apply(engagement_new)
impact_action_table_eng = pd.crosstab(df.impact_action_yn_new,df.engagement_new)
print(impact_action_table_eng)
engagement_new 1 2a 2b 3 4 5 Null
impact_action_yn_new
No 676 508 587 683 172 31 1
Yes 410 405 303 671 357 237 1
# Crosstab: Impact YN x Engagement - Row percentages
impact_action_table_eng_rowperc = pd.crosstab(df.impact_action_yn_new,df.engagement_new).apply(lambda r: r/r.sum()*100, axis=1)
print(impact_action_table_eng_rowperc)
engagement_new 1 2a 2b 3 4 \
impact_action_yn_new
No 25.432656 19.112114 22.084274 25.696012 6.471031
Yes 17.197987 16.988255 12.709732 28.145973 14.974832
engagement_new 5 Null
impact_action_yn_new
No 1.166290 0.037622
Yes 9.941275 0.041946
#plot data
stacked_imp_eng_rowperc = impact_action_table_eng_rowperc.stack().reset_index().rename(columns={0:'value'})
total = float(len(df))
#set fig size
fig, ax = plt.subplots(figsize=(15,10))
#set style
sns.set_style('whitegrid')
#plot
ax = sns.barplot(x=stacked_imp_eng_rowperc.engagement_new,
y=stacked_imp_eng_rowperc.value,
hue=stacked_imp_eng_rowperc.impact_action_yn_new)
#plot legend
ax.legend(loc='center right',bbox_to_anchor=(.95,.9),ncol=1, fancybox=True, shadow=True)
#plot axis labels
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height,
'{:1.2f}'.format(height/total),
ha="center")
ax.set(xlabel='Engagement Level', ylabel='% Reporting an Action within Last 12 Months');
I'm not sure why the data labels on the bar plot are showing as 0.00. It is calling the crosstab. Any thoughts?
Is there a way to convert the crosstab calcualtions to show as a percentage? I'd like to plot those percentages instead of decimals.
Thanks for all your help!

Related

How to count values over different thresholds per column in Pandas dataframe groupby?

My goal is for a dataset similar to the example below, to group by [s_num, ip, f_num, direction] and then filter the score columns using separate thresholds and count how many values are above the threshold.
id s_num ip f_num direction algo_1_x algo_2_x algo_1_score algo_2_score
0 0.0 0.0 0.0 0.0 X -4.63 -4.45 0.624356 0.664009
15 19.0 0.0 2.0 0.0 X -5.44 -5.02 0.411217 0.515843
16 20.0 0.0 2.0 0.0 X -12.36 -5.09 0.397237 0.541112
20 24.0 0.0 2.0 1.0 X -4.94 -5.15 0.401744 0.526032
21 25.0 0.0 2.0 1.0 X -4.78 -4.98 0.386410 0.564934
22 26.0 0.0 2.0 1.0 X -4.89 -5.03 0.394326 0.513896
24 28.0 0.0 2.0 2.0 X -4.78 -5.00 0.420078 0.521993
25 29.0 0.0 2.0 2.0 X -4.91 -5.14 0.407355 0.485878
26 30.0 0.0 2.0 2.0 X 11.83 -4.97 0.392242 0.659122
27 31.0 0.0 2.0 2.0 X -4.73 -5.07 0.377011 0.524774
​
the result should look something like:
Each entry in algo_i column is the # of values in the group larger than the corresponding threshold
So far I tried first grouping, and applying custom aggregation like so:
def count_success(x,thresh):
return ((x > thresh)*1).sum()
thresholds=[0.1,0.2]
df.groupby(attr_cols).agg({f'algo_{i+1}_score':count_success(thresh) for i, thresh in enumerate(thresholds)})
but this results in an error :
count_success() missing 1 required positional argument: 'thresh'
So, how can I pass another argument to a function using .agg( )? or is there an easier way to do it using some pandas function?
Named aggregation does not allow extra parameter to be passed to your function. You can use numpy boardcasting:
attr_cols = ["s_num", "ip", "f_num", "direction"]
score_cols = df.columns[df.columns.str.match("algo_\d+_score")]
# Convert everything to numpy to prepare for broadcasting
score = df[score_cols].to_numpy()
threshold = np.array([0.1, 0.5])
# Raise `threshold` up 2 dimensions so that every value in `score` is
# broadcast against every value in `threshold`
mask = score > threshold[:, None, None]
# Assemble the result
row_index = pd.MultiIndex.from_frame(df[attr_cols])
col_index = pd.MultiIndex.from_product([threshold, score_cols], names=["threshold", "algo"])
result = (
pd.DataFrame(np.hstack(mask), index=row_index, columns=col_index)
.groupby(attr_cols)
.sum()
)
Result:
threshold 0.1 0.5
algo algo_1_score algo_2_score algo_1_score algo_2_score
s_num ip f_num direction
0.0 0.0 0.0 X 1 1 1 1
2.0 0.0 X 2 2 0 2
1.0 X 3 3 0 3
2.0 X 4 4 0 3

Panda(Python): add a new column in a data frame which depends on its row value and aggregated value from another data frame

I am new to python and pandas, so my doubt can be silly also.
Problem:
So I have two data frames let's say df1 and df2 where
df1 is like
treatment1 treatment2 value comparision test adjustment statsig p_value
0 Treatment Control 0.795953 Treatment:Control t-test Benjamini-Hochberg False 0.795953
1 Treatment2 Control 0.795953 Treatment2:Control t-test Benjamini-Hochberg False 0.795953
2 Treatment2 Treatment 0.795953 Treatment2:Treatment t-test Benjamini-Hochberg False 0.795953
and df2 is like
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
.. ... ...
336 Treatment3 35.0
337 Treatment3 9.0
338 Treatment3 35.0
339 Treatment3 9.0
340 Treatment3 35.0
I want to add a column mean_percentage_lift in df1 where
lift_mean_percentage = (mean(treatment1)/mean(treatment2) -1) * 100
where `treatment1` and `treatment2` can be anything in `[Treatment, Control, Treatment2]`
My Approach:
I am using the assign function of the data frame.
df1.assign(mean_percentage_lift = lambda dataframe: lift_mean_percentage(df2, dataframe['treatment1'], dataframe['treatment2']))
where
def lift_mean_percentage(df, treatment1, treatment2):
treatment1_data = df[df[group_type_col] == treatment1]
treatment2_data = df[df[group_type_col] == treatment2]
mean1 = treatment1_data['metric'].mean()
mean2 = treatment2_data['metric'].mean()
return (mean1/mean2 -1) * 100
But I am getting this error Can only compare identically-labeled Series objects for line
treatment1_data = df[df[group_type_col] == treatment1]. Is there something I am doing wrong is there any alternative to this.
For dataframe df2:
group_type metric
0 Treatment 31.0
1 Treatment2 83.0
2 Treatment 51.0
3 Treatment 20.0
4 Control 41.0
5 Treatment3 35.0
6 Treatment3 9.0
7 Treatment 35.0
8 Treatment3 9.0
9 Control 5.0
You can try:
def lift_mean_percentage(df, T1, T2):
treatment1= df['metric'][df['group_type']==T1].mean()
treatment2= df['metric'][df['group_type']==T2].mean()
return (treatment1/treatment2 -1) * 100
runing:
lift_mean_percentage(df2,'Treatment2','Control')
the result:
260.8695652173913

numpy - anyway to improve this further(PandasTook(1h26m), NumpyTakes(38m))

Initially had everything written pandas and for this exploratory exercise i had did a lot of groupby's and while running with the whole data it was ran for 1h 26m. Over the last weekend, i had changed everything from pandas to using numpy, currently it took(Wall time: 38min 27s). I would like to know if it can be further improved
While converting to numpy, i had additionally used numpy_indexed.
Overall what i am doing is calling the below function this loop(i have read in lots of places loops are bad). Dataset has around 657058 rows and there around 5000 tickers.
for idx, ticker in enumerate(ticker_list):
...
df_temp = weekly_trend_analysis(exchange, df_weekly, df_daily)
...
df_weekly_all = pd.concat([df_weekly_all, df_temp], sort=False)
def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
if exchange == 'BSE':
ticker = df_daily.iloc[0]['sc_code']
else:
ticker = df_daily.iloc[0]['symbol']
arr_yearWeek = df_daily['yearWeek'].to_numpy()
arr_close = df_daily['close'].to_numpy()
arr_prevclose = df_daily['prevclose'].to_numpy()
arr_chng = df_daily['chng'].to_numpy()
arr_chngp = df_daily['chngp'].to_numpy()
arr_ts = df_daily['ts'].to_numpy()
arr_volumes = df_daily['volumes'].to_numpy()
# Close
arr_concat = np.column_stack((arr_yearWeek, arr_close))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
#a = df_temp[['yearWeek', 'close']].to_numpy()
yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
cmaxs, cmins = [], []
first, last, wChng, wChngp = [], [], [], []
for idx,subarr in enumerate(npi_gb):
cmaxs.append( np.amax(subarr) )
cmins.append( np.amin(subarr) )
first.append(subarr[0])
last.append(subarr[-1])
wChng.append( subarr[-1] - subarr[0] )
wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Chng
arr_concat = np.column_stack((arr_yearWeek, arr_chng))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
HSDL, HSDG = [], []
for idx,subarr in enumerate(npi_gb):
HSDL.append( np.amin(subarr) )
HSDG.append( np.amax(subarr) )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Chngp
arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
HSDLp, HSDGp = [], []
for idx,subarr in enumerate(npi_gb):
HSDLp.append( np.amin(subarr) )
HSDGp.append( np.amax(subarr) )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Last Traded Date of the Week
i = df_daily[['yearWeek', 'ts']].to_numpy()
j = npi.group_by(i[:, 0]).split(i[:, 1])
lastTrdDoW = []
for idx,subarr in enumerate(j):
lastTrdDoW.append( subarr[-1] )
i = np.empty((100,100))
#j.clear()
# Times inreased
TI = np.where(arr_close > arr_prevclose, 1, 0)
# Below npi_gb_yearWeekTI is used in volumes section
arr_concat = np.column_stack((arr_yearWeek, TI))
npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
# Volume ( dependent on above section value t_group , thats the reason to move from top to here)
arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
for idx,subarr in enumerate(npi_gb):
vavgs.append( np.mean(subarr) )
ldvotWk = subarr[-1]
lastDVotWk.append(ldvotWk)
#print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
ixDel = np.argmax(subarr)
hV = subarr[ixDel]
vmaxs.append( hV )
if(len(subarr)>1):
subarr = np.delete(subarr, ixDel)
vawoHV = np.mean(subarr)
else:
vawoHV = np.mean(subarr)
volAvgWOhv.append( vawoHV )
HVdAV.append(hV / vawoHV)
CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
lastDVdAV.append(ldvotWk / vawoHV)
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Preparing the dataframe
# yearWeek and occurances
#yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
yearWeek = yearWeek.astype(int)
HSDL = np.round(HSDL,2)
HSDG = np.round(HSDG,2)
HSDLp = np.round(HSDLp,2)
HSDGp = np.round(HSDGp,2)
first = np.round(first,2)
last = np.round(last,2)
wChng = np.round(wChng,2)
wChngp = np.round(wChngp,2)
vavgs = np.array(vavgs).astype(int)
volAvgWOhv = np.array(volAvgWOhv).astype(int)
HVdAV = np.round(HVdAV,2)
dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
df_weekly = pd.DataFrame(data=dict_temp)
df_weekly['sc_code'] = ticker
cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
, 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
, 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
df_weekly = df_weekly[cols].copy()
# df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
if df_weekly_all.shape[0] == 0:
df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
# Removing all yearWeek in df_weekly2 from df_weekly
a = set(df_weekly_all['yearWeek'])
b = set(df_weekly['yearWeek'])
c = list(a.difference(b))
#print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
# Append the latest week data to df_weekly
df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
#print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
return df_weekly_all
Input data
ts = ['2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00']
sc_code = ['500002','500002','500002','500002','500002','500002','500002','500002','500002','500002']
high = [1326.6, 208.45, 732.15, 14.87, 1979.0, 57.8, 11.55, 1.68, 8.1, 139.4]
low = [1306.35, 204.0, 717.05, 13.41, 1937.65, 54.65, 11.2, 1.52, 7.75, 135.65]
close = [1313.55, 206.65, 723.05, 13.53, 1955.25, 56.0, 11.21, 1.68, 8.1, 136.85]
prevclose = [1319.85, 202.95, 718.95, 14.29, 1967.3, 54.65, 11.22, 1.6, 7.75, 135.05]
volumes = [7785, 6150, 21641, 46296, 707019, 40089, 25300, 5920, 500, 235355]
yearWeek = [201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913]
chng = [-6.29, 3.70, 4.09, -0.75, -12.04, 1.35, -0.09, 0.079, 0.34, 1.79]
chngp = [-0.48, 1.82, 0.57, -5.32, -0.61, 2.47, -0.09, 5.0, 4.52, 1.33]
dict_temp = {'ts':ts, 'sc_code':sc_code, 'high':high, 'low':low, 'close':close, 'prevclose':prevclose, 'volumes':volumes, 'yearWeek':yearWeek, 'chng':chng, 'chngp':chngp}
df_weekly = pd.DataFrame(data=dict_temp)
Adding line-profiler details,
('getcwd : ', '/home/bobby_dreamer')
Timer unit: 1e-06 s
Total time: 0.043637 s
File: BTD-Analysis1V3.py
Function: weekly_trend_analysis at line 36
Line # Hits Time Per Hit % Time Line Contents
==============================================================
36 def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
37
38 1 3.0 3.0 0.0 if exchange == 'BSE':
39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']
40 else:
41 ticker = df_daily.iloc[0]['symbol']
42
95 # Last Traded Date of the Week
96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()
97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])
98
160
161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)
166
167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker
168
169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
172
173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()
174
175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:
177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
178
179 # Removing all yearWeek in df_weekly2 from df_weekly
180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])
181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])
182 1 5.0 5.0 0.0 c = list(a.difference(b))
183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
185
186 # Append the latest week data to df_weekly
187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
189
190 1 2.0 2.0 0.0 return df_weekly_all
After reviewing the above profile, made changes to code which consumed more time, basically added more numpy code removed pandas in the function. Below code when run with whole data took only Wall time: 7min 47s.
Encountered some numpy errors like below, handled via writing intermediate files. I am using windows machine and intermediate files were < 3MB. Not sure if there were any limitations.
MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 def weekly_trend_analysis_np(exchange, np_weekly_all, df_daily):
39
40 1 4.0 4.0 0.0 if exchange == 'BSE':
43 1 152.0 152.0 1.2 ticker = df_daily['sc_code'].to_numpy()[0]
44 else:
47 ticker = df_daily['symbol'].to_numpy()[0]
48
101 # Last Traded Date of the Week
102 1 33.0 33.0 0.3 arr_concat = np.column_stack((arr_yearWeek, arr_ts))
103 1 341.0 341.0 2.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
104
152 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)
153 1 59.0 59.0 0.5 HSDL = np.round(HSDL,2)
154 1 26.0 26.0 0.2 HSDG = np.round(HSDG,2)
155 1 23.0 23.0 0.2 HSDLp = np.round(HSDLp,2)
156 1 23.0 23.0 0.2 HSDGp = np.round(HSDGp,2)
157
158 1 23.0 23.0 0.2 first = np.round(first,2)
159 1 23.0 23.0 0.2 last = np.round(last,2)
160 1 23.0 23.0 0.2 wChng = np.round(wChng,2)
161 1 23.0 23.0 0.2 wChngp = np.round(wChngp,2)
162
163 1 12.0 12.0 0.1 vavgs = np.array(vavgs).astype(int)
164 1 16.0 16.0 0.1 volAvgWOhv = np.array(volAvgWOhv).astype(int)
165 1 24.0 24.0 0.2 HVdAV = np.round(HVdAV,2)
166
167 1 16.0 16.0 0.1 ticker = np.full(yearWeek.shape[0], ticker)
168 1 2.0 2.0 0.0 np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
169 1 2.0 2.0 0.0 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
170 1 546.0 546.0 4.2 , CPveoHVD, lastDVotWk, lastDVdAV))
171
173 1 2.0 2.0 0.0 if len(np_weekly_all) > 0:
175 1 2.0 2.0 0.0 a = np_weekly_all[:,1]
176 1 1.0 1.0 0.0 b = np_weekly[:,1]
177 1 205.0 205.0 1.6 tf_1 = np.isin(a, b, invert=True)
179 1 13.0 13.0 0.1 t_result = list(compress(range(len(tf_1)), tf_1))
181 1 13.0 13.0 0.1 np_weekly_all = np_weekly_all[t_result]
182 1 40.0 40.0 0.3 np_weekly_all = np.vstack((np_weekly_all, np_weekly))
183 else:
184 np_weekly_all = []
185 np_weekly_all = np.vstack((np_weekly))
186
187 1 2.0 2.0 0.0 return np_weekly_all
I would glad to hear your suggestions and thanks for pointing to profiler, i didn't know about that.
Edited and posted the line profiler code above. I have updated the code from pandas to numpy which had reduced the times considerable from inital 1h 26min to now 7mins.

Iterating through a series to find values >= x then use values

I have a series (of length 201) created from reading a .xlsx spread sheet, as follows:
xl = pandas.ExcelFile(file)
data = xl.parse('Sheet1')
data.columns = ["a", "b", "c", "d", "e", "f", "g", "h"]
A = data.a
So I am working with A and if I print (A) I get
0 76.0
1 190.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 320.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I want to iterate through A and find all the values => 180 and make a new array (or series) where for the values in A => 180 I subtract 180 but for values in A =< 180 I use the original value. I have tried the following but I get errors:
nx = len(A)
for i in range (nx):
if A_new(i) >= A(i) + 180:
else A_new(i) == A(i)
Use Series.mask / Series.where:
new_s = s.mask(s.ge(180),s.sub(180))
#new_s = s.sub(180).where(s.ge(180),s) #or series.where
or np.where
new_s = pd.Series(data = np.where(s.ge(180),s.sub(180),s),
index = s.index,
name = s.name)
We could also use Series.loc
new_s = s.copy()
new_s.loc[s.ge(180)] =s.sub(180)
new_s output
0 76.0
1 10.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 140.0
200 0.0
Name: Vazi, Length: 201, dtype: float64

Modifying cells in pandas df does not succeed

I am trying to modify cells in existing df -- if I find string with no alpha characters (e.g such as "*" ) I set it to "0.0" string and when all cells are processed I try to convert a column numeric type.
But setting "0.0" for some reason does not reflect in resulting df
for i, col in enumerate(cols):
for ii in range(0, df.shape[0]):
row = df.iloc[ii]
value = row[col]
if isinstance(value, str):
if not( utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
df.iat[ii, i] = "0.0"
df[col] = df[col].astype(np.float_)
#df[col] = df[col].to_numeric() #this throws error that Series does not have to_numeric()
I get error
could not convert string to float: 'cat'
And when I print df I see that values were not changed.
What could be the issue?
Thanks!
df
f289,f290,f291,f292,f293,f294,f295,f296,f297,f298,f299,f300,f301,f302,f303,f304,f305,f306,f307,f308,f309,f310
01M015,P.S. 015 Roberto Clemente,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M019,P.S. 019 Asher Levy,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M020,P.S. 020 Anna Silver,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M034,P.S. 034 Franklin D. Roosevelt,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,14
01M063,The STAR Academy - P.S.63,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,6
01M064,P.S. 064 Robert Simon,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M110,P.S. 110 Florence Nightingale,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M134,P.S. 134 Henrietta Szold,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M137,P.S. 137 John L. Bernstein,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M140,P.S. 140 Nathan Straus,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M142,P.S. 142 Amalia Castro,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M184,P.S. 184m Shuang Wen,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M188,P.S. 188 The Island School,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,10
So, in this case, I expect this df to have "0.0" instead of "*" and these cols to have numeric datatype e.g float after conversion
You can change condition for returning 0.0, I set for test x=="*"
df.iloc[:,3:] = df.iloc[:,3:].applymap(lambda x: 0.0 if x=="*" else x)
f289 f290 f291 ... f308 f309 f310
0 01M015 P.S. 015 Roberto Clemente Elementary ... 0.0 0.0 0
1 01M019 P.S. 019 Asher Levy Elementary ... 0.0 0.0 0
2 01M020 P.S. 020 Anna Silver Elementary ... 0.0 0.0 0
3 01M034 P.S. 034 Franklin D. Roosevelt K-8 ... 0.0 0.0 14
4 01M063 The STAR Academy - P.S.63 Elementary ... 0.0 0.0 6
5 01M064 P.S. 064 Robert Simon Elementary ... 0.0 0.0 0
6 01M110 P.S. 110 Florence Nightingale Elementary ... 0.0 0.0 0
7 01M134 P.S. 134 Henrietta Szold Elementary ... 0.0 0.0 0
8 01M137 P.S. 137 John L. Bernstein Elementary ... 0.0 0.0 0
9 01M140 P.S. 140 Nathan Straus K-8 ... 0.0 0.0 0
10 01M142 P.S. 142 Amalia Castro Elementary ... 0.0 0.0 0
11 01M184 P.S. 184m Shuang Wen K-8 ... 0.0 0.0 0
12 01M188 P.S. 188 The Island School K-8 ... 0.0 0.0 10
Update
define function
def f(value) :
if isinstance(value, str):
if not(utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
return 0.0
return float(value)
Apply it to each cell
df = df.applymap(f)

Categories

Resources