Iterating through a series to find values >= x then use values - python

I have a series (of length 201) created from reading a .xlsx spread sheet, as follows:
xl = pandas.ExcelFile(file)
data = xl.parse('Sheet1')
data.columns = ["a", "b", "c", "d", "e", "f", "g", "h"]
A = data.a
So I am working with A and if I print (A) I get
0 76.0
1 190.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 320.0
200 0.0
Name: Vazi, Length: 201, dtype: float64
I want to iterate through A and find all the values => 180 and make a new array (or series) where for the values in A => 180 I subtract 180 but for values in A =< 180 I use the original value. I have tried the following but I get errors:
nx = len(A)
for i in range (nx):
if A_new(i) >= A(i) + 180:
else A_new(i) == A(i)

Use Series.mask / Series.where:
new_s = s.mask(s.ge(180),s.sub(180))
#new_s = s.sub(180).where(s.ge(180),s) #or series.where
or np.where
new_s = pd.Series(data = np.where(s.ge(180),s.sub(180),s),
index = s.index,
name = s.name)
We could also use Series.loc
new_s = s.copy()
new_s.loc[s.ge(180)] =s.sub(180)
new_s output
0 76.0
1 10.0
2 0.0
3 86.0
4 0.0
196 156.0
197 0.0
198 0.0
199 140.0
200 0.0
Name: Vazi, Length: 201, dtype: float64

Related

Appending floats to empty pandas DataFrame

I am trying to build a recursive function that will take data from a specific dataframe, do a little math with it, and append that result to a new dataframe. My current code looks as follows
div1, div2, div3 = [pd.DataFrame(index = range(1), columns = ['g']) for i in range(3)]
# THIS IS NOT WORKING FOR SOME REASON
def stats(div, obp):
loop = 1
while loop <= 3:
games = obp['g'].sum() / 2
div = div.append(games)
loop += 1
if loop == 2:
stats(div2, dii_obp, dii_hr)
elif loop == 3:
stats(div3, diii_obp, diii_hr)
else:
print('Invalid')
stats(div1, di_obp)
I get an error that reads:
TypeError: cannot concatenate object of type '<class 'numpy.float64'>'; only Series and DataFrame objs are valid
div1, and di_obp are dataframes and ['g'] is a column in the di_obp dataframe.
I have tried making the variable games into an empty list, and a series and got different errors. I am not sure what I should try next. Any help is much appreciated!!
here is the head of the di_obp dataframe, the dii_obp and the diii_obp dataframes are the same but with different values.
print(di_obp.head())
rank team g ab h bb hbp sf sh
608 213.0 None 56.0 1947.0 526.0 182.0 55.0 19.0 22.0
609 214.0 None 36.0 1099.0 287.0 124.0 25.0 11.0 24.0
610 215.0 None 35.0 1099.0 247.0 159.0 51.0 11.0 24.0
611 216.0 None 36.0 1258.0 317.0 157.0 30.0 11.0 7.0
612 217.0 None 38.0 1136.0 281.0 138.0 41.0 14.0 10.0
CURRENT PROBLEM:
div1, div2, div3 = [pd.DataFrame(index= range(1), columns = ['g']) for i in range(3)]
def stats(div, obp):
loop = 1
while loop <= 3:
while loop <= 3:
games = obp['g'].sum() / 2
div[0] = div[0].append({'g': games}, ignore_index=True)
loop += 1
if loop == 2:
stats([div2], dii_obp)
elif loop == 3:
stats([div3], diii_obp)
else:
print('Done')
stats([div1], di_obp)
this does not return an error, but my dataframes are still empty
Appending to a dataframe is generally not recommended. Instead, you should accumulate your data in lists and then create dataframes from those lists:
div1, div2, div3 = [[] for _ in range(3)]
def stats(div, obp):
loop = 1
while loop <= 3:
while loop <= 3:
games = obp['g'].sum() / 2
div.append(games)
loop += 1
if loop == 2:
stats(div2, dii_obp)
elif loop == 3:
stats(div3, diii_obp)
else:
print('Done')
stats(div1, di_obp)
div1_df, div2_df, div2_df = [pd.DataFrame({'g': div}) for div in [div1, div2, div3]]

numpy - anyway to improve this further(PandasTook(1h26m), NumpyTakes(38m))

Initially had everything written pandas and for this exploratory exercise i had did a lot of groupby's and while running with the whole data it was ran for 1h 26m. Over the last weekend, i had changed everything from pandas to using numpy, currently it took(Wall time: 38min 27s). I would like to know if it can be further improved
While converting to numpy, i had additionally used numpy_indexed.
Overall what i am doing is calling the below function this loop(i have read in lots of places loops are bad). Dataset has around 657058 rows and there around 5000 tickers.
for idx, ticker in enumerate(ticker_list):
...
df_temp = weekly_trend_analysis(exchange, df_weekly, df_daily)
...
df_weekly_all = pd.concat([df_weekly_all, df_temp], sort=False)
def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
if exchange == 'BSE':
ticker = df_daily.iloc[0]['sc_code']
else:
ticker = df_daily.iloc[0]['symbol']
arr_yearWeek = df_daily['yearWeek'].to_numpy()
arr_close = df_daily['close'].to_numpy()
arr_prevclose = df_daily['prevclose'].to_numpy()
arr_chng = df_daily['chng'].to_numpy()
arr_chngp = df_daily['chngp'].to_numpy()
arr_ts = df_daily['ts'].to_numpy()
arr_volumes = df_daily['volumes'].to_numpy()
# Close
arr_concat = np.column_stack((arr_yearWeek, arr_close))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
#a = df_temp[['yearWeek', 'close']].to_numpy()
yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
cmaxs, cmins = [], []
first, last, wChng, wChngp = [], [], [], []
for idx,subarr in enumerate(npi_gb):
cmaxs.append( np.amax(subarr) )
cmins.append( np.amin(subarr) )
first.append(subarr[0])
last.append(subarr[-1])
wChng.append( subarr[-1] - subarr[0] )
wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Chng
arr_concat = np.column_stack((arr_yearWeek, arr_chng))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
HSDL, HSDG = [], []
for idx,subarr in enumerate(npi_gb):
HSDL.append( np.amin(subarr) )
HSDG.append( np.amax(subarr) )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Chngp
arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
HSDLp, HSDGp = [], []
for idx,subarr in enumerate(npi_gb):
HSDLp.append( np.amin(subarr) )
HSDGp.append( np.amax(subarr) )
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Last Traded Date of the Week
i = df_daily[['yearWeek', 'ts']].to_numpy()
j = npi.group_by(i[:, 0]).split(i[:, 1])
lastTrdDoW = []
for idx,subarr in enumerate(j):
lastTrdDoW.append( subarr[-1] )
i = np.empty((100,100))
#j.clear()
# Times inreased
TI = np.where(arr_close > arr_prevclose, 1, 0)
# Below npi_gb_yearWeekTI is used in volumes section
arr_concat = np.column_stack((arr_yearWeek, TI))
npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
# Volume ( dependent on above section value t_group , thats the reason to move from top to here)
arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
for idx,subarr in enumerate(npi_gb):
vavgs.append( np.mean(subarr) )
ldvotWk = subarr[-1]
lastDVotWk.append(ldvotWk)
#print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
ixDel = np.argmax(subarr)
hV = subarr[ixDel]
vmaxs.append( hV )
if(len(subarr)>1):
subarr = np.delete(subarr, ixDel)
vawoHV = np.mean(subarr)
else:
vawoHV = np.mean(subarr)
volAvgWOhv.append( vawoHV )
HVdAV.append(hV / vawoHV)
CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
lastDVdAV.append(ldvotWk / vawoHV)
#npi_gb.clear()
arr_concat = np.empty((100,100))
# Preparing the dataframe
# yearWeek and occurances
#yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
yearWeek = yearWeek.astype(int)
HSDL = np.round(HSDL,2)
HSDG = np.round(HSDG,2)
HSDLp = np.round(HSDLp,2)
HSDGp = np.round(HSDGp,2)
first = np.round(first,2)
last = np.round(last,2)
wChng = np.round(wChng,2)
wChngp = np.round(wChngp,2)
vavgs = np.array(vavgs).astype(int)
volAvgWOhv = np.array(volAvgWOhv).astype(int)
HVdAV = np.round(HVdAV,2)
dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
df_weekly = pd.DataFrame(data=dict_temp)
df_weekly['sc_code'] = ticker
cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
, 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
, 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
df_weekly = df_weekly[cols].copy()
# df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
if df_weekly_all.shape[0] == 0:
df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
# Removing all yearWeek in df_weekly2 from df_weekly
a = set(df_weekly_all['yearWeek'])
b = set(df_weekly['yearWeek'])
c = list(a.difference(b))
#print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
# Append the latest week data to df_weekly
df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
#print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
return df_weekly_all
Input data
ts = ['2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-01 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00','2019-04-02 00:00:00']
sc_code = ['500002','500002','500002','500002','500002','500002','500002','500002','500002','500002']
high = [1326.6, 208.45, 732.15, 14.87, 1979.0, 57.8, 11.55, 1.68, 8.1, 139.4]
low = [1306.35, 204.0, 717.05, 13.41, 1937.65, 54.65, 11.2, 1.52, 7.75, 135.65]
close = [1313.55, 206.65, 723.05, 13.53, 1955.25, 56.0, 11.21, 1.68, 8.1, 136.85]
prevclose = [1319.85, 202.95, 718.95, 14.29, 1967.3, 54.65, 11.22, 1.6, 7.75, 135.05]
volumes = [7785, 6150, 21641, 46296, 707019, 40089, 25300, 5920, 500, 235355]
yearWeek = [201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913, 201913]
chng = [-6.29, 3.70, 4.09, -0.75, -12.04, 1.35, -0.09, 0.079, 0.34, 1.79]
chngp = [-0.48, 1.82, 0.57, -5.32, -0.61, 2.47, -0.09, 5.0, 4.52, 1.33]
dict_temp = {'ts':ts, 'sc_code':sc_code, 'high':high, 'low':low, 'close':close, 'prevclose':prevclose, 'volumes':volumes, 'yearWeek':yearWeek, 'chng':chng, 'chngp':chngp}
df_weekly = pd.DataFrame(data=dict_temp)
Adding line-profiler details,
('getcwd : ', '/home/bobby_dreamer')
Timer unit: 1e-06 s
Total time: 0.043637 s
File: BTD-Analysis1V3.py
Function: weekly_trend_analysis at line 36
Line # Hits Time Per Hit % Time Line Contents
==============================================================
36 def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
37
38 1 3.0 3.0 0.0 if exchange == 'BSE':
39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']
40 else:
41 ticker = df_daily.iloc[0]['symbol']
42
95 # Last Traded Date of the Week
96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()
97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])
98
160
161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)
166
167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker
168
169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
172
173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()
174
175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:
177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))
178
179 # Removing all yearWeek in df_weekly2 from df_weekly
180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])
181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])
182 1 5.0 5.0 0.0 c = list(a.difference(b))
183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
185
186 # Append the latest week data to df_weekly
187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
189
190 1 2.0 2.0 0.0 return df_weekly_all
After reviewing the above profile, made changes to code which consumed more time, basically added more numpy code removed pandas in the function. Below code when run with whole data took only Wall time: 7min 47s.
Encountered some numpy errors like below, handled via writing intermediate files. I am using windows machine and intermediate files were < 3MB. Not sure if there were any limitations.
MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32
Line # Hits Time Per Hit % Time Line Contents
==============================================================
38 def weekly_trend_analysis_np(exchange, np_weekly_all, df_daily):
39
40 1 4.0 4.0 0.0 if exchange == 'BSE':
43 1 152.0 152.0 1.2 ticker = df_daily['sc_code'].to_numpy()[0]
44 else:
47 ticker = df_daily['symbol'].to_numpy()[0]
48
101 # Last Traded Date of the Week
102 1 33.0 33.0 0.3 arr_concat = np.column_stack((arr_yearWeek, arr_ts))
103 1 341.0 341.0 2.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
104
152 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)
153 1 59.0 59.0 0.5 HSDL = np.round(HSDL,2)
154 1 26.0 26.0 0.2 HSDG = np.round(HSDG,2)
155 1 23.0 23.0 0.2 HSDLp = np.round(HSDLp,2)
156 1 23.0 23.0 0.2 HSDGp = np.round(HSDGp,2)
157
158 1 23.0 23.0 0.2 first = np.round(first,2)
159 1 23.0 23.0 0.2 last = np.round(last,2)
160 1 23.0 23.0 0.2 wChng = np.round(wChng,2)
161 1 23.0 23.0 0.2 wChngp = np.round(wChngp,2)
162
163 1 12.0 12.0 0.1 vavgs = np.array(vavgs).astype(int)
164 1 16.0 16.0 0.1 volAvgWOhv = np.array(volAvgWOhv).astype(int)
165 1 24.0 24.0 0.2 HVdAV = np.round(HVdAV,2)
166
167 1 16.0 16.0 0.1 ticker = np.full(yearWeek.shape[0], ticker)
168 1 2.0 2.0 0.0 np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
169 1 2.0 2.0 0.0 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
170 1 546.0 546.0 4.2 , CPveoHVD, lastDVotWk, lastDVdAV))
171
173 1 2.0 2.0 0.0 if len(np_weekly_all) > 0:
175 1 2.0 2.0 0.0 a = np_weekly_all[:,1]
176 1 1.0 1.0 0.0 b = np_weekly[:,1]
177 1 205.0 205.0 1.6 tf_1 = np.isin(a, b, invert=True)
179 1 13.0 13.0 0.1 t_result = list(compress(range(len(tf_1)), tf_1))
181 1 13.0 13.0 0.1 np_weekly_all = np_weekly_all[t_result]
182 1 40.0 40.0 0.3 np_weekly_all = np.vstack((np_weekly_all, np_weekly))
183 else:
184 np_weekly_all = []
185 np_weekly_all = np.vstack((np_weekly))
186
187 1 2.0 2.0 0.0 return np_weekly_all
I would glad to hear your suggestions and thanks for pointing to profiler, i didn't know about that.
Edited and posted the line profiler code above. I have updated the code from pandas to numpy which had reduced the times considerable from inital 1h 26min to now 7mins.

LabelEncoder cannot inverse_transform (unseen labels) after imputing missing values

I'm at a beginner to intermediate data science level. I want to impute missing values from a dataframe using knn.
As the dataframe contains strings and floats, I need to encode / decode values using LabelEncoder.
My method is as follows:
Replace NaN to be able to encode
Encode the text values and put them in a dictionary
Retrieve the NaN (previously converted) to be imputed with knn
Assign values with knn
Decode values from the dictionary
Unfortunately, in the last step, imputing values adds new values that cannot be decoded (unseen labels error message).
Could you please explain to me what I am doing wrong? Ideally help me to correct it please. Before concluding, I wanted to say that I know that there are other tools like OneHotEncoder, but I don't know them well enough and I found LabelEncoder much more intuitive because you can see it directly in the dataframe (where LabelEncoder provides an array).
Please find below an example of my method, thank you very much for your help :
[1]
# Import libraries.
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]}
# Make a DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Output :
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 NaN NaN Black 150.0
2 Victoria 29.0 NaN NaN
3 Nicolas NaN Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 NaN 190.0
[2]
# LabelEncoder does not work with NaN values, so I replace them with value '1000' :
df = df.replace(np.nan, 1000)
# And to avoid errors, str columns must be set as strings (even '1000' value) :
df[['Name','Car color']] = df[['Name','Car color']].astype(str)
df
Output
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 1000 1000.0 Black 150.0
2 Victoria 29.0 1000 1000.0
3 Nicolas 1000.0 Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 1000 190.0
[3]
# Import LabelEncoder library :
from sklearn.preprocessing import LabelEncoder
# define labelencoder :
le = LabelEncoder()
# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict
# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)
# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))
# Show output :
df
Output
Name Age Car color Height
0 2 59.0 2 177.0
1 0 1000.0 1 150.0
2 5 29.0 0 1000.0
3 3 1000.0 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
[4]
#Reverse back 1000 to missing values in order to impute them :
df = df.replace(1000, np.nan)
df
Output
Name Age Car color Height
0 2 59.0 2 177.0
1 0 NaN 1 150.0
2 5 29.0 0 NaN
3 3 NaN 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
[5]
# Import knn imputer library to replace impute missing values :
from sklearn.impute import KNNImputer
# Define imputer :
imputer = KNNImputer(n_neighbors=2)
# impute and reassign index/colonnes :
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
df
Output
Name Age Car color Height
0 2.0 59.0 2.0 177.0
1 0.0 47.0 1.0 150.0
2 5.0 29.0 0.0 165.0
3 3.0 44.0 1.0 180.0
4 4.0 65.0 3.0 175.0
5 1.0 50.0 0.0 190.0
[6]
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)
Error message :
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6929
6930 def applymap(self, func):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
297 "y contains previously unseen labels: %s" % str(diff))
298 y = np.asarray(y)
--> 299 return self.classes_[y]
300
301 def _more_tags(self):
IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')
Based on my comment you should do
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)

Modifying cells in pandas df does not succeed

I am trying to modify cells in existing df -- if I find string with no alpha characters (e.g such as "*" ) I set it to "0.0" string and when all cells are processed I try to convert a column numeric type.
But setting "0.0" for some reason does not reflect in resulting df
for i, col in enumerate(cols):
for ii in range(0, df.shape[0]):
row = df.iloc[ii]
value = row[col]
if isinstance(value, str):
if not( utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
df.iat[ii, i] = "0.0"
df[col] = df[col].astype(np.float_)
#df[col] = df[col].to_numeric() #this throws error that Series does not have to_numeric()
I get error
could not convert string to float: 'cat'
And when I print df I see that values were not changed.
What could be the issue?
Thanks!
df
f289,f290,f291,f292,f293,f294,f295,f296,f297,f298,f299,f300,f301,f302,f303,f304,f305,f306,f307,f308,f309,f310
01M015,P.S. 015 Roberto Clemente,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M019,P.S. 019 Asher Levy,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M020,P.S. 020 Anna Silver,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M034,P.S. 034 Franklin D. Roosevelt,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,14
01M063,The STAR Academy - P.S.63,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,6
01M064,P.S. 064 Robert Simon,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M110,P.S. 110 Florence Nightingale,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M134,P.S. 134 Henrietta Szold,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M137,P.S. 137 John L. Bernstein,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M140,P.S. 140 Nathan Straus,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M142,P.S. 142 Amalia Castro,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M184,P.S. 184m Shuang Wen,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M188,P.S. 188 The Island School,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,10
So, in this case, I expect this df to have "0.0" instead of "*" and these cols to have numeric datatype e.g float after conversion
You can change condition for returning 0.0, I set for test x=="*"
df.iloc[:,3:] = df.iloc[:,3:].applymap(lambda x: 0.0 if x=="*" else x)
f289 f290 f291 ... f308 f309 f310
0 01M015 P.S. 015 Roberto Clemente Elementary ... 0.0 0.0 0
1 01M019 P.S. 019 Asher Levy Elementary ... 0.0 0.0 0
2 01M020 P.S. 020 Anna Silver Elementary ... 0.0 0.0 0
3 01M034 P.S. 034 Franklin D. Roosevelt K-8 ... 0.0 0.0 14
4 01M063 The STAR Academy - P.S.63 Elementary ... 0.0 0.0 6
5 01M064 P.S. 064 Robert Simon Elementary ... 0.0 0.0 0
6 01M110 P.S. 110 Florence Nightingale Elementary ... 0.0 0.0 0
7 01M134 P.S. 134 Henrietta Szold Elementary ... 0.0 0.0 0
8 01M137 P.S. 137 John L. Bernstein Elementary ... 0.0 0.0 0
9 01M140 P.S. 140 Nathan Straus K-8 ... 0.0 0.0 0
10 01M142 P.S. 142 Amalia Castro Elementary ... 0.0 0.0 0
11 01M184 P.S. 184m Shuang Wen K-8 ... 0.0 0.0 0
12 01M188 P.S. 188 The Island School K-8 ... 0.0 0.0 10
Update
define function
def f(value) :
if isinstance(value, str):
if not(utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
return 0.0
return float(value)
Apply it to each cell
df = df.applymap(f)

Pandas Seaborn Data labels showing as 0.00

I have two questions:
My data labels are showing as 0.00 and isn't matching my cross tab. I don't know why...
Updating with the full code:
df = pd.read_csv('2018_ms_data_impact_only.csv', low_memory=False)
df.head()
StartDate EndDate Status IPAddress Progress duration Finished RecordedDate ResponseId RecipientLastName ... Gender LGBTQ Mobile organizing_interest Parent Policy policy_interest reg_to_vote unique_id Veteran
0 4/6/18 10:32 4/6/18 10:39 1 NaN 100 391 1 4/6/18 10:39 R_1liSDxRmTKDLFfT Mays ... Woman 0.0 4752122624 Currently in this field 1.0 NaN NaN 0.0 0034000001VAbTAAA1 0.0
1 4/9/18 6:31 4/9/18 6:33 1 NaN 100 160 1 4/9/18 6:33 R_0ezRf2zyaLwFDa1 Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
2 4/9/18 9:14 4/9/18 9:15 1 NaN 100 70 1 4/9/18 9:15 R_DeHh3DQ23uQZwLD Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
3 4/9/18 9:21 4/9/18 9:22 1 NaN 100 69 1 4/9/18 9:22 R_1CC0ckmyS7E1qs3 Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
4 4/9/18 9:28 4/9/18 9:29 1 NaN 100 54 1 4/9/18 9:29 R_01GuM5KqtHIgvEl Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
def impact_action_yn_new(series):
if series == 3:
return 'No'
elif series == 1:
return 'Yes'
df['impact_action_yn_new'] = df['impact_action_yn'].apply(impact_action_yn_new)
df['impact_action_yn_new'].value_counts(sort=False)
# clean up engagement - collapse nan and 0, 2s
def engagement_new(series):
if series == '0':
return 'Null'
elif series == 'NaN':
return 'Null'
elif series == '1':
return '1'
elif series == '2':
return '2a'
elif series == '2a':
return '2a'
elif series == '2b':
return '2b'
elif series == '3':
return '3'
elif series == '4':
return '4'
elif series == '5':
return '5'
df['engagement_new'] = df['Engagement'].apply(engagement_new)
impact_action_table_eng = pd.crosstab(df.impact_action_yn_new,df.engagement_new)
print(impact_action_table_eng)
engagement_new 1 2a 2b 3 4 5 Null
impact_action_yn_new
No 676 508 587 683 172 31 1
Yes 410 405 303 671 357 237 1
# Crosstab: Impact YN x Engagement - Row percentages
impact_action_table_eng_rowperc = pd.crosstab(df.impact_action_yn_new,df.engagement_new).apply(lambda r: r/r.sum()*100, axis=1)
print(impact_action_table_eng_rowperc)
engagement_new 1 2a 2b 3 4 \
impact_action_yn_new
No 25.432656 19.112114 22.084274 25.696012 6.471031
Yes 17.197987 16.988255 12.709732 28.145973 14.974832
engagement_new 5 Null
impact_action_yn_new
No 1.166290 0.037622
Yes 9.941275 0.041946
#plot data
stacked_imp_eng_rowperc = impact_action_table_eng_rowperc.stack().reset_index().rename(columns={0:'value'})
total = float(len(df))
#set fig size
fig, ax = plt.subplots(figsize=(15,10))
#set style
sns.set_style('whitegrid')
#plot
ax = sns.barplot(x=stacked_imp_eng_rowperc.engagement_new,
y=stacked_imp_eng_rowperc.value,
hue=stacked_imp_eng_rowperc.impact_action_yn_new)
#plot legend
ax.legend(loc='center right',bbox_to_anchor=(.95,.9),ncol=1, fancybox=True, shadow=True)
#plot axis labels
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height,
'{:1.2f}'.format(height/total),
ha="center")
ax.set(xlabel='Engagement Level', ylabel='% Reporting an Action within Last 12 Months');
I'm not sure why the data labels on the bar plot are showing as 0.00. It is calling the crosstab. Any thoughts?
Is there a way to convert the crosstab calcualtions to show as a percentage? I'd like to plot those percentages instead of decimals.
Thanks for all your help!

Categories

Resources