Modifying cells in pandas df does not succeed - python

I am trying to modify cells in existing df -- if I find string with no alpha characters (e.g such as "*" ) I set it to "0.0" string and when all cells are processed I try to convert a column numeric type.
But setting "0.0" for some reason does not reflect in resulting df
for i, col in enumerate(cols):
for ii in range(0, df.shape[0]):
row = df.iloc[ii]
value = row[col]
if isinstance(value, str):
if not( utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
df.iat[ii, i] = "0.0"
df[col] = df[col].astype(np.float_)
#df[col] = df[col].to_numeric() #this throws error that Series does not have to_numeric()
I get error
could not convert string to float: 'cat'
And when I print df I see that values were not changed.
What could be the issue?
Thanks!
df
f289,f290,f291,f292,f293,f294,f295,f296,f297,f298,f299,f300,f301,f302,f303,f304,f305,f306,f307,f308,f309,f310
01M015,P.S. 015 Roberto Clemente,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M019,P.S. 019 Asher Levy,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M020,P.S. 020 Anna Silver,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M034,P.S. 034 Franklin D. Roosevelt,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,14
01M063,The STAR Academy - P.S.63,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,6
01M064,P.S. 064 Robert Simon,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M110,P.S. 110 Florence Nightingale,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M134,P.S. 134 Henrietta Szold,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M137,P.S. 137 John L. Bernstein,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M140,P.S. 140 Nathan Straus,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M142,P.S. 142 Amalia Castro,Elementary,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M184,P.S. 184m Shuang Wen,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
01M188,P.S. 188 The Island School,K-8,1.0,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,10
So, in this case, I expect this df to have "0.0" instead of "*" and these cols to have numeric datatype e.g float after conversion

You can change condition for returning 0.0, I set for test x=="*"
df.iloc[:,3:] = df.iloc[:,3:].applymap(lambda x: 0.0 if x=="*" else x)
f289 f290 f291 ... f308 f309 f310
0 01M015 P.S. 015 Roberto Clemente Elementary ... 0.0 0.0 0
1 01M019 P.S. 019 Asher Levy Elementary ... 0.0 0.0 0
2 01M020 P.S. 020 Anna Silver Elementary ... 0.0 0.0 0
3 01M034 P.S. 034 Franklin D. Roosevelt K-8 ... 0.0 0.0 14
4 01M063 The STAR Academy - P.S.63 Elementary ... 0.0 0.0 6
5 01M064 P.S. 064 Robert Simon Elementary ... 0.0 0.0 0
6 01M110 P.S. 110 Florence Nightingale Elementary ... 0.0 0.0 0
7 01M134 P.S. 134 Henrietta Szold Elementary ... 0.0 0.0 0
8 01M137 P.S. 137 John L. Bernstein Elementary ... 0.0 0.0 0
9 01M140 P.S. 140 Nathan Straus K-8 ... 0.0 0.0 0
10 01M142 P.S. 142 Amalia Castro Elementary ... 0.0 0.0 0
11 01M184 P.S. 184m Shuang Wen K-8 ... 0.0 0.0 0
12 01M188 P.S. 188 The Island School K-8 ... 0.0 0.0 10
Update
define function
def f(value) :
if isinstance(value, str):
if not(utils.representsInt(value) or utils.representsFloat(value) ) and re.search('[a-zA-Z]', x) is None:
return 0.0
return float(value)
Apply it to each cell
df = df.applymap(f)

Related

Algorithm for predict(start=start_date, end=end_date) on unique same named weather station

Story that pertain to a new design solution
Goal is to use weather data to run ARIMA model fit on each group of like named 'stations' with their associated precipitation data, then finally execute a 30 day forward forecast. Looking to process specific same named stations and then next process the next unique same named stations, etc.
The algorithm to add question
How to write algorithm to run ARIMA model for each UNIQUE 'station' and, perhaps grouping stations to be unique groups to run ARIMA model on the group, and then fit for a 30 day forward forecast? The ARIMA(2,1,1) is a working order terms from auto.arima().
How to write a group algorithm for same named 'stations' before running the ARIMA model, fit, forecast? Or what other approach would achieve a set of like named stations to process specific same named stations and then move unto the next unique same named stations.
Working code executes but needs broader algorithm
Code was working, but on last run, predict(start=start_date, end=end_date) issued a key error. Removed NA, so this may fix the predict(start, end)
wd.weather_data = wd.weather_data[wd.weather_data['date'].notna()]
forecast_models = [50000]
n = 1
df_all_stations = data_prcp.drop(['level_0', 'index', 'prcp'], axis=1)
wd.weather_data.sort_values("date", axis = 0, ascending = True, inplace = True)
for station_name in wd.weather_data['station']:
start_date = pd.to_datetime(wd.weather_data['date'])
number_of_days = 31
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=30)
model = statsmodels.tsa.arima_model.ARIMA(wd.weather_data['prcp'], order=(2,1,1))
model_fit = model.fit()
forecast = model_fit.predict(start=start_date, end=end_date)
forecast_models.append(forecast)
Data Source
<bound method NDFrame.head of station date tavg tmin tmax prcp snow
0 Anchorage, AK 2018-01-01 -4.166667 -8.033333 -0.30 0.3 80.0
35328 Grand Forks, ND 2018-01-01 -14.900000 -23.300000 -6.70 0.0 0.0
86016 Key West, FL 2018-01-01 20.700000 16.100000 25.60 0.0 0.0
59904 Wilmington, NC 2018-01-01 -2.500000 -7.100000 0.00 0.0 0.0
66048 State College, PA 2018-01-01 -13.500000 -17.000000 -10.00 4.5 0.0
... ... ... ... ... ... ... ...
151850 Kansas City, MO 2022-03-30 9.550000 3.700000 16.55 21.1 0.0
151889 Springfield, MO 2022-03-30 12.400000 4.500000 17.10 48.9 0.0
151890 St. Louis, MO 2022-03-30 14.800000 8.000000 17.60 24.9 0.0
151891 State College, PA 2022-03-30 0.400000 -5.200000 6.20 0.2 0.0
151899 Wilmington, NC 2022-03-30 14.400000 6.200000 20.20 0.0 0.0
wdir wspd pres
0 143.0 5.766667 995.133333
35328 172.0 33.800000 1019.200000
86016 4.0 13.000000 1019.900000
59904 200.0 21.600000 1017.000000
66048 243.0 12.700000 1015.200000
... ... ... ...
151850 294.5 24.400000 998.000000
151889 227.0 19.700000 997.000000
151890 204.0 20.300000 996.400000
151891 129.0 10.800000 1020.400000
151899 154.0 16.400000 1021.900000
Error
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

Create new columns based on previous columns with multiplication

I want to create a list of columns where the new columns are based on previous columns times 1.5. It will roll until Year 2020. I tried to use previous and current but it didn't work as expected. How can I make it work as expected?
df = pd.DataFrame({
'us2000':[5,3,6,9,2,4],
}); df
a = []
for i in range(1, 21):
a.append("us202" + str(i))
for previous, current in zip(a, a[1:]):
df[current] = df[previous] * 1.5
IIUC you can fix you code with:
a = []
for i in range(0, 21):
a.append(f'us20{i:02}')
for previous, current in zip(a, a[1:]):
df[current] = df[previous] * 1.5
Another, vectorial, approach with numpy would be:
df2 = (pd.DataFrame(df['us2000'].to_numpy()[:,None]*1.5**np.arange(21),
columns=[f'us20{i:02}' for i in range(21)]))
output:
us2000 us2001 us2002 us2003 us2004 us2005 us2006 us2007 ...
0 5 7.5 11.25 16.875 25.3125 37.96875 56.953125 85.429688
1 3 4.5 6.75 10.125 15.1875 22.78125 34.171875 51.257812
2 6 9.0 13.50 20.250 30.3750 45.56250 68.343750 102.515625
3 9 13.5 20.25 30.375 45.5625 68.34375 102.515625 153.773438
4 2 3.0 4.50 6.750 10.1250 15.18750 22.781250 34.171875
5 4 6.0 9.00 13.500 20.2500 30.37500 45.562500 68.343750
Try:
for i in range(1, 21):
df[f"us{int(2000+i):2d}"] = df[f"us{int(2000+i-1):2d}"].mul(1.5)
>>> df
us2000 us2001 us2002 ... us2018 us2019 us2020
0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650
1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190
2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380
3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571
4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460
5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920
[6 rows x 21 columns]
pd.DataFrame(df.to_numpy()*[1.5**i for i in range(0,21)])\
.rename(columns=lambda x:str(x).rjust(2,'0')).add_prefix("us20")
out
us2000 us2001 us2002 ... us2018 us2019 us2020
0 5 7.5 11.25 ... 7389.45940 11084.18910 16626.283650
1 3 4.5 6.75 ... 4433.67564 6650.51346 9975.770190
2 6 9.0 13.50 ... 8867.35128 13301.02692 19951.540380
3 9 13.5 20.25 ... 13301.02692 19951.54038 29927.310571
4 2 3.0 4.50 ... 2955.78376 4433.67564 6650.513460
5 4 6.0 9.00 ... 5911.56752 8867.35128 13301.026920
[6 rows x 21 columns]

Column has dtype object, cannot use method 'nlargest' with this dtype

I'm using Google Colab and I want to analyze a file from Google Spreadsheet using pandas. I imported them successfully and I can print them out with pd.DataFrame
data_tablet = gc.open_by_url(f'https://docs.google.com/spreadsheets/d/{sheet_id}/edit#gid={tablet_gid}')
tablet_var = data_tablet.worksheet('tablet')
tablet_data = tablet_var.get_all_records()
df_tablet = pd.DataFrame(tablet_data)
print(df_tablet)
name 1st quarter ... 4th quarter total
0 Albendazol 400 mg 18.0 ... 60.0 78
1 Alopurinol 100 mg 125.0 ... 821.0 946
2 Ambroksol 30 mg 437.0 ... 798.0 1,235.00
3 Aminofilin 200 mg 70.0 ... 522.0 592
4 Amitriptilin 25 mg 83.0 ... 178.0 261
.. ... ... ... ... ...
189 Levoflaksin 250 mg 611.0 ... 822.0 1,433.00
190 Linezolid 675.0 ... 315.0 990
191 Moxifloxacin 400 mg 964.0 ... 99.0 1,063.00
192 Pyrazinamide 500 mg 395.0 ... 189.0 584
193 Vitamin B 6 330.0 ... 825.0 1,155.00
[194 rows x 6 columns]
I want to select the top 10 out of 194 items from the total and it did not work.
Selecting the top 10 from total and running command below and I get cannot use method 'nlargest' with this dtype
# Ambil data 10 terbesar dari 194 item
df_tablet_top10 = df_tablet.nlargest(10, 'total')
print(df_tablet_top10)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-a7295330f7a9> in <module>()
1 # Ambil data 10 terbesar dari 194 item
----> 2 df_tablet_top10 = df_tablet.nlargest(10, 'total')
3 print(df_tablet_top10)
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/algorithms.py in compute(self, method)
1273 if not self.is_valid_dtype_n_method(dtype):
1274 raise TypeError(
-> 1275 f"Column {repr(column)} has dtype {dtype}, "
1276 f"cannot use method {repr(method)} with this dtype"
1277 )
TypeError: Column 'total' has dtype object, cannot use method 'nlargest' with this dtype
But when I select it from 1st quarter it works just fine
df_tablet_top10 = df_tablet.nlargest(10, '1st quarter')
print(df_tablet_top10)
nama 1st quarter ... 4th quarter total
154 Salbutamol 4 mg 981.0 ... 23.0 1,004.00
74 MDT FB dewasa (obat kusta) 978.0 ... 910.0 1,888.00
155 Paracetamol 500 mg Tablet 976.0 ... 503.0 1,479.00
33 Furosemid 40 mg 975.0 ... 524.0 1,499.00
23 Deksametason 0,5 mg 972.0 ... 793.0 1,765.00
21 Bisakodil (dulkolax) 5 mg 970.0 ... 798.0 1,768.00
191 Moxifloxacin 400 mg 964.0 ... 99.0 1,063.00
85 Metronidazol 250 mg 958.0 ... 879.0 1,837.00
96 Nistatin 500.000 IU 951.0 ... 425.0 1,376.00
37 Glimepirid 2 mg 947.0 ... 890.0 1,837.00
[10 rows x 6 columns]
Any idea what causes this to happen?
Also, I have changed the format for the 1st quarter to total as number on google sheet and it still did not work
I found the solution, but not the explanation.
All I did is just to convert the total column as float with
df_tablet['total'] = df_tablet['total'].astype(float)
df_tablet['total'] = df_tablet['total'].astype(float)
df_tablet_top10 = df_tablet.nlargest(10, '1st quarter')
print(df_tablet_top10)

Pandas Seaborn Data labels showing as 0.00

I have two questions:
My data labels are showing as 0.00 and isn't matching my cross tab. I don't know why...
Updating with the full code:
df = pd.read_csv('2018_ms_data_impact_only.csv', low_memory=False)
df.head()
StartDate EndDate Status IPAddress Progress duration Finished RecordedDate ResponseId RecipientLastName ... Gender LGBTQ Mobile organizing_interest Parent Policy policy_interest reg_to_vote unique_id Veteran
0 4/6/18 10:32 4/6/18 10:39 1 NaN 100 391 1 4/6/18 10:39 R_1liSDxRmTKDLFfT Mays ... Woman 0.0 4752122624 Currently in this field 1.0 NaN NaN 0.0 0034000001VAbTAAA1 0.0
1 4/9/18 6:31 4/9/18 6:33 1 NaN 100 160 1 4/9/18 6:33 R_0ezRf2zyaLwFDa1 Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
2 4/9/18 9:14 4/9/18 9:15 1 NaN 100 70 1 4/9/18 9:15 R_DeHh3DQ23uQZwLD Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
3 4/9/18 9:21 4/9/18 9:22 1 NaN 100 69 1 4/9/18 9:22 R_1CC0ckmyS7E1qs3 Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
4 4/9/18 9:28 4/9/18 9:29 1 NaN 100 54 1 4/9/18 9:29 R_01GuM5KqtHIgvEl Mays ... Woman 0.0 4752122684 2020 0.0 A 2020 0.0 0034000001W3tGOAAZ 0.0
def impact_action_yn_new(series):
if series == 3:
return 'No'
elif series == 1:
return 'Yes'
df['impact_action_yn_new'] = df['impact_action_yn'].apply(impact_action_yn_new)
df['impact_action_yn_new'].value_counts(sort=False)
# clean up engagement - collapse nan and 0, 2s
def engagement_new(series):
if series == '0':
return 'Null'
elif series == 'NaN':
return 'Null'
elif series == '1':
return '1'
elif series == '2':
return '2a'
elif series == '2a':
return '2a'
elif series == '2b':
return '2b'
elif series == '3':
return '3'
elif series == '4':
return '4'
elif series == '5':
return '5'
df['engagement_new'] = df['Engagement'].apply(engagement_new)
impact_action_table_eng = pd.crosstab(df.impact_action_yn_new,df.engagement_new)
print(impact_action_table_eng)
engagement_new 1 2a 2b 3 4 5 Null
impact_action_yn_new
No 676 508 587 683 172 31 1
Yes 410 405 303 671 357 237 1
# Crosstab: Impact YN x Engagement - Row percentages
impact_action_table_eng_rowperc = pd.crosstab(df.impact_action_yn_new,df.engagement_new).apply(lambda r: r/r.sum()*100, axis=1)
print(impact_action_table_eng_rowperc)
engagement_new 1 2a 2b 3 4 \
impact_action_yn_new
No 25.432656 19.112114 22.084274 25.696012 6.471031
Yes 17.197987 16.988255 12.709732 28.145973 14.974832
engagement_new 5 Null
impact_action_yn_new
No 1.166290 0.037622
Yes 9.941275 0.041946
#plot data
stacked_imp_eng_rowperc = impact_action_table_eng_rowperc.stack().reset_index().rename(columns={0:'value'})
total = float(len(df))
#set fig size
fig, ax = plt.subplots(figsize=(15,10))
#set style
sns.set_style('whitegrid')
#plot
ax = sns.barplot(x=stacked_imp_eng_rowperc.engagement_new,
y=stacked_imp_eng_rowperc.value,
hue=stacked_imp_eng_rowperc.impact_action_yn_new)
#plot legend
ax.legend(loc='center right',bbox_to_anchor=(.95,.9),ncol=1, fancybox=True, shadow=True)
#plot axis labels
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height,
'{:1.2f}'.format(height/total),
ha="center")
ax.set(xlabel='Engagement Level', ylabel='% Reporting an Action within Last 12 Months');
I'm not sure why the data labels on the bar plot are showing as 0.00. It is calling the crosstab. Any thoughts?
Is there a way to convert the crosstab calcualtions to show as a percentage? I'd like to plot those percentages instead of decimals.
Thanks for all your help!

Python: Imported csv not being split into proper columns

I am importing a csv file into python using pandas but the data frame is only in one column. I copied and pasted data from the comma-separated format from The Player Standing Field table at this link (second one) into an excel file and saved it as a csv (originally as ms-dos, then both as normal and utf-8 per recommendation by AllthingsGo42). But it only returned a single column data frame.
Examples of what I tried:
dataset=pd.read('MLB2016PlayerStats2.csv')
dataset=pd.read('MLB2016PlayerStats2.csv', delimiter=',')
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',')
The each line of code above all returned:
Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary
1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2...
2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1...
3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,...
4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,1...
5,Cristhian Adames\adamecr01,24,COL,NL,69,43,3...
Also tried:
dataset=pd.read_csv('MLB2016PlayerStats2.csv',encoding='ISO-8859-9',
delimiter=',',quoting=3)
Which returned:
"Rk Name Age Tm Lg G GS CG Inn Ch
\
0 "1 Fernando Abad\abadfe01 30 TOT AL 57 0 0 46.2 4
1 "2 Jose Abreu\abreujo02 29 CHW AL 152 152 150 1355.2 1337
2 "3 A.J. Achter\achteaj01 27 LAA AL 27 0 0 37.2 6
3 "4 Dustin Ackley\ackledu01 28 NYY AL 23 16 10 140.1 97
4 "5 Cristhian Adames\adamecr01 24 COL NL 69 43 38 415.0 212
E DP Fld% Rtot Rtot/yr Rdrs Rdrs/yr RF/9 RF/G \
0 ... 0 1 1.000 NaN NaN NaN NaN 0.77 0.07
1 ... 10 131 0.993 -2.0 -2.0 -5.0 -4.0 8.81 8.73
2 ... 0 0 1.000 NaN NaN 0.0 0.0 1.43 0.22
3 ... 0 8 1.000 1.0 9.0 3.0 27.0 6.22 4.22
4 ... 6 24 0.972 -4.0 -12.0 1.0 3.0 4.47 2.99
Pos Summary"
0 P"
1 1B"
2 P"
3 1B-OF-2B"
4 SS-2B-3B"
Below is what the data looks like in notepad++
"Rk,Name,Age,Tm,Lg,G,GS,CG,Inn,Ch,PO,A,E,DP,Fld%,Rtot,Rtot/yr,Rdrs,Rdrs/yr,RF/9,RF/G,Pos Summary"
"1,Fernando Abad\abadfe01,30,TOT,AL,57,0,0,46.2,4,0,4,0,1,1.000,,,,,0.77,0.07,P"
"2,Jose Abreu\abreujo02,29,CHW,AL,152,152,150,1355.2,1337,1243,84,10,131,.993,-2,-2,-5,-4,8.81,8.73,1B"
"3,A.J. Achter\achteaj01,27,LAA,AL,27,0,0,37.2,6,2,4,0,0,1.000,,,0,0,1.43,0.22,P"
"4,Dustin Ackley\ackledu01,28,NYY,AL,23,16,10,140.1,97,89,8,0,8,1.000,1,9,3,27,6.22,4.22,1B-OF-2B"
"5,Cristhian Adames\adamecr01,24,COL,NL,69,43,38,415.0,212,68,138,6,24,.972,-4,-12,1,3,4.47,2.99,SS-2B-3B"
"6,Austin Adams\adamsau01,29,CLE,AL,19,0,0,18.1,1,0,0,1,0,.000,,,0,0,0.00,0.00,P"
Sorry for the confusion with my question before. I hope this edit will clear things up. Thank you to those that answered thus far.
Running it quickly myself, I was able to get what I am understanding is the desired output.
My only thought is that there i s no need to call out a delimiter for a csv, because a csv is a comma separated variable file, but that should not matter. I am thinking that there is something incorrect with your actual data file and I would go and make sure it is saved correctly. I would echo previous comments and make sure that the csv is a UTF-8, and not an MS-DOS or Macintosh (both options when saving in excel)
Best of luck!
There is no need to call for a delimiter for a csv. You only have to change the separator from ";" to ",". For this you can open your csv file with notepad and change them with the replace tool.

Categories

Resources