Pandas df.ix not replacing - python

I have this dataframe
df1_9
date store_nbr item_nbr units station_nbr tavg preciptotal
8 2012-01-01 1 9 29 1 42 0.05
119 2012-01-02 1 9 60 1 41 0.01
...
452 2012-01-05 1 9 16 1 32 0.00
563 2012-01-06 1 9 12 1 36 T
I want to replace the 'T' in the preciptotal column with the value .01.
df1_9.ix[df1_9.preciptotal == 'T', 'preciptotal'] = 0.01
I wrote this code, but for some reason it is not working. I have been staring at this for a while, any help would be appreciated.

Related

Calculate a weighted average on a dataframe without using a loop in python

Let's say I have the following dataframe:
Individual stop x y z time
0 23 1 20 27 4 21
1 23 2 23 24 13 63
2 1756 2 5 41 73 12
3 1756 3 7 42 72 6
4 1756 4 4.5 39 72 45
5 1756 4 3 50 73 98
6 2153 2 121 12 6 33
7 2153 3 122.5 2 6 0
8 3276 1 54 33 -12 0
9 5609 1 -2 44 -32 56
10 5609 2 8 44 -32 23
11 5609 5 102 -23 16 76
I would like to calculate the average of the position x, y, z weighted by the time for each Individual. I would like to then put the results in a new dataframe like this:
Individual bar_x bar_y bar_z
0 23 22.5 24.75 10.75
2 1756 3.72 45.96 72.68
6 2153 121 12 6
9 5609 50.48 11.15 24.16
I have done this with a loop going through every Individual and calculate the weighted average. It works well, but the running time is VERY long when the dataframe gets bigger. I am pretty sure there is a much faster solution using pandas but I haven't find the way yet, any idea please?
Thanks in advance!
Is this what you are looking for?
(df[['x','y','z']].mul(df.groupby('Individual')['time']
.transform(lambda x: x.div(x.sum())).fillna(0),axis=0)
.groupby(df['Individual'])
.sum()
.round(2))
Output:
x y z
Individual
23 22.25 24.75 10.75
1756 3.72 45.96 72.68
2153 121.00 12.00 6.00
3276 0.00 0.00 0.00
5609 50.48 11.15 -8.46

Create new Pandas columns using the value from previous row

I need to create two new Pandas columns using the logic and value from the previous row.
I have the following data:
Day Vol Price Income Outgoing
1 499 75
2 3233 90
3 1812 70
4 2407 97
5 3474 82
6 1057 53
7 2031 68
8 304 78
9 1339 62
10 2847 57
11 3767 93
12 1096 83
13 3899 88
14 4090 63
15 3249 52
16 1478 52
17 4926 75
18 1209 52
19 1982 90
20 4499 93
My challenge is to come up with a logic where both the Income and Outgoing columns (which are currently empty), should have the values of (Vol * Price).
But, the Income column should carry this value when, the previous day's "Price" value is lower than present. The Outgoing column should carry this value when, the previous day's "Price" value is higher than present. The rest of the Income and Outgoing columns, should just have NaN's. If the Price is unchanged, then that day's value is to be dropped.
But the entire logic should start with (n + 1) day. The first row should be skipped and the logic should apply from row 2 onwards.
I have tried using shift in my code example such as:
if sample_data['Price'].shift(1) < sample_data['Price'].shift(2)):
sample_data['Income'] = sample_data['Vol'] * sample_data['Price']
else:
sample_data['Outgoing'] = sample_data['Vol'] * sample_data['Price']
But it isn't working.
I feel there would be a simpler and comprehensive tactic to go about this, could someone please help ?
Update (The final output should look like this):
For day 16, the data is deleted because we have two similar prices for day 15 and 16.
I'd calculate the product and the mask separately, and then update the cols:
In [11]: vol_price = df["Vol"] * df["Price"]
In [12]: incoming = df["Price"].diff() < 0
In [13]: df.loc[incoming, "Income"] = vol_price
In [14]: df.loc[~incoming, "Outgoing"] = vol_price
In [15]: df
Out[15]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 NaN 290970.0
2 3 1812 70 126840.0 NaN
3 4 2407 97 NaN 233479.0
4 5 3474 82 284868.0 NaN
5 6 1057 53 56021.0 NaN
6 7 2031 68 NaN 138108.0
7 8 304 78 NaN 23712.0
8 9 1339 62 83018.0 NaN
9 10 2847 57 162279.0 NaN
10 11 3767 93 NaN 350331.0
11 12 1096 83 90968.0 NaN
12 13 3899 88 NaN 343112.0
13 14 4090 63 257670.0 NaN
14 15 3249 52 168948.0 NaN
15 16 1478 52 NaN 76856.0
16 17 4926 75 NaN 369450.0
17 18 1209 52 62868.0 NaN
18 19 1982 90 NaN 178380.0
19 20 4499 93 NaN 418407.0
or is it this way around:
In [21]: incoming = df["Price"].diff() > 0
In [22]: df.loc[incoming, "Income"] = vol_price
In [23]: df.loc[~incoming, "Outgoing"] = vol_price
In [24]: df
Out[24]:
Day Vol Price Income Outgoing
0 1 499 75 NaN 37425.0
1 2 3233 90 290970.0 NaN
2 3 1812 70 NaN 126840.0
3 4 2407 97 233479.0 NaN
4 5 3474 82 NaN 284868.0
5 6 1057 53 NaN 56021.0
6 7 2031 68 138108.0 NaN
7 8 304 78 23712.0 NaN
8 9 1339 62 NaN 83018.0
9 10 2847 57 NaN 162279.0
10 11 3767 93 350331.0 NaN
11 12 1096 83 NaN 90968.0
12 13 3899 88 343112.0 NaN
13 14 4090 63 NaN 257670.0
14 15 3249 52 NaN 168948.0
15 16 1478 52 NaN 76856.0
16 17 4926 75 369450.0 NaN
17 18 1209 52 NaN 62868.0
18 19 1982 90 178380.0 NaN
19 20 4499 93 418407.0 NaN

apply lambda function using group by and using previous row value

I have the following data frame, per each date, per hour, I want create a new column "result"such that if the value in column "B" is >=0 then use the value in column A; otherwise use the maximum between 0 and the previous row value in column B
Date Hour A B result
1/1/2018 1 5 95 5
1/1/2018 1 16 79 16
1/1/2018 1 85 -6 79
1/1/2018 1 12 -18 0
1/1/2018 2 17 43 17
1/1/2018 2 17 26 17
1/1/2018 2 16 10 16
1/1/2018 2 142 -132 10
1/1/2018 2 10 -142 0
I tried grouping by date and hour and then applying a lambda function using shift but I got an error:
df['result'] = df.groupby(['Date','Hour']).apply(lambda x: x['A'] if x['B'] >= 0 else np.maximum(0, x['B'].shift(1)), axis = 1)
Use np.where. The groupby is only necessary when shifting "B", so you can vectorise this operation without using apply.
df['result'] = np.where(
df.B >= 0,
df.A,
df.groupby(['Date', 'Hour'])['B'].shift().clip(lower=0))
df
Date Hour A B result
0 1/1/2018 1 5 95 5.0
1 1/1/2018 1 16 79 16.0
2 1/1/2018 1 85 -6 79.0
3 1/1/2018 1 12 -18 0.0
4 1/1/2018 2 17 43 17.0
5 1/1/2018 2 17 26 17.0
6 1/1/2018 2 16 10 16.0
7 1/1/2018 2 142 -132 10.0
8 1/1/2018 2 10 -142 0.0

combining columns for a new date format in python

I'm new to Python so any help or advice is very appreciated and sorry if I'm asking very obvious things.
I'm having the following data :
WMO_NO YEAR MONTH DAY HOUR MINUTE H PS T RH TD WDIR WSP
0 4018 2006 1 1 11 28 38 988.6 0.9 98 0.6 120 14.4
1 4018 2006 1 1 11 28 46 987.6 0.5 91 -0.7 122 15.0
2 4018 2006 1 1 11 28 57 986.3 0.5 89 -1.1 124 15.5
3 4018 2006 1 1 11 28 66 985.1 0.5 90 -1.1 126 16.0
4 4018 2006 1 1 11 28 74 984.1 0.4 90 -1.1 127 16.5
I would like to combine the YEAR MONTH DAY HOUR MINUTE into a new column formatted as YEAR:MONTH:DAY:HOUR:MINUTE ( and then index the T data with this column) and do some analysis.
My first question is how to I create such a new column ? The second is can I do comparisons and analysis on this column like ( YEAR:MONTH:DAY:HOUR:MINUTE > 2007:04:13:04:44)?
Cheers.
You can use to_datetime and then if necessary Series.dt.strftime with custom format, check http://strftime.org/:
df['date'] = pd.to_datetime(df[['YEAR','MONTH','DAY','HOUR','MINUTE']])
df['date_new'] = df['date'].dt.strftime('%Y:%m:%d:%H:%M')
print (df)
WMO_NO YEAR MONTH DAY HOUR MINUTE H PS T RH TD WDIR \
0 4018 2006 1 1 11 28 38 988.6 0.9 98 0.6 120
1 4018 2006 1 1 11 28 46 987.6 0.5 91 -0.7 122
2 4018 2006 1 1 11 28 57 986.3 0.5 89 -1.1 124
3 4018 2006 1 1 11 28 66 985.1 0.5 90 -1.1 126
4 4018 2006 1 1 11 28 74 984.1 0.4 90 -1.1 127
WSP date date_new
0 14.4 2006-01-01 11:28:00 2006:01:01:11:28
1 15.0 2006-01-01 11:28:00 2006:01:01:11:28
2 15.5 2006-01-01 11:28:00 2006:01:01:11:28
3 16.0 2006-01-01 11:28:00 2006:01:01:11:28
4 16.5 2006-01-01 11:28:00 2006:01:01:11:28
If your data consists of integers instead of strings you can use this to create a datetime index:
import pandas as pd
import datetime as dt
columns = ['ID', 'Year', 'Month', 'Day', 'Hour', 'Minute']
data = [ ['1', 2006, 1, 1, 11, 28],
['2', 2006, 1, 1, 11, 29]]
df = pd.DataFrame(data=data, columns=columns)
df.index = df.apply(lambda x: dt.datetime(x['Year'], x['Month'], x['Day'], x['Hour'], x['Minute']), axis=1)

converting an HTML table in Pandas Dataframe

I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script
import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)
and since My results are coming as 1 list, I tried to convert it into a data frame with
data1=pd.DataFrame(Data)
and result came as
0
0 0 1 2 3 4...
and because of result as a list, I can't apply any functions such as rename, dropna, drop.
I will appreciate every help
I think you need add [0] if need select first item of list, because read_html return list of DataFrames:
So you can use:
import pandas as pd
data1 = pd.read_html('http://www.espn.com/nhl/statis‌​tics/player/‌​_/stat/point‌​s/sort/point‌​s/year/2015&‌​#47;seasontype/2‌​',skiprows=1)[0]
print (data1)
0 1 2 3 4 5 6 7 8 9 \
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
1 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
2 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
3 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
4 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
5 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
6 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
7 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
8 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
9 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
10 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
11 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
12 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
13 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
14 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
15 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
16 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
17 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94
18 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87
19 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85
20 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01
21 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91
22 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87
23 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
24 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
25 21 Max Pacioretty, LW MTL 80 37 30 67 38 32 0.84
26 NaN Logan Couture, C SJ 82 27 40 67 -6 12 0.82
27 23 Jonathan Toews, C CHI 81 28 38 66 30 36 0.81
28 NaN Erik Karlsson, D OTT 82 21 45 66 7 42 0.80
29 NaN Henrik Zetterberg, LW DET 77 17 49 66 -6 32 0.86
30 26 Pavel Datsyuk, C DET 63 26 39 65 12 8 1.03
31 NaN Joe Thornton, C SJ 78 16 49 65 -4 30 0.83
32 28 Nikita Kucherov, RW TB 82 28 36 64 38 37 0.78
33 NaN Patrick Kane, RW CHI 61 27 37 64 10 10 1.05
34 NaN Mark Stone, RW OTT 80 26 38 64 21 14 0.80
35 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
36 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
37 NaN Alexander Steen, LW STL 74 24 40 64 8 33 0.86
38 NaN Kyle Turris, C OTT 82 24 40 64 5 36 0.78
39 NaN Johnny Gaudreau, LW CGY 80 24 40 64 11 14 0.80
40 NaN Anze Kopitar, C LA 79 16 48 64 -2 10 0.81
41 35 Radim Vrbata, RW VAN 79 31 32 63 6 20 0.80
42 NaN Jaden Schwartz, LW STL 75 28 35 63 13 16 0.84
43 NaN Filip Forsberg, C NSH 82 26 37 63 15 24 0.77
44 NaN Jordan Eberle, RW EDM 81 24 39 63 -16 24 0.78
45 NaN Ondrej Palat, LW TB 75 16 47 63 31 24 0.84
46 40 Zach Parise, LW MIN 74 33 29 62 21 41 0.84
10 11 12 13 14 15 16
0 SOG PCT GWG G A G A
1 253 13.8 6 10 13 2 3
2 278 13.7 8 13 18 0 1
3 237 11.8 3 10 21 0 0
4 395 13.4 11 25 9 0 0
5 221 10.0 3 11 22 0 0
6 153 11.8 3 3 30 0 0
7 280 13.2 5 13 16 0 0
8 158 19.6 5 6 10 0 0
9 226 8.9 5 4 21 0 0
10 264 14.0 6 8 10 0 0
11 NaN NaN NaN NaN NaN NaN NaN
12 SOG PCT GWG G A G A
13 182 17.0 3 11 15 0 0
14 279 9.0 4 14 23 0 0
15 101 17.8 0 5 20 0 0
16 268 16.0 6 13 12 0 0
17 203 14.3 6 8 9 0 0
18 202 12.9 0 7 19 2 0
19 261 14.2 5 19 12 0 0
20 212 13.2 4 9 17 0 0
21 191 13.1 6 3 10 0 2
22 304 13.8 8 6 6 4 1
23 NaN NaN NaN NaN NaN NaN NaN
24 SOG PCT GWG G A G A
25 302 12.3 10 7 4 3 2
26 263 10.3 4 6 18 2 0
27 192 14.6 7 6 11 2 1
28 292 7.2 3 6 24 0 0
29 227 7.5 3 4 24 0 0
30 165 15.8 5 8 16 0 0
31 131 12.2 0 4 18 0 0
32 190 14.7 2 2 13 0 0
33 186 14.5 5 6 16 0 0
34 157 16.6 6 5 8 1 0
35 NaN NaN NaN NaN NaN NaN NaN
36 SOG PCT GWG G A G A
37 223 10.8 5 8 16 0 0
38 215 11.2 6 4 12 1 0
39 167 14.4 4 8 13 0 0
40 134 11.9 4 6 18 0 0
41 267 11.6 7 12 11 0 0
42 184 15.2 4 8 8 0 2
43 237 11.0 6 6 13 0 0
44 183 13.1 2 6 15 0 0
45 139 11.5 5 3 8 1 1
46 259 12.7 3 11 5 0 0
If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0
Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.
Full line would be:
data1 = pd.read_html(url, skiprows=1, header=0)[0]
[0] is the first table in the list of possible tables.
There are options for handling NA values as well. Check out the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
I know this is late, but here's a better way...
I noticed that the DataFrames in the list are all part of the same table/dataset you are trying to analyze, so instead of breaking them up and then merging them together, a better solution is to contact the list of DataFrames.
Check out the results of this code:
df = pd.concat(pd.read_html('https://www.espn.com/nhl/stats/player/_/view/goaltending'),axis=1)
output:
df.head(1)
index RK Name POS GP W L OTL GA/G SA GA SV SV% SO TOI PIM SOSA SOS SOS%
0 1 Igor ShesterkinNYR G 53 36 13 4 2.07 1622 106 1516 0.935 6 3070:32 2 28 20 0.714

Categories

Resources