pandas dataframe in def - python

I tried the below code to pass a df to a def function.
the first line works fine with df.dropna.
however the df.replace has issue as I found that it does not do the replace as I expected.
def Max(df):
df.dropna(subset=df.columns[3:10], inplace=True)
print(df)
df.replace(to_replace=65535, value=-10, inplace=True)
print(df)
return df
anyone know the issue and how to solve it?

Your code works well. Maybe try this version without inplace modifications:
>>> df
A B C D E F G H I J
0 1 2 3 4 5.0 6 7 8 9 10.0
1 11 65535 13 14 15.0 16 17 18 19 20.0
2 21 22 23 24 25.0 26 27 28 29 NaN
3 65535 32 33 34 NaN 36 37 38 39 40.0
4 41 42 65535 44 45.0 46 47 48 49 50.0
5 51 52 53 54 55.0 56 57 58 59 60.0
def Max(df):
return df.dropna(subset=df.columns[3:10]).replace(65535, -10)
>>> Max(df)
A B C D E F G H I J
0 1 2 3 4 5.0 6 7 8 9 10.0
1 11 -10 13 14 15.0 16 17 18 19 20.0
4 41 42 -10 44 45.0 46 47 48 49 50.0
5 51 52 53 54 55.0 56 57 58 59 60.0

Related

how to split an integer value from one column to two columns in text file using pandas or numpy (python)

I have a text file which has a number of integer values like this.
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0 4 5 2
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27 34 54 11
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69 66 87 14
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67 82 92 17
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49 41 53 12
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47 29 36 21
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34 27 64 7
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10 7 11 1
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2 6 6 10
I have to make a file by merging several files like this but you guys can see a problem with this data.
In 4 and 5 lines, the first values, 1017 and 1106, right next to period index make a problem.
When I try to read these two lines, I always have had this result.
It came out that first values in first column next to index columns couldn't recognized as first values themselves.
In [14]: fw.iloc[80,:]
Out[14]:
3 72.0
4 46.0
5 52.0
6 29.0
7 29.0
8 22.0
9 204.0
10 41.0
11 46.0
12 51.0
13 57.0
14 67.0
15 82.0
16 92.0
17 17.0
18 NaN
Name: (20180722, 201807281017), dtype: float64
I tried to make it correct with indexing but failed.
The desirable result is,
In [14]: fw.iloc[80,:]
Out[14]:
2 1017.0
3 110.0
4 72.0
5 46.0
6 52.0
7 29.0
8 29.0
9 22.0
10 204.0
11 41.0
12 46.0
13 51.0
14 57.0
15 67.0
16 82.0
17 92.0
18 17.0
Name: (20180722, 201807281017), dtype: float64
How can I solve this problem?
+
I used this code to read this file.
fw = pd.read_csv('warm_patient.txt', index_col=[0,1], header=None, delim_whitespace=True)
A better fit for this would be pandas.read_fwf. For your example:
df = pd.read_fwf(filename, index_col=[0,1], header=None, widths=2*[10]+17*[4])
I don't know if the column widths can be inferred for all your data or need to be hardcoded.
One possibility would be to manually construct the dataframe, this way we can parse the text by splitting the values every 4 characters.
from textwrap import wrap
import pandas as pd
def read_file(f_name):
data = []
with open(f_name) as f:
for line in f.readlines():
idx1 = line[0:8]
idx2 = line[10:18]
points = map(lambda x: int(x.replace(" ", "")), wrap(line.rstrip()[18:], 4))
data.append([idx1, idx2, *points])
return pd.DataFrame(data).set_index([0, 1])
It could be made somewhat more efficient (in particular if this is a particularly long text file), but here's one solution.
fw = pd.read_csv('test.txt', header=None, delim_whitespace=True)
for i in fw[pd.isna(fw.iloc[:,-1])].index:
num_str = str(fw.iat[i,1])
a,b = map(int,[num_str[:-4],num_str[-4:]])
fw.iloc[i,3:] = fw.iloc[i,2:-1]
fw.iloc[i,:3] = [fw.iat[i,0],a,b]
fw = fw.set_index([0,1])
The result of print(fw) from there is
2 3 4 5 6 7 8 9 10 11 12 13 14 15 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1 0
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9 27
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38 69
20180722 20180728 1017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 20180804 1106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38 47
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31 34
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8 2
16 17 18
0 1
20180701 20180707 4 5 2.0
20180708 20180714 34 54 11.0
20180715 20180721 66 87 14.0
20180722 20180728 82 92 17.0
20180729 20180804 41 53 12.0
20180805 20180811 29 36 21.0
20180812 20180818 27 64 7.0
20180819 20180825 7 11 1.0
20180826 20180901 6 6 10.0
Here's the result of the print after applying your initial solution of fw = pd.read_csv('test.txt', index_col=[0,1], header=None, delim_whitespace=True) for comparison.
2 3 4 5 6 7 8 9 10 11 12 13 14 \
0 1
20180701 20180707 52 11 1 2 4 1 0 0 10 7 1 3 1
20180708 20180714 266 8 19 3 2 9 7 25 20 17 12 9 9
20180715 20180721 654 52 34 31 20 16 12 25 84 31 38 37 38
20180722 201807281017 110 72 46 52 29 29 22 204 41 46 51 57 67
20180729 201808041106 276 37 11 87 20 10 8 284 54 54 72 38 49
20180805 20180811 624 78 19 15 55 16 8 9 172 15 31 35 38
20180812 20180818 488 63 17 7 26 10 9 7 116 17 14 39 31
20180819 20180825 91 4 7 0 4 5 1 3 16 3 4 5 10
20180826 20180901 49 2 2 1 0 4 0 1 2 0 1 4 8
15 16 17 18
0 1
20180701 20180707 0 4 5 2.0
20180708 20180714 27 34 54 11.0
20180715 20180721 69 66 87 14.0
20180722 201807281017 82 92 17 NaN
20180729 201808041106 41 53 12 NaN
20180805 20180811 47 29 36 21.0
20180812 20180818 34 27 64 7.0
20180819 20180825 10 7 11 1.0
20180826 20180901 2 6 6 10.0

Rolling average across several columns and rows

import random
random.sample(range(1, 100), 10)
df = pd.DataFrame({"A": random.sample(range(1, 100), 10),
"B":random.sample(range(1, 100), 10),
"C":random.sample(range(1, 100), 10)})
df["D"]="need_to_calc"
df
I need the value of Column D, Row 9 to equal the average of the block of cells from rows 6 through 8 across columns A through C. I want to do this for all rows.
I am not sure how to do this in a single pythonic action. Instead I have hacky temporary columns and ugly nonsense.
Is there a cleaner way to define this column without temporary tables?
You can do it like this:
means = df.rolling(3).mean().shift(1)
df['D'] = (means['A'] + means['B'] + means['C'])/3
Output:
A B C D
0 43 57 15 NaN
1 86 34 68 NaN
2 40 12 78 NaN
3 97 24 54 48.111111
4 90 42 10 54.777778
5 34 54 98 49.666667
6 98 36 31 55.888889
7 16 5 24 54.777778
8 35 53 67 44.000000
9 80 66 37 40.555556
You can do it so:
df["D"]= (df.sum(axis=1).rolling(window=3, min_periods=3).sum()/9).shift(1)
Example:
A B C D
0 62 89 12 need_to_calc
1 44 13 63 need_to_calc
2 28 21 54 need_to_calc
3 93 93 4 need_to_calc
4 95 84 42 need_to_calc
5 68 68 35 need_to_calc
6 3 92 56 need_to_calc
7 13 88 83 need_to_calc
8 22 37 23 need_to_calc
9 64 58 5 need_to_calc
Output:
A B C D
0 62 89 12 NaN
1 44 13 63 NaN
2 28 21 54 NaN
3 93 93 4 42.888889
4 95 84 42 45.888889
5 68 68 35 57.111111
6 3 92 56 64.666667
7 13 88 83 60.333333
8 22 37 23 56.222222
9 64 58 5 46.333333

Reshaping a multiindex pandas dataframe

I have a multiindex pandas dataframe that looks like this
ID I II III
METRIC a b c d a b c d a b c d
2015-08-01 0 1 2 3 20 21 22 23 40 41 42 43
2015-08-02 4 5 6 7 24 25 26 27 44 45 46 47
2015-08-03 8 9 10 11 28 29 30 31 48 49 50 51
where it is indexed by the dates (2015-08-01, 2015-08-02, 2015-08-03, etc.), the first-level columns (I, II, III) are IDs and the second-level columns are corresponding METRICs (a, b, c, d). I would like to reshape it to the following
METRIC a b c d
ID
I 2015-08-01 0 1 2 3
2015-08-02 4 5 6 7
2015-08-03 8 9 10 11
II 2015-08-01 20 21 22 23
2015-08-02 24 25 26 27
2015-08-03 28 29 30 31
III 2015-08-01 40 41 42 43
2015-08-02 44 45 46 47
2015-08-03 48 49 50 51
I have (unsuccessfully) looked into using .pivot, .stack, and .melt, but they don't give me what I am looking for. I currently loop over IDs and build a list of dataframes and concat them together as a new dataframe to get what I want.
Any suggestions would be greatly appreciated.
Let's use stack, swaplevel and sort_index:
df.stack(0).swaplevel(0,1).sort_index()
Output:
METRIC a b c d
ID
I 2015-08-01 0 1 2 3
2015-08-02 4 5 6 7
2015-08-03 8 9 10 11
II 2015-08-01 20 21 22 23
2015-08-02 24 25 26 27
2015-08-03 28 29 30 31
III 2015-08-01 40 41 42 43
2015-08-02 44 45 46 47
2015-08-03 48 49 50 51
You can let transpose or T do some of the work for you.
df.T.stack().unstack(1)
METRIC a b c d
ID
I 2015-08-01 0 1 2 3
2015-08-02 4 5 6 7
2015-08-03 8 9 10 11
II 2015-08-01 20 21 22 23
2015-08-02 24 25 26 27
2015-08-03 28 29 30 31
III 2015-08-01 40 41 42 43
2015-08-02 44 45 46 47
2015-08-03 48 49 50 51
Using #piRSquared's method, we can skip the transpose, just df.unstack().unstack(1)

Python: Lookup value in header of another data frame and replace/map the corresponding value

I have a data frame with index members which looks like this (A,B,C,... are the company names):
df_members
Date 1 2 3 4
0 2016-01-01 A B C D
1 2016-01-02 B C D E
2 2016-01-03 C D E F
3 2016-01-04 F A B C
4 2016-01-05 B C D E
5 2016-01-06 A B C D
and I have a second table including e.g. prices:
df_prices
Date A B C D E F
0 2015-12-30 1 2 3 4 5 6
1 2015-12-31 7 8 9 10 11 12
2 2016-01-01 13 14 15 16 17 18
3 2016-01-02 20 21 22 23 24 25
4 2016-01-03 27 28 29 30 31 32
5 2016-01-04 34 35 36 37 38 39
6 2016-01-05 41 42 43 44 45 46
7 2016-01-06 48 49 50 51 52 53
The goal is to replace all company names in df1 with the price from df_prices resulting in df_result:
df_result
Date 1 2 3 4
0 2016-01-01 13 14 15 16
1 2016-01-02 21 22 23 24
2 2016-01-03 29 30 31 32
3 2016-01-04 39 34 35 36
4 2016-01-05 42 43 44 45
5 2016-01-06 48 49 50 51
I already have a solution where I iterate through all cells in df_members, look for the values in df_prices and write them in a new data frame df_result. The problem is that my data frames are very large and this process takes around 7 hours.
I already tried to use the merge/join, map or lookup function but it could not solve the problem.
My approach is the following:
# Create new dataframes
df_result = pd.DataFrame(columns=df_members.columns, index=unique_dates_list)
# Load prices
df_prices = prices
# Search ticker & write values in new dataframe
for i in range(0,len(df_members)):
for j in range(0,len(df_members.columns)):
if str(df_members.iloc[i, j]) != 'nan' and df_members.iloc[i, j] in df_prices.columns:
df_result.iloc[i, j] = df_prices.iloc[i, df_prices.columns.get_loc(df_members.iloc[i, j])]
Question: Is there a way to map the values more efficiently?
pandas.lookup() will do what you need:
Code:
df_result = pd.DataFrame(columns=[], index=df_members.index)
for column in df_members.columns:
df_result[column] = df_prices.lookup(
df_members.index, df_members[column])
Test Code:
import pandas as pd
df_members = pd.read_fwf(StringIO(
u"""
Date 1 2 3 4
2016-01-01 A B C D
2016-01-02 B C D E
2016-01-03 C D E F
2016-01-04 F A B C
2016-01-05 B C D E
2016-01-06 A B C D"""
), header=1).set_index('Date')
df_prices = pd.read_fwf(StringIO(
u"""
Date A B C D E F
2015-12-30 1 2 3 4 5 6
2015-12-31 7 8 9 10 11 12
2016-01-01 13 14 15 16 17 18
2016-01-02 20 21 22 23 24 25
2016-01-03 27 28 29 30 31 32
2016-01-04 34 35 36 37 38 39
2016-01-05 41 42 43 44 45 46
2016-01-06 48 49 50 51 52 53"""
), header=1).set_index('Date')
df_result = pd.DataFrame(columns=[], index=df_members.index)
for column in df_members.columns:
df_result[column] = df_prices.lookup(
df_members.index, df_members[column])
print(df_result)
Results:
1 2 3 4
Date
2016-01-01 13 14 15 16
2016-01-02 21 22 23 24
2016-01-03 29 30 31 32
2016-01-04 39 34 35 36
2016-01-05 42 43 44 45
2016-01-06 48 49 50 51

converting an HTML table in Pandas Dataframe

I am reading an HTML table with pd.read_html but the result is coming in a list, I want to convert it inot a pandas dataframe, so I can continue further operations on the same. I am using the following script
import pandas as pd
import html5lib
data=pd.read_html('http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/2',skiprows=1)
and since My results are coming as 1 list, I tried to convert it into a data frame with
data1=pd.DataFrame(Data)
and result came as
0
0 0 1 2 3 4...
and because of result as a list, I can't apply any functions such as rename, dropna, drop.
I will appreciate every help
I think you need add [0] if need select first item of list, because read_html return list of DataFrames:
So you can use:
import pandas as pd
data1 = pd.read_html('http://www.espn.com/nhl/statis‌​tics/player/‌​_/stat/point‌​s/sort/point‌​s/year/2015&‌​#47;seasontype/2‌​',skiprows=1)[0]
print (data1)
0 1 2 3 4 5 6 7 8 9 \
0 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
1 1 Jamie Benn, LW DAL 82 35 52 87 1 64 1.06
2 2 John Tavares, C NYI 82 38 48 86 5 46 1.05
3 3 Sidney Crosby, C PIT 77 28 56 84 5 47 1.09
4 4 Alex Ovechkin, LW WSH 81 53 28 81 10 58 1.00
5 NaN Jakub Voracek, RW PHI 82 22 59 81 1 78 0.99
6 6 Nicklas Backstrom, C WSH 82 18 60 78 5 40 0.95
7 7 Tyler Seguin, C DAL 71 37 40 77 -1 20 1.08
8 8 Jiri Hudler, LW CGY 78 31 45 76 17 14 0.97
9 NaN Daniel Sedin, LW VAN 82 20 56 76 5 18 0.93
10 10 Vladimir Tarasenko, RW STL 77 37 36 73 27 31 0.95
11 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
12 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
13 NaN Nick Foligno, LW CBJ 79 31 42 73 16 50 0.92
14 NaN Claude Giroux, C PHI 81 25 48 73 -3 36 0.90
15 NaN Henrik Sedin, C VAN 82 18 55 73 11 22 0.89
16 14 Steven Stamkos, C TB 82 43 29 72 2 49 0.88
17 NaN Tyler Johnson, C TB 77 29 43 72 33 24 0.94
18 16 Ryan Johansen, C CBJ 82 26 45 71 -6 40 0.87
19 17 Joe Pavelski, C SJ 82 37 33 70 12 29 0.85
20 NaN Evgeni Malkin, C PIT 69 28 42 70 -2 60 1.01
21 NaN Ryan Getzlaf, C ANA 77 25 45 70 15 62 0.91
22 20 Rick Nash, LW NYR 79 42 27 69 29 36 0.87
23 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
24 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
25 21 Max Pacioretty, LW MTL 80 37 30 67 38 32 0.84
26 NaN Logan Couture, C SJ 82 27 40 67 -6 12 0.82
27 23 Jonathan Toews, C CHI 81 28 38 66 30 36 0.81
28 NaN Erik Karlsson, D OTT 82 21 45 66 7 42 0.80
29 NaN Henrik Zetterberg, LW DET 77 17 49 66 -6 32 0.86
30 26 Pavel Datsyuk, C DET 63 26 39 65 12 8 1.03
31 NaN Joe Thornton, C SJ 78 16 49 65 -4 30 0.83
32 28 Nikita Kucherov, RW TB 82 28 36 64 38 37 0.78
33 NaN Patrick Kane, RW CHI 61 27 37 64 10 10 1.05
34 NaN Mark Stone, RW OTT 80 26 38 64 21 14 0.80
35 NaN PP SH NaN NaN NaN NaN NaN NaN NaN
36 RK PLAYER TEAM GP G A PTS +/- PIM PTS/G
37 NaN Alexander Steen, LW STL 74 24 40 64 8 33 0.86
38 NaN Kyle Turris, C OTT 82 24 40 64 5 36 0.78
39 NaN Johnny Gaudreau, LW CGY 80 24 40 64 11 14 0.80
40 NaN Anze Kopitar, C LA 79 16 48 64 -2 10 0.81
41 35 Radim Vrbata, RW VAN 79 31 32 63 6 20 0.80
42 NaN Jaden Schwartz, LW STL 75 28 35 63 13 16 0.84
43 NaN Filip Forsberg, C NSH 82 26 37 63 15 24 0.77
44 NaN Jordan Eberle, RW EDM 81 24 39 63 -16 24 0.78
45 NaN Ondrej Palat, LW TB 75 16 47 63 31 24 0.84
46 40 Zach Parise, LW MIN 74 33 29 62 21 41 0.84
10 11 12 13 14 15 16
0 SOG PCT GWG G A G A
1 253 13.8 6 10 13 2 3
2 278 13.7 8 13 18 0 1
3 237 11.8 3 10 21 0 0
4 395 13.4 11 25 9 0 0
5 221 10.0 3 11 22 0 0
6 153 11.8 3 3 30 0 0
7 280 13.2 5 13 16 0 0
8 158 19.6 5 6 10 0 0
9 226 8.9 5 4 21 0 0
10 264 14.0 6 8 10 0 0
11 NaN NaN NaN NaN NaN NaN NaN
12 SOG PCT GWG G A G A
13 182 17.0 3 11 15 0 0
14 279 9.0 4 14 23 0 0
15 101 17.8 0 5 20 0 0
16 268 16.0 6 13 12 0 0
17 203 14.3 6 8 9 0 0
18 202 12.9 0 7 19 2 0
19 261 14.2 5 19 12 0 0
20 212 13.2 4 9 17 0 0
21 191 13.1 6 3 10 0 2
22 304 13.8 8 6 6 4 1
23 NaN NaN NaN NaN NaN NaN NaN
24 SOG PCT GWG G A G A
25 302 12.3 10 7 4 3 2
26 263 10.3 4 6 18 2 0
27 192 14.6 7 6 11 2 1
28 292 7.2 3 6 24 0 0
29 227 7.5 3 4 24 0 0
30 165 15.8 5 8 16 0 0
31 131 12.2 0 4 18 0 0
32 190 14.7 2 2 13 0 0
33 186 14.5 5 6 16 0 0
34 157 16.6 6 5 8 1 0
35 NaN NaN NaN NaN NaN NaN NaN
36 SOG PCT GWG G A G A
37 223 10.8 5 8 16 0 0
38 215 11.2 6 4 12 1 0
39 167 14.4 4 8 13 0 0
40 134 11.9 4 6 18 0 0
41 267 11.6 7 12 11 0 0
42 184 15.2 4 8 8 0 2
43 237 11.0 6 6 13 0 0
44 183 13.1 2 6 15 0 0
45 139 11.5 5 3 8 1 1
46 259 12.7 3 11 5 0 0
If your dataframe ends up with columns indexed as 0,1,2 etc and the headings in the first row, (as above) just specify that the column names are in the first row with header=0
Without this, pandas may see a mix of data types - text in row 1 and numbers in the rest and cast the column as object rather than, say, int64.
Full line would be:
data1 = pd.read_html(url, skiprows=1, header=0)[0]
[0] is the first table in the list of possible tables.
There are options for handling NA values as well. Check out the documentation here:
https://pandas.pydata.org/docs/reference/api/pandas.read_html.html
I know this is late, but here's a better way...
I noticed that the DataFrames in the list are all part of the same table/dataset you are trying to analyze, so instead of breaking them up and then merging them together, a better solution is to contact the list of DataFrames.
Check out the results of this code:
df = pd.concat(pd.read_html('https://www.espn.com/nhl/stats/player/_/view/goaltending'),axis=1)
output:
df.head(1)
index RK Name POS GP W L OTL GA/G SA GA SV SV% SO TOI PIM SOSA SOS SOS%
0 1 Igor ShesterkinNYR G 53 36 13 4 2.07 1622 106 1516 0.935 6 3070:32 2 28 20 0.714

Categories

Resources