Related
I have a data set that has year vs month values like this
JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
2004 1.9 1.7 1.7 2.3 3.1 3.3 3 2.7 2.5 3.2 3.5 3.3
2005 3 3 3.1 3.5 2.8 2.5 3.2 3.6 4.7 4.3 3.5 3.4
2006 4 3.6 3.4 3.5 4.2 4.3 4.1 3.8 2.1 1.3 2 2.5
2007 2.1 2.4 2.8 2.6 2.7 2.7 2.4 2 2.8 3.5 4.3 4.1
2008 4.3 4 4 3.9 4.2 5 5.6 5.4 4.9 3.7 1.1 0.1
I want to convert it to like this using Python / Pandas:
Date Value
Jan-04 1.9
Feb-04 1.7
Mar-04 1.7
Apr-04 2.3
May-04 3.1
Jun-04 3.3
Jul-04 3
Aug-04 2.7
Sep-04 2.5
Oct-04 3.2
Nov-04 3.5
Dec-04 3.3
Jan-05 3
Feb-05 3
Mar-05 3.1
Apr-05 3.5
May-05 2.8
Jun-05 2.5
Jul-05 3.2
Aug-05 3.6
Sep-05 4.7
Oct-05 4.3
Nov-05 3.5
Dec-05 3.4
How can this be done?
Or alternatively using melt:
cols = ['JAN', 'FEB','MAR','APR','MAY','JUN','JUL','AUG','SEP','OCT','NOV','DEC']
rows=list(range(2004,2009))
#I've used random numbers instead of your values.
df = pd.DataFrame(index=rows,columns=cols,data=np.random.rand(5,12)).reset_index()
tdf = df.melt(id_vars=['index'])
tdf['d'] = pd.to_datetime(tdf['variable']+tdf['index'].astype(str), format='%b%Y')
print(tdf)
Output:
index variable value comb d
0 2004 JAN 0.963338 JAN2004 2004-01-01
1 2005 JAN 0.265815 JAN2005 2005-01-01
2 2006 JAN 0.254360 JAN2006 2006-01-01
3 2007 JAN 0.275372 JAN2007 2007-01-01
4 2008 JAN 0.042116 JAN2008 2008-01-01
The cleaning of the columns, I leave to the OP.
Use DataFrame.stack for reshape, then if possible join last 2 values per years with month names:
df = df.rename_axis('date').stack().reset_index(name='Value')
df['date'] = df.pop('level_1') + '-' +df['date'].astype(str).str[2:]
print (df.head())
date Value
0 JAN-04 1.9
1 FEB-04 1.7
2 MAR-04 1.7
3 APR-04 2.3
4 MAY-04 3.1
Or with convert to datetimes:
df = df.rename_axis('date').stack().reset_index(name='Value')
df['date'] = pd.to_datetime(df.pop('level_1') + df['date'].astype(str) , format='%b%Y')
print (df.head())
date Value
0 2004-01-01 1.9
1 2004-02-01 1.7
2 2004-03-01 1.7
3 2004-04-01 2.3
4 2004-05-01 3.1
df = df.rename_axis('date').stack().reset_index(name='Value')
df['date'] = pd.to_datetime(df.pop('level_1') + df['date'].astype(str) , format='%b%Y').dt.strftime('%b-%y').str.upper()
print (df.head())
date Value
0 JAN-04 1.9
1 FEB-04 1.7
2 MAR-04 1.7
3 APR-04 2.3
4 MAY-04 3.1
I have two dataframes df1, df2 described below
df1
prod age
0 Winalto_eu 28
1 Winalto_uc 25
2 CEM_eu 30
df2
age qx
0 25 2.7
1 26 2.8
2 27 2.8
3 28 2.9
4 29 3.0
5 30 3.2
6 31 3.4
7 32 3.7
8 33 4.1
9 34 4.6
10 35 5.1
11 36 5.6
12 37 6.1
13 38 6.7
14 39 7.5
15 40 8.2
I would like to add new columns with a for loop to df1.
The names of the new colums should be qx1, qx2,...qx10
for i in range(0,10):
df1['qx'+str(i)]
The values of qx1 should be affected by the loop, doing a kind of vlookup on the age :
For instance on the first row, for the prod 'Winalto_eu', the value of qx1 should be the value of
df2['qx'] at the age of 28+1, qx2 the same at 28+2...
The target dataframe should look like this :
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Have you any idea ?
Thanks
I think this would give what you want. I used shift function to first generate additional columns in df2 and then merged with df1.
import pandas as pd
df1 = pd.DataFrame({'prod': ['Winalto_eu', 'Winalto_uc', 'CEM_eu'], 'age' : [28, 25, 30]})
df2 = pd.DataFrame({'age': list(range(25,41)), 'qx': [2.7, 2.8, 2.8, 2.9, 3, 3.2, 3.4, 3.7, 4.1, 4.6, 5.1, 5.6, 6.1, 6.7, 7.5, 8.2]})
for i in range(1,11):
df2['qx'+str(i)] = df2.qx.shift(-i)
df3 = pd.merge(df1,df2,how = 'left',on = ['age'])
At the beginning you should try with pd.df.set_index('prod",inplace=True) after that transponse df with qx
Here's a way using .loc filtering the data:
top_n = 10
values = [df2.loc[df2['age'].gt(x),'qx'].iloc[:top_n].tolist() for x in df1['age']]
coln = ['qx'+str(x) for x in range(1,11)]
df1[coln] = pd.DataFrame(values)
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Ridiculously overengineered solution:
pd.concat([df1,pd.DataFrame(columns=['qx'+str(i) for i in range(11)],
data=[ser1.T.loc[:,i:i+10].values.flatten().tolist()
for i in df1['age']])],
axis=1)
prod age qx0 qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.7 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Try:
df=df1.assign(key=0).merge(df2.assign(key=0), on="key", suffixes=["", "_y"]).query("age<age_y").drop(["key"], axis=1)
df["q"]=df.groupby("prod")["age_y"].rank()
#to keep only 10 positions for each
df=df.loc[df["q"]<=10]
df=df.pivot_table(index=["prod", "age"], columns="q", values="qx")
df.columns=[f"qx{col:0.0f}" for col in df.columns]
df=df.reset_index()
Output:
prod age qx1 qx2 qx3 ... qx6 qx7 qx8 qx9 qx10
0 CEM_eu 30 3.4 3.7 4.1 ... 5.6 6.1 6.7 7.5 8.2
1 Winalto_eu 28 3.0 3.2 3.4 ... 4.6 5.1 5.6 6.1 6.7
2 Winalto_uc 25 2.8 2.8 2.9 ... 3.4 3.7 4.1 4.6 5.1
Say I have two dataframes, df1 and df2 as shown here:
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
Timestamp_A
0 0.6
1 1.1
2 1.6
3 2.1
4 2.6
5 3.1
6 3.6
7 4.1
8 4.6
9 5.1
10 5.6
11 6.1
12 6.6
13 7.1
Timestamp_B
0 2.2
1 2.7
2 3.2
3 3.7
4 5.2
5 5.7
Each dataframe is the output of different sensor readings, and each is being transmitted at the same frequency. What I would like to do, is to align these two dataframes together such that each timestamp in B aligns with the timestamp in A closest to it's value. For all values in Timestamp_A which do not have a match to Timestamp_B, replace them with np.nan. Does anyone have any advice for the best way to go about doing something like this? Here is the desired output:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 NaN
7 4.1 NaN
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 NaN
12 6.6 NaN
13 7.1 NaN
You probably want some application of merge_asof, like so:
import pandas as pd
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
df3 = pd.merge_asof(df1, df2, left_on='Timestamp_A', right_on='Timestamp_B',
tolerance=0.5, direction='nearest')
print(df3)
Output as follows:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
The tolerance will define what "not having a match" means numerically, so that is up to you to determine.
When you only have two columns and one value assignment , I feel like reindex is more suitable
df2.index=df2.Timestamp_B
df1['New']=df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values
df1
Out[109]:
Timestamp_A New
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
For more columns
s=pd.DataFrame(df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values,index=df1.index,columns=df2.columns)
df1=pd.concat([df1,s],axis=1)
Here is my code everything works pretty good until I try and send to excel. I have a script that works fine for one web page but not multiple pages.
Working code and what I want:
import pandas as pd
from pandas import ExcelWriter
dfs = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play/',header=0)
for df in dfs:
print(df)
writer = pd.ExcelWriter('nfl.xlsx')
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
writer.save()
none working code:
import pandas as pd
from pandas import ExcelWriter
oyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0)
dyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/opponent-yards-per-play',header=0)
for df in (oyyp_df, dyyp_df):
print(df)
writer = pd.ExcelWriter('nfl.xlsx')
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
df.to_excel('nflypp.xlsx', sheet_name='yppd', index=False, engine='xlsxwriter')
writer.save()
Working until it gets to the df.to_excel
error: AttributeError: 'list' object has no attribute 'to_excel'
Here is the out put
C:\Cabs\projects>nflstatsypp.py
[ Rank Team 2018 Last 3 Last 1 Home Away 2017
0 1 Kansas City 7.0 7.0 6.9 6.4 7.5 6.1
1 2 LA Chargers 6.8 6.4 6.2 6.6 6.9 5.9
2 3 LA Rams 6.7 6.2 5.4 7.0 6.4 5.8
3 4 Tampa Bay 6.5 6.3 5.3 6.3 6.8 5.6
4 5 New Orleans 6.2 6.0 3.6 6.7 5.7 6.3
5 6 Pittsburgh 6.2 6.0 5.3 6.2 6.2 5.8
6 7 Carolina 6.2 7.3 6.8 6.1 6.2 5.1
7 8 Atlanta 6.0 5.0 2.9 6.5 5.5 5.8
8 9 Green Bay 6.0 5.4 4.4 5.9 6.1 4.9
9 10 Denver 5.9 6.1 6.3 6.1 5.8 4.8
10 11 New England 5.9 6.2 6.6 6.2 5.5 6.0
11 12 NY Giants 5.8 6.2 5.0 5.4 6.1 4.9
12 13 Houston 5.7 6.0 5.2 6.2 5.3 5.0
13 14 Seattle 5.7 6.2 6.8 5.5 5.9 5.2
14 15 San Francisco 5.7 5.8 6.1 5.4 5.9 5.3
15 16 Indianapolis 5.7 5.7 3.7 6.2 5.1 4.6
16 17 Cincinnati 5.6 5.1 4.8 5.5 5.7 4.8
17 18 Minnesota 5.6 5.1 4.7 5.6 5.6 5.4
18 19 Oakland 5.5 5.3 6.4 6.2 5.0 5.4
19 20 Philadelphia 5.5 5.4 6.1 5.5 5.5 5.6
20 21 Chicago 5.5 4.6 4.9 6.0 5.0 4.9
21 22 Cleveland 5.4 7.3 8.2 5.1 5.8 4.9
22 23 Tennessee 5.4 7.1 7.5 5.8 5.0 5.2
23 24 Miami 5.4 4.7 3.5 5.8 4.9 4.9
24 25 Dallas 5.3 5.2 4.7 5.6 5.1 5.3
25 26 Detroit 5.3 5.0 4.8 5.2 5.5 5.5
26 27 Baltimore 5.2 5.4 4.8 5.3 5.2 4.6
27 28 Washington 5.2 4.8 5.6 5.0 5.4 5.3
28 29 Jacksonville 5.0 4.3 3.8 5.0 5.1 5.4
29 30 NY Jets 4.9 4.5 4.3 5.4 4.4 5.0
30 31 Buffalo 4.5 6.2 6.3 4.5 4.6 4.7
31 32 Arizona 4.4 4.8 5.5 4.5 4.2 4.7]
[ Rank Team 2018 Last 3 Last 1 Home Away 2017
0 1 Baltimore 4.6 4.1 2.9 4.5 4.8 5.0
1 2 Buffalo 4.9 4.2 3.5 5.1 4.7 5.3
2 3 Chicago 4.9 4.8 5.0 4.6 5.2 5.1
3 4 Pittsburgh 5.2 5.1 6.2 5.6 4.8 5.3
4 5 Dallas 5.3 5.2 3.6 4.9 5.6 5.1
5 6 Minnesota 5.3 5.4 6.6 4.6 5.9 4.8
6 7 Arizona 5.3 5.1 4.4 5.0 5.6 4.9
7 8 Jacksonville 5.3 5.6 7.5 4.3 6.2 4.8
8 9 Houston 5.4 6.1 8.2 5.9 4.9 5.7
9 10 Tennessee 5.4 5.1 3.8 5.0 5.7 5.1
10 11 LA Chargers 5.5 5.1 5.3 5.7 5.4 5.3
11 12 Indianapolis 5.5 4.8 3.9 5.6 5.4 5.7
12 13 Green Bay 5.5 5.7 5.5 5.2 5.8 5.5
13 14 San Francisco 5.6 5.9 6.8 5.1 5.8 5.3
14 15 New England 5.7 5.4 4.7 5.4 5.9 5.7
15 16 NY Jets 5.7 6.8 6.7 6.0 5.4 5.4
16 17 Cleveland 5.7 5.3 5.2 6.0 5.5 5.1
17 18 Carolina 5.8 5.5 5.3 5.8 5.8 5.4
18 19 Washington 5.8 5.8 6.1 5.7 5.9 5.3
19 20 NY Giants 5.8 6.0 4.9 5.7 6.0 5.7
20 21 Denver 5.9 6.2 4.8 6.0 5.7 4.9
21 22 New Orleans 5.9 4.8 4.7 6.1 5.8 5.4
22 23 Kansas City 6.0 5.4 6.4 5.4 6.4 5.6
23 24 Philadelphia 6.1 7.0 5.6 5.7 6.6 5.2
24 25 Detroit 6.1 5.6 5.4 5.9 6.4 5.5
25 26 LA Rams 6.1 6.4 4.8 6.4 5.8 5.3
26 27 Seattle 6.1 7.2 6.1 6.7 5.8 4.9
27 28 Atlanta 6.2 5.1 4.8 6.4 5.9 5.2
28 29 Cincinnati 6.2 5.7 6.3 6.2 6.2 5.0
29 30 Miami 6.3 6.7 6.3 6.1 6.5 5.4
30 31 Tampa Bay 6.4 6.4 6.8 5.8 7.1 6.0
31 32 Oakland 6.6 6.2 6.9 6.5 6.6 5.6]
Traceback (most recent call last):
File "C:\Cabs\projects\nflstatsypp.py", line 14, in
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
AttributeError: 'list' object has no attribute 'to_excel'
One last ? how do you clean up the above second table so the headers are lined up like the first table? if it has been answered add link please. Thanks. Note when printed out in python the first table headers are correct just for clarification. thanks again. no more edits. hope all this helps.
I'm brand new, just having fun. Have been researching for months with all different codes. have about 15.py trying to get this to work.
Thanks for any help. if the answer is out there I can't find or understand it. :-) finally. again sorry for being such a newbe. LOL
There's a few ways you can do this. I would probably loop it condense the code a bit, save each dataframe as you iterate in your for loop. But it also looks like you want different names for your sheets, which would involve creating a variable in same way to associate each of your pd.read_htmls, and it appears you're a beginner, so we'll just try to keep this as simple as possible, and we'll do it another way which is just straight away save the data.
First off, when you do oyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0), it's storing it as dataframe, BUT packaging it into a list (see here).
Also, it would be benificial to go back and read up about lists in Python. So your for loop iterates through those items in each of your lists (oyyp_df, dyyp_df).
If you want to call a specific item in a list, you call it by it's index/position. The key to note though, is that the index starts at 0. So the first item in a list is at position 0, the 2nd item is at position 1, etc.
a_list = ['first item', 'sencond item, 'third item']
to call that first item, you'd type a_list[0] and you would see the output 'first item'.
Now a list can be of many data types. It could be strings, like above, it can integers, it can be dictionaries, or in your case here, it's dataframes.
so oyyp_df is really = [<your DATFRAME>, <maybe a 2nd dataframe>, etc.]. yours only contains 1 item, in the first position. So you get that error. lists can't do .to_excel, but dataframes can.
What we can do is store that 1st item dataframe though by setting that equal to another name (or you could actually use the same name...but be careful as if your list has other items in it, you'd lose those); oyyp_df = oyyp_df[0]
I changed a couple things to hopefully make it clearer in your code below.
import pandas as pd
html_data1 = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0)
html_data2 = pd.read_html('https://www.teamrankings.com/nfl/stat/opponent-yards-per-play',header=0)
for df in (html_data1, html_data2):
print(df)
oyyp_df = html_data1[0]
dyyp_df = html_data2[0]
writer = pd.ExcelWriter('nflypp.xlsx')
oyyp_df.to_excel(writer, sheet_name='yppo', index=False)
dyyp_df.to_excel(writer, sheet_name='yppd', index=False)
writer.save()
writer.close()
I am reading the book : "Building Machine Learning Systems with Python".
In the classification of Iris dates, I am having trouble understanding the syntax of :
plt.scatter(features[target == t,0],
features[target == t,1],
marker=marker,
c=c)
Specifically, what does features[target == t,0] actually mean?
Looking at this code, it seems that features and target are both arrays and t is a number. Moreover, both features and target have the same number of rows.
In that case, features[target == t, 0] does the following:
target == t creates a Boolean array of the same shape as target (True if the value is t, otherwise False).
features[target == t, 0] selects those rows from features which correspond to True in the target == t array. The 0 specifies that the first column of features should be selected.
In other words, the code selects the rows of features for which target is equal to t and from those rows, the 0 selects the first column.
A better explanation to this could be that this for loop splits features array into 3 different arrays , each corresponding to a particular species of Iris.
All these arrays have the 1st feature of the particular plant(instance).
This will be the output if you print features[target==t,0]
[ 5.1 4.9 4.7 4.6 5. 5.4 4.6 5. 4.4 4.9 5.4 4.8 4.8 4.3 5.8
5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5. 5. 5.2 5.2 4.7
4.8 5.4 5.2 5.5 4.9 5. 5.5 4.9 4.4 5.1 5. 4.5 4.4 5. 5.1
4.8 5.1 4.6 5.3 5. ]
[ 7. 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5. 5.9 6. 6.1 5.6
6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6. 5.7
5.5 5.5 5.8 6. 5.4 6. 6.7 6.3 5.6 5.5 5.5 6.1 5.8 5. 5.6
5.7 5.7 6.2 5.1 5.7]
[ 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8
6.4 6.5 7.7 7.7 6. 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2
7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6. 6.9 6.7 6.9 5.8 6.8 6.7
6.7 6.3 6.5 6.2 5.9]