How to add columns with a for loop in a dataframe?

How to add columns with a for loop in a dataframe? - python

I have two dataframes df1, df2 described below
df1
prod age
0 Winalto_eu 28
1 Winalto_uc 25
2 CEM_eu 30
df2
age qx
0 25 2.7
1 26 2.8
2 27 2.8
3 28 2.9
4 29 3.0
5 30 3.2
6 31 3.4
7 32 3.7
8 33 4.1
9 34 4.6
10 35 5.1
11 36 5.6
12 37 6.1
13 38 6.7
14 39 7.5
15 40 8.2
I would like to add new columns with a for loop to df1.
The names of the new colums should be qx1, qx2,...qx10
for i in range(0,10):
df1['qx'+str(i)]
The values of qx1 should be affected by the loop, doing a kind of vlookup on the age :
For instance on the first row, for the prod 'Winalto_eu', the value of qx1 should be the value of
df2['qx'] at the age of 28+1, qx2 the same at 28+2...
The target dataframe should look like this :
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Have you any idea ?
Thanks

I think this would give what you want. I used shift function to first generate additional columns in df2 and then merged with df1.
import pandas as pd
df1 = pd.DataFrame({'prod': ['Winalto_eu', 'Winalto_uc', 'CEM_eu'], 'age' : [28, 25, 30]})
df2 = pd.DataFrame({'age': list(range(25,41)), 'qx': [2.7, 2.8, 2.8, 2.9, 3, 3.2, 3.4, 3.7, 4.1, 4.6, 5.1, 5.6, 6.1, 6.7, 7.5, 8.2]})
for i in range(1,11):
df2['qx'+str(i)] = df2.qx.shift(-i)
df3 = pd.merge(df1,df2,how = 'left',on = ['age'])

At the beginning you should try with pd.df.set_index('prod",inplace=True) after that transponse df with qx

Here's a way using .loc filtering the data:
top_n = 10
values = [df2.loc[df2['age'].gt(x),'qx'].iloc[:top_n].tolist() for x in df1['age']]
coln = ['qx'+str(x) for x in range(1,11)]
df1[coln] = pd.DataFrame(values)
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2

Ridiculously overengineered solution:
pd.concat([df1,pd.DataFrame(columns=['qx'+str(i) for i in range(11)],
data=[ser1.T.loc[:,i:i+10].values.flatten().tolist()
for i in df1['age']])],
axis=1)
prod age qx0 qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.7 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2

Try:
df=df1.assign(key=0).merge(df2.assign(key=0), on="key", suffixes=["", "_y"]).query("age<age_y").drop(["key"], axis=1)
df["q"]=df.groupby("prod")["age_y"].rank()
#to keep only 10 positions for each
df=df.loc[df["q"]<=10]
df=df.pivot_table(index=["prod", "age"], columns="q", values="qx")
df.columns=[f"qx{col:0.0f}" for col in df.columns]
df=df.reset_index()
Output:
prod age qx1 qx2 qx3 ... qx6 qx7 qx8 qx9 qx10
0 CEM_eu 30 3.4 3.7 4.1 ... 5.6 6.1 6.7 7.5 8.2
1 Winalto_eu 28 3.0 3.2 3.4 ... 4.6 5.1 5.6 6.1 6.7
2 Winalto_uc 25 2.8 2.8 2.9 ... 3.4 3.7 4.1 4.6 5.1

Related

How to format monthly data table into time series in python

table looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Year
1948 3.4 3.8 4.0 3.9 3.5 3.6 3.6 3.9 3.8 3.7 3.8 4.0
1949 4.3 4.7 5.0 5.3 6.1 6.2 6.7 6.8 6.6 7.9 6.4 6.6
1950 6.5 6.4 6.3 5.8 5.5 5.4 5.0 4.5 4.4 4.2 4.2 4.3
1951 3.7 3.4 3.4 3.1 3.0 3.2 3.1 3.1 3.3 3.5 3.5 3.1
1952 3.2 3.1 2.9 2.9 3.0 3.0 3.2 3.4 3.1 3.0 2.8 2.7
I want the output to look like:
Date Data
1948-01-01 0.034
1948-02-01 ....
etc
I tried this already:
Convert monthly data table to seasonal time series using pandas

import pandas as pd
columns = pd.date_range(start = "1948", periods=12, freq='MS')
columns = columns.strftime('%Y-%m-%d') #this is optional if you want to change format
print (columns)

How to find the best match between two pandas columns?

Say I have two dataframes, df1 and df2 as shown here:
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
Timestamp_A
0 0.6
1 1.1
2 1.6
3 2.1
4 2.6
5 3.1
6 3.6
7 4.1
8 4.6
9 5.1
10 5.6
11 6.1
12 6.6
13 7.1
Timestamp_B
0 2.2
1 2.7
2 3.2
3 3.7
4 5.2
5 5.7
Each dataframe is the output of different sensor readings, and each is being transmitted at the same frequency. What I would like to do, is to align these two dataframes together such that each timestamp in B aligns with the timestamp in A closest to it's value. For all values in Timestamp_A which do not have a match to Timestamp_B, replace them with np.nan. Does anyone have any advice for the best way to go about doing something like this? Here is the desired output:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 NaN
7 4.1 NaN
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 NaN
12 6.6 NaN
13 7.1 NaN

You probably want some application of merge_asof, like so:
import pandas as pd
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
df3 = pd.merge_asof(df1, df2, left_on='Timestamp_A', right_on='Timestamp_B',
tolerance=0.5, direction='nearest')
print(df3)
Output as follows:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
The tolerance will define what "not having a match" means numerically, so that is up to you to determine.

When you only have two columns and one value assignment , I feel like reindex is more suitable
df2.index=df2.Timestamp_B
df1['New']=df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values
df1
Out[109]:
Timestamp_A New
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
For more columns
s=pd.DataFrame(df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values,index=df1.index,columns=df2.columns)
df1=pd.concat([df1,s],axis=1)

keep getting error with multiple pages when using df to excel works fine with 1 page

Here is my code everything works pretty good until I try and send to excel. I have a script that works fine for one web page but not multiple pages.
Working code and what I want:
import pandas as pd
from pandas import ExcelWriter
dfs = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play/',header=0)
for df in dfs:
print(df)
writer = pd.ExcelWriter('nfl.xlsx')
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
writer.save()
none working code:
import pandas as pd
from pandas import ExcelWriter
oyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0)
dyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/opponent-yards-per-play',header=0)
for df in (oyyp_df, dyyp_df):
print(df)
writer = pd.ExcelWriter('nfl.xlsx')
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
df.to_excel('nflypp.xlsx', sheet_name='yppd', index=False, engine='xlsxwriter')
writer.save()
Working until it gets to the df.to_excel
error: AttributeError: 'list' object has no attribute 'to_excel'
Here is the out put
C:\Cabs\projects>nflstatsypp.py
[ Rank Team 2018 Last 3 Last 1 Home Away 2017
0 1 Kansas City 7.0 7.0 6.9 6.4 7.5 6.1
1 2 LA Chargers 6.8 6.4 6.2 6.6 6.9 5.9
2 3 LA Rams 6.7 6.2 5.4 7.0 6.4 5.8
3 4 Tampa Bay 6.5 6.3 5.3 6.3 6.8 5.6
4 5 New Orleans 6.2 6.0 3.6 6.7 5.7 6.3
5 6 Pittsburgh 6.2 6.0 5.3 6.2 6.2 5.8
6 7 Carolina 6.2 7.3 6.8 6.1 6.2 5.1
7 8 Atlanta 6.0 5.0 2.9 6.5 5.5 5.8
8 9 Green Bay 6.0 5.4 4.4 5.9 6.1 4.9
9 10 Denver 5.9 6.1 6.3 6.1 5.8 4.8
10 11 New England 5.9 6.2 6.6 6.2 5.5 6.0
11 12 NY Giants 5.8 6.2 5.0 5.4 6.1 4.9
12 13 Houston 5.7 6.0 5.2 6.2 5.3 5.0
13 14 Seattle 5.7 6.2 6.8 5.5 5.9 5.2
14 15 San Francisco 5.7 5.8 6.1 5.4 5.9 5.3
15 16 Indianapolis 5.7 5.7 3.7 6.2 5.1 4.6
16 17 Cincinnati 5.6 5.1 4.8 5.5 5.7 4.8
17 18 Minnesota 5.6 5.1 4.7 5.6 5.6 5.4
18 19 Oakland 5.5 5.3 6.4 6.2 5.0 5.4
19 20 Philadelphia 5.5 5.4 6.1 5.5 5.5 5.6
20 21 Chicago 5.5 4.6 4.9 6.0 5.0 4.9
21 22 Cleveland 5.4 7.3 8.2 5.1 5.8 4.9
22 23 Tennessee 5.4 7.1 7.5 5.8 5.0 5.2
23 24 Miami 5.4 4.7 3.5 5.8 4.9 4.9
24 25 Dallas 5.3 5.2 4.7 5.6 5.1 5.3
25 26 Detroit 5.3 5.0 4.8 5.2 5.5 5.5
26 27 Baltimore 5.2 5.4 4.8 5.3 5.2 4.6
27 28 Washington 5.2 4.8 5.6 5.0 5.4 5.3
28 29 Jacksonville 5.0 4.3 3.8 5.0 5.1 5.4
29 30 NY Jets 4.9 4.5 4.3 5.4 4.4 5.0
30 31 Buffalo 4.5 6.2 6.3 4.5 4.6 4.7
31 32 Arizona 4.4 4.8 5.5 4.5 4.2 4.7]
[ Rank Team 2018 Last 3 Last 1 Home Away 2017
0 1 Baltimore 4.6 4.1 2.9 4.5 4.8 5.0
1 2 Buffalo 4.9 4.2 3.5 5.1 4.7 5.3
2 3 Chicago 4.9 4.8 5.0 4.6 5.2 5.1
3 4 Pittsburgh 5.2 5.1 6.2 5.6 4.8 5.3
4 5 Dallas 5.3 5.2 3.6 4.9 5.6 5.1
5 6 Minnesota 5.3 5.4 6.6 4.6 5.9 4.8
6 7 Arizona 5.3 5.1 4.4 5.0 5.6 4.9
7 8 Jacksonville 5.3 5.6 7.5 4.3 6.2 4.8
8 9 Houston 5.4 6.1 8.2 5.9 4.9 5.7
9 10 Tennessee 5.4 5.1 3.8 5.0 5.7 5.1
10 11 LA Chargers 5.5 5.1 5.3 5.7 5.4 5.3
11 12 Indianapolis 5.5 4.8 3.9 5.6 5.4 5.7
12 13 Green Bay 5.5 5.7 5.5 5.2 5.8 5.5
13 14 San Francisco 5.6 5.9 6.8 5.1 5.8 5.3
14 15 New England 5.7 5.4 4.7 5.4 5.9 5.7
15 16 NY Jets 5.7 6.8 6.7 6.0 5.4 5.4
16 17 Cleveland 5.7 5.3 5.2 6.0 5.5 5.1
17 18 Carolina 5.8 5.5 5.3 5.8 5.8 5.4
18 19 Washington 5.8 5.8 6.1 5.7 5.9 5.3
19 20 NY Giants 5.8 6.0 4.9 5.7 6.0 5.7
20 21 Denver 5.9 6.2 4.8 6.0 5.7 4.9
21 22 New Orleans 5.9 4.8 4.7 6.1 5.8 5.4
22 23 Kansas City 6.0 5.4 6.4 5.4 6.4 5.6
23 24 Philadelphia 6.1 7.0 5.6 5.7 6.6 5.2
24 25 Detroit 6.1 5.6 5.4 5.9 6.4 5.5
25 26 LA Rams 6.1 6.4 4.8 6.4 5.8 5.3
26 27 Seattle 6.1 7.2 6.1 6.7 5.8 4.9
27 28 Atlanta 6.2 5.1 4.8 6.4 5.9 5.2
28 29 Cincinnati 6.2 5.7 6.3 6.2 6.2 5.0
29 30 Miami 6.3 6.7 6.3 6.1 6.5 5.4
30 31 Tampa Bay 6.4 6.4 6.8 5.8 7.1 6.0
31 32 Oakland 6.6 6.2 6.9 6.5 6.6 5.6]
Traceback (most recent call last):
File "C:\Cabs\projects\nflstatsypp.py", line 14, in
df.to_excel('nflypp.xlsx', sheet_name='yppo', index=False, engine='xlsxwriter')
AttributeError: 'list' object has no attribute 'to_excel'
One last ? how do you clean up the above second table so the headers are lined up like the first table? if it has been answered add link please. Thanks. Note when printed out in python the first table headers are correct just for clarification. thanks again. no more edits. hope all this helps.
I'm brand new, just having fun. Have been researching for months with all different codes. have about 15.py trying to get this to work.
Thanks for any help. if the answer is out there I can't find or understand it. :-) finally. again sorry for being such a newbe. LOL

There's a few ways you can do this. I would probably loop it condense the code a bit, save each dataframe as you iterate in your for loop. But it also looks like you want different names for your sheets, which would involve creating a variable in same way to associate each of your pd.read_htmls, and it appears you're a beginner, so we'll just try to keep this as simple as possible, and we'll do it another way which is just straight away save the data.
First off, when you do oyyp_df = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0), it's storing it as dataframe, BUT packaging it into a list (see here).
Also, it would be benificial to go back and read up about lists in Python. So your for loop iterates through those items in each of your lists (oyyp_df, dyyp_df).
If you want to call a specific item in a list, you call it by it's index/position. The key to note though, is that the index starts at 0. So the first item in a list is at position 0, the 2nd item is at position 1, etc.
a_list = ['first item', 'sencond item, 'third item']
to call that first item, you'd type a_list[0] and you would see the output 'first item'.
Now a list can be of many data types. It could be strings, like above, it can integers, it can be dictionaries, or in your case here, it's dataframes.
so oyyp_df is really = [<your DATFRAME>, <maybe a 2nd dataframe>, etc.]. yours only contains 1 item, in the first position. So you get that error. lists can't do .to_excel, but dataframes can.
What we can do is store that 1st item dataframe though by setting that equal to another name (or you could actually use the same name...but be careful as if your list has other items in it, you'd lose those); oyyp_df = oyyp_df[0]
I changed a couple things to hopefully make it clearer in your code below.
import pandas as pd
html_data1 = pd.read_html('https://www.teamrankings.com/nfl/stat/yards-per-play.html',header=0)
html_data2 = pd.read_html('https://www.teamrankings.com/nfl/stat/opponent-yards-per-play',header=0)
for df in (html_data1, html_data2):
print(df)
oyyp_df = html_data1[0]
dyyp_df = html_data2[0]
writer = pd.ExcelWriter('nflypp.xlsx')
oyyp_df.to_excel(writer, sheet_name='yppo', index=False)
dyyp_df.to_excel(writer, sheet_name='yppd', index=False)
writer.save()
writer.close()

Pandas dataframe threshold -- Keep number fixed if exceed

I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)

You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0

Change values of one column in pandas dataframe

How can I change the values of the column 4 to 1 and -1, so that Iris-setosa is replace with 1 and Iris-virginica replaced with -1?
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
I would appreciate the help.

You can use replace
d = {'Iris-setosa': 1, 'Iris-virginica': -1}
df['4'].replace(d,inplace = True)
0 1 2 3 4
0 5.1 3.5 1.4 0.2 1
1 4.9 3.0 1.4 0.2 1
2 4.7 3.2 1.3 0.2 1
3 4.6 3.1 1.5 0.2 1
4 5.0 3.6 1.4 0.2 1
5 5.4 3.9 1.7 0.4 1
6 4.6 3.4 1.4 0.3 1
.. ... ... ... ... ...
120 6.9 3.2 5.7 2.3 -1
121 5.6 2.8 4.9 2.0 -1
122 7.7 2.8 6.7 2.0 -1
123 6.3 2.7 4.9 1.8 -1
124 6.7 3.3 5.7 2.1 -1
125 7.2 3.2 6.0 1.8 -1
126 6.2 2.8 4.8 1.8 -1

df.iloc[df["4"]=="Iris-setosa","4"]=1
df.iloc[df["4"]=="Iris-virginica","4"]=-1

I would do something like this
def encode_row(self, row):
if row[4] == "Iris-setosa":
return 1
return -1
df_test[4] = df_test.apply(lambda row : self.encode_row(row), axis=1)
assuming that df_test is your data frame

Sounds like
df['4'] = np.where(df['4'] == 'Iris-setosa', 1, -1)
should do the job

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to add columns with a for loop in a dataframe? - python

At the beginning you should try with pd.df.set_index('prod",inplace=True) after that transponse df with qx

Related

How to format monthly data table into time series in python

How to find the best match between two pandas columns?

keep getting error with multiple pages when using df to excel works fine with 1 page

Pandas dataframe threshold -- Keep number fixed if exceed

Change values of one column in pandas dataframe

Categories

Resources