Pandas dataframe threshold -- Keep number fixed if exceed - python

I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)

You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0

Related

How to add columns with a for loop in a dataframe?

I have two dataframes df1, df2 described below
df1
prod age
0 Winalto_eu 28
1 Winalto_uc 25
2 CEM_eu 30
df2
age qx
0 25 2.7
1 26 2.8
2 27 2.8
3 28 2.9
4 29 3.0
5 30 3.2
6 31 3.4
7 32 3.7
8 33 4.1
9 34 4.6
10 35 5.1
11 36 5.6
12 37 6.1
13 38 6.7
14 39 7.5
15 40 8.2
I would like to add new columns with a for loop to df1.
The names of the new colums should be qx1, qx2,...qx10
for i in range(0,10):
df1['qx'+str(i)]
The values of qx1 should be affected by the loop, doing a kind of vlookup on the age :
For instance on the first row, for the prod 'Winalto_eu', the value of qx1 should be the value of
df2['qx'] at the age of 28+1, qx2 the same at 28+2...
The target dataframe should look like this :
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Have you any idea ?
Thanks
I think this would give what you want. I used shift function to first generate additional columns in df2 and then merged with df1.
import pandas as pd
df1 = pd.DataFrame({'prod': ['Winalto_eu', 'Winalto_uc', 'CEM_eu'], 'age' : [28, 25, 30]})
df2 = pd.DataFrame({'age': list(range(25,41)), 'qx': [2.7, 2.8, 2.8, 2.9, 3, 3.2, 3.4, 3.7, 4.1, 4.6, 5.1, 5.6, 6.1, 6.7, 7.5, 8.2]})
for i in range(1,11):
df2['qx'+str(i)] = df2.qx.shift(-i)
df3 = pd.merge(df1,df2,how = 'left',on = ['age'])
At the beginning you should try with pd.df.set_index('prod",inplace=True) after that transponse df with qx
Here's a way using .loc filtering the data:
top_n = 10
values = [df2.loc[df2['age'].gt(x),'qx'].iloc[:top_n].tolist() for x in df1['age']]
coln = ['qx'+str(x) for x in range(1,11)]
df1[coln] = pd.DataFrame(values)
prod age qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Ridiculously overengineered solution:
pd.concat([df1,pd.DataFrame(columns=['qx'+str(i) for i in range(11)],
data=[ser1.T.loc[:,i:i+10].values.flatten().tolist()
for i in df1['age']])],
axis=1)
prod age qx0 qx1 qx2 qx3 qx4 qx5 qx6 qx7 qx8 qx9 qx10
0 Winalto_eu 28 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7
1 Winalto_uc 25 2.7 2.8 2.8 2.9 3.0 3.2 3.4 3.7 4.1 4.6 5.1
2 CEM_eu 30 3.2 3.4 3.7 4.1 4.6 5.1 5.6 6.1 6.7 7.5 8.2
Try:
df=df1.assign(key=0).merge(df2.assign(key=0), on="key", suffixes=["", "_y"]).query("age<age_y").drop(["key"], axis=1)
df["q"]=df.groupby("prod")["age_y"].rank()
#to keep only 10 positions for each
df=df.loc[df["q"]<=10]
df=df.pivot_table(index=["prod", "age"], columns="q", values="qx")
df.columns=[f"qx{col:0.0f}" for col in df.columns]
df=df.reset_index()
Output:
prod age qx1 qx2 qx3 ... qx6 qx7 qx8 qx9 qx10
0 CEM_eu 30 3.4 3.7 4.1 ... 5.6 6.1 6.7 7.5 8.2
1 Winalto_eu 28 3.0 3.2 3.4 ... 4.6 5.1 5.6 6.1 6.7
2 Winalto_uc 25 2.8 2.8 2.9 ... 3.4 3.7 4.1 4.6 5.1

How to append multiple columns to the first 3 columns and repeat the index values using pandas?

I have a data set in which the columns are in multiples of 3 (excluding index column[0]).
I am new to python.
Here there are 9 columns excluding index. So I want to append 4th column to the 1st,5th column to 2nd,6th to 3rd, again 7th to 1st, 8th to 2nd, 9th to 3rd, and so on for large data set. My large data set will always be in multiples of 3 (excl.index col.).
Also I want the index values to repeat in same order. In this case 6,9,4,3 to repeat 3 times.
import pandas as pd
import io
data =io.StringIO("""
6,5.6,4.6,8.2,2.5,9.4,7.6,9.3,4.1,1.9
9,2.3,7.8,1,4.8,6.7,8.4,45.2,8.9,1.5
4,4.8,9.1,0,7.1,5.6,3.6,63.7,7.6,4
3,9.4,10.6,7.5,1.5,4.3,14.3,36.1,6.3,0
""")
df = pd.read_csv(data,index_col=[0],header = None)
Expected Output:
df
6,5.6,4.6,8.2
9,2.3,7.8,1
4,4.8,9.1,0
3,9.4,10.6,7.5
6,2.5,9.4,7.6
9,4.8,6.7,8.4
4,7.1,5.6,3.6
3,1.5,4.3,14.3
6,9.3,4.1,1.9
9,45.2,8.9,1.5
4,63.7,7.6,4
3,36.1,6.3,0
Idea is reshape by stack with sorting second level of MultiIndex and also for correct ordering create ordered CategoricalIndex:
a = np.arange(len(df.columns))
df.index = pd.CategoricalIndex(df.index, ordered=True, categories=df.index.unique())
df.columns = [a // 3, a % 3]
df = df.stack(0).sort_index(level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
Split the data frame horizontally and concatenate the components vertically:
df.columns=[1,2,3]*(len(df.columns)//3)
rslt= pd.concat( [ df.iloc[:,i:i+3] for i in range(0,len(df.columns),3) ])
1 2 3
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0

Merging two dataframes with one common column name [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")

How to create new column based on top and bottom parts of single dataframe in PANDAS?

I have merged two dataframes having same column names. Is there a easy way to get another column of mean of these two appended dataframes?
Maybe code explains it better.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[10,20,30,40]})
df2 = pd.DataFrame({'a':[1.2,2.2,3.2,4.2],'b':[10.2,20.2,30.2,40.2]})
df = df1.append(df2)
print(df)
df['a_mean'] = ???
a b
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
0 1.2 10.2
1 2.2 20.2
2 3.2 30.2
3 4.2 40.2
How to create a new column a_mean with values
[1.1, 2.1, 3.1, 4.1, 1.1, 2.1, 3.1, 4.1] effectively ?
melt()
df=df.assign(a_mean=df1.add(df2).div(2).melt().value)
Or taking only df, you can do:
df=df.assign(a_mean=df.groupby(df.index)['a'].mean())
a b a_mean
0 1.0 10.0 1.1
1 2.0 20.0 2.1
2 3.0 30.0 3.1
3 4.0 40.0 4.1
0 1.2 10.2 1.1
1 2.2 20.2 2.1
2 3.2 30.2 3.1
3 4.2 40.2 4.1
Try this:
df['a_mean'] = np.tile( (df1.a.to_numpy() + df2.a.to_numpy())/2, 2)
As per the comments, there is already a great answer by Anky, but to extend this method you can do this:
df['a_mean2'] = np.tile( (df.iloc[0: len(df)//2].a.to_numpy() + df.iloc[len(df)//2:].a.to_numpy())/2, 2)
Update:
df['a_mean3'] = np.tile(df.a.to_numpy().reshape(2,-1).mean(0), 2)
Outptut
print(df)
a b a_mean2 a_mean a_mean3
0 1.0 10.0 1.1 1.1 1.1
1 2.0 20.0 2.1 2.1 2.1
2 3.0 30.0 3.1 3.1 3.1
3 4.0 40.0 4.1 4.1 4.1
0 1.2 10.2 1.1 1.1 1.1
1 2.2 20.2 2.1 2.1 2.1
2 3.2 30.2 3.1 3.1 3.1
3 4.2 40.2 4.1 4.1 4.1

Smallest Difference Between 2 object in Dataframe

df = pd.DataFrame( {'Stock' : ['Apple','Broadcomm','Citi','D&G'],
'PE' : pd.Series([1.5,3.9,5.6,6.8]),
})
I'm looking for an algorithm to rank stock pair from a pool based on the difference of PE i.e PE stock 1 - PE stock 2
i.e pool of 40 stocks, rank based on unique stock pair based on smallest PE difference. Total will have 20 unique pairs
eg. MSFT appears in pair 1, with smallest PE associated with MSFT pair, MSFT should not reappear again in the subsequent pair
What's the correct algorithm for doing this?
So far I have tried to find the PE difference of each and every pair and rank ascending. What should I do next?
A pandas base solution:
First make the matches :
df = pd.DataFrame( {'Stock' : ['Apple','Broadcomm','Citi','D&G','Samsung','Elite'],
'PE' : pd.Series([1.5,3.9,5.6,6.8,6,6])})
df.set_index('Stock',inplace=True)
df.sort_values('PE',inplace=True)
crosstable=pd.DataFrame(add.outer(df.PE,-df.PE),df.index,df.index)
v=crosstable.mask(triu(ones((len(df),len(df)),bool))) #keep valid comparisons
Then v is :
Stock Apple Broadcomm Citi Samsung Elite D&G
Stock
Apple NaN NaN NaN NaN NaN NaN
Broadcomm 2.4 NaN NaN NaN NaN NaN
Citi 4.1 1.7 NaN NaN NaN NaN
Samsung 4.5 2.1 0.4 NaN NaN NaN
Elite 4.5 2.1 0.4 0.0 NaN NaN
D&G 5.3 2.9 1.2 0.8 0.8 NaN
Then the classement :
w=v.stack()
w.sort_values(inplace=True)
w is :
Stock Stock
Elite Samsung 0.0
Samsung Citi 0.4
Elite Citi 0.4
D&G Samsung 0.8
Elite 0.8
Citi 1.2
Citi Broadcomm 1.7
Samsung Broadcomm 2.1
Elite Broadcomm 2.1
Broadcomm Apple 2.4
D&G Broadcomm 2.9
Citi Apple 4.1
Samsung Apple 4.5
Elite Apple 4.5
D&G Apple 5.3
And extract the best pairs :
i=0
s=set(df.index)
top=[]
while s :
x,y = w.index[i]
if x in s and y in s :
top += (x,y),
s -= {x,y}
i+=1
w[top] is the result:
Stock Stock
Elite Samsung 0.0
D&G Citi 1.2
Broadcomm Apple 2.4
This is an approach that uses itertools.combinations(), isin(), and drop():
import pandas as pd
import itertools as it
df = pd.DataFrame({'Stock' : ['Apple', 'Broadcomm', 'Citi', 'D&G', 'Elixir', 'Foxtrot'],
'PE' : [3.8, 3.9, 5.6, 6.8, 0.5, 3.9]})
print(df)
assert len(df) % 2 == 0
m = df.set_index('Stock')
ranking = pd.DataFrame(columns=['StockA', 'StockB', 'minPE', 'deltaPE'],
data=[(a, b, min(m.PE[a], m.PE[b]), abs(m.PE[a] - m.PE[b]))
for a, b in it.combinations(m.index, 2)])
ranking.sort_values(['deltaPE', 'minPE'], inplace=True)
print(ranking)
# ranking is sorted from best to worst.
# Start with first line, eliminate other lines that belong to either one of
# this line's stocks (but not both), then proceed to next line and repeat.
for i in range(len(df) // 2):
a = ranking.iloc[i].StockA
b = ranking.iloc[i].StockB
contenders = ranking[ranking.StockA.isin([a, b]) ^ ranking.StockB.isin([a, b])]
ranking.drop(contenders.index, inplace=True)
print(ranking)
Output:
PE Stock
0 3.8 Apple
1 3.9 Broadcomm
2 5.6 Citi
3 6.8 D&G
4 0.5 Elixir
5 3.9 Foxtrot
# ---- Ranking after sorting:
StockA StockB minPE deltaPE
8 Broadcomm Foxtrot 3.9 0.0
0 Apple Broadcomm 3.8 0.1
4 Apple Foxtrot 3.8 0.1
9 Citi D&G 5.6 1.2
5 Broadcomm Citi 3.9 1.7
11 Citi Foxtrot 3.9 1.7
1 Apple Citi 3.8 1.8
6 Broadcomm D&G 3.9 2.9
13 D&G Foxtrot 3.9 2.9
2 Apple D&G 3.8 3.0
3 Apple Elixir 0.5 3.3
7 Broadcomm Elixir 0.5 3.4
14 Elixir Foxtrot 0.5 3.4
10 Citi Elixir 0.5 5.1
12 D&G Elixir 0.5 6.3
# ---- Ranking after dropping rows:
StockA StockB minPE deltaPE
8 Broadcomm Foxtrot 3.9 0.0
9 Citi D&G 5.6 1.2
3 Apple Elixir 0.5 3.3

Categories

Resources