df = pd.DataFrame( {'Stock' : ['Apple','Broadcomm','Citi','D&G'],
'PE' : pd.Series([1.5,3.9,5.6,6.8]),
})
I'm looking for an algorithm to rank stock pair from a pool based on the difference of PE i.e PE stock 1 - PE stock 2
i.e pool of 40 stocks, rank based on unique stock pair based on smallest PE difference. Total will have 20 unique pairs
eg. MSFT appears in pair 1, with smallest PE associated with MSFT pair, MSFT should not reappear again in the subsequent pair
What's the correct algorithm for doing this?
So far I have tried to find the PE difference of each and every pair and rank ascending. What should I do next?
A pandas base solution:
First make the matches :
df = pd.DataFrame( {'Stock' : ['Apple','Broadcomm','Citi','D&G','Samsung','Elite'],
'PE' : pd.Series([1.5,3.9,5.6,6.8,6,6])})
df.set_index('Stock',inplace=True)
df.sort_values('PE',inplace=True)
crosstable=pd.DataFrame(add.outer(df.PE,-df.PE),df.index,df.index)
v=crosstable.mask(triu(ones((len(df),len(df)),bool))) #keep valid comparisons
Then v is :
Stock Apple Broadcomm Citi Samsung Elite D&G
Stock
Apple NaN NaN NaN NaN NaN NaN
Broadcomm 2.4 NaN NaN NaN NaN NaN
Citi 4.1 1.7 NaN NaN NaN NaN
Samsung 4.5 2.1 0.4 NaN NaN NaN
Elite 4.5 2.1 0.4 0.0 NaN NaN
D&G 5.3 2.9 1.2 0.8 0.8 NaN
Then the classement :
w=v.stack()
w.sort_values(inplace=True)
w is :
Stock Stock
Elite Samsung 0.0
Samsung Citi 0.4
Elite Citi 0.4
D&G Samsung 0.8
Elite 0.8
Citi 1.2
Citi Broadcomm 1.7
Samsung Broadcomm 2.1
Elite Broadcomm 2.1
Broadcomm Apple 2.4
D&G Broadcomm 2.9
Citi Apple 4.1
Samsung Apple 4.5
Elite Apple 4.5
D&G Apple 5.3
And extract the best pairs :
i=0
s=set(df.index)
top=[]
while s :
x,y = w.index[i]
if x in s and y in s :
top += (x,y),
s -= {x,y}
i+=1
w[top] is the result:
Stock Stock
Elite Samsung 0.0
D&G Citi 1.2
Broadcomm Apple 2.4
This is an approach that uses itertools.combinations(), isin(), and drop():
import pandas as pd
import itertools as it
df = pd.DataFrame({'Stock' : ['Apple', 'Broadcomm', 'Citi', 'D&G', 'Elixir', 'Foxtrot'],
'PE' : [3.8, 3.9, 5.6, 6.8, 0.5, 3.9]})
print(df)
assert len(df) % 2 == 0
m = df.set_index('Stock')
ranking = pd.DataFrame(columns=['StockA', 'StockB', 'minPE', 'deltaPE'],
data=[(a, b, min(m.PE[a], m.PE[b]), abs(m.PE[a] - m.PE[b]))
for a, b in it.combinations(m.index, 2)])
ranking.sort_values(['deltaPE', 'minPE'], inplace=True)
print(ranking)
# ranking is sorted from best to worst.
# Start with first line, eliminate other lines that belong to either one of
# this line's stocks (but not both), then proceed to next line and repeat.
for i in range(len(df) // 2):
a = ranking.iloc[i].StockA
b = ranking.iloc[i].StockB
contenders = ranking[ranking.StockA.isin([a, b]) ^ ranking.StockB.isin([a, b])]
ranking.drop(contenders.index, inplace=True)
print(ranking)
Output:
PE Stock
0 3.8 Apple
1 3.9 Broadcomm
2 5.6 Citi
3 6.8 D&G
4 0.5 Elixir
5 3.9 Foxtrot
# ---- Ranking after sorting:
StockA StockB minPE deltaPE
8 Broadcomm Foxtrot 3.9 0.0
0 Apple Broadcomm 3.8 0.1
4 Apple Foxtrot 3.8 0.1
9 Citi D&G 5.6 1.2
5 Broadcomm Citi 3.9 1.7
11 Citi Foxtrot 3.9 1.7
1 Apple Citi 3.8 1.8
6 Broadcomm D&G 3.9 2.9
13 D&G Foxtrot 3.9 2.9
2 Apple D&G 3.8 3.0
3 Apple Elixir 0.5 3.3
7 Broadcomm Elixir 0.5 3.4
14 Elixir Foxtrot 0.5 3.4
10 Citi Elixir 0.5 5.1
12 D&G Elixir 0.5 6.3
# ---- Ranking after dropping rows:
StockA StockB minPE deltaPE
8 Broadcomm Foxtrot 3.9 0.0
9 Citi D&G 5.6 1.2
3 Apple Elixir 0.5 3.3
Related
I have a data set in which the columns are in multiples of 3 (excluding index column[0]).
I am new to python.
Here there are 9 columns excluding index. So I want to append 4th column to the 1st,5th column to 2nd,6th to 3rd, again 7th to 1st, 8th to 2nd, 9th to 3rd, and so on for large data set. My large data set will always be in multiples of 3 (excl.index col.).
Also I want the index values to repeat in same order. In this case 6,9,4,3 to repeat 3 times.
import pandas as pd
import io
data =io.StringIO("""
6,5.6,4.6,8.2,2.5,9.4,7.6,9.3,4.1,1.9
9,2.3,7.8,1,4.8,6.7,8.4,45.2,8.9,1.5
4,4.8,9.1,0,7.1,5.6,3.6,63.7,7.6,4
3,9.4,10.6,7.5,1.5,4.3,14.3,36.1,6.3,0
""")
df = pd.read_csv(data,index_col=[0],header = None)
Expected Output:
df
6,5.6,4.6,8.2
9,2.3,7.8,1
4,4.8,9.1,0
3,9.4,10.6,7.5
6,2.5,9.4,7.6
9,4.8,6.7,8.4
4,7.1,5.6,3.6
3,1.5,4.3,14.3
6,9.3,4.1,1.9
9,45.2,8.9,1.5
4,63.7,7.6,4
3,36.1,6.3,0
Idea is reshape by stack with sorting second level of MultiIndex and also for correct ordering create ordered CategoricalIndex:
a = np.arange(len(df.columns))
df.index = pd.CategoricalIndex(df.index, ordered=True, categories=df.index.unique())
df.columns = [a // 3, a % 3]
df = df.stack(0).sort_index(level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
Split the data frame horizontally and concatenate the components vertically:
df.columns=[1,2,3]*(len(df.columns)//3)
rslt= pd.concat( [ df.iloc[:,i:i+3] for i in range(0,len(df.columns),3) ])
1 2 3
0
6 5.6 4.6 8.2
9 2.3 7.8 1.0
4 4.8 9.1 0.0
3 9.4 10.6 7.5
6 2.5 9.4 7.6
9 4.8 6.7 8.4
4 7.1 5.6 3.6
3 1.5 4.3 14.3
6 9.3 4.1 1.9
9 45.2 8.9 1.5
4 63.7 7.6 4.0
3 36.1 6.3 0.0
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two .csv files "train_id.csv" and "train_ub.csv", I want to load them as pandas dataframes. Their dimensions are different, but they have only common column, let's say:
train_id:
ID id_01 id_02 id_03 id_04
1 0.0 1.0 5.2 7.1
2 0.5 7.7 3.4 9.8
5 1.5 0.8 1.6 2.5
7 3.0 0.2 3.4 6.3
8 5.5 1.8 7.5 7.0
9 7.2 2.6 9.1 1.1
11 9.5 3.5 2.2 0.3
while train_ub:
ID ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 9.2 8.3
2 1.5 2.7 0.4 4.9
3 2.7 4.8 7.6 3.7
4 4.8 9.2 2.4 5.4
6 6.0 5.8 5.5 0.6
10 9.1 3.6 4.1 2.0
11 7.3 7.5 0.2 9.5
One may see that they have in common the first column but there are missing values in each dataframe. Is there a way in pandas to merge them column wise in order to get a dataframe of the form:
ID id_01 id_02 id_03 id_04 ub_01 ub_02 ub_03 ub_04
1 0.0 1.0 5.2 7.1 0.0 1.0 9.2 8.3
2 0.5 7.7 3.4 9.8 1.5 2.7 0.4 4.9
3 NaN NaN NaN NaN 2.7 4.8 7.6 3.7
4 NaN NaN NaN NaN 4.8 9.2 2.4 5.4
5 1.5 0.8 1.6 2.5 NaN NaN NaN NaN
6 NaN NaN NaN NaN 6.0 5.8 5.5 0.6
7 3.0 0.2 3.4 6.3 NaN NaN NaN NaN
8 5.5 1.8 7.5 7.0 NaN NaN NaN NaN
9 7.2 2.6 9.1 1.1 NaN NaN NaN NaN
10 NaN NaN NaN NaN 9.1 3.6 4.1 2.0
11 9.5 3.5 2.2 0.3 9.5 3.5 2.2 0.3
PS: Notice that this is an oversimplified example, the real databases have the shapes id(144233, 41) and ub(590540, 394).
You could accomplish this using an outer join. Here is the code for it:
train_id = pd.read_csv("train_id.csv")
train_up = pd.read_csv("train_up")
train_merged = train_id.merge(train_ub, on=["ID"], how="outer")
I have merged two dataframes having same column names. Is there a easy way to get another column of mean of these two appended dataframes?
Maybe code explains it better.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'a':[1,2,3,4],'b':[10,20,30,40]})
df2 = pd.DataFrame({'a':[1.2,2.2,3.2,4.2],'b':[10.2,20.2,30.2,40.2]})
df = df1.append(df2)
print(df)
df['a_mean'] = ???
a b
0 1.0 10.0
1 2.0 20.0
2 3.0 30.0
3 4.0 40.0
0 1.2 10.2
1 2.2 20.2
2 3.2 30.2
3 4.2 40.2
How to create a new column a_mean with values
[1.1, 2.1, 3.1, 4.1, 1.1, 2.1, 3.1, 4.1] effectively ?
melt()
df=df.assign(a_mean=df1.add(df2).div(2).melt().value)
Or taking only df, you can do:
df=df.assign(a_mean=df.groupby(df.index)['a'].mean())
a b a_mean
0 1.0 10.0 1.1
1 2.0 20.0 2.1
2 3.0 30.0 3.1
3 4.0 40.0 4.1
0 1.2 10.2 1.1
1 2.2 20.2 2.1
2 3.2 30.2 3.1
3 4.2 40.2 4.1
Try this:
df['a_mean'] = np.tile( (df1.a.to_numpy() + df2.a.to_numpy())/2, 2)
As per the comments, there is already a great answer by Anky, but to extend this method you can do this:
df['a_mean2'] = np.tile( (df.iloc[0: len(df)//2].a.to_numpy() + df.iloc[len(df)//2:].a.to_numpy())/2, 2)
Update:
df['a_mean3'] = np.tile(df.a.to_numpy().reshape(2,-1).mean(0), 2)
Outptut
print(df)
a b a_mean2 a_mean a_mean3
0 1.0 10.0 1.1 1.1 1.1
1 2.0 20.0 2.1 2.1 2.1
2 3.0 30.0 3.1 3.1 3.1
3 4.0 40.0 4.1 4.1 4.1
0 1.2 10.2 1.1 1.1 1.1
1 2.2 20.2 2.1 2.1 2.1
2 3.2 30.2 3.1 3.1 3.1
3 4.2 40.2 4.1 4.1 4.1
Say I have two dataframes, df1 and df2 as shown here:
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
Timestamp_A
0 0.6
1 1.1
2 1.6
3 2.1
4 2.6
5 3.1
6 3.6
7 4.1
8 4.6
9 5.1
10 5.6
11 6.1
12 6.6
13 7.1
Timestamp_B
0 2.2
1 2.7
2 3.2
3 3.7
4 5.2
5 5.7
Each dataframe is the output of different sensor readings, and each is being transmitted at the same frequency. What I would like to do, is to align these two dataframes together such that each timestamp in B aligns with the timestamp in A closest to it's value. For all values in Timestamp_A which do not have a match to Timestamp_B, replace them with np.nan. Does anyone have any advice for the best way to go about doing something like this? Here is the desired output:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 NaN
7 4.1 NaN
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 NaN
12 6.6 NaN
13 7.1 NaN
You probably want some application of merge_asof, like so:
import pandas as pd
df1 = pd.DataFrame({'Timestamp_A': [0.6, 1.1, 1.6, 2.1, 2.6, 3.1, 3.6, 4.1, 4.6, 5.1, 5.6, 6.1, 6.6, 7.1]})
df2 = pd.DataFrame({'Timestamp_B': [2.2, 2.7, 3.2, 3.7, 5.2, 5.7]})
df3 = pd.merge_asof(df1, df2, left_on='Timestamp_A', right_on='Timestamp_B',
tolerance=0.5, direction='nearest')
print(df3)
Output as follows:
Timestamp_A Timestamp_B
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
The tolerance will define what "not having a match" means numerically, so that is up to you to determine.
When you only have two columns and one value assignment , I feel like reindex is more suitable
df2.index=df2.Timestamp_B
df1['New']=df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values
df1
Out[109]:
Timestamp_A New
0 0.6 NaN
1 1.1 NaN
2 1.6 NaN
3 2.1 2.2
4 2.6 2.7
5 3.1 3.2
6 3.6 3.7
7 4.1 3.7
8 4.6 NaN
9 5.1 5.2
10 5.6 5.7
11 6.1 5.7
12 6.6 NaN
13 7.1 NaN
For more columns
s=pd.DataFrame(df2.reindex(df1.Timestamp_A,method='nearest',tolerance=0.5).values,index=df1.index,columns=df2.columns)
df1=pd.concat([df1,s],axis=1)
I have a dataframe with scores of three persons (John, Terry, Henry) from day 1 to day 7.
1 2 3 4 5 6 7
John 1.3 2.8 3.0 4.4 2.6 3.1 4.8
Terry 1.1 2.3 4.1 5.5 3.7 2.1 3.8
Henry 0.3 1.0 2.0 3.0 2.7 1.1 2.8
How do I set a score ceiling such that once a score hits > 2.5, all scores from that day onwards is FIXED no matter what the score is
The output should be:
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0
Tried the following didn't work. I first do a boolean on all numbers > 2.5 to "1", then apply a mask to the cumulative sum:
df = df.mask((df > 2.5).cumsum(axis=1) > 0, df)
You can find first non NaN value by where with bfill and select first column by iloc:
m = (df > 2.5).cumsum(axis=1) > 0
s = df.where(m).bfill(axis=1).iloc[:, 0]
print (s)
John 2.8
Terry 4.1
Henry 3.0
Name: 1, dtype: float64
df = df.mask(m, s, axis=0)
Or shift mask and forward filling NaNs to last values:
m = (df > 2.5).cumsum(axis=1) > 0
df = df.mask(m.shift(axis=1).fillna(False)).ffill(axis=1)
print (df)
1 2 3 4 5 6 7
John 1.3 2.8 2.8 2.8 2.8 2.8 2.8
Terry 1.1 2.3 4.1 4.1 4.1 4.1 4.1
Henry 0.3 1.0 2.0 3.0 3.0 3.0 3.0