I have two pandas dataframe that I want to merge. My first dataframe, names, is a list of stock tickers and corresponding dates. Example below:
Date Symbol DateSym
0 2017-01-05 AGRX AGRX01-05-2017
1 2017-01-05 TMDI TMDI01-05-2017
2 2017-01-06 ATHE ATHE01-06-2017
3 2017-01-06 AVTX AVTX01-06-2017
4 2017-01-09 CVM CVM01-09-2017
5 2017-01-10 DFFN DFFN01-10-2017
6 2017-01-10 VKTX VKTX01-10-2017
7 2017-01-11 BIOC BIOC01-11-2017
8 2017-01-11 CVM CVM01-11-2017
9 2017-01-11 PALI PALI01-11-2017
I created another dataframe, price1, that loops through the tickers and creates a dataframe with the open, high, low, close and other relevant info I need. When I merge the two dataframes together, I want to only show the names dataframe on the left with the corresponding price data on the right. What I ran a test of the first 10 tickers, I noticed that the combined dataframe is outputting redundant rows. (See CVM in row 4 and 5 below), even though the price1 dataframe doesn't have duplicate values. What am I doing wrong?
def price_stats(df):
# df['ticker'] = df
df['Gap Up%'] = df['Open'] / df['Close'].shift(1) - 1
df['HOD%'] = df['High'] / df['Open'] - 1
df['Close vs Open%'] = df['Close'] / df['Open'] - 1
df['Close%'] = df['Close'] / df['Close'].shift(1) - 1
df['GU and Goes Red'] = np.where((df['Low'] < df['Close'].shift(1)) & (df['Open'] > df['Close'].shift(1)), 1, 0)
df['Yday Intraday>30%'] = np.where((df['Close vs Open%'].shift(1) > .30), 1, 0)
df['Gap Up?'] = np.where((df['Gap Up%'] > 0), 1, 0)
df['Sloppy $ Vol'] = (df['High'] + df['Low'] + df['Close']) / 3 * df['Volume']
df['Prev Day Sloppy $ Vol'] = (df['High'].shift(1) + df['Low'].shift(1) + df['Close'].shift(1)) / 3 * df[
'Volume'].shift(1)
df['Prev Open'] = df['Open'].shift(1)
df['Prev High'] = df['High'].shift(1)
df['Prev Low'] = df['Low'].shift(1)
df['Prev Close'] = df['Close'].shift(1)
df['Prev Vol'] = df['Volume'].shift(1)
df['D-2 Close'] = df['Close'].shift(2)
df['D-2 Vol'] = df['Volume'].shift(2)
df['D-3 Close']= df['Close'].shift(3)
df['D-2 Open'] = df['Open'].shift(2)
df['D-2 High'] = df['High'].shift(2)
df['D-2 Low'] = df['Low'].shift(2)
df['D-2 Intraday Rnage'] = df['D-2 Close']/df['D-2 Open']-1
df['D-2 Close%'] = df['D-2 Close']/df['D-3 Close']-1
df.dropna(inplace=True)
vol_excel = pd.read_excel('C://U******.xlsx')
names = vol_excel.Symbol.to_list()
price1 = []
price1 = pd.DataFrame(price1)
for name in names[0:10]:
print(name)
price = yf.download(name, start="2016-12-01", end="2022-03-04")
price['ticker'] = name
price_stats(price)
price1 = pd.concat([price1, price])
price1 = price1.reset_index()
orig_day = pd.to_datetime(price1['Date'])
price1['Prev Day Date'] = orig_day - pd.tseries.offsets.CustomBusinessDay(1, holidays=nyse.holidays().holidays)
price1['DateSym'] = price1['ticker']+ price1['Date'].dt.strftime('%m-%d-%Y')
price1 = price1.rename(columns={'ticker':'Symbol'})
datesym = price1['DateSym']
price1.drop(labels=['DateSym'], axis=1,inplace = True)
price1.insert(0, 'DateSym', datesym)
vol_excel['DateSym'] = vol_excel['Symbol']+vol_excel['Date'].dt.strftime('%m-%d-%Y')
dfcombo = vol_excel.merge(price1,on=['Date','Symbol'],how='inner')
See how CVM is duplicated twice when i print out dfcombo
Date Symbol DateSym_x DateSym_y Open High Low Close Adj Close Volume ... Prev Vol D-2 Close D-2 Vol D-3 Close D-2 Open D-2 High D-2 Low D-2 Intraday Rnage D-2 Close% Prev Day Date
0 2017-01-05 AGRX AGRX01-05-2017 AGRX01-05-2017 2.71 2.71 2.40 2.52 2.52 2408400 ... 18584900.0 5.000 2390400.0 5.700000 5.770 5.813000 4.460 -0.133449 -0.122807 2017-01-04
1 2017-01-05 TMDI TMDI01-05-2017 TMDI01-05-2017 15.60 16.50 12.90 13.50 13.50 43830 ... 114327.0 10.500 61543.0 7.200000 7.500 10.500000 7.500 0.400000 0.458333 2017-01-04
2 2017-01-06 ATHE ATHE01-06-2017 ATHE01-06-2017 2.58 2.60 2.23 2.42 2.42 222500 ... 1750700.0 1.930 53900.0 1.750000 1.790 1.950000 1.790 0.078212 0.102857 2017-01-05
3 2017-01-06 AVTX AVTX01-06-2017 AVTX01-06-2017 1.24 1.24 1.02 1.07 1.07 480500 ... 1246100.0 0.883 44900.0 0.890000 0.896 0.950000 0.827 -0.014509 -0.007865 2017-01-05
4 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
5 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
6 2017-01-10 DFFN DFFN01-10-2017 DFFN01-10-2017 111.00 232.50 108.75 125.25 125.25 165407 ... 43167.0 30.900 67.0 34.650002 31.500 34.349998 30.900 -0.019048 -0.108225 2017-01-09
7 2017-01-10 VKTX VKTX01-10-2017 VKTX01-10-2017 1.64 1.64 1.43 1.56 1.56 981700 ... 1550400.0 1.260 264900.0 1.230000 1.250 1.299000 1.210 0.008000 0.024390 2017-01-09
8 2017-01-11 BIOC BIOC01-11-2017 BIOC01-11-2017 858.00 1017.00 630.00 813.00 813.00 210182 ... 78392.0 306.000 5368.0 285.000000 285.000 315.000000 285.000 0.073684 0.073684 2017-01-10
9 2017-01-11 CVM CVM01-11-2017 CVM01-11-2017 4.25 4.50 3.00 3.75 3.75 487584 ... 672692.0 2.750 376520.0 2.750000 2.750 3.000000 2.750 0.000000 0.000000 2017-01-10
I'm wondering since the names dataframe may have the same tickers, but different dates, in the dataframe and each time the price1 dataframe is pulling the price data and adding to that price1 dataframe maybe causing the issue.
For example, in the names dataframe, AGRX can be listed for the date 2017-01-05 and 2020-12-20. My loop function as shown pulls from yahoo data and appends it to the price1 dataframe even though its the same set of data. Along the same token, is there a way for me to skip appending that duplicate ticker and will that solve the issue?
sorry if I ain't clear, but got a challenge,
[this is the sample data I have generated to try to make my challenge clear] 1
sample data 1
B
V
S
F
K
0.32
10.32
11.32
12.32
13.32
1.32
11.32
12.32
13.32
14.32
2.32
12.32
13.32
14.32
15.32
3.32
13.32
14.32
15.32
16.32
4.32
14.32
15.32
16.32
17.32
5.32
15.32
16.32
17.32
18.32
6.32
16.32
17.32
18.32
19.32
7.32
17.32
18.32
19.32
20.32
8.32
18.32
19.32
20.32
21.32
9.32
19.32
20.32
21.32
22.32
10.32
20.32
21.32
22.32
23.32
my expected output
K
L
M
1
2.32
2
3.32
3
4.32
4
5.32
5
6.32
6
13.32
7
14.32
8
15.32
9
16.32
10
17.32
the second image explains the outcome
I would like to know how I would create another column M in dataset 2 that will return the name of the column from dataset 1 contains the values in column L (which is in dataset 2)
I have tried the code below but it wasn't adding up to it since I had got this error and I thought someone here will help with this, thanks in advance!
spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]```
returned the following error
```~\AppData\Local\Temp/ipykernel_25368/552331776.py in <module>
----> 1 spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]
~\AppData\Local\Temp/ipykernel_25368/552331776.py in <listcomp>(.0)
----> 1 spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]
TypeError: 'in <string>' requires string as left operand, not Series```
I created two dataframes here.For loop searches for matches and records column names for each row. You can also take dataframes from my message: df, df1 and place them in your message.
import pandas as pd
import numpy as np
df = pd.DataFrame({'B':[0.32,1.32,2.32,3.32,4.32,5.32,6.32,7.32,8.32,9.32,10.32],
'V':[10.32,11.32,12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32],
'S':[11.32,12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32],
'F':[12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32,22.32],
'K':[13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32,22.32,23.32]})
print(df)
df1 = pd.DataFrame({'K':[1,2,3,4,5,6,7,8,9,10],
'L':[2.32,3.32,4.32,5.32,6.32,13.32,14.32,15.32,16.32,17.32]})
M = []
for k in range(0, len(df1)):
i, c = np.where(df == df1['L'][k])#Get the indexes of the columns where there was a match
ttt = df.columns[c]#Get the name of the columns
M.append(','.join(list(ttt)))#Output values to a comma-separated string if there are more than one values
df1['M'] = M #Adding a column with the received values
print(df1)
This is my dataset:
df = {'brand_no':['BH 1', 'BH 2', 'BH 5', 'BH 7', 'BH 6'],
'1240000601_min':[5.87,5.87,5.87,5.87,np.nan],
'1240000601_max':[8.87,7.47,10.1,1.9,10.8],
'1240000603_min':[5.87,np.nan,6.5,2.0,7.8],
'1240000603_max':[8.57,7.47,10.2,1.0,10.2],
'1240000604_min':[5.87,5.67,6.9,1.0,7.8],
'1240000604_max':[8.87,8.87,8.87,np.nan,8.87],
'1240000605_min':[15.87,15.67,16.9,1.0,17.8],
'1240000605_max':[18.11,17.47,20.1,1.9,22.6],
'1240000606_min':[8.12,8.12,np.nan,8.12,np.nan],
'1240000606_max;':[np.nan,7.47,10.1,1.9,np.nan]}
# Create DataFrame
df = pd.DataFrame(df)
# Print the output.
df
As you can see the column stays the same except nan.(Because the data is sparse it has nan as well), so I want drop these column which has value same across all the rows. (in this case column 1240000601_min, 1240000604_max and 1240000606_min)
Desired output:
As we can see here, all the column with same value across all rows are dropped. Pls help get this.
You can use something like this:
columns = [column for column in df.columns if df[column].nunique()==1]
df.drop(columns=columns)
df.nunique() drops nans by default, so you don't have to worry about that.
Try this:
df_cleared = df.loc[:, df.nunique() > 1]
You can use: df.nunique() to check for number of unique items in all columns and filter those > 1 with .gt(1). This will form a boolean mask of the columns. Then, use .loc and put the boolean mask just created on the second parameter of .loc to filter the columns:
df_cleaned = df.loc[:, df.nunique().gt(1)]
Result:
pritn(df_cleaned)
brand_no 1240000601_max 1240000603_min 1240000603_max 1240000604_min 1240000605_min 1240000605_max 1240000606_max;
0 BH 1 8.87 5.87 8.57 5.87 15.87 18.11 NaN
1 BH 2 7.47 NaN 7.47 5.67 15.67 17.47 7.47
2 BH 5 10.10 6.50 10.20 6.90 16.90 20.10 10.10
3 BH 7 1.90 2.00 1.00 1.00 1.00 1.90 1.90
4 BH 6 10.80 7.80 10.20 7.80 17.80 22.60 NaN
df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.
To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40