So, I'm trying to import the stock prices for S&P500 (SPY) BP, the (O&G/energy company). The result I am looking for is a "table" of 3 columns; 1 for dates, 1 for the Adj Close of SPY and 1 for the Adj Close of BP. However, my code produces:
ValueError: columns overlap but no suffix specified: Index(['SPY'], dtype='object')
I understand what this error is telling me though: The index column “Adj Close” has an overlap. Irrespective of the stock, the column we are extracting each time is called “SPY”. The join() method I am using is confused because column names must be unique, well, something like that is how I've interpreted it...
The code:
import pandas as pd
def test_run():
start_date=('2016-03-10') #start date parameter
end_date=('2017-03-10') #end date parameter
dates=pd.date_range(start_date,end_date)
df1=pd.DataFrame(index=dates) #create empty dataframe df1
dfSPY=pd.read_csv("C:\SPY.csv",index_col="Date",parse_dates=True,
usecols=['Date','Adj Close'],na_values=['nan'])
#create dataframe for SPY stock
#rename Adj Close column to SPY to prevent clash
dfSPY=dfSPY.rename(columns={'Adj Close':'SPY'})
#join the 2 dataframes using DataFrame.join(), and how='inner'
df1=df1.join(dfSPY,how='inner')
#read in more stocks; SPY & BP
symbols=['SPY','BP']
for symbol in symbols:
df_temp=pd.read_csv("C{}.csv".format(symbol),index_col='Date',parse_dates=True,
usecols=['Date','Adj Close'],na_values=['nan'])
#rename to prevent clash
df_temp=df_temp.rename(columns={'Adj Close':symbol})
df1=df1.join(df_temp) #use default how='left'
print(df1)
if __name__=="__main__":
test_run()
So, that's the code I've got. If there's anyone out there who can shed some light as to what an Earth I've done wrong, please let me know.
Many thanks!
The code you provided is overriding the value of df_temp in your for loop- it will only end up with the value assigned during the last iteration. I assume the last two lines posted below are actually inside your for loop:
for symbol in symbols:
df_temp=pd.read_csv("C{}.csv".format(symbol),index_col='Date',parse_dates=True,
usecols=['Date','Adj Close'],na_values=['nan'])
df_temp=df_temp.rename(columns={'Adj Close':symbol})
df1=df1.join(df_temp) #use default how='left'
There's already an 'SPY' column after you joined dfSPY to df1. You have 'SPY' again in your list of symbols, which is going to throw an error because pandas can't join dataframes with overlapping column names, unless you specify a suffix to distinguish the columns
I just wanted to kinda get this question closed. So I gave up on importing the .CSV file of stocks, and just "imported" directly from Yahoo Finance. This doesn't really answer my original question, and so I still don't know what went wrong, but the following solution is much more efficient and "elegant" I feel:
import pandas as pd
import pandas.io.data as web
import datetime
start = datetime.datetime(2000,1,1)
end = datetime.date.today()
BP=web.DataReader("BP","yahoo",start,end)
SPY=web.DataReader("SPY","yahoo",start,end)
df_stocks=pd.DataFrame({"BP":BP["Adj Close"],"SPY":SPY["Adj Close"]})
df_stocks.tail()
BP SPY
Date
2017-03-07 33.869999 237.000000
2017-03-08 33.310001 236.559998
2017-03-09 33.500000 236.860001
2017-03-10 34.330002 237.690002
2017-03-13 34.070000 237.809998
Thanks to anyone who had a look.
Related
I'm very new to coding, so perhaps there is a super simple answer for this, but here it goes:
I have a dataframe of a bunch of stocks. Each stock has a ticker and their names are stored in a column. I've created a list of all the stocks I want in my data frame. I am wondering how I remove the stocks with tickers that do not appear in my list.
from pandas import *
C = DataFrame(["TD","CM","AAPL","GOOG", "GOOS"],columns=["Ticker"])
There are several hundred occurrences of each ticker, and each has an associated price, return, risk free rate, and time. I've created a list of stocks that I want to analyze, based on how many occurrences they have in the dataframe. I have already done this, but the simplified list looks like this:
list = ['GOOG', 'AAPL']
I want to return a dataframe that only has these tickers in it, but also includes all the row data associated with each one. I'm honestly pretty stumped on how to do this, but I'm sure there is a simple answer. Any help would be super appreciated!
You can use this.
tickers = ['GOOG', 'AAPL']
df = C.loc[C['Ticker'].isin(tickers)].reset_index(drop=True)
Output:
#df
Ticker
0 AAPL
1 GOOG
I am new in python and this data science world and I am trying to play with different datasets.
In this case I am using the housing price index from quandl but unfortunately I get stuck when when I need to take the abbreviations names from the wiki page always getting the same Error KeyError.
import quandl
import pandas as pd
#pull every single housing price index from quandl
#quandl api key
api_key = 'xxxxxxxxxxxx'
#get stuff from quandl
df = quandl.get('FMAC/HPI_AK',authtoken = api_key) #alaska \
##print(df.head())
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
#get quandl frannymac query names for each 50 state
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
So the problem I got in the following step:
#get 50 states using pandas read html from wikipedia
fifty_states = pd.read_html('https://en.wikipedia.org /wiki/List_of_states_and_territories_of_the_United_States')
##print(fifty_states[0][1]) #first data frame is index 0, #looking for column 1,#from element 1 on
I have tried different ways to get just the abbreviation but does not work
for abbv in fifty_states[0][1][2:]:
#print('FMAC/HPI_'+str(abbv))
for abbv in fifty_states[0][1][1:]:
#print('FMAC/HPI_'+str(abbv))
always Keyerror: 0
I just need this step to work, and to have the following output:
FMAC/HPI_AL,
FMAC/HPI_AK,
FMAC/HPI_AZ,
FMAC/HPI_AR,
FMAC/HPI_CA,
FMAC/HPI_CO,
FMAC/HPI_CT,
FMAC/HPI_DE,
FMAC/HPI_FL,
FMAC/HPI_GA,
FMAC/HPI_HI,
FMAC/HPI_ID,
FMAC/HPI_IL,
FMAC/HPI_IN,
FMAC/HPI_IA,
FMAC/HPI_KS,
FMAC/HPI_KY,
FMAC/HPI_LA,
FMAC/HPI_ME
for the 50 states from US and then proceed to make a data analysis from this data.
Can anybody tell me what am I doing wrong ? cheers
Note that fifty_states is a list of DataFrames, filled with
content of tables from the source page.
The first of them (at index 0 in fifty_states) is the table of US states.
If you don't know column names in a DataFrame (e.g. df),
to get column 1 from it (numeration form 0), run:
df.iloc[:, 1]
So, since we want this column from fifty_states[0], run:
fifty_states[0].iloc[:, 1]
Your code failed because you attempted to apply [1] to this DataFrame,
but this DataFrame has no column named 1.
Note that e.g. fifty_states[0][('Cities', 'Capital')] gives proper result,
because:
this DataFrame has a MultiIndex on columns,
one of columns has Cities at the first MultiIndex level
and Capital at the second level.
And getting back to your code, run:
for abbv in fifty_states[0].iloc[:, 1]:
print('FMAC/HPI_' + str(abbv))
Note that [2:] is not needed. You probably wanted to skip 2 initial rows
of the <table> HTML tag, containing column names,
but in Pandas they are actually kept in the MultiIndex on columns,
so to get all values, you don't need to skip anything.
If you want these strings as a list, for future use, the code can be:
your_list = ('FMAC/HPI_' + fifty_states[0].iloc[:, 1]).tolist()
Please excuse me if this question is too n00bish, I am brand new to Python and need to use it for work, which unfortunately means diving into higher level stuff without first understanding the basics...
I have a massive CSV with text transcripts which I read into a pandas dataframe. These transcripts are broken down into IDs and the ID's must be grouped to get a singular record for each interaction as they are broken apart into segments in the original database they come from. The format is something like this:
ID TEXT
1 This is the beginning of a convo
1 heres the middle
1 heres the end of the convo
2 this is the start of another convo...etc.
I used this code to group by ID and create singular records:
df1 = df.groupby('ID').text.apply(' '.join)
This code worked great but now I am stuck with a series (?) that no longer recognizes the index "ID", I think it's been merged with the text or something. When I use to_frame() the problem remains. I am wondering how I might separate the ID again and use that to index the data?
The groupby will return groupby-ed column as the index. Looking at your code this is what I see.
import pandas as pd
df = pd.DataFrame({'ID':[1,1,1,2],
'TEXT':['This is the beginning of a convo', 'heres the
middle', 'heres the end of the convo', 'this is the
start of another convo...etc.']})
df1 = df.groupby('ID').TEXT.apply(' '.join)
print(df1)
ID
1 This is the beginning of a convo heres the mid...
2 this is the start of another convo...etc.
Name: TEXT, dtype: object
You can take the series df1 and re-index it if you want the ID as a column in a dataframe, or move on with it as an index to the series which can be handy depending on what your next steps will be.
df1 = df1.reset_index()
print(df1)
ID TEXT
0 1 This is the beginning of a convo heres the mid...
1 2 this is the start of another convo...etc.
To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?
I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]