Pandas is not labelling columns correctly and it does make sense because I have used this method a few times already. I can not think of anything that goes wrong and I am able to reproduce this error on a new jupyter notebook as well.
import pandas as pd
columns = {'Start Time', 'Open', 'High', 'Low', 'Close', 'Volume',
'End Time', 'Amount', 'No. Trades', 'Taker Buy Base',
'Taker Buy Quote', 'N'}
df = pd.DataFrame(test_data, columns=columns)
df
Sample data
test_data = [[1617231600000,
'538.32000000',
'545.15000000',
'535.05000000',
'541.06000000',
'8663.58299000',
1617235199999,
'4686031.35064850',
11361,
'5051.86680000',
'2733476.69840350',
'0'],
[1617235200000,
'541.11000000',
'554.67000000',
'541.00000000',
'552.49000000',
'13507.97221000',
1617238799999,
'7404389.49931720',
14801,
'7736.80002000',
'4242791.14275430',
'0'],
[1617238800000,
'552.58000000',
'553.73000000',
'544.82000000',
'548.50000000',
'7115.15238000',
1617242399999,
'3907155.60293150',
5556,
'3580.46860000',
'1964701.27448790',
'0'],
[1617242400000,
'548.49000000',
'550.63000000',
'544.70000000',
'545.45000000',
'3589.18173000',
1617245999999,
'1964514.51702120',
3974,
'1742.76278000',
'954042.85262340',
'0'],
[1617246000000,
'545.80000000',
'545.80000000',
'540.48000000',
'541.56000000',
'4767.67233000',
1617249599999,
'2586841.14566500',
4960,
'2516.25626000',
'1364734.14900580',
'0']]
Expected output: The column names should be labelled in the same order as described in the columns variable. It is labelled like this for me, clearly the wrong order. There is another post but it did not help.
Python sets do not maintain order. Use a list (square brackets instead of curly) instead.
columns = ['Start Time', 'Open', 'High', 'Low', 'Close', 'Volume',
'End Time', 'Amount', 'No. Trades', 'Taker Buy Base',
'Taker Buy Quote', 'N']
Related
I'm a relative novice to pandas but use it to plot and compare trends in industrial and economic data across countries and time. The df are organised like this:
#create sample df
df1 = pd.DataFrame(columns=['2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'],index = ['United Arab Emirates', 'Argentina', 'Australia', 'Austria', 'Bulgaria',
'Brazil', 'Canada'])
df2 = pd.DataFrame(columns=['2004', '2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016'],index = ['Argentina', 'Australia', 'Austria', 'Bulgaria',
'Brazil', 'Canada', 'Switzerland', 'Chile', 'Colombia'])
df3 = pd.DataFrame(columns=['2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', '2014', '2015', '2016', '2017', '2018'],index = ['Argentina', 'Australia', 'Austria', 'Bulgaria',
'Brazil', 'Canada'])
This data comes from different sources so does not always contain the same list of countries and years. In order to scatter plot them I need to wrangle the df so that they are al the same shape with identical rows & columns / lists of countries and years. I am doing this as following:
Concat the df joining on inner, creating lists of countries and years that are common to all the df:
#create lists of countries and years common to all df
dfList = [df1, df2, df3]
merged = pd.concat(dfList, axis = 1, join='inner')
countryList = merged.index
merged = pd.concat(dfList, axis=0, join='inner')
yearList = merged.columns
However I am having problems writing a function that loops through the df and removes the columns and rows (years & countries) that are not contained in yearList and countryList. The following function seems to run okay but does not change the columns / rows of the df. I'm pretty sure this is down to my misunderstanding of how variables work within loops, but haven't been able to find anything on this as it applies to complete df.
Can anyone point out why this loop isn't working or suggest a more elegant / efficient way of wrangling a group of df so that they all contain identically labelled indices & columns? Many thanks in advance.
#loop through all df removing all rows / cols that are not in countryList & yearList
def countryyear(x):
for x in dfList:
x = x[x.index.isin(countryList)]
x = x.loc[:,x.columns.isin(yearList)]
#return x
countryyear(dfList)
In your countryyear function your loop is working, but you have a return statement in it. When the loop reaches that return it ends the loop and outputs the first dataframe. I would iterate over dataframes outside the function, then return the clean dataframes to a list. At the moment you also aren't using the local variable x that you assign in the dataframe, you just ignore it by using x in the for loop.
def countryyear(df):
df = df[fd.index.isin(countryList)]
df = df.loc[:,df.columns.isin(yearList)]
return df
cleanDFs = [countryyear(df) for df in dfList]
just use loc to find the common index and columns:
df.loc[countryList, yearList]
merged = pd.concat(dfList, axis = 1, join='inner')
countryList = merged.index
merged = pd.concat(dfList, axis=0, join='inner')
yearList = merged.columns
dfn_list = list()
for df in dfList:
dfn = df.loc[countryList, yearList].copy()
dfn_list.append(dfn)
I'm attempting to install a stock market gauge displaying the S&P 500, NASDAQ, and DJIA composite indices; however, I'm not sure where to find sources to gather the three together. Otherwise, I think I could bring it in one at a time as the following:
yahoo_url = "http://finance.yahoo.com/q?s=^GSPC"
web.DataReader('^GSPC','yahoo') # S&P 500
I'm not too sure how to go about making this particular code work though.
Any advice or pointing in the right direction is greatly appreciated.
Thank you,
AJ
List the indexes you want
indexes = ['^DJI', '^GSPC', '^IXIC']
Query yahoo finance for them
df = web(indexes, 'yahoo', start='2010-01-01')
>>> df.columns
MultiIndex([('Adj Close', '^DJI'),
('Adj Close', '^GSPC'),
('Adj Close', '^IXIC'),
( 'Close', '^DJI'),
( 'Close', '^GSPC'),
( 'Close', '^IXIC'),
( 'High', '^DJI'),
( 'High', '^GSPC'),
( 'High', '^IXIC'),
( 'Low', '^DJI'),
( 'Low', '^GSPC'),
( 'Low', '^IXIC'),
( 'Open', '^DJI'),
( 'Open', '^GSPC'),
( 'Open', '^IXIC'),
( 'Volume', '^DJI'),
( 'Volume', '^GSPC'),
( 'Volume', '^IXIC')],
names=['Attributes', 'Symbols'])
I have the following two dataframes:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['01/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
(perhaps it's clearer in the screenshots here: https://imgur.com/a/YNrWpR2)
The df2 is much larger than shown here - it contains columns for 100 companies. So for example, for the 10th company, the column names are: ReturnOnAssets.10, etc.
I have created a dictionary which maps the company names to the column names:
stocks = {'Microsoft':'','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7'}
and so on.
Now, what I am trying to achieve is adding a column "ReturnOnAssets" from d2 to d1, but for a specific company and for a specific date. So looking at df1, the first tweet (i.e. "text") contains a keyword "Amazon" and it was posted on 04/28/2017. I now need to go to df2 to the relevant column name for Amazon (i.e. "ReturnOnAssets.2") and fetch the value for the specified date.
So what I expect looks like this:
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon', **'10.5'**], ["blala Amazon", '04/28/2017', 'Amazon', 'x'], ["blabla Netflix', '06/28/2017', 'Netflix', 'x']], columns=['text', 'date', 'keyword', 'ReturnOnAssets'])
By x I mean values which where not included in the example df1 and df2.
I am fairly new to pandas and I can't wrap my head around it. I tried:
keyword = df1['keyword']
txt = 'ReturnOnAssets.'+ stocks[keyword]
df1['ReturnOnAssets'] = df2[txt]
But I don't know how to fetch the relevant date, and also this gives me an error: "Series' objects are mutable, thus they cannot be hashed", which probably comes from the fact that I cannot just add a whole column of keywords to the text string.
I don't know how to achieve the operation I need to do, so I would appreciate help.
It can probably be shortened and you can add if statements to deal with when there are missing values.
import pandas as pd
import numpy as np
df1 = pd.DataFrame([["blala Amazon", '05/28/2017', 'Amazon'], ["blala Facebook", '04/28/2017', 'Facebook'], ['blabla Netflix', '06/28/2017', 'Netflix']], columns=['text', 'dates', 'keyword'])
df1
df2 = pd.DataFrame([['06/28/2017', '3.4', '10.2'], ['05/28/2017', '3.7', '10.5'], ['04/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAsset.1', 'ReturnOnAsset.2'])
#creating myself a bigger df2 to cover all the way to netflix
for i in range (9):
df2[('ReturnOnAsset.' + str(i))]=np.random.randint(1, 1000, df1.shape[0])
stocks = {'Microsoft':'0','Apple' :'1', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Johnson & Johnson':'5',
'JPMorgan' :'6', 'Alphabet': '7', 'Netflix': '8'}
#new col where to store values
df1['ReturnOnAsset']=np.nan
for index, row in df1.iterrows():
colname=('ReturnOnAsset.' + stocks[row['keyword']] )
df1['ReturnOnAsset'][index]=df2.loc[df2['dates'] ==row['dates'] , colname]
Next time please give us a correct test data, I modified your dates and dictionary for match the first and second column (netflix and amazon values).
This code will work if and only if all dates from df1 are in df2 (Note that in df1 the column name is date and in df2 the column name is dates)
df1 = pd.DataFrame([["blala Amazon", '02/30/2017', 'Amazon'], ["blala Amazon", '04/28/2017', 'Amazon'], ['blabla Netflix', '02/30/2017', 'Netflix']], columns=['text', 'date', 'keyword'])
df2 = pd.DataFrame([['04/28/2017', '3.4', '10.2'], ['02/30/2017', '3.7', '10.5'], ['03/28/2017', '6.0', '10.9']], columns=['dates', 'ReturnOnAssets.1', 'ReturnOnAssets.2'])
stocks = {'Microsoft':'','Apple' :'5', 'Amazon':'2', 'Facebook':'3',
'Berkshire Hathaway':'4', 'Netflix':'1',
'JPMorgan' :'6', 'Alphabet': '7'}
df1["ReturnOnAssets"]= [ df2["ReturnOnAssets." + stocks[ df1[ "keyword" ][ index ] ] ][ df2.index[ df2["dates"] == df1["date"][index] ][0] ] for index in range(len(df1)) ]
df1
I am currently trying to learn how to apply Data Science skills which I am learning through Coursera and Dataquest to little personal projects.
I found a dataset on Google BigQuery from the US Department of Health and Human Services which includes all weekly surveillance reports of nationally notifiable diseases for all U.S. cities and states published between 1888 and 2013.
I exported the data to a .csv file and imported it into a Jupyter notebook which I am running through Anaconda. Upon looking at the header of the dataset I noticed that the dates/weeks are shown as 'epi_week'.
I am trying to make the data more readable and useable for some analysis, to do this I was hoping to conver it into something along the lines of DD/MM/YYYY or Week/Month/Year etc.
I did some research, apparently epi-weeks are also referred to as CDC weeks and so far I found an extension/package for python 3 which is called "epiweeks".
Using the epiweeks package I can turn some 'normal' dates into what the package creator refers to into some sort of an epi weeks form but they look nothing like what I can see in the dataset.
For example if I use todays date, the 24th of May 2019 (24/05/2019) then the output is: "Week 21 of Year 2019" but this is what the first four entrys in the data (and following the same format, all the other ones) look like:
epi_week
'197006'
'197007'
'197008'
'197012'
In [1]: disease_header
Out [1]:
[['epi_week', 'state', 'loc', 'loc_type', 'disease', 'cases', 'incidence_per_100000']]
In [2]: disease[:4]
Out [2]:
[['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']]
The epiweeks package was developed to solve problems like the one you have here.
With the example data you provided, let's create a new column with week ending date:
import pandas as pd
from epiweeks import Week
columns = ['epi_week', 'state', 'loc', 'loc_type',
'disease', 'cases', 'incidence_per_100000']
data = [
['197006', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197007', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197008', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0'],
['197012', 'AK', 'ALASKA', 'STATE', 'MUMPS', '0', '0']
]
df = pd.DataFrame(data, columns=columns)
# Now create a new column with week ending date in ISO format
df['week_ending'] = df['epi_week'].apply(lambda x: Week.fromstring(x).enddate())
That results in something like:
I recommend you to have a look over the epiweeks package documentation for more examples.
If you only need to have year and week columns, that can be done without using the epiweeks package:
df['year'] = df['epi_week'].apply(lambda x: int(x[:4]))
df['week'] = df['epi_week'].apply(lambda x: int(x[4:6]))
That results in something like:
I am getting html table based on day so if I search for 20 days it brings me 20 table and I want to add all 20 tables in 1 table so I can verify data within time series.
I have tried merge and add functions of pandas but it just add as string.
Table one
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
table two
[['\xa0', 'All Issues', 'Investment Grade', 'High Yield', 'Convertible'],
['Total Issues Traded', '8039', '5456', '2386', '197'],
['Advances', '3834', '2671', '1075', '88'],
['Declines', '3668', '2580', '994', '94'],
['Unchanged', '163', '54', '99', '10'],
['52 Week High', '305', '100', '193', '12'],
['52 Week Low', '152', '83', '63', '6'],
['Dollar Volume*', '27568', '17000', '9299', '1269']]
code but it add as string.
tab_data = [[item.text for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
df = pd.DataFrame(tab_data)
df2 = pd.DataFrame(tab_data)
df3 = df.add(df2,fill_value=0)
df
If you want to convert the numeric cells into integers, you would need to do that explicitly, as follows:
tab_data = [[int(item.text) if item.text.isdigit() else item.text
for item in row_data.select("th,td")]
for row_data in tables.select("tr")]
Hope it helps.
The way you are converting the data frame treats all values as text.
There are two options here.
Explicitly convert the strings to the data type you want using astype
Use read_html to create data frames from html tables, which also tries to do the data type conversion.