Question about combining pandas dataframe

Question about combining pandas dataframe - python

I have two dataframes from .csv files, and I am combining them based on a common col name they share, "NAME" and what I am trying to do is display the differences of two of the factors on another column. However the error I get is
Traceback (most recent call last):
File "C:\Users\nhoss\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\indexes\base.py", line 2891, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '\ufeff2010'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\nhoss\OneDrive\Desktop\Senior_Project\responserate.py", line 22, in <module>
combinedresponse['DIFFERENCE'] = combinedresponse['\ufeff2010'] - combinedresponse['2000']
File "C:\Users\nhoss\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\frame.py", line 2902, in __getitem__
indexer = self.columns.get_loc(key)
File "C:\Users\nhoss\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\pandas\core\indexes\base.py", line 2893, in get_loc
raise KeyError(key) from err
KeyError: '\ufeff2010'
[Finished in 0.933s]
Here is my code:
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import string
response2000 = pd.read_csv(r'C:\Users\nhoss\OneDrive\Desktop\Senior_Project\2000ResponseRates.csv', skiprows=0)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
response2010 = pd.read_csv(r'C:\Users\nhoss\OneDrive\Desktop\Senior_Project\responserate2010.csv', skiprows=0 )
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
combinedresponse = response2000.merge(response2010, on='NAME', how='inner')
combinedresponse['DIFFERENCE'] = combinedresponse['2010'] - combinedresponse['2000']
print(combinedresponse)
The CSV Files:
responserate2010.csv
2010,NAME,STATE,COUNTY_ID
52,Allegany County,36,3
64,Bronx County,36,5
68,Broome County,36,7
57,Cattaraugus County,36,9
64,Cayuga County,36,11
61,Chautauqua County,36,13
71,Chemung County,36,15
58,Chenango County,36,17
62,Clinton County,36,19
50,Columbia County,36,21
67,Cortland County,36,23
50,Delaware County,36,25
66,Dutchess County,36,27
70,Erie County,36,29
63,Fulton County,36,35
52,Essex County,36,31
59,Franklin County,36,33
2000ResponseRates.csv:
SS,CCC,NAME,2000
36,001,Albany County,70
36,003005,Allegany County,60
36,005,Bronx County,56
36,007,Broome County,72
36,009,Cattaraugus County,64
36,011,Cayuga County,60
36,013,Chautauqua County,66
36,015,Chemung County,75
36,017,Chenango County,65
36,019,Clinton County,68
36,021,Columbia County,62
36,023,Cortland County,64
36,025,Delaware County,53
36,027,Dutchess County,68
36,029,Erie County,74
36,031,Essex County,58
36,033,Franklin County,67

Please try as below:
combinedresponse = pd.merge(respose2000, response2010, on="NAME", how="inner)
combinedresponse['DIFFERENCE'] = combinedresponse['2010'] - combinedresponse['2000']

Related

Creating a deltatime array in Python

I am new to python, so I decided to start a project to improve my skills. Therefore, I started trying this one on GeeksForGeeks. Now, I am having difficulty to append a deltaTime variable into an array. I tried a numpy array as well, but it did not worked out.
My code:
from matplotlib.ticker import Formatter
import pandas as pd
import matplotlib.pyplot as plt
import datetime
import numpy as np
from pandas._libs.tslibs import timestamps
birdData = pd.read_csv("bird_tracking.csv")
birdNames = pd.unique(birdData.bird_name)
#Pegando intervalo do tempo
timestamps = []
for i in range(len(birdData)):
timestamps.append(datetime.datetime.strptime(birdData.date_time.iloc[i][:-3], "%Y-%m-%d %H:%M:%S"))
birdData["timestamps"] = pd.Series(timestamps, index = birdData.index)
plt.figure(figsize=(7, 7))
for name in birdNames:
times = birdData.timestamps[birdData.bird_name == name]
elapsedTime = []
for time in times:
x = time-times[0]
#print(x)
elapsedTime.append(x)
plt.plot(np.array(elapsedTime)/datetime.timedelta(days=1), label = name)
plt.xlabel(" Observation ")
plt.ylabel(" Elapsed time (days) ")
plt.show()
The error that I am finding:
Traceback (most recent call last):
File "C:\Users\User\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3080, in get_loc
return self._engine.get_loc(casted_key)
File "pandas\_libs\index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 1625, in pandas._libs.hashtable.Int64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 1632, in pandas._libs.hashtable.Int64HashTable.get_item
KeyError: 0
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "c:\Users\User\Documents\GitHub\TrackingBirdMigration\dataTime.py", line 24, in <module>
x = time-times[0]
File "C:\Users\User\anaconda3\lib\site-packages\pandas\core\series.py", line 853, in __getitem__
return self._get_value(key)
File "C:\Users\User\anaconda3\lib\site-packages\pandas\core\series.py", line 961, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\User\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3082, in get_loc
raise KeyError(key) from err
KeyError: 0
[Done] exited with code=1 in 8.313 seconds

Key error message when calculating variables using pandas and yfinance

trying to calculate some variables from yfinance from the column df['Close'].
But im getting this error which i have not seen before. and heres are the code:
import os
import pandas as pd
import plotly.graph_objects as go
symbols = 'AAPL'
for filename in os.listdir('datasets/'):
#print(filename)
symbol = filename.split('.')[0]
#print(symbol)
df = pd.read_csv('datasets/{}'.format(filename))
if df.empty:
continue
df['20_sma'] = df['Close'].rolling(window=20).mean()
df['stddev'] = df['Close'].rolling(window=20).std()
df['lowerband'] = df['20_sma'] + (2* df['stddev'])
df['upperband'] = df['20_sma'] - (2* df['stddev'])
if symbol in symbols:
print(df)
and heres are the error message:
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Close'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/Kit/Documents/TTM_squeezer/squeeze.py", line 16, in <module>
df['20_sma'] = df['Close'].rolling(window=20).mean()
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/frame.py", line 2906, in __getitem__
indexer = self.columns.get_loc(key)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
raise KeyError(key) from err
KeyError: 'Close'
Seems like the 'Close' column has contributed to this error but i just cant figure out why?
Many thanks

turns out there was an error in the process where the local file was saved
case closed, thanks all

KeyError: 1.0 after renaming columns of dataframe

Following script:
import pandas as pd
import numpy as np
import math
A = pd.DataFrame(np.array([[1,2,3,4],[5,6,7,8]]))
Floor1 = math.floor(A.min()[1]/2)*2
names = np.array([ 0. , 0.635, 1.27 , 1.905])
A.columns = names
Floor2 = math.floor(A.min()[1]/2)*2
Floor1 is being executed correctly, Floor2 which is done with the same df but with renamed columns isn't. I get a key error:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2646, in get_loc
return self._engine.get_loc(key)
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 385, in pandas._libs.hashtable.Float64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 392, in pandas._libs.hashtable.Float64HashTable.get_item
KeyError: 1.0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Desktop\Python\untitled0.py", line 13, in <module>
Floor2 = math.floor(A.min()[1]/2)*2
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 871, in __getitem__
result = self.index.get_value(self, key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\numeric.py", line 449, in get_value
loc = self.get_loc(k)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\numeric.py", line 508, in get_loc
return super().get_loc(key, method=method, tolerance=tolerance)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 2648, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\index.pyx", line 138, in pandas._libs.index.IndexEngine.get_loc
File "pandas\_libs\hashtable_class_helper.pxi", line 385, in pandas._libs.hashtable.Float64HashTable.get_item
File "pandas\_libs\hashtable_class_helper.pxi", line 392, in pandas._libs.hashtable.Float64HashTable.get_item
KeyError: 1.0
I know, there is a similar question: After rename column get keyerror
But I didn't really get the answer and - more important - how to solve it.

Before renaming if you get the columns of A using list(A.columns), you'll see that you'll get the list [0,1,2,3]. So, you can index using the key 1. However, after renaming, you can no longer index with key 1 because the column names have changed.

If you are using A.min(), you are finding minimum value in axis=0 by default that is along columns.
When changing the column names, you cannot access index '1' as there is no index with the name '1' in the columns.
If Your intension is finding the minimum in a row, you can use A.min(axis=1).
You can write the code like this.
import pandas as pd
import numpy as np
import math
A = pd.DataFrame(np.array([[1,2,3,4],[5,6,7,8]]))
Floor1 = math.floor(A.min(axis=1)[1]/2)*2
names = np.array([ 0. , 0.635, 1.27 , 1.905])
A.columns = names
Floor2 = math.floor(A.min(axis=1)[1]/2)*2
Thank you

Pandas datareader failure

I want to get all the stocks from sp500 to a folder in csv format.
Now while scanning the sp500 everything works great but it seems to be that in some cases the index referred to date is missing because stock doesn't exist or has no date for a specific time, whatever I tried to change startdate and enddate but no effect - in en earlier post I was said to filter those dates with an exception but due to python is new land for me I was like an alien... is there someone who can help me?
If this error occurs:
/home/mu351i/PycharmProjects/untitled/venv/bin/python /home/mu351i/PycharmProjects/untitled/get_sp500_beautifulsoup_intro.py
Traceback (most recent call last):
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Date'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mu351i/PycharmProjects/untitled/get_sp500_beautifulsoup_intro.py", line 44, in get_data_from_yahoo
df = web.DataReader (ticker, 'yahoo', start, end)
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas/util/_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas_datareader/data.py", line 387, in DataReader
session=session,
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas_datareader/base.py", line 251, in read
df = self._read_one_data(self.url, params=self._get_params(self.symbols))
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas_datareader/yahoo/daily.py", line 165, in _read_one_data
prices["Date"] = to_datetime(to_datetime(prices["Date"], unit="s").dt.date)
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas/core/frame.py", line 2995, in getitem
indexer = self.columns.get_loc(key)
File "/home/mu351i/PycharmProjects/untitled/venv/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 2899, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 107, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 131, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Date'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/mu351i/PycharmProjects/untitled/get_sp500_beautifulsoup_intro.py", line 57, in
get_data_from_yahoo()
File "/home/mu351i/PycharmProjects/untitled/get_sp500_beautifulsoup_intro.py", line 48, in get_data_from_yahoo
except RemoteDataError:
NameError: name 'RemoteDataError' is not defined
Process finished with exit code 1
how would you avoid this by changing this code?
import datetime as dt
import os
import pickle
import bs4 as bs
import pandas_datareader.data as web
import requests
def safe_sp500_tickers():
resp = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
soup = bs.BeautifulSoup(resp.text,'lxml')
table = soup.find('table',{'class':'wikitable sortable'})
tickers = []
for row in table.findAll('tr')[1:]:
ticker=row.findAll('td')[0].text.strip()
tickers.append(ticker)
with open('sp500tickers.pickle','wb') as f:
pickle.dump(tickers,f)
return tickers
safe_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
if reload_sp500:
tickers=safe_sp500_tickers()
else:
with open('sp500tickers.pickle', 'rb') as f:
tickers = pickle.load(f)
if not os.path.exists('stock_dfs'):
os.makedirs('stock_dfs')
start = dt.datetime(1999,1,1)
end = dt.datetime(2019,12,19)
for ticker in tickers:
try:
if not os.path.exists ('stock_dfs/{}.csv'.format (ticker)):
df = web.DataReader (ticker, 'yahoo', start, end)
df.to_csv ('stock_dfs/{}.csv'.format (ticker))
else:
print ("Ticker from {} already availablle".format (ticker))
except RemoteDataError:
print ("No information for ticker '%s'" % i)
continue
except KeyError:
print("no Date for Ticker: " +ticker )
continue
get_data_from_yahoo()
A Commentator asked for some DATA Sample, well this is DATA form TSLA.csv
Date,High,Low,Open,Close,Volume,Adj Close
2010-06-29,25.0,17.540000915527344,19.0,23.889999389648438,18766300,23.889999389648438
2010-06-30,30.420000076293945,23.299999237060547,25.790000915527344,23.829999923706055,17187100,23.829999923706055
2010-07-01,25.920000076293945,20.270000457763672,25.0,21.959999084472656,8218800,21.959999084472656
2010-07-02,23.100000381469727,18.709999084472656,23.0,19.200000762939453,5139800,19.200000762939453
2010-07-06,20.0,15.829999923706055,20.0,16.110000610351562,6866900,16.110000610351562
2010-07-07,16.6299991607666,14.979999542236328,16.399999618530273,15.800000190734863,6921700,15.800000190734863
2010-07-08,17.520000457763672,15.569999694824219,16.139999389648438,17.459999084472656,7711400,17.459999084472656
2010-07-09,17.899999618530273,16.549999237060547,17.579999923706055,17.399999618530273,4050600,17.399999618530273
2010-07-12,18.06999969482422,17.0,17.950000762939453,17.049999237060547,2202500,17.049999237060547
2010-07-13,18.639999389648438,16.899999618530273,17.389999389648438,18.139999389648438,2680100,18.139999389648438
2010-07-14,20.149999618530273,17.760000228881836,17.940000534057617,19.84000015258789,4195200,19.84000015258789
2010-07-15,21.5,19.0,19.940000534057617,19.889999389648438,3739800,19.889999389648438
2010-07-16,21.299999237060547,20.049999237060547,20.700000762939453,20.639999389648438,2621300,20.639999389648438
Please provide constructive feedback because I'new here.
Thanks :)

You are missing an import
Add the following import at the top of your script
from pandas_datareader._utils import RemoteDataError

import pandas as pd
df = pd.read_html(
"https://en.wikipedia.org/wiki/List_of_S%26P_500_companies")[0]
sort = pd.DataFrame(df).sort_values(by=['Date first added'])
sort['Date first added'] = pd.to_datetime(sort['Date first added'])
start_date = '1-1-1999'
end_date = '11-12-2019'
mask = (sort['Date first added'] > start_date) & (
sort['Date first added'] <= end_date)
sort = sort.loc[mask]
pd.DataFrame(sort).to_csv('result.csv', index=False)
Output: View Online
ScreenShot:

Reading a file with pandas and use correlation coefficients on two columns

I have a file like following with no header
0.000000 0.330001 0.280120
1.000000 0.355590 0.298581
2.000000 0.305945 0.280231
I want to read this file using pandas dataframe and want to perform correlation coefficient between the second and the third column.
I am trying like following:
import pandas as pd
df = pd.read_csv('COLVAR_hbondnohead', header=None)
df['1'].corr(df['2'])
It pops up with a huge error message. Am I not treating the columns properly? Any suggestion or hint?
Error message
Traceback (most recent call last):
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
TypeError: an integer is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sbhakat/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3063, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 164, in pandas._libs.index.IndexEngine.get_loc
KeyError: '1'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 958, in pandas._libs.hashtable.Int64HashTable.get_item
TypeError: an integer is required
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/sbhakat/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2685, in __getitem__
return self._getitem_column(key)
File "/home/sbhakat/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2692, in _getitem_column
return self._get_item_cache(key)
File "/home/sbhakat/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 2486, in _get_item_cache
values = self._data.get(item)
File "/home/sbhakat/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/home/sbhakat/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3065, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 164, in pandas._libs.index.IndexEngine.get_loc
KeyError: '1'

You will have to specify separator which is space while reading file. Then use position to access the columns. Below code should work.
df = pd.read_csv('test.txt', sep=' ', header=None)
df[1].corr(df[2])

Roy what is the file extension? is it .csv ? if it is you should add it to the end of fileName like pd.read_csv('COLVAR_hbondnohead.csv', header=None)

You don't have columns named 1 and 2, So, you have to create those columns first.
import pandas as pd
df = pd.read_csv('COLVAR_hbondnohead', header=None)
df1 = df.reindex(columns=['1','2', '3'])
then
df1['2'].corr(df1['3'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Question about combining pandas dataframe - python

Please try as below: combinedresponse = pd.merge(respose2000, response2010, on="NAME", how="inner) combinedresponse['DIFFERENCE'] = combinedresponse['2010'] - combinedresponse['2000']

Related

Creating a deltatime array in Python

Key error message when calculating variables using pandas and yfinance

KeyError: 1.0 after renaming columns of dataframe

Pandas datareader failure

Reading a file with pandas and use correlation coefficients on two columns

Categories

Resources