Python Decompressing gzip csv in pandas csv reader - python

The following code works in Python3 but fails in Python2
r = requests.get("http://api.bitcoincharts.com/v1/csv/coinbaseUSD.csv.gz", stream=True)
decompressed_file = gzip.GzipFile(fileobj=r.raw)
data = pd.read_csv(decompressed_file, sep=',')
data.columns = ["timestamp", "price" , "volume"] # set df col headers
return data
The error I get in Python2 is the following:
TypeError: 'int' object has no attribute '__getitem__'
The error is on the line where I set data equal to pd.read_csv(...)
Seems to be a pandas error to me
Stacktrace:
Traceback (most recent call last):
File "fetch.py", line 51, in <module>
print(f.get_historical())
File "fetch.py", line 36, in get_historical
data = pd.read_csv(f, sep=',')
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 709, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 449, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 818, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1049, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1695, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 562, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 760, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 965, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2197, in pandas._libs.parsers.raise_parser_error
io.UnsupportedOperation: seek

The issue from the traceback you posted is related to the fact that the Response object's raw attribute is a file-like object that does not support the .seek method that typical file objects support. However, when ingesting the file object with pd.read_csv, pandas (in python2) seems to be making use of the seek method of the provided file object.
You can confirm that the returned response's raw data is not seekable by calling r.raw.seekable(), which should normally return False.
The way to circumvent this issue may be to wrap the returned data into an io.BytesIO object as follows:
import gzip
import io
import pandas as pd
import requests
# file_url = "http://api.bitcoincharts.com/v1/csv/coinbaseUSD.csv.gz"
file_url = "http://api.bitcoincharts.com/v1/csv/aqoinEUR.csv.gz"
r = requests.get(file_url, stream=True)
dfile = gzip.GzipFile(fileobj=io.BytesIO(r.raw.read()))
data = pd.read_csv(dfile, sep=',')
print(data)
0 1 2
0 1314964052 2.60 0.4
1 1316277154 3.75 0.5
2 1316300526 4.00 4.0
3 1316300612 3.80 1.0
4 1316300622 3.75 1.5
As you can see, I used a smaller file from the directory of files available. You can switch this to your desired file.
In any case, io.BytesIO(r.raw.read()) should be seekable, and therefore should help avoid the io.UnsupportedOperation exception you are encountering.
As for the TypeError exception, it is inexistent in this snippet of code.
I hope this helps.

Related

pandas read_csv error when loading a plain CSV file

I am having this very weird error with python pandas:
import pandas as pd
df = pd.read_csv('C:\Temp\test.csv', index_col=None, comment='#', sep=',')
The test.csv is a very simple CSV file created in Notepad:
aaa,bbb,date
hhhhh,wws,20220701
Now I get the error:
File "C:\test\untitled0.py", line 10, in <module>
df = pd.read_csv('C:\temp\test.csv', index_col=None, comment='#', sep=',')
File "C:\...\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
return func(*args, **kwargs)
File "C:\...\lib\site-packages\pandas\io\parsers\readers.py", line 586, in read_csv
return _read(filepath_or_buffer, kwds)
File "C:\...\lib\site-packages\pandas\io\parsers\readers.py", line 482, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\...\lib\site-packages\pandas\io\parsers\readers.py", line 811, in __init__
self._engine = self._make_engine(self.engine)
File "C:\...\lib\site-packages\pandas\io\parsers\readers.py", line 1040, in _make_engine
return mapping[engine](self.f, **self.options) # type: ignore[call-arg]
File "C:\...\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 51, in __init__
self._open_handles(src, kwds)
File "C:\...\lib\site-packages\pandas\io\parsers\base_parser.py", line 229, in _open_handles
errors=kwds.get("encoding_errors", "strict"),
File "C:\...\lib\site-packages\pandas\io\common.py", line 707, in get_handle
newline="",
OSError: [Errno 22] Invalid argument: 'C:\temp\test.csv'
I also tried to use Excel to export a CSV file, and get the same error.
Does anyone know what goes wrong?
In a python string, the backslash in '\t' is an escape character which causes those two characters ( \ followed by t) to mean tab. You can get around this using raw strings by prefacing the opening quote with the letter 'r':
r'C:\Temp\test.csv'

Lineterminator giving extra row using pandas

I think I have a problem using line terminator when reading data from a text file using pandas. It gives an unneeded row at the end of the column of NaN. My data has 12 rows plus a header but using the code provided it produces 13rows plus a header. How can I get it to output the correct data.
Data was written to text file using
with open(filepath, 'a', newline='') as filey:
csv_writer = csv.writer(filey, delimiter = '\t', lineterminator='\r\n')
Code
import pandas as pd
path_to_results = r"C:\\...\\Desktop\\Results\\_results.txt"
data = pd.read_csv(path_to_results,sep='\t',lineterminator='\r')
data = pd.DataFrame(data)
#print(data.head())
print(data["Vi_V"])
print(data["mass_g"])
using Lineterminator \r :
Name: Vi_V, dtype: object
0 0.24
1 0.47
...
11 3.66
12 NaN
using Lineterminator \r\n :
Traceback (most recent call last):
File "c:/Users.../g_00.py", line 5, in <module>
data = pd.read_csv(path_to_results,sep='\t',lineterminator='\r\n')
File "C:...\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:...\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 448, in _read
parser = TextFileReader(fp_or_buf, **kwds)
File "C:...\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 880, in __init__
self._make_engine(self.engine)
File "C:...\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 1114, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:...\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parsers.py", line 1891, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas\_libs\parsers.pyx", line 395, in pandas._libs.parsers.TextReader.__cinit__
ValueError: Only length-1 line terminators supported
dee cue gave the answer in a comment:
"Have you tried leaving out the line terminator parameter? Try to see if python on Windows automatically reads new line as \r\n. – dee cue Oct 12 at 10:01"
It worked.

How can I concatenate 3 large tweet dataframe (csv) files each having approximately 5M tweets?

I have three csv dataframes of tweets, each ~5M tweets. The following code for concatenating them exists with low memory error. My machine has 32GB memory. How can I assign more memory for this task in pandas?
df1 = pd.read_csv('tweets.csv')
df2 = pd.read_csv('tweets2.csv')
df3 = pd.read_csv('tweets3.csv')
frames = [df1, df2, df3]
result = pd.concat(frames)
result.to_csv('tweets_combined.csv')
The error is:
$ python concantenate_dataframes.py
sys:1: DtypeWarning: Columns (0,1,2,3,4,5,6,8,9,10,11,12,13,14,19,22,23,24) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "concantenate_dataframes.py", line 19, in <module>
df2 = pd.read_csv('tweets2.csv')
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
UPDATE: tried the suggestions in the answer and still get error
$ python concantenate_dataframes.py
Traceback (most recent call last):
File "concantenate_dataframes.py", line 18, in <module>
df1 = pd.read_csv('tweets.csv', low_memory=False, error_bad_lines=False)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 943, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
I am running the code on Ubuntu 20.04 OS
I think this is problem with malformed data (some data not structure properly in tweets2.csv) for that you can use error_bad_lines=False and try to chnage engine from c to python like engine='python'
ex : df2 = pd.read_csv('tweets2.csv', error_bad_lines=False)
or
ex : df2 = pd.read_csv('tweets2.csv', engine='python')
or maybe
ex : df2 = pd.read_csv('tweets2.csv', engine='python', error_bad_lines=False)
but I recommand to identify those revord and repair that.
And also if you want hacky way to do this than use
1) https://askubuntu.com/questions/941480/how-to-merge-multiple-files-of-the-same-format-into-a-single-file
2) https://askubuntu.com/questions/656039/concatenate-multiple-files-without-headerenter link description here
Specify dtype option on import or set low_memory=False

pandas cannot converge

I get an error that a file does not exist while I have the file there in the folder, would you please tell me where I am making a mistake?
pd.DataFrame.from_csv
I am getting an error shown below.
Traceback (most recent call last):
File "main.py", line 194, in <module>
start_path+end_res)
File "/Users/admin/Desktop/script/mergeT.py", line 5, in merge
df_peak = pd.DataFrame.from_csv(peak_score, index_col = False, sep='\t')
File "/Library/Python/2.7/site-packages/pandas/core/frame.py", line 1231, in from_csv
infer_datetime_format=infer_datetime_format)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 645, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 388, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 729, in __init__
self._make_engine(self.engine)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 922, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Library/Python/2.7/site-packages/pandas/io/parsers.py", line 1389, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas/parser.pyx", line 373, in pandas.parser.TextReader.__cinit__ (pandas/parser.c:4175)
File "pandas/parser.pyx", line 667, in pandas.parse**strong text**r.TextReader._setup_parser_source (pandas/parser.c:8440)
IOError: File results\scoring\fed\score_peak.txt does not exist
I have tried to set a path to the exact file
for example
As per documentation of pandas 0.19.1 pandas.DataFrame.from_csv does not support index_col = False. Try to use pandas.read_csv instead (with the same parameters). Also make sure you are using the up to date version of pandas.
See if this works:
import pandas as pd
def merge(peak_score, profile_score, res_file):
df_peak = pd.read_csv(peak_score, index_col = False, sep='\t')
df_profile = pd.read_csv(profile_score, index_col = False, sep='\t')
result = pd.concat([df_peak, df_profile], axis=1)
print result.head()
test = []
for a,b in zip(result['prot_a_p'],result['prot_b_p']):
if a == b:
test.append(1)
else:
test.append(0)
result['test']=test
result = result[result['test']==0]
del result['test']
result = result.fillna(0)
result.to_csv(res_file)
if __name__ == '__main__':
pass
Regarding the path issue when changing from Windows to OS X:
In all flavours of Unix, paths are written with slashes /, while in Windows backslashes \ are used. Since OS X is a descendant of Unix, as other users have correctly pointed out, when you change there from Windows you need to adapt your paths.

Error from download HF data from netfonds

I am trying to download HF data from netfonds website by directly using Dr. Yves Hilpisch's sample code, however, I ran into error message such as
ValueError: No columns to parse from file
— can anyone help with this? Thanks a lot.
Here is the sample code:
import numpy as np
import pandas as pd
import datetime as dt
from urllib import urlretrieve
url1='http://hopey.netfonds.no/posdump.php?'
url2='date=%s%s%s&paper=AAPL.O&csv_format=csv'
url=url1+url2
year='2014'
month='09'
days=['23','24']
AAPL=pd.DataFrame()
for day in days:
AAPL=AAPL.append(pd.read_csv(url % (year,month,day),
index_col=0, header=0, parse_dates=True))
AAPL.columns=['bid','bdepth','bdeptht','offer','odepth','odeptht']
AAPL.info()
The error message is like this :
Traceback (most recent call last):
File "<ipython-input-87-27cc48982059>", line 18, in <module>
index_col=0, header=0, parse_dates=True))
File "C:\Users\jinj\AppData\Local\Continuum\Miniconda\lib\site-packages\pandas\io\parsers.py", line 474, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\jinj\AppData\Local\Continuum\Miniconda\lib\site-packages\pandas\io\parsers.py", line 250, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "C:\Users\jinj\AppData\Local\Continuum\Miniconda\lib\site-packages\pandas\io\parsers.py", line 566, in __init__
self._make_engine(self.engine)
File "C:\Users\jinj\AppData\Local\Continuum\Miniconda\lib\site-packages\pandas\io\parsers.py", line 705, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "C:\Users\jinj\AppData\Local\Continuum\Miniconda\lib\site-packages\pandas\io\parsers.py", line 1072, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "pandas\parser.pyx", line 512, in pandas.parser.TextReader.__cinit__ (pandas\parser.c:4814)
ValueError: No columns to parse from file

Categories

Resources