Unable to Read a large CSV into Panda dataframe

Unable to Read a large CSV into Panda dataframe - python

I cant read the whole file into a dataframe so I tried breaking it into very small chunks
for chunk in pd.read_csv(r'HugeFile.txt',chunksize=1000, sep=r"\s+", header=None):
df = pd.concat(chunk)
print(df)
But I am still getting this error
Traceback (most recent call last):
File "C:\Data Science\Python\main.py", line 13, in <module>
pd.read_csv(r'HugeFile.txt',chunksize=1000, sep=r"\s+", header=None):
File "C:\Data Science\Python\venv\lib\site-packages\pandas\io\parsers.py", line 1034, in __next__
return self.get_chunk()
File "C:\Data Science\Python\venv\lib\site-packages\pandas\io\parsers.py", line 1084, in get_chunk
return self.read(nrows=size)
File "C:\Data Science\Python\venv\lib\site-packages\pandas\io\parsers.py", line 1057, in read
index, columns, col_dict = self._engine.read(nrows)
File "C:\Data Science\Python\venv\lib\site-packages\pandas\io\parsers.py", line 2061, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 756, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 783, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 827, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 814, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 1951, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 8, saw 4
Thanks

Related

I'm getting an error while reading a csv with pandas

I'm getting this error while I'm trying to read a csv in python with pandas
df02 = pd.read_csv('PMDM Full\filename.csv', sep = '|')
Traceback (most recent call last): File "<stdin>", line 1, in <module> File
"C:\Users\dm\Google Drive\CS\GV\Tickets\Status
Check\venv\lib\site-packages\pandas\util\_decorators.py", line 311, in
wrapper File "C:\Users\dm\Google Drive\CS\GV\Tickets\Status
Check\venv\lib\site-packages\pandas\io\parsers\readers.py", line 1250,
in read index, columns, col_dict = self._engine.read(nrows) File
"C:\Users\dm\Google Drive\CS\GV\Tickets\Status
Check\venv\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py",
line 225, in read chunks = self._reader.read_low_memory(nrows)
File "pandas\_libs\parsers.pyx", line 805, in
pandas._libs.parsers.TextReader.read_low_memory File
"pandas\_libs\parsers.pyx", line 861, in
pandas._libs.parsers.TextReader._read_rows File
"pandas\_libs\parsers.pyx", line 847, in
pandas._libs.parsers.TextReader._tokenize_rows File
"pandas\_libs\parsers.pyx", line 1960, in
pandas._libs.parsers.raise_parser_error pandas.errors.ParserError:
Error tokenizing data. C error: Expected 109 fields in line 1021, saw
113
Code used:
df02 = pd.read_csv('filepath', sep = '|')
sample file

This error is occurring because line 5 of the csv contains a different number of columns than the other lines.
To read the file excluding this line you can use the following code:
df = pd.read_csv('sample.csv', sep='|', error_bad_lines=False)

How can I concatenate 3 large tweet dataframe (csv) files each having approximately 5M tweets?

I have three csv dataframes of tweets, each ~5M tweets. The following code for concatenating them exists with low memory error. My machine has 32GB memory. How can I assign more memory for this task in pandas?
df1 = pd.read_csv('tweets.csv')
df2 = pd.read_csv('tweets2.csv')
df3 = pd.read_csv('tweets3.csv')
frames = [df1, df2, df3]
result = pd.concat(frames)
result.to_csv('tweets_combined.csv')
The error is:
$ python concantenate_dataframes.py
sys:1: DtypeWarning: Columns (0,1,2,3,4,5,6,8,9,10,11,12,13,14,19,22,23,24) have mixed types.Specify dtype option on import or set low_memory=False.
Traceback (most recent call last):
File "concantenate_dataframes.py", line 19, in <module>
df2 = pd.read_csv('tweets2.csv')
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 859, in pandas._libs.parsers.TextReader.read
UPDATE: tried the suggestions in the answer and still get error
$ python concantenate_dataframes.py
Traceback (most recent call last):
File "concantenate_dataframes.py", line 18, in <module>
df1 = pd.read_csv('tweets.csv', low_memory=False, error_bad_lines=False)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 454, in _read
data = parser.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 1133, in read
ret = self._engine.read(nrows)
File "/home/mona/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 862, in pandas._libs.parsers.TextReader.read
File "pandas/_libs/parsers.pyx", line 943, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
File "pandas/_libs/parsers.pyx", line 874, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas/_libs/parsers.pyx", line 928, in pandas._libs.parsers.TextReader._read_rows
File "pandas/_libs/parsers.pyx", line 915, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2070, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
I am running the code on Ubuntu 20.04 OS

I think this is problem with malformed data (some data not structure properly in tweets2.csv) for that you can use error_bad_lines=False and try to chnage engine from c to python like engine='python'
ex : df2 = pd.read_csv('tweets2.csv', error_bad_lines=False)
or
ex : df2 = pd.read_csv('tweets2.csv', engine='python')
or maybe
ex : df2 = pd.read_csv('tweets2.csv', engine='python', error_bad_lines=False)
but I recommand to identify those revord and repair that.
And also if you want hacky way to do this than use
1) https://askubuntu.com/questions/941480/how-to-merge-multiple-files-of-the-same-format-into-a-single-file
2) https://askubuntu.com/questions/656039/concatenate-multiple-files-without-headerenter link description here

Specify dtype option on import or set low_memory=False

Getting the below error while trying to read a CSV file in python

File "<ipython-input-10-9cc4e896b568>", line 1, in <module>
pd.read_csv('temp.csv')
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 401, in _read
data = parser.read()
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 848, in pandas.parser.TextReader.read (pandas\parser.c:10415)
File "pandas\parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)
File "pandas\parser.pyx", line 947, in pandas.parser.TextReader._read_rows (pandas\parser.c:11728)
File "pandas\parser.pyx", line 1049, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:13162)
File "pandas\parser.pyx", line 1108, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:14116)
File "pandas\parser.pyx", line 1206, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:16172)
File "pandas\parser.pyx", line 1222, in pandas.parser.TextReader._string_convert (pandas\parser.c:16400)
File "pandas\parser.pyx", line 1458, in pandas.parser._string_box_utf8 (pandas\parser.c:22072)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
pd.read_csv('temp.csv')
Traceback (most recent call last):
File "<ipython-input-11-9cc4e896b568>", line 1, in <module>
pd.read_csv('temp.csv')
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 401, in _read
data = parser.read()
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 848, in pandas.parser.TextReader.read (pandas\parser.c:10415)
File "pandas\parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)
File "pandas\parser.pyx", line 947, in pandas.parser.TextReader._read_rows (pandas\parser.c:11728)
File "pandas\parser.pyx", line 1049, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:13162)
File "pandas\parser.pyx", line 1108, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:14116)
File "pandas\parser.pyx", line 1206, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:16172)
File "pandas\parser.pyx", line 1222, in pandas.parser.TextReader._string_convert (pandas\parser.c:16400)
File "pandas\parser.pyx", line 1458, in pandas.parser._string_box_utf8 (pandas\parser.c:22072)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte
import sys
sys.setdefaultencoding("ISO-8859-1")
Traceback (most recent call last):
File "<ipython-input-12-b416bfca896f>", line 2, in <module>
sys.setdefaultencoding("ISO-8859-1")
AttributeError: module 'sys' has no attribute 'setdefaultencoding'
pd.read_csv('temp.csv')
Traceback (most recent call last):
File "<ipython-input-13-9cc4e896b568>", line 1, in <module>
pd.read_csv('temp.csv')
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 401, in _read
data = parser.read()
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "C:\Users\nivetha.n\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas\parser.pyx", line 848, in pandas.parser.TextReader.read (pandas\parser.c:10415)
File "pandas\parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)
File "pandas\parser.pyx", line 947, in pandas.parser.TextReader._read_rows (pandas\parser.c:11728)
File "pandas\parser.pyx", line 1049, in pandas.parser.TextReader._convert_column_data (pandas\parser.c:13162)
File "pandas\parser.pyx", line 1108, in pandas.parser.TextReader._convert_tokens (pandas\parser.c:14116)
File "pandas\parser.pyx", line 1206, in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:16172)
File "pandas\parser.pyx", line 1222, in pandas.parser.TextReader._string_convert (pandas\parser.c:16400)
File "pandas\parser.pyx", line 1458, in pandas.parser._string_box_utf8 (pandas\parser.c:22072)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

use pd.read_csv('temp.csv', encoding='latin-1')

Pandas Reading csv - Error

I have a problem with reading a CSV file with pandas (I know there are other topics but I could not solve the problem). My code is:
import pandas as pd
f = pd.read_csv('1803Ltem.csv',sep='\t', dtype=object,)
The error I get is:
Traceback (most recent call last):
File "/username/username/Documents/first.py", line 362, in <module>
fuck = pd.read_csv('1803Ltem.csv',sep='\t', dtype=object,)
File "/Users/username/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/username/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py", line 325, in _read
return parser.read()
File "/Users/username/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py", line 815, in read
ret = self._engine.read(nrows)
File "/Users/username/anaconda/lib/python3.5/site-packages/pandas/io/parsers.py", line 1314, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 4587, saw 5
What am I doing wrong?

Try adding the argument error_bad_lines=False to read_csv

The following worked for me by adding:
import pandas as pd
f = pd.read_csv('1803Ltem.csv',sep='\t', dtype=object,error_bad_lines=False)

Python-Reportlab error: ValueError: format not resolved

when I was using python-reportlab to create a pdf document, sometimes it throws out an exception: ValueError: format not resolved talk.google.com, I wonder why this came out, and how to solve it, the full error stack is like below:
File "/usr/lib64/python2.7/threading.py", line 552, in __bootstrap_inner
self.run()
File "/usr/lib64/python2.7/threading.py", line 505, in run
self.__target(*self.__args, **self.__kwargs)
File "/usr/lib/python2.7/site-packages/tweets2pdf/tweets2pdf.py", line 42,
in generate_thread
tpdoc.dump()
File "/usr/lib/python2.7/site-packages/tweets2pdf/pdfgen.py", line 609, in
dump
self.pdfdoc.build(self.elements, onFirstPage = self.on_first_page,
onLaterPages = self.on_later_pages)
File "/usr/lib64/python2.7/site-
packages/reportlab/platypus/doctemplate.py", line 1117, in build
BaseDocTemplate.build(self,flowables, canvasmaker=canvasmaker)
File "/usr/lib64/python2.7/site-
packages/reportlab/platypus/doctemplate.py", line 906, in build
self._endBuild()
File "/usr/lib64/python2.7/site-
packages/reportlab/platypus/doctemplate.py", line 848, in _endBuild
if getattr(self,'_doSave',1): self.canv.save()
File "/usr/lib64/python2.7/site-packages/reportlab/pdfgen/canvas.py", line
1123, in save
self._doc.SaveToFile(self._filename, self)
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 235, in SaveToFile
f.write(self.GetPDFData(canvas))
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 257, in GetPDFData
return self.format()
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 417, in format
IOf = IO.format(self)
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 869, in format
fcontent = format(self.content, document, toplevel=1) # yes this is at
top level
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 102, in format
f = element.format(document)
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 1635, in format
return D.format(document)
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 667, in format
L = [(format(PDFName(k),document)+" "+format(dict[k],document)) for k in
keys]
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 102, in format
f = element.format(document)
File "/usr/lib64/python2.7/site-packages/reportlab/pdfbase/pdfdoc.py",
line 1764, in format
if f is None: raise ValueError, "format not resolved %s" % self.name
ValueError: format not resolved talk.google.com

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Unable to Read a large CSV into Panda dataframe - python

Related

I'm getting an error while reading a csv with pandas

How can I concatenate 3 large tweet dataframe (csv) files each having approximately 5M tweets?

Getting the below error while trying to read a CSV file in python

Pandas Reading csv - Error

Python-Reportlab error: ValueError: format not resolved

Categories

Resources