I am trying to read in a tsv file (0.5gb) using pandas however, I can't seem to get it to work. I have stripped my code down to its simplest form and still no luck:
import pandas as pd
import os
rawpath = 'my path'
filename = 'my file name'
finalfile = os.path.join(rawpath, filename)
df = pd.read_csv(finalfile, nrows=5000, sep='\t')
print(df.head())
I have tried to chunk the file, with no luck, read_table doesn't work either. I have gone in and freed up as much memory as possible on my machine but when I finally recieve any output from Pycharm, it says:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Can anyone assist please?
Try setting dtype=object and na_values = "Your NA format" (optional, if you know it)
Also, make sure you have the right separator.
something like:
df = pd.read_csv(finalfile, nrows=5000, sep='\t', dtype=object, na_values = '-NaN')
Edit:
Also, you mentioned chunking the file. I am not sure what you mean by that, but i will mention that you can chunk directly using pandas, instead of nrows. Final code:
mylist = []
for chunk in pd.read_csv(finalfile, sep='\t', dtype=object, na_values = '-NaN', chunksize=100):
mylist.append(chunk)
big_data = pd.concat(mylist, axis= 0)
del mylist
Related
I have a dataset from the State Security Department in my county that has some problems.
I can't read the records at all from the file that is made available in CSV, bringing up only empty records. When I convert the file to XLSX it does get read.
I would like to know if there is any possible solution to the above problem.
The dataset is available at: here or here.
I tried the code below, but i only get nulls, except for the first row in the first column:
df = pd.read_csv('mensal_ss.csv', sep=';', names=cols, encoding='latin1')
image
Thank you!
If you try with utf-16 as the encoding, it seems to work. However, note that the year rows complicates the parsing, so you may need some extra manipulation of the csv to circumvent that depending on what you want to do with the data
df = pd.read_csv('mensal_ss.csv', sep=';', encoding='utf-16')
try to use 'utf-16-le':
import pandas as pd
df = pd.read_csv('mensal_ss.csv', sep=';', encoding='utf-16-le')
print(df.head())
I am trying to load a flat file into a python pandas data frame.
Using Python 3.8.3 and pandas version 1.0.5
The read_csv code is like this:
import pandas as pd
df = pd.read_csv(myfile, sep='|', usecols=[0], names=["ID"],
dtype=str,
encoding='UTF-8',
memory_map=True,
low_memory=True, engine='c')
print('nb entries:', df["ID"].size)
This gives me a number of entries.
However, this does not match the number of entries I get with the following code:
num_lines = sum(1 for line in open(myfile, encoding='UTF-8')
print('nb lines:', num_lines)
I don't get an error message.
I tried several options (with/without encoding, with/without low memory, with or without memory map, with or without warn_bad_lines, with the c engine or the default one), but I always got the same erroneous results.
By changing the nrows parameters I identified where in the file the problem seems to be. And I copied the lines of interest in a test file and re-run the code on the test file. This time I get the correct result.
Now I realize that my machine is a little short on memory, so maybe some allocation is failing silently. Would there be a way to test for that? I tried running the script without any other applications open, but I got the same erroneous results.
How should I troubleshoot this type of problem?
Something like this could be used to read the file in chunks
import pandas as pd
import numpy as np
n_rows = sum(1 for _ in open("./test.csv", encoding='UTF-8')) - 1
chunk_size = 300
n_chunks = int(np.ceil(n_rows / chunk_size))
read_lines = 0
for chunk_idx in range(n_chunks):
df = pd.read_csv("./test.csv", header=0, skiprows=chunk_idx*chunk_size, nrows=chunk_size)
read_lines += len(df)
print(read_lines)
Currently I am getting the below Error and I am tried out the below posts:
Solution 1
Solution 2
But I am not able to get the error resolved. My python code is as below:
import pandas as pd
testdata = pd.read_csv(file_name, header=None, delim_whitespace=True)
I tried to print the value in testdata but it doesn't show any output.
The following is my csvfile:
Firstly, declare your filename inside testdata as a string, and make sure it is either in the local directory, or that you have the correct filepath.
import pandas as pd
testdata = pd.read_csv("filename.csv", header=None, delim_whitespace=True)
If that does not work, post some information about the environment you are using.
First, you probably don't need header=None as you seem to have headers in the file.
Also try removing the blank line between the headers and the first line of data.
Check and double check your file name.
I have a csv file that contains some data with columns names:
"PERIODE"
"IAS_brut"
"IAS_lissé"
"Incidence_Sentinelles"
I have a problem with the third one "IAS_lissé" which is misinterpreted by pd.read_csv() method and returned as �.
What is that character?
Because it's generating a bug in my flask application, is there a way to read that column in an other way without modifying the file?
In [1]: import pandas as pd
In [2]: pd.read_csv("Openhealth_S-Grippal.csv",delimiter=";").columns
Out[2]: Index([u'PERIODE', u'IAS_brut', u'IAS_liss�', u'Incidence_Sentinelles'], dtype='object')
I found the same problem with spanish, solved it with with "latin1" encoding:
import pandas as pd
pd.read_csv("Openhealth_S-Grippal.csv",delimiter=";", encoding='latin1')
Hope it helps!
You can change the encoding parameter for read_csv, see the pandas doc here. Also the python standard encodings are here.
I believe for your example you can use the utf-8 encoding (assuming that your language is French).
df = pd.read_csv("Openhealth_S-Grippal.csv", delimiter=";", encoding='utf-8')
Here's an example showing some sample output. All I did was make a csv file with one column, using the problem characters.
df = pd.read_csv('sample.csv', encoding='utf-8')
Output:
IAS_lissé
0 1
1 2
2 3
Try using:
import pandas as pd
df = pd.read_csv('file_name.csv', encoding='utf-8-sig')
Using utf-8 didn't work for me. E.g. this piece of code:
bla = pd.DataFrame(data = [1, 2])
bla.to_csv('funkyNamé , things.csv')
blabla = pd.read_csv('funkyNamé , things.csv', delimiter=";", encoding='utf-8')
blabla
Ultimately returned: OSError: Initializing from file failed
I know you said you didn't want to modify the file. If you meant the file content vs the filename, I would rename the file to something without an accent, read the csv file under its new name, then reset the filename back to its original name.
originalfilepath = r'C:\Users\myself\\funkyNamé , things.csv'
originalfolder = r'C:\Users\myself'
os.rename(originalfilepath, originalFolder+"\\tempName.csv")
df = pd.read_csv(originalFolder+"\\tempName.csv", encoding='ISO-8859-1')
os.rename(originalFolder+"\\tempName.csv", originalfilepath)
If you did mean "without modifying the filename, my apologies for not being helpful to you, and I hope this helps someone else.
I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.
import pandas as pd
file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]
df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M
If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.
import pandas as pd
file_name = r"C:\....\file.txt"
df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M
Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).
Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.
Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwf into it. This is the code for the generation and reading it:
from random import random
import pandas as pd
file_name = r"/tmp/file.txt"
lines_no = int(30e6)
with open(file_name, 'w') as f:
for i in range(lines_no):
if i%int(1e5) == 0:
print("Writing progress: {:0.1f}%"
.format(float(i) / float(lines_no)*100), end='\r')
f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")
print("File created. Now read it using pd.read_fwf ...")
fwidths = [11,11,11,11,11,11]
df = pd.read_fwf(file_name, widths = fwidths,
names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])
#print(df)
print(df.shape) #<30M
So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.
The result is : (30000000, 6)
Example file created:
7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905