I am trying to load a flat file into a python pandas data frame.
Using Python 3.8.3 and pandas version 1.0.5
The read_csv code is like this:
import pandas as pd
df = pd.read_csv(myfile, sep='|', usecols=[0], names=["ID"],
dtype=str,
encoding='UTF-8',
memory_map=True,
low_memory=True, engine='c')
print('nb entries:', df["ID"].size)
This gives me a number of entries.
However, this does not match the number of entries I get with the following code:
num_lines = sum(1 for line in open(myfile, encoding='UTF-8')
print('nb lines:', num_lines)
I don't get an error message.
I tried several options (with/without encoding, with/without low memory, with or without memory map, with or without warn_bad_lines, with the c engine or the default one), but I always got the same erroneous results.
By changing the nrows parameters I identified where in the file the problem seems to be. And I copied the lines of interest in a test file and re-run the code on the test file. This time I get the correct result.
Now I realize that my machine is a little short on memory, so maybe some allocation is failing silently. Would there be a way to test for that? I tried running the script without any other applications open, but I got the same erroneous results.
How should I troubleshoot this type of problem?
Something like this could be used to read the file in chunks
import pandas as pd
import numpy as np
n_rows = sum(1 for _ in open("./test.csv", encoding='UTF-8')) - 1
chunk_size = 300
n_chunks = int(np.ceil(n_rows / chunk_size))
read_lines = 0
for chunk_idx in range(n_chunks):
df = pd.read_csv("./test.csv", header=0, skiprows=chunk_idx*chunk_size, nrows=chunk_size)
read_lines += len(df)
print(read_lines)
Related
I am trying to get the dimensions (shape) of a data frame using pandas in python without reading the entire data frame first in memory given that the file is quite large.
To get the number of columns with minimal loading of the file into the memory, I can for example use the argument below.
import pandas as pd
pd = pd.read_csv("myData.csv", nrows=1)
print(pd.shape)
To get the row numbers I can use the argument usecols = [1] when reading the file but there must be a simpler way of doing this.
If there are other packages or scripts that can easily give me such metadata information, I would be happy as well. It is really metadata I am looking for such as column names, number of rows, number of columns etc but I don't want to read the entire file in!
You don't even need pandas for this. Use the built-in csv module to parse the file:
import csv
with open('myData.csv')as fp:
reader = csv.reader(fp)
headers = next(reader) # The header row is now consumed
ncol = len(headers)
nrow = sum(1 for _ in reader) # What remains are the data rows
I am trying to read in a tsv file (0.5gb) using pandas however, I can't seem to get it to work. I have stripped my code down to its simplest form and still no luck:
import pandas as pd
import os
rawpath = 'my path'
filename = 'my file name'
finalfile = os.path.join(rawpath, filename)
df = pd.read_csv(finalfile, nrows=5000, sep='\t')
print(df.head())
I have tried to chunk the file, with no luck, read_table doesn't work either. I have gone in and freed up as much memory as possible on my machine but when I finally recieve any output from Pycharm, it says:
pandas.errors.ParserError: Error tokenizing data. C error: out of memory
Can anyone assist please?
Try setting dtype=object and na_values = "Your NA format" (optional, if you know it)
Also, make sure you have the right separator.
something like:
df = pd.read_csv(finalfile, nrows=5000, sep='\t', dtype=object, na_values = '-NaN')
Edit:
Also, you mentioned chunking the file. I am not sure what you mean by that, but i will mention that you can chunk directly using pandas, instead of nrows. Final code:
mylist = []
for chunk in pd.read_csv(finalfile, sep='\t', dtype=object, na_values = '-NaN', chunksize=100):
mylist.append(chunk)
big_data = pd.concat(mylist, axis= 0)
del mylist
i want to read a dataset from a file with pandas, but when i use pd.read_csv(), the program read it, but when i want to see the dataframe appears:
pandas.io.parsers.TextFileReader at 0x1b3b6b3e198
As additional informational the file is too large (around 9 Gigas)
The file use as a separator the vertical lines, and i tried using chunksize but it doesn't work.
import pandas as pd
df = pd.read_csv(r"C:\Users\dguerr\Documents\files\Automotive\target_file", iterator=True, sep='|',chunksize=1000)
I want to import my data in the traditional pandas dataframe format.
You can load it chunk by chunk by doing:
import pandas as pd
path_to_file = "C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file"
chunk_size = 1000
for chunk in pd.read_csv(path_to_file,chunksize=chunk_size):
# do your stuff
You might want to check encoding types within a DataFrame. Your pd.read_csv defaults to utf8, should you be using latin1 for instance, this could potentially lead to such errors.
import pandas as pd
df = pd.read_csv('C:/Users/dguerr/Documents/Acxiom files/Automotive/auto_model_target_file',
encoding='latin-1', chunksize=1000)
After solving a sorting of a dataset, I have a problem at this point of my code.
with open(fns_land[xx]) as infile:
lines = infile.readlines()
for line in lines:
result_station.append(line.split(',')[0])
result_date.append(line.split(',')[1])
result_metar.append(line.split(',')[-1])
I have a problem in the lines line. In this line the data are sometimes to huge and i get a kill error.
Is there a short/nice way to rewrite this point?
Use readline instead, this read it one line at a time without loading the entire file into memory.
with open(fns_land[xx]) as infile:
while True:
line = infile.readline()
if not line:
break
result_station.append(line.split(',')[0])
result_date.append(line.split(',')[1])
result_metar.append(line.split(',')[-1])
If you are dealing with a dataset, I would suggest that you have a look at pandas, which I great for dealing with data wrangling.
If your problem is a large dataset, you could load the data in chunks.
import pandas as pd
tfr = pd.read_csv('fns_land{0}.csv'.format(xx), iterator=True, chunksize=1000)
Line: imported pandas modul
Line: read data from your csv file in chunks of 1000 lines.
This will be of type pandas.io.parsers.TextFileReader. To load the entire csv file, you follow up with:
df = pd.concat(tfr, ignore_index=True)
The parameter ignore_index=True is added to avoid duplicity of indexes.
You now have all your data loaded into a dataframe. Then do your data manipulation on the columns as vectors, which also is faster than regular line by line.
Have a look here this question that dealt with something similar.
I have a rather large fixed-width file (~30M rows, 4gb) and when I attempted to create a DataFrame using pandas read_fwf() it only loaded a portion of the file, and was just curious if anyone has had a similar issue with this parser not reading the entire contents of a file.
import pandas as pd
file_name = r"C:\....\file.txt"
fwidths = [3,7,9,11,51,51]
df = read_fwf(file_name, widths = fwidths, names = [col0, col1, col2, col3, col4, col5])
print df.shape #<30M
If I naively read the file into 1 column using read_csv(), all of the file is read to memory and there is no data loss.
import pandas as pd
file_name = r"C:\....\file.txt"
df = read_csv(file_name, delimiter = "|", names = [col0]) #arbitrary delimiter (the file doesn't include pipes)
print df.shape #~30M
Of course, without seeing the contents or format of the file it could be related to something on my end, but wanted to see if anyone else has had any issues with this in the past. I did a sanity check and tested a couple of the rows deep in the file and they all seem to be formatted correctly (further verified when I was able to pull this into an Oracle DB with Talend using the same specs).
Let me know if anyone has any ideas, it would be great to run everything via Python and not go back and forth when I begin to develop analytics.
Few lines of the input file would be useful to see how the date looks like. Nevertheless, I generated some random file of similar format (I think) that you have, and applied pd.read_fwf into it. This is the code for the generation and reading it:
from random import random
import pandas as pd
file_name = r"/tmp/file.txt"
lines_no = int(30e6)
with open(file_name, 'w') as f:
for i in range(lines_no):
if i%int(1e5) == 0:
print("Writing progress: {:0.1f}%"
.format(float(i) / float(lines_no)*100), end='\r')
f.write(" ".join(["{:<10.8f}".format(random()*10) for v in range(6)])+"\n")
print("File created. Now read it using pd.read_fwf ...")
fwidths = [11,11,11,11,11,11]
df = pd.read_fwf(file_name, widths = fwidths,
names = ['col0', 'col1', 'col2', 'col3', 'col4', 'col5'])
#print(df)
print(df.shape) #<30M
So in this case, it seams it is working fine. I use Python 3.4, Ubuntu 14.04 x64 and pandas 0.15.1. It takes a while to create the file and read it using pd.read_fwf. But it seems to be working, at least for me and my setup.
The result is : (30000000, 6)
Example file created:
7.83905215 9.64128377 9.64105762 8.25477816 7.31239330 2.23281189
8.55574419 9.08541874 9.43144800 5.18010536 9.06135038 2.02270145
7.09596172 7.17842495 9.95050576 4.98381816 1.36314390 5.47905083
6.63270922 4.42571036 2.54911162 4.81059164 2.31962024 0.85531626
2.01521946 6.50660619 8.85352934 0.54010559 7.28895079 7.69120905