I understand that generators in Python can help for reading and processing large files when specific transformations or outputs are needed from the file (i.e. such as reading a specific column or computing an aggregation).
However, for me it's not clear if there is any benefit in using generators in Python when the only purpose is to read the entire file.
Edit: Assuming your dataset fits in memory.
Lazy Method for Reading Big File in Python?
pd.read_csv('sample_file.csv', chunksize=chunksize)
vs.
pd.read_csv('sample_file.csv')
Are generators useful just to read the entire data without any data processing?
The DataFrame you get from pd.read_csv('sample_file.csv') might fit into memory; however, pd.read_csv itself is a memory intensive function so while reading a file that will end up consuming 10 gigabytes of memory your actual memory usage may exceed 30-40 gigabytes. In cases like this, reading the file in smaller chunks might be the only option.
I have an SAS file that is roughly 112 million rows. I do not actually have access to SAS software, so I need to get this data into, preferably, a pandas DataFrame or something very similar in the python family. I just don't know how to do this efficiently. ie, just doing df = pd.read_sas(filename.sas7bdat) takes a few hours. I can do chunk sizes but that doesn't really solve the underlying problem. Is there any faster way to get this into pandas, or do I just have to eat the multi-hour wait? Additionally, even when I have read in the file, I can barely do anything with it because iterating over the df takes forever as well. It usually just ends up crashing the Jupyter kernel. Thanks in advance for any advice in this regard!
Regarding the first part, I guess there is not much to do as the read_sas options are limited.
For the second part 1. iterating manually through rows is slow and not pandas philosophy. Whenever possible use vectorial operations. 2. Try to look into specialized solutions for large datasets, like dask. Also read how to scale to large dataframes.
Maybe you don't need your entire file to work on it so you can take 10%. You can also change your variable types to reduce its memory.
if you want to store a df and re use it instead of re importing the entire file each time you want to work on it you can save it as a pickle file (.pkl) and re open it by using pandas.read_pickle
here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?
I am using pandas to analyse large CSV data files. They are around 100 megs in size.
Each load from csv takes a few seconds, and then more time to convert the dates.
I have tried loading the files, converting the dates from strings to datetimes, and then re-saving them as pickle files. But loading those takes a few seconds as well.
What fast methods could I use to load/save the data from disk?
As #chrisb said, pandas' read_csv is probably faster than csv.reader/numpy.genfromtxt/loadtxt. I don't think you will find something better to parse the csv (as a note, read_csv is not a 'pure python' solution, as the CSV parser is implemented in C).
But, if you have to load/query the data often, a solution would be to parse the CSV only once and then store it in another format, eg HDF5. You can use pandas (with PyTables in background) to query that efficiently (docs).
See here for a comparison of the io performance of HDF5, csv and SQL with pandas: http://pandas.pydata.org/pandas-docs/stable/io.html#performance-considerations
And a possibly relevant other question: "Large data" work flows using pandas
Posting this late in response to a similar question that had found simply using modin out of the box fell short. The answer will be similar with dask - use all of the below strategies in combination for best results, as appropriate for your use case.
The pandas docs on Scaling to Large Datasets have some great tips which I'll summarize here:
Load less data. Read in a subset of the columns or rows using the usecols or nrows parameters to pd.read_csv. For example, if your data has many columns but you only need the col1 and col2 columns, use pd.read_csv(filepath, usecols=['col1', 'col2']). This can be especially important if you're loading datasets with lots of extra commas (e.g. the rows look like index,col1,col2,,,,,,,,,,,. In this case, use nrows to read in only a subset of the data to make sure that the result only includes the columns you need.
Use efficient datatypes. By default, pandas stores all integer data as signed 64-bit integers, floats as 64-bit floats, and strings as objects or string types (depending on the version). You can convert these to smaller data types with tools such as Series.astype or pd.to_numeric with the downcast option.
Use Chunking. Parsing huge blocks of data can be slow, especially if your plan is to operate row-wise and then write it out or to cut the data down to a smaller final form. You can use the chunksize and iterator arguments to loop over chunks of the data and process the file in smaller pieces. See the docs on Iterating through files chunk by chunk for more detail. Alternately, use the low_memory flag to get Pandas to use the chunked iterator on the backend, but return a single dataframe.
Use other libraries. There are a couple great libraries listed here, but I'd especially call out dask.dataframe, which specifically works toward your use case, by enabling chunked, multi-core processing of CSV files which mirrors the pandas API and has easy ways of converting the data back into a normal pandas dataframe (if desired) after processing the data.
Additionally, there are some csv-specific things I think you should consider:
Specifying column data types. Especially if chunking, but even if you're not, specifying the column types can dramatically reduce read time and memory usage and highlight problem areas in your data (e.g. NaN indicators or Flags that don't meet one of pandas's defaults). Use the dtypes parameter with a single data type to apply to all columns or a dict of column name, data type pairs to indicate the types to read in. Optionally, you can provide converters to format dates, times, or other numerical data if it's not in a format recognized by pandas.
Specifying the parser engine - pandas can read csvs in pure python (slow) or C (much faster). The python engine has slightly more features (e.g. currently the C parser can't read files with complex multi-character delimeters and it can't skip footers). Try using the argument engine='c' to make sure the C engine is being used. If your file can't be read by the c engine, I'd try fixing the file(s) first manually (e.g. stripping out a footer or standardizing the delimiters) and then parsing with the C engine, if possible.
Make sure you're catching all NaNs and data flags in numeric columns. This can be a tough one, and specifying specific data types in your inputs can be helpful in catching bad cases. Use the na_values, keep_default_na, date_parser, and converters argumentss to pd.read_csv. Currently, the default list of values interpreted as NaN are ['', '#N/A', '#N/A N/A', '#NA', '-1.#IND', '-1.#QNAN', '-NaN', '-nan', '1.#IND', '1.#QNAN', '<NA>', 'N/A', 'NA', 'NULL', 'NaN', 'n/a', 'nan', 'null'].For example, if your numeric columns have non-numeric values coded as notANumber then this would be missed and would either cause an error (if you had dtypes specified) or would cause pandas to re-categorieze the entire column as an object column (suuuper bad for memory and speed!).
Read the pd.read_csv docs over and over and over again. Many of the arguments to read_csv have important performance considerations. pd.read_csv is optimized to smooth over a large amount of variation in what can be considered a csv, and the more magic pandas has to be ready to perform (determine types, interpret nans, convert dates (maybe), skip headers/footers, infer indices/columns, handle bad lines, etc) the slower the read will be. Give it as many hints/constraints as you can and you might see performance increase a lot! And if it's still not enough, many of these tweaks will also apply to the dask.dataframe API, so this scales up further nicely.
Additionally, if you have the option, save the files in a stable binary storage format. Apache Parquet is a good columnar storage format with pandas support, but there are many others (see that Pandas IO guide for more options). Pickles can be a bit brittle across pandas versions (of course, so can any binary storage format, but this is usually less a concern for explicit data storage formats rather than pickles), and CSVs are inefficient and under-specified, hence the need for type conversion and interpretation.
One thing to check is the actual performance of the disk system itself. Especially if you use spinning disks (not SSD), your practical disk read speed may be one of the explaining factors for the performance. So, before doing too much optimization, check if reading the same data into memory (by, e.g., mydata = open('myfile.txt').read()) takes an equivalent amount of time. (Just make sure you do not get bitten by disk caches; if you load the same data twice, the second time it will be much faster because the data is already in RAM cache.)
See the update below before believing what I write underneath
If your problem is really parsing of the files, then I am not sure if any pure Python solution will help you. As you know the actual structure of the files, you do not need to use a generic CSV parser.
There are three things to try, though:
Python csv package and csv.reader
NumPy genfromtext
Numpy loadtxt
The third one is probably fastest if you can use it with your data. At the same time it has the most limited set of features. (Which actually may make it fast.)
Also, the suggestions given you in the comments by crclayton, BKay, and EdChum are good ones.
Try the different alternatives! If they do not work, then you will have to do write something in a compiled language (either compiled Python or, e.g. C).
Update: I do believe what chrisb says below, i.e. the pandas parser is fast.
Then the only way to make the parsing faster is to write an application-specific parser in C (or other compiled language). Generic parsing of CSV files is not straightforward, but if the exact structure of the file is known there may be shortcuts. In any case parsing text files is slow, so if you ever can translate it into something more palatable (HDF5, NumPy array), loading will be only limited by the I/O performance.
Modin is an early-stage project at UC Berkeley’s RISELab designed to facilitate the use of distributed computing for Data Science. It is a multiprocess Dataframe library with an identical API to pandas that allows users to speed up their Pandas workflows.
Modin accelerates Pandas queries by 4x on an 8-core machine, only requiring users to change a single line of code in their notebooks.
pip install modin
if using dask
pip install modin[dask]
import modin by typing
import modin.pandas as pd
It uses all CPU cores to import csv file and it is almost like pandas.
Most of the solutions are helpful here, I would like to say that parallelizing the loading can also help. Simple code below:
import os
import glob
path = r'C:\Users\data' # or whatever your path
data_files = glob.glob(os.path.join(path, "*.psv")) #list of all the files to be read
import reader
from multiprocessing import Pool
def read_psv_all (file_name):
return pd.read_csv(file_name,
delimiter='|', # change this as needed
low_memory=False
)
pool = Pool(processes=3) # can change 3 to number of processors you want to utilize
df_list = pool.map(read_psv_all, data_files)
df = pd.concat(df_list, ignore_index=True,axis=0, sort=False)
Note that if you are using windows/jupyter, it might be a sinister combination to use parallel processing. You might need to use if __name__ == '__main__' in your code.
Along with this, do utilize columns, dtypes which would definitely help.
I have a 3.3gb file containing one long line. The values in the file are comma separated and either floats or ints. Most of the values are 10. I want to read the data into a numpy array. Currently, I'm using numpy.fromfile:
>>> import numpy
>>> f = open('distance_matrix.tmp')
>>> distance_matrix = numpy.fromfile(f, sep=',')
but that has been running for over an hour now and it's currently using ~1 Gig memory, so I don't think it's halfway yet.
Is there a faster way to read in large data that is on a single line?
This should probably be a comment... but I don't have enough reputation to put comments in.
I've used hdf files, via h5py, of sizes well over 200 gigs with very little processing time, on the order of a minute or two, for file accesses. In addition the hdf libraries support mpi and concurrent access.
This means that, assuming you can format your original one line file, as an appropriately hierarchic hdf file (e.g. make a group for every `large' segment of data) you can use the inbuilt capabilities of hdf to make use of multiple core processing of your data exploiting mpi to pass what ever data you need between the cores.
You need to be careful with your code and understand how mpi works in conjunction with hdf, but it'll speed things up no end.
Of course all of this depends on putting the data into an hdf file in a way that allows you to take advantage of mpi... so maybe not the most practical suggestion.
Consider dumping the data using some binary format. See something like http://docs.scipy.org/doc/numpy/reference/generated/numpy.save.html
This way it will be much faster because you don't need to parse the values.
If you can't change the file type (not the result of one of your programs) then there's not much you can do about it. Make sure your machine has lots of ram (at least 8GB) so that it doesn't need to use the swap at all. Defragmenting the harddrive might help as well, or using a SSD drive.
An intermediate solution might be a C++ binary to do the parsing and then dump it in a binary format. I don't have any links for examples on this one.