I have a very large CSV that takes ~30 seconds to read when using the normal pd.read_csv command. Is there a way to speed this process up? I'm thinking maybe something that only reads rows that have some matching value in one of the columns.
i.e. only read in rows where the value in column 'A' is the value '5'.
Dask module can do a lazy read of a large CSV file in Python.
You trigger the computation by calling the .compute() method. At this time the file is read in chunks and applies whatever conditional logic you specify.
import dask.dataframe as dd
df = dd.read_csv(csv_file)
df = df[df['A'] == 5]
df = df.compute()
print(len(df)) # print number of records
print(df.head()) # print first 5 rows to show sample of data
If you're looking for a value in a CSV file, you must look for the entire document, then limit it to 5 results.
If you want to just retrieve the first five rows, you may are looking for this:
nrows :int,optional
Number of rows of file to read. Useful for reading pieces of large files.
Reference: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Try and chunk it dude! Truffle Shuffle! Goonies Never say die.
mylist = []
for chunk in pd.read_csv('csv_file.csv', sep=',', chunksize=10000):
mylist.append(chunk[chunk.A == 5])
big_data = pd.concat(mylist, axis= 0)
del mylist
Related
I'm trying to create a dictionary file for a big size csv file that is divided into chunks to be processed, but when I'm creating the dictionary its just doing it for one chuck, and when I try to append it it passes epmty dataframe to the new df. this is the code I used
wdata = pd.read_csv(fileinput, nrows=0,).columns[0]
skip = int(wdata.count(' ') == 0)
dic = pd.DataFrame()
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic.append(dic_tmp)
dic.to_csv('newwww.csv', index=False)
if I saved the dic_tmp one is just a dictionary for one chunk not the whole set and dic is taking alot of time to process but returns empty dataframes at the end, any error with my code ?
input csv is like
output csv is like
expected output should be
so its not adding the chunks together its just pasting the new chunk regardless what is in the previous chunk or the csv.
In order to split the column into words and count the occurrences:
df['sentences'].apply(lambda x: pd.value_counts(x.split(" "))).sum(axis=0)
or
from collections import Counter
result = Counter(" ".join(df['sentences'].values.tolist()).split(" ")).items()
both seem to be equally slow, but probably better than your approach.
Taken from here:
Count distinct words from a Pandas Data Frame
Couple of problems that I see are
Why read the csv file twice?
First time here wdata = pd.read_csv(fileinput, nrows=0,).columns[0] and second time in the for loop.
If you aren't using the combined data frame further. I think it is better to write the chunks to csv file in append mode like shown below
for chunk in pd.read_csv(fileinput, names=['sentences'], skiprows=skip, chunksize=1000):
dic_tmp = (chunk['sentences'].str.split(expand=True).stack().value_counts().rename_axis('word').reset_index(name='freq'))
dic_tmp.to_csv('newwww.csv', mode='a', header=False)
I have a number of .xls datasheets which I am looking to clean and merge.
Each data sheet is generated by a larger system which cannot be changed.
The method that generates the data sets displays the selected parameters for the data set. (E.G 1) I am looking to automate the removal of these.
The number of rows that this takes up varies, so I am unable to blanket remove x rows from each sheet. Furthermore, the system that generates the report arbitrarily merges cells in the blank sections to the right of the information.
Currently I am attempting what feels like a very inelegant solution where I convert the file to a CSV, read it as a string and remove everything before the first column.
data_xls = pd.read_excel('InputFile.xls', index_col=None)
data_xls.to_csv('Change1.csv', encoding='utf-8')
with open("Change1.csv") as f:
s = f.read() + '\n'
a=(s[s.index("Col1"):])
df = pd.DataFrame([x.split(',') for x in a.split('\n')])
This works but it seems wildly inefficient:
Multiple format conversions
Reading every line in the file when the only rows being altered occur within first ~20
Dataframe ends up with column headers shifted over by one and must be re-aligned (Less concern)
With some of the files being around 20mb, merging a batch of 8 can take close to 10 minutes.
A little hacky, but an idea to speed up your process, by doing some operations directly on your dataframe. Considering you know your first column name to be Col1, you could try something like this:
df = pd.read_excel('InputFile.xls', index_col=None)
# Find the first occurrence of "Col1"
column_row = df.index[df.iloc[:, 0] == "Col1"][0]
# Use this row as header
df.columns = df.iloc[column_row]
# Remove the column name (currently an useless index number)
del df.columns.name
# Keep only the data after the (old) column row
df = df.iloc[column_row + 1:]
# And tidy it up by resetting the index
df.reset_index(drop=True, inplace=True)
This should work for any dynamic number of header rows in your Excel (xls & xlsx) files, as long as you know the title of the first column...
If you know the number of junk rows, you skip them using "skiprows",
data_xls = pd.read_excel('InputFile.xls', index_col=None, skiprows=2)
I have a time series in a big text file.
That file is more than 4 GB.
As it is a time series, I would like to read only 1% of lines.
Desired minimalist example:
df = pandas.read_csv('super_size_file.log',
load_line_percentage = 1)
print(df)
desired output:
>line_number, value
0, 654564
100, 54654654
200, 54
300, 46546
...
I can't resample after loading, because it takes too much memory to load it in the first place.
I may want to load chunk by chunk and resample every chunk. But is seems inefficient to me.
Any ideas are welcome. ;)
Anytime I have to deal with a very large file, I ask "What would Dask do?".
Load the large file as a dask.DataFrame, convert the index to a column (workaround due to full index control not being available), and filter on that new column.
import dask.dataframe as dd
import pandas as pd
nth_row = 100 # grab every nth row from the larger DataFrame
dask_df = dd.read_csv('super_size_file.log') # assuming this file can be read by pd.read_csv
dask_df['df_index'] = dask_df.index
dask_df_smaller = dask_df[dask_df['df_index'] % nth_row == 0]
df_smaller = dask_df_smaller.compute() # to execute the operations and return a pandas DataFrame
This will give you rows 0, 100, 200, etc. from the larger file. If you want to cut down the DataFrame to specific columns, do this before calling compute, i.e. dask_df_smaller = dask_df_smaller[['Signal_1', 'Signal_2']]. You can also call compute with the scheduler='processes' option to use all cores on your CPU.
You can enter the number of rows you want to read when you use the read_csv pandas function. Here is what you could do :
import pandas as pd
# Select file
infile = 'path/file'
number_of_lines = x
# Use nrows to choose number of rows
data = pd.read_csv(infile,, nrows = number_of_lines*0.01)
You can also use the chunksize option if you want to read the data chunk by chunk like you mentionned :
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
Take a look at Iterating through files chunk by chunk.
It contains an elegant description how to read a CSV file in chunks.
The basic idea is to pass chunksize parameter (No of rows per chunk).
Then, in a loop, you can read this file chunk by chunk.
This should do what you want.
# Select All From CSV File Where
import csv
# Asks for search criteria from user
search_parts = input("Enter search criteria:\n").split(",")
# Opens csv data file
file = csv.reader(open("C:\\your_path\\test.csv"))
# Go over each row and print it if it contains user input.
for row in file:
if all([x in row for x in search_parts]):
print(row)
# If you only want to read rows 1,000,000 ... 1,999,999
read_csv(..., skiprows=1000000, nrows=999999)
I have a very large csv file with millions of rows and a list of the row numbers that I need.like
rownumberList = [1,2,5,6,8,9,20,22]
I know there is something called skiprows that helps to skip several rows when reading csv file like that
df = pd.read_csv('myfile.csv',skiprows = skiplist)
#skiplist would contain the total row list deducts rownumberList
However, since the csv file is very large, directly selecting the rows that I need could be more efficient. So I was wondering are there any methods to select rows when using read_csv? Not try to select rows using dataframe afterwards, since I try to minimize the time of reading file.Thanks.
There is a parameter called nrows : int, default None
Number of rows of file to read. Useful for reading pieces of large files (Docs)
pd.read_csv(file_name,nrows=int)
In case you need some part in the middle. Use both skiprows as well as nrows in read_csv.if skiprows indicate the beginning rows and nrows will indicate the next number of rows after skipping eg.
Example:
pd.read_csv('../input/sample_submission.csv',skiprows=5,nrows=10)
This will select data from the 6th row to 16 row
Edit based on comment:
Since there is a list this one might help i.e
li = [1,2,3,5,9]
r = [i for i in range(max(li)) if i not in li]
df = pd.read_csv('../input/sample_submission.csv',skiprows=r,nrows= max(li))
# This will skip the rows you dont want as well as limit the number of rows to maximum of the list.
import pandas as pd
rownumberList = [1,2,5,6,8,9,20,22]
df = pd.read_csv('myfile.csv',skiprows=lambda x: x not in rownumberList)
for pandas 0.25.1, pandas read_csv, you can pass callable function to skiprows
I am not sure about read_csv() from Pandas (there is though a way to use an iterator for reading a large file in chunks), but you can read the file line by line (lazy-loading, not reading the whole file in memory) with csv.reader (or csv.DictReader), leaving only the desired rows with the help of enumerate():
import csv
import pandas as pd
DESIRED_ROWS = {1, 17, 28}
with open("input.csv") as input_file:
reader = csv.reader(input_file)
desired_rows = [row for row_number, row in enumerate(reader)
if row_number in DESIRED_ROWS]
df = pd.DataFrame(desired_rows)
(assuming you would like to pick random/discontinuous rows and not a "continuous chunk" from somewhere in the middle - in that case #James's idea to have "start and "stop" would work generally better).
import pandas as pd
df = pd.read_csv('Data.csv')
df.iloc[3:6]
Returns rows 3 through 5 and all columns.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html
From de documentation you can see that skiprows can take an integer or a list as values to remove some lines.
So basicaly you can tell it to remove all but those you want. For this you first need to know the number in lines in the file (best if you know beforehand) by open it and counting as following:
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
Now you need to create the complementary list (here are sets but also works, don't know why). First you create the one from 1 to the number of rows and then substract the numbers of the rows you want to read.
skiplist = set(range(1, row_count+1)) - set(rownumberList)
Finally you can read the csv as normal.
df = pd.read_csv('myfile.csv',skiprows = skiplist)
here is the full code:
import pandas as pd
with open('myfile.csv') as f:
row_count = sum(1 for row in f)
rownumberList = [1,2,5,6,8,9,20,22]
skiplist = set(range(1, row_count+1)) - set(rownumberList)
df = pd.read_csv('myfile.csv', skiprows=skiplist)
you could try this
import pandas as pd
#making data frame from a csv file
data = pd.read_csv("your_csv_flie.csv", index_col ="What_you_want")
# retrieving multiple rows by iloc method
rows = data.iloc [[1,2,5,6,8,9,20,22]]
You will not be able to circumvent the read time when accessing a large file. If you have a very large CSV file, any program will need to read through it at least up to the point where you want to begin extracting rows. Really, that is what databases are designed for.
However, if you want to extract rows 300,000 to 300,123 from a 10,000,000 row CSV file, you are better off reading just the data you need into Python before converting it to a data frame in Pandas. For this you can use the csv module.
import csv
import pandas
start = 300000
stop = start + 123
data = []
with open('/very/large.csv', 'r') as fp:
reader = csv.reader(fp)
for i, line in enumerate(reader):
if i >= start:
data.append(line)
if i > stop:
break
df = pd.DataFrame(data)
for i in range (1,20)
the first parameter is the first row and the last parameter is the last row...
I have a large csv file, about 600mb with 11 million rows and I want to create statistical data like pivots, histograms, graphs etc. Obviously trying to just to read it normally:
df = pd.read_csv('Check400_900.csv', sep='\t')
Doesn't work so I found iterate and chunksize in a similar post so I used:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
All good, i can for example print df.get_chunk(5) and search the whole file with just:
for chunk in df:
print chunk
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk.
plt.plot()
print df.head()
print df.describe()
print df.dtypes
customer_group3 = df.groupby('UserID')
y3 = customer_group.size()
Solution, if need create one big DataFrame if need processes all data at once (what is possible, but not recommended):
Then use concat for all chunks to df, because type of output of function:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
isn't dataframe, but pandas.io.parsers.TextFileReader - source.
tp = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print tp
#<pandas.io.parsers.TextFileReader object at 0x00000000150E0048>
df = pd.concat(tp, ignore_index=True)
I think is necessary add parameter ignore index to function concat, because avoiding duplicity of indexes.
EDIT:
But if want working with large data like aggregating, much better is use dask, because it provides advanced parallelism.
You do not need concat here. It's exactly like writing sum(map(list, grouper(tup, 1000))) instead of list(tup). The only thing iterator and chunksize=1000 does is to give you a reader object that iterates 1000-row DataFrames instead of reading the whole thing. If you want the whole thing at once, just don't use those parameters.
But if reading the whole file into memory at once is too expensive (e.g., takes so much memory that you get a MemoryError, or slow your system to a crawl by throwing it into swap hell), that's exactly what chunksize is for.
The problem is that you named the resulting iterator df, and then tried to use it as a DataFrame. It's not a DataFrame; it's an iterator that gives you 1000-row DataFrames one by one.
When you say this:
My problem is I don't know how to use stuff like these below for the whole df and not for just one chunk
The answer is that you can't. If you can't load the whole thing into one giant DataFrame, you can't use one giant DataFrame. You have to rewrite your code around chunks.
Instead of this:
df = pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000)
print df.dtypes
customer_group3 = df.groupby('UserID')
… you have to do things like this:
for df in pd.read_csv('Check1_900.csv', sep='\t', iterator=True, chunksize=1000):
print df.dtypes
customer_group3 = df.groupby('UserID')
Often, what you need to do is aggregate some data—reduce each chunk down to something much smaller with only the parts you need. For example, if you want to sum the entire file by groups, you can groupby each chunk, then sum the chunk by groups, and store a series/array/list/dict of running totals for each group.
Of course it's slightly more complicated than just summing a giant series all at once, but there's no way around that. (Except to buy more RAM and/or switch to 64 bits.) That's how iterator and chunksize solve the problem: by allowing you to make this tradeoff when you need to.
You need to concatenate the chucks. For example:
df2 = pd.concat([chunk for chunk in df])
And then run your commands on df2
This might not reply directly to the question, but when you have to load a big dataset it is a good practice, to covert the dtypes of your columns while reading the dataset. Also if you know which columns you need, use the usecols argument to load only those.
df = pd.read_csv("data.csv",
usecols=['A', 'B', 'C', 'Date'],
dtype={'A':'uint32',
'B':'uint8',
'C':'uint8'
},
parse_dates=['Date'], # convert to datetime64
sep='\t'
)