I have a large csv file with more than 1 million rows. Each row has two features, callsite (the location of an API invocation) and a sequence of tokens to the callsite. They are written as:
callsite 1, token 1, token 2, token 3, ...
callsite 1, token 3, token 4, token 4, token 6, ...
callsite 2, token 3, token 1, token 6, token 7, ...
I want to shuffle the rows and split them into two files (for training and testing). The problem is that I want to split than according to the callsites instead of the rows. There may be more than one row belonging to one callsite. So I first read all the callsites, shuffle and split them as follows:
import csv
import random
with open(file,'r') as csv_file:
reader = csv.reader(csv_file)
callsites = [row[0] for row in reader]
random.shuffle(callsites)
test_callsites = callsites[0:n_test] //n_test is the number of test cases
Then, I read each row from the csv file and compare the callsite to put it in the train.csv or test.csv as follows:
with open(file,'r') as csv_file, open('train.csv','w') as train_file, open('test.csv','w') as test_file:
reader = csv.reader(csv_file)
train_writer = csv.writer(train_file)
test_writer = csv.writer(test_file)
for row in reader:
if row[0] in test_callsites:
test_writer.writerow(row)
else:
train_writer.writerow(row)
The problem is that the code works too slow, more than one day to finish. The comparison for each row causes the complexity O(n^2). And the read and write row by row may also be not efficient. But I am afraid that loading all data in the memory would cause memory error. Is there any better way to deal with large files like that?
Would it be quicker if I use dataframe to read and write it? But the sequence length is varied each row. I tried to write the data as (put all tokens as a list in one column):
callsite, sequence
callsite 1, [token1||token2||token 3]
However, it seems not convenient to restore the [token 1||token 2||token 3] as sequences.
Is there any better practice to store and restore that kind of data with variable length?
The simplest fix is to change:
test_callsites = callsites[0:n_test]
to
test_callsites = frozenset(callsites[:n_test]) # set also works; frozenset just reduces chance of mistakenly modifying it
This would reduce the work for each test of if row[0] in test_callsites: from O(n_test) to O(1), which would likely make a huge improvement if n_test is on the order of four digits or more (likely, when we're talking about millions of rows).
You could also slightly reduce the work (mostly in terms of improving memory locality by having a smaller bin of things being selected) in creating it in the first place by changing:
random.shuffle(callsites)
test_callsites = callsites[0:n_test]
to:
test_callsites = frozenset(random.sample(callsites, n_test))
which avoids reshuffling the whole of callsites in favor of selecting n_test values from it (which you then convert to a frozenset, or just set, for cheap lookup). Bonus, it's a one-liner. :-)
Side-note: Your code is potentially wrong as written. You must pass newline='' to your various calls to open to ensure that the chosen CSV dialect's newline preferences are honored.
What about something like this?
import csv
import random
random.seed(42) # need this to get reproducible splits
with open("input.csv", "r") as input_file, open("train.csv", "w") as train_file, open(
"test.csv", "w"
) as test_file:
reader = csv.reader(input_file)
train_writer = csv.writer(train_file)
test_writer = csv.writer(test_file)
test_callsites = set()
train_callsites = set()
for row in reader:
callsite = row[0]
if callsite in test_callsites:
test_writer.writerow(row)
elif callsite in train_callsites:
train_writer.writerow(row)
elif random.random() <= 0.2: # put here the train/test split you need
test_writer.writerow(row)
test_callsites.add(callsite)
else:
train_writer.writerow(row)
train_callsites.add(callsite)
In this way you'll need a single pass over the file. Drawback is that you'll get a split which is approximately 20%.
Tested on 1Mx100 rows (~850mb) and seems reasonably usable.
Related
I have file which has 4 columns with, separated values. I need only first column only so I have read file then split that line with, separated and store it in one list variable called first_file_list.
I have another file which has 6 columns with, separated values. My requirement is read first column of first row of file and check that string is exist in list called first_file_list. If that is exist then copy that line to new file.
My first file has approx. 6 million records and second file has approx. 4.5 million records. Just to check the performance of my code instead of 4.5 million I have put only 100k records in second file and to process the 100k record code takes approx. 2.5 hours.
Following is my logic for this:
first_file_list = []
with open("c:\first_file.csv") as first_f:
next(first_f) # Ignoring first row as it is header and I don't need that
temp = first_f.readlines()
for x in temp:
first_file_list.append(x.split(',')[0])
first_f.close()
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
second_f.close()
out_file = open("c:\output_file.csv", "a")
for x in second_file_co:
if x.split(',')[0] in first_file_list:
out_file.write(x)
out_file.close()
Can you please help me to get to know that what I am doing wrong here so that my code take this much time to compare 100k records? or can you suggest better way to do this in Python.
Use a set for fast membership checking.
Also, there's no need to copy the contents of the entire file to memory. You can just iterate over the remaining contents of the file.
first_entries = set()
with open("c:\first_file.csv") as first_f:
next(first_f)
for line in first_f:
first_entries.add(line.split(',')[0])
with open("c:\second_file.csv") as second_f:
with open("c:\output_file.csv", "a") as out_file:
next(second_f)
for line in second_f:
if line.split(',')[0] in first_entries:
out_file.write(line)
Additionally, I noticed you called .close() on file objects that were opened with the with statement. Using with (context managers) means all the clean up is done after you exit its context. So it handles the .close() for you.
work with sets - see below
first_file_values = set()
second_file_values = set()
with open("c:\first_file.csv") as first_f:
next(first_f)
temp = first_f.readlines()
for x in temp:
first_file_values.add(x.split(',')[0])
with open("c:\second_file.csv") as second_f:
next(second_f)
second_file_co = second_f.readlines()
for x in second_file_co:
second_file_values.add(x.split(',')[0])
with open("c:\output_file.csv", "a") as out_file:
for x in second_file_values:
if x in first_file_values:
out_file.write(x)
I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.
What my code does:
I read in the line count of my csv with
with open(file) as f:
row_count = sum(1 for line in f)
afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.
chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
print(counter)
if ((counter % 5) == 0 | (counter == 0):
plt.plot(df["Variable"])
counter = counter +1
plt.show()
Now to my question:
It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:
wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
for chunk[i] in pd.read_csv(file, chunksize=chunksize):
.
.
I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.
Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.
Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.
Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.
You are wasting a lot of resources when parsing CSV into DataFrame without using it. To avoid this you can create line index during the first pass:
fp = open(file_name)
row_count = 0
pos = {0: 0}
line = fp.readline()
while line:
row_count += 1
pos[row_count] = fp.tell()
line = fp.readline()
Do not dispose the file handle yet! Because read_csv() accepts streams, you can move your file pointer as you want:
chunksize = row_count // 30
wanted_plts = [1,5,10,15,20,25,30]
for i in wanted_plts:
fp.seek(pos[i*chunksize]) # this will bring you to the first line of the desired chunk
obj = pd.read_csv(fp, chunksize=chunksize) # read your chunk lazily
df = obj.get_chunk() # convert to DataFrame object
plt.plot(df["Variable"]) # do something
fp.close() # Don't forget to close the file when finished.
And finally a warning: when reading CSV this way you will lose column names. So make an adjustment:
obj = pd.read_csv(fp, chunksize=chunksize, names=[!!<column names you have>!!])
P.S. file is a reserved word, avoid using it to prevent undesired side effects. You can use file_ or file_name instead.
I've toyed with your setup, trying to find a way to skip chunks, using another rendering library like pyqtgraph or using matplotlib.pyplot subroutines instead of plot(), all to no avail.
So the only fair advice I can give you is to limit the scope of read_csv to only the data you're interested in by passing the usecols parameter.
Instead of:
for chunk in pd.read_csv(file, chunksize=chunksize):
plt.plot(chunk['Variable'])
Use:
for chunk in pd.read_csv(file, usecols=['Variable'], chunksize=chunksize):
plt.plot(chunk)
And, if you haven't already, definitely limit the number of iterations by going for the biggest chunksize you possibly can (so in your case the lowest row_count divider).
I haven't quantified their respective weight but you will gain on both the csv_read() and the plot() method overheads, even ever so slightly due to the fact that your current chunks are already quite big.
With my test data, quadrupling the chunksize cuts down processing time in half:
chunksize=1000 => executed in 12.7s
chunksize=2000 => executed in 9.06s
chunksize=3000 => executed in 7.68s
chunksize=4000 => executed in 6.94s
And specifying usecols at read time also cuts down processing time in half again:
chunksize=1000 + usecols=['Variable'] => executed in 8.33s
chunksize=2000 + usecols=['Variable'] => executed in 5.27s
chunksize=3000 + usecols=['Variable'] => executed in 4.39s
chunksize=4000 + usecols=['Variable'] => executed in 3.54s
As far as I know, pandas does not provide any support for skipping chunks of file. At least I never found anything about it in the documentation.
In general, skipping lines from file (not reading them at all) is difficult unless you know in advance how many lines you want to skip and how many characters you have in each of those lines. In this case you can try to play with IO and seek to move the stream position to the exact place you need the next iteration.
But it does not seem your case.
I think the best thing you can do to improve efficiency is to read the lines using standard IO, and convert to a dataframe only the lines you need / want to plot.
Consider for example the following custom iterator.
When instantiated, it saves the header (first line). Each iteration it reads a chunk of lines from the file and then skip the following n*chunksize lines. It returns the header line followed by the read lines, wrapped in a io.StringIO object (so it's a stream and can be fed directly to pandas.read_csv).
import io
from itertools import islice
class DfReaderChunks:
def __init__(self, filename, chunksize, n):
self.fo = open(filename)
self.chs = chunksize
self.skiplines = self.chs * n
self.header = next(self.fo)
def getchunk(self):
ll = list(islice(self.fo, self.chs))
if len(ll) == 0:
raise StopIteration
dd = list(islice(self.fo, self.skiplines))
return self.header + ''.join(ll)
def __iter__(self):
return self
def __next__(self):
return io.StringIO(self.getchunk())
def close(self):
self.fo.close()
def __del__(self):
self.fo.close()
Using this class, your can read from your file:
reader = DfReaderChunks(file, chunksize, 4)
for dfst in reader:
df = pd.read_csv(dfst)
print(df) #here I print to stdout, you can plot
reader.close()
which is "equivalent" to your setup:
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
if (counter % 5 == 0):
print(df) #again I print, you can plot
counter += 1
I tested the time used by both the above snippets using a dataframe of 39 Mb (100000 rows or random numbers).
On my machine, the former takes 0.458 seconds, the latter 0.821 seconds.
The only drawback is that the former snippet loses track of the row index (it's a new dataframe each time, so index always start from 0) but the printed chunks are the same.
I want to read only a certain amount of rows starting from a certain row in a csv file without iterating over the whole csv file to reach this certain point.
Lets say i have a csv file with 100 rows and i want to read only row 50 to 60. I dont want to iterate from row 1 to 49 to reach row 50 to start reading. Can i somehow achieve this with seek()?
For example:
Seek to row 50
read from 50 to 60
next time:
seek to row 27
read 27 to 34
and so on
So not only seeking continuesly forward through the file but also backwards.
Thank you a lot
An option would be to use Pandas. For example:
import pandas as pd
# Select file
infile = r'path/file'
# Use skiprows to choose starting point and nrows to choose number of rows
data = pd.read_csv(infile, skiprows = 50, nrows=10)
You can use chunksize
import pandas as pd
chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
process(chunk)
If the # of columns/line lengths are variable, it isn't possible to find the line you want without "reading" (ie, processing) every character of the file that comes before that, and counting the line terminators. And the fastest way to process them in python, is to use iteration.
As to the fastest way to do that with a large file, I do not know whether it is faster to iterate by line this way:
with open(file_name) as f:
for line,_ in zip(f, range(50)):
pass
lines = [line for line,_ in zip(f, range(10))]
...or to read a character at a time using seek, and count new line characters. But it is certainly MUCH more convenient to do the first.
However if the file gets read a lot, iterating over the lines will be slow over time. If the file contents do not change, you could instead accomplish this by reading the whole thing once and building a dict of the line lengths ahead of time:
from itertools import accumulate
with open(file_name) as f:
cum_lens = dict(enumerate(accumulate(len(line) for line in f), 1))
This would allow you to seek to any line number in the file without processing the whole thing ever again:
def seek_line(path, line_num, cum_lens):
with open(path) as f:
f.seek(cum_lens[line_num], 0)
return f.readline()
class LineX:
"""A file reading object that can quickly obtain any line number."""
def __init__(self, path, cum_lens):
self.cum_lens = cum_lens
self.path = path
def __getitem__(self, i):
return seek_line(self.path, i, self.cum_lens)
linex = LineX(file_name, cum_lens)
line50 = linex[50]
But at this point, you might be better off loading the file contents into some kind of database. It depends on what you're trying to do, and what kind of data the file contains.
As others are saying the most obvious solution is to use pandas read csv !
The method has a parameter called skiprows:
from the doc there is what is said :
skiprows : list-like, int or callable, optional
Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
You can have something like this :
import pandas as pd
data = pd.read_csv('path/to/your/file', skiprows =lambda x: x not in range(50, 60))
Since you specify that the memory is your problem you can use the chunksize parameter as said in this tutorial
he said :
The parameter essentially means the number of rows to be read into a
dataframe at any single time in order to fit into the local memory.
Since the data consists of more than 70 millions of rows, I specified
the chunksize as 1 million rows each time that broke the large data
set into many smaller pieces.
df_chunk = pd.read_csv(r'../input/data.csv', chunksize=1000000)
You can try this and iterate over the chunk to retrieve only the rows you are looking for.
The function should return true if the row number is in the specified list
its that easy:
with open("file.csv", "r") as file:
print(file.readlines()[50:60])
I'm reading in a large csv file using chuncksize (pandas DataFrame), like so
reader = pd.read_csv('log_file.csv', low_memory = False, chunksize = 4e7)
I know I could just calculate the number of chunks with which it reads in the file but I would like to do it automatically and save the number of chunks into a variable, like so (in pseudo code)
number_of_chuncks = countChuncks(reader)
Any ideas?
You can use a generator expression to iterate through reader (a TextFileReader returned by read_csv when we define chunksize) and sum 1 for each iteration:
number_of_chunks = sum(1 for chunk in reader)
Alternatively, you can use a generator expression to count the number of lines in your file (similar logic of the first option, but iterating through the lines of the file), than divide this number by the chunksize and round up the result (with math.ceil)
import math
number_of_rows = sum(1 for row in open('log_file.csv', 'r'))
number_of_chunks = math.ceil(number_of_rows/chunksize)
or
import math
number_of_chunks = math.ceil(sum(1 for row in open('log_file.csv', 'r'))/chunksize)
In my tests, the second solution showed a better performance than the first.
I have an interesting problem.
I have a very large (larger than 300MB, more than 10,000,000 lines/rows in the file) CSV file with time series data points inside. Every month I get a new CSV file that is almost the same as the previous file, except for a few new lines have been added and/or removed and perhaps a couple of lines have been modified.
I want to use Python to compare the 2 files and identify which lines have been added, removed and modified.
The issue is that the file is very large, so I need a solution that can handle the large file size and execute efficiently within a reasonable time, the faster the better.
Example of what a file and its new file might look like:
Old file
A,2008-01-01,23
A,2008-02-01,45
B,2008-01-01,56
B,2008-02-01,60
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,9
etc...
New file
A,2008-01-01,23
A,2008-02-01,45
A,2008-03-01,67 (added)
B,2008-01-01,56
B,2008-03-01,33 (removed and added)
C,2008-01-01,3
C,2008-02-01,7
C,2008-03-01,22 (modified)
etc...
Basically the 2 files can be seen as matrices that need to be compared, and I have begun thinking of using PyTable. Any ideas on how to solve this problem would be greatly appreciated.
Like this.
Step 1. Sort.
Step 2. Read each file, doing line-by-line comparison. Write differences to another file.
You can easily write this yourself. Or you can use difflib. http://docs.python.org/library/difflib.html
Note that the general solution is quite slow as it searches for matching lines near a difference. Writing your own solution can run faster because you know things about how the files are supposed to match. You can optimize that "resynch-after-a-diff" algorithm.
And 10,000,000 lines hardly matters. It's not that big. Two 300Mb files easily fit into memory.
This is a little bit of a naive implementation but will deal with unsorted data:
import csv
file1_dict = {}
file2_dict = {}
with open('file1.csv') as handle:
for row in csv.reader(handle):
file1_dict[tuple(row[:2])] = row[2:]
with open('file2.csv') as handle:
for row in csv.reader(handle):
file2_dict[tuple(row[:2])] = row[2:]
with open('outfile.csv', 'w') as handle:
writer = csv.writer(handle)
for key, val in file1_dict.iteritems():
if key in file2_dict:
#deal with keys that are in both
if file2_dict[key] == val:
writer.writerow(key+val+('Same',))
else:
writer.writerow(key+file2_dict[key]+('Modified',))
file2_dict.pop(key)
else:
writer.writerow(key+val+('Removed',))
#deal with added keys!
for key, val in file2_dict.iteritems():
writer.writerow(key+val+('Added',))
You probably won't be able to "drop in" this solution but it should get you ~95% of the way there. #S.Lott is right, 2 300mb files will easily fit in memory ... if your files get into the 1-2gb range then this may have to be modified with the assumption of sorted data.
Something like this is close ... although you may have to change the comparisons around for the added a modified to make sense:
#assumming both files are sorted by columns 1 and 2
import datetime
from itertools import imap
def str2date(in):
return datetime.date(*map(int,in.split('-')))
def convert_tups(row):
key = (row[0], str2date(row[1]))
val = tuple(row[2:])
return key, val
with open('file1.csv') as handle1:
with open('file2.csv') as handle2:
with open('outfile.csv', 'w') as outhandle:
writer = csv.writer(outhandle)
gen1 = imap(convert_tups, csv.reader(handle1))
gen2 = imap(convert_tups, csv.reader(handle2))
gen2key, gen2val = gen2.next()
for gen1key, gen1val in gen1:
if gen1key == gen2key and gen1val == gen2val:
writer.writerow(gen1key+gen1val+('Same',))
gen2key, gen2val = gen2.next()
elif gen1key == gen2key and gen1val != gen2val:
writer.writerow(gen2key+gen2val+('Modified',))
gen2key, gen2val = gen2.next()
elif gen1key > gen2key:
while gen1key>gen2key:
writer.writerow(gen2key+gen2val+('Added',))
gen2key, gen2val = gen2.next()
else:
writer.writerow(gen1key+gen1val+('Removed',))