Python - Reducing Import and Parse Time for Large CSV Files - python

My first post:
Before beginning, I should note I am relatively new to OOP, though I have done DB/stat work in SAS, R, etc., so my question may not be well posed: please let me know if I need to clarify anything.
My question:
I am attempting to import and parse large CSV files (~6MM rows and larger likely to come). The two limitations that I've run into repeatedly have been runtime and memory (32-bit implementation of Python). Below is a simplified version of my neophyte (nth) attempt at importing and parsing in reasonable time. How can I speed up this process? I am splitting the file as I import and performing interim summaries due to memory limitations and using pandas for the summarization:
Parsing and Summarization:
def ParseInts(inString):
try:
return int(inString)
except:
return None
def TextToYearMo(inString):
try:
return 100*inString[0:4]+int(inString[5:7])
except:
return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
if elmPos in [0,2,5]:
return elmValue
elif elmPos == 3:
return TextToYearMo(elmValue)
else:
if elmPos == 18:
return ParseInts(elmValue.strip('\n'))
else:
return ParseInts(elmValue)
def MakeAndSumList(inList):
df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
return df[['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14']].groupby(
['x1','x2','x3','x4','x5']).sum().reset_index()
Function Calls:
def ParsedSummary(longString,delimtr,rowNum):
keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]
#Do some other stuff that takes very little time
return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]
def CSVToList(fileName, delimtr=','):
with open(fileName) as f:
enumFile = enumerate(f)
listEnumFile = set(enumFile)
for lineCount, l in enumFile:
pass
maxSplit = math.floor(lineCount / 10) + 1
counter = 0
Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
for counter in range(0,10):
startRow = int(counter * maxSplit)
endRow = int((counter + 1) * maxSplit)
includedRows = set(range(startRow,endRow))
listOfRows = [ParsedSummary(row,delimtr,rownum)
for rownum, row in listEnumFile if rownum in includedRows]
Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])
listOfRows = []
counter += 1
return Summary
(Again, this is my first question - so I apologize if I simplified too much or, more likely, too little, but I am at a loss as to how to expedite this.)
For runtime comparison:
Using Access I can import, parse, summarize, and merge several files in this size-range in <5 mins (though I am right at its 2GB lim). I'd hope I can get comparable results in Python - presently I'm estimating ~30 min run time for one file. Note: I threw something together in Access' miserable environment only because I didn't have admin rights readily available to install anything else.
Edit: Updated parsing code. Was able to shave off five minutes (est. runtime at 25m) by changing some conditional logic to try/except. Also - runtime estimate doesn't include pandas portion - I'd forgotten I'd commented that out while testing, but its impact seems negligible.

If you want to optimize performance, don't roll your own CSV reader in Python. There is already a standard csv module. Perhaps pandas or numpy have faster csv readers; I'm not sure.
From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:
In short, pandas.io.parsers.read_csv beats everybody else, NumPy's loadtxt is impressively slow and NumPy's from_file and load impressively fast.

Related

I need to extract data from a text file that contains >8 million records. What is the best suited language to do this using Multithreading?

Currently I am using Multiprocessing feature of Python. Though this works fine for text file up to 2 million records, it fails for the file with 8 million records with "Can't access lock."
Moreover, it takes about 30 minutes to process the file with 2 million records and fails like after about an hour or so for the big file.
I am doing this:
def try_multiple_operations(item):
aab_type = item[15:17]
aab_amount = item[35:46]
aab_name = item[82:100]
aab_reference = item[64:82]
if aab_type not in '99' or 'Z5':
aab_record = f'{aab_name} {aab_amount} {aab_reference}'
else:
aab_record = 'ignore'
return aab_record
Calling the try_multiple_operations in the __main__:
if __name__ == '__main__':
//some other code
executor = concurrent.futures.ProcessPoolExecutor(10)
futures = [executor.submit(try_multiple_operations, item) for item in aab ]
concurrent.futures.wait(futures)
aab_list = [x.result() for x in futures]
aab_list.sort()
//some other code for further processing
I have used pandas/dataframes too. I am able to do a bit of the processing using that. However, I want to be able to retain the original format of the file after processing which dataframes make a bit tricky as they return data in either ndarray format or acsv format.
I would like to understand if there is a faster way of doing this, maybe using some other programming language.
As advised by #wwii, updated the code to get rid of the multiprocessing. This made the code a lot faster.
def try_multiple_operations(items):
data_aab = []
value = ['Z5','RA','Z4','99', 99]
for item in items:
aab_type = item[15:17]
aab_amount = item[35:46]
aab_name = item[82:100]
aab_reference = item[64:82]
if aab_type not in value:
# aab_record = f'{aab_name} {aab_amount} {aab_reference}'
data_aab.append(f'{aab_name} {aab_amount} {aab_reference}')
return data_aab
Called this as:
if __name__ == '__main__':
#some code
aab_list = try_multiple_operations(aab)
#some extra code
This, while a bit surprising for me, is a lot faster than multiprocessing.

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. đź’—
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

Reading specific chunks pandas / not reading all chunks in pandas

I am trying to use accordingly to this question and answer reading a large csv file by chunks and processing it. Since I'm not native with python I got an optimization problem and looking for a better solution here.
What my code does:
I read in the line count of my csv with
with open(file) as f:
row_count = sum(1 for line in f)
afterwards I "slice" my data in 30 equal sized chunks and process it accordingly to the linked answer with a for loop and pd.read_csv(file, chunksize). Since plotting 30 graphs in one is pretty unclear, I plot it every 5 steps with modulo (which may be variated). For this I use an external counter.
chunksize = row_count // 30
counter = 0
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
print(counter)
if ((counter % 5) == 0 | (counter == 0):
plt.plot(df["Variable"])
counter = counter +1
plt.show()
Now to my question:
It seems like, this loop reads the chunk size in before processing the loop, which is reasonable. I can see this, since the print(counter) steps are also fairly slow. Since I read a few million rows of a csv, it takes some time every step. Is there a way to skip the not wanted chunks in the for loop, before reading it in? I was trying out something like:
wanted_plts <- [1,5,10,15,20,25,30]
for i in wanted_plts:
for chunk[i] in pd.read_csv(file, chunksize=chunksize):
.
.
I think I have understanding issues how I can manipulate this syntax of the for loop range. There should be an elegant way to fix this.
Also: i found the .get_chunk(x) by pandas but this seems to create just one chunk of size x.
Another attempt by me is trying to subset the reader object of pd.read_csv like pd.read_csv()[0,1,2] but it seems that's not possible too.
Amendment: I'm aware plotting a lot of data in matplotlib is slow. I preprocess it earlier, but for making this code readable I removed all unnecessary parts.
You are wasting a lot of resources when parsing CSV into DataFrame without using it. To avoid this you can create line index during the first pass:
fp = open(file_name)
row_count = 0
pos = {0: 0}
line = fp.readline()
while line:
row_count += 1
pos[row_count] = fp.tell()
line = fp.readline()
Do not dispose the file handle yet! Because read_csv() accepts streams, you can move your file pointer as you want:
chunksize = row_count // 30
wanted_plts = [1,5,10,15,20,25,30]
for i in wanted_plts:
fp.seek(pos[i*chunksize]) # this will bring you to the first line of the desired chunk
obj = pd.read_csv(fp, chunksize=chunksize) # read your chunk lazily
df = obj.get_chunk() # convert to DataFrame object
plt.plot(df["Variable"]) # do something
fp.close() # Don't forget to close the file when finished.
And finally a warning: when reading CSV this way you will lose column names. So make an adjustment:
obj = pd.read_csv(fp, chunksize=chunksize, names=[!!<column names you have>!!])
P.S. file is a reserved word, avoid using it to prevent undesired side effects. You can use file_ or file_name instead.
I've toyed with your setup, trying to find a way to skip chunks, using another rendering library like pyqtgraph or using matplotlib.pyplot subroutines instead of plot(), all to no avail.
So the only fair advice I can give you is to limit the scope of read_csv to only the data you're interested in by passing the usecols parameter.
Instead of:
for chunk in pd.read_csv(file, chunksize=chunksize):
plt.plot(chunk['Variable'])
Use:
for chunk in pd.read_csv(file, usecols=['Variable'], chunksize=chunksize):
plt.plot(chunk)
And, if you haven't already, definitely limit the number of iterations by going for the biggest chunksize you possibly can (so in your case the lowest row_count divider).
I haven't quantified their respective weight but you will gain on both the csv_read() and the plot() method overheads, even ever so slightly due to the fact that your current chunks are already quite big.
With my test data, quadrupling the chunksize cuts down processing time in half:
chunksize=1000 => executed in 12.7s
chunksize=2000 => executed in 9.06s
chunksize=3000 => executed in 7.68s
chunksize=4000 => executed in 6.94s
And specifying usecols at read time also cuts down processing time in half again:
chunksize=1000 + usecols=['Variable'] => executed in 8.33s
chunksize=2000 + usecols=['Variable'] => executed in 5.27s
chunksize=3000 + usecols=['Variable'] => executed in 4.39s
chunksize=4000 + usecols=['Variable'] => executed in 3.54s
As far as I know, pandas does not provide any support for skipping chunks of file. At least I never found anything about it in the documentation.
In general, skipping lines from file (not reading them at all) is difficult unless you know in advance how many lines you want to skip and how many characters you have in each of those lines. In this case you can try to play with IO and seek to move the stream position to the exact place you need the next iteration.
But it does not seem your case.
I think the best thing you can do to improve efficiency is to read the lines using standard IO, and convert to a dataframe only the lines you need / want to plot.
Consider for example the following custom iterator.
When instantiated, it saves the header (first line). Each iteration it reads a chunk of lines from the file and then skip the following n*chunksize lines. It returns the header line followed by the read lines, wrapped in a io.StringIO object (so it's a stream and can be fed directly to pandas.read_csv).
import io
from itertools import islice
class DfReaderChunks:
def __init__(self, filename, chunksize, n):
self.fo = open(filename)
self.chs = chunksize
self.skiplines = self.chs * n
self.header = next(self.fo)
def getchunk(self):
ll = list(islice(self.fo, self.chs))
if len(ll) == 0:
raise StopIteration
dd = list(islice(self.fo, self.skiplines))
return self.header + ''.join(ll)
def __iter__(self):
return self
def __next__(self):
return io.StringIO(self.getchunk())
def close(self):
self.fo.close()
def __del__(self):
self.fo.close()
Using this class, your can read from your file:
reader = DfReaderChunks(file, chunksize, 4)
for dfst in reader:
df = pd.read_csv(dfst)
print(df) #here I print to stdout, you can plot
reader.close()
which is "equivalent" to your setup:
for chunk in pd.read_csv(file, chunksize=chunksize):
df = chunk
if (counter % 5 == 0):
print(df) #again I print, you can plot
counter += 1
I tested the time used by both the above snippets using a dataframe of 39 Mb (100000 rows or random numbers).
On my machine, the former takes 0.458 seconds, the latter 0.821 seconds.
The only drawback is that the former snippet loses track of the row index (it's a new dataframe each time, so index always start from 0) but the printed chunks are the same.

"pygame.error: Out of memory" when loading level with a large area [duplicate]

Is there a limit to memory for python? I've been using a python script to calculate the average values from a file which is a minimum of 150mb big.
Depending on the size of the file I sometimes encounter a MemoryError.
Can more memory be assigned to the python so I don't encounter the error?
EDIT: Code now below
NOTE: The file sizes can vary greatly (up to 20GB) the minimum size of the a file is 150mb
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w")
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for u in files:
line = u.readlines()
list_of_lines = []
for i in line:
values = i.split('\t')
list_of_lines.append(values)
count = 0
for j in list_of_lines:
count +=1
for k in range(0,count):
list_of_lines[k].remove('\n')
length = len(list_of_lines[0])
print_counter = 4
for o in range(0,length):
total = 0
for p in range(0,count):
number = float(list_of_lines[p][o])
total = total + number
average = total/count
print average
if print_counter == 4:
file_write.write(str(average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
(This is my third answer because I misunderstood what your code was doing in my original, and then made a small but crucial mistake in my second—hopefully three's a charm.
Edits: Since this seems to be a popular answer, I've made a few modifications to improve its implementation over the years—most not too major. This is so if folks use it as template, it will provide an even better basis.
As others have pointed out, your MemoryError problem is most likely because you're attempting to read the entire contents of huge files into memory and then, on top of that, effectively doubling the amount of memory needed by creating a list of lists of the string values from each line.
Python's memory limits are determined by how much physical ram and virtual memory disk space your computer and operating system have available. Even if you don't use it all up and your program "works", using it may be impractical because it takes too long.
Anyway, the most obvious way to avoid that is to process each file a single line at a time, which means you have to do the processing incrementally.
To accomplish this, a list of running totals for each of the fields is kept. When that is finished, the average value of each field can be calculated by dividing the corresponding total value by the count of total lines read. Once that is done, these averages can be printed out and some written to one of the output files. I've also made a conscious effort to use very descriptive variable names to try to make it understandable.
try:
from itertools import izip_longest
except ImportError: # Python 3
from itertools import zip_longest as izip_longest
GROUP_SIZE = 4
input_file_names = ["A1_B1_100000.txt", "A2_B2_100000.txt", "A1_B2_100000.txt",
"A2_B1_100000.txt"]
file_write = open("average_generations.txt", 'w')
mutation_average = open("mutation_average", 'w') # left in, but nothing written
for file_name in input_file_names:
with open(file_name, 'r') as input_file:
print('processing file: {}'.format(file_name))
totals = []
for count, fields in enumerate((line.split('\t') for line in input_file), 1):
totals = [sum(values) for values in
izip_longest(totals, map(float, fields), fillvalue=0)]
averages = [total/count for total in totals]
for print_counter, average in enumerate(averages):
print(' {:9.4f}'.format(average))
if print_counter % GROUP_SIZE == 0:
file_write.write(str(average)+'\n')
file_write.write('\n')
file_write.close()
mutation_average.close()
You're reading the entire file into memory (line = u.readlines()) which will fail of course if the file is too large (and you say that some are up to 20 GB), so that's your problem right there.
Better iterate over each line:
for current_line in u:
do_something_with(current_line)
is the recommended approach.
Later in your script, you're doing some very strange things like first counting all the items in a list, then constructing a for loop over the range of that count. Why not iterate over the list directly? What is the purpose of your script? I have the impression that this could be done much easier.
This is one of the advantages of high-level languages like Python (as opposed to C where you do have to do these housekeeping tasks yourself): Allow Python to handle iteration for you, and only collect in memory what you actually need to have in memory at any given time.
Also, as it seems that you're processing TSV files (tabulator-separated values), you should take a look at the csv module which will handle all the splitting, removing of \ns etc. for you.
Python can use all memory available to its environment. My simple "memory test" crashes on ActiveState Python 2.6 after using about
1959167 [MiB]
On jython 2.5 it crashes earlier:
239000 [MiB]
probably I can configure Jython to use more memory (it uses limits from JVM)
Test app:
import sys
sl = []
i = 0
# some magic 1024 - overhead of string object
fill_size = 1024
if sys.version.startswith('2.7'):
fill_size = 1003
if sys.version.startswith('3'):
fill_size = 497
print(fill_size)
MiB = 0
while True:
s = str(i).zfill(fill_size)
sl.append(s)
if i == 0:
try:
sys.stderr.write('size of one string %d\n' % (sys.getsizeof(s)))
except AttributeError:
pass
i += 1
if i % 1024 == 0:
MiB += 1
if MiB % 25 == 0:
sys.stderr.write('%d [MiB]\n' % (MiB))
In your app you read whole file at once. For such big files you should read the line by line.
No, there's no Python-specific limit on the memory usage of a Python application. I regularly work with Python applications that may use several gigabytes of memory. Most likely, your script actually uses more memory than available on the machine you're running on.
In that case, the solution is to rewrite the script to be more memory efficient, or to add more physical memory if the script is already optimized to minimize memory usage.
Edit:
Your script reads the entire contents of your files into memory at once (line = u.readlines()). Since you're processing files up to 20 GB in size, you're going to get memory errors with that approach unless you have huge amounts of memory in your machine.
A better approach would be to read the files one line at a time:
for u in files:
for line in u: # This will iterate over each line in the file
# Read values from the line, do necessary calculations
Not only are you reading the whole of each file into memory, but also you laboriously replicate the information in a table called list_of_lines.
You have a secondary problem: your choices of variable names severely obfuscate what you are doing.
Here is your script rewritten with the readlines() caper removed and with meaningful names:
file_A1_B1 = open("A1_B1_100000.txt", "r")
file_A2_B2 = open("A2_B2_100000.txt", "r")
file_A1_B2 = open("A1_B2_100000.txt", "r")
file_A2_B1 = open("A2_B1_100000.txt", "r")
file_write = open ("average_generations.txt", "w")
mutation_average = open("mutation_average", "w") # not used
files = [file_A2_B2,file_A2_B2,file_A1_B2,file_A2_B1]
for afile in files:
table = []
for aline in afile:
values = aline.split('\t')
values.remove('\n') # why?
table.append(values)
row_count = len(table)
row0length = len(table[0])
print_counter = 4
for column_index in range(row0length):
column_total = 0
for row_index in range(row_count):
number = float(table[row_index][column_index])
column_total = column_total + number
column_average = column_total/row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1
file_write.write('\n')
It rapidly becomes apparent that (1) you are calculating column averages (2) the obfuscation led some others to think you were calculating row averages.
As you are calculating column averages, no output is required until the end of each file, and the amount of extra memory actually required is proportional to the number of columns.
Here is a revised version of the outer loop code:
for afile in files:
for row_count, aline in enumerate(afile, start=1):
values = aline.split('\t')
values.remove('\n') # why?
fvalues = map(float, values)
if row_count == 1:
row0length = len(fvalues)
column_index_range = range(row0length)
column_totals = fvalues
else:
assert len(fvalues) == row0length
for column_index in column_index_range:
column_totals[column_index] += fvalues[column_index]
print_counter = 4
for column_index in column_index_range:
column_average = column_totals[column_index] / row_count
print column_average
if print_counter == 4:
file_write.write(str(column_average)+'\n')
print_counter = 0
print_counter +=1

Parellel function call in python

I am quite new to python.I have been thinking of making the below code to parellel calls where a list of doj values are formatted with help of lambda,
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
def formatdoj(doj):
doj = str(doj).split(" ")[0]
doj = datetime.strptime(doj, '%Y' + "-" + '%m' + "-" + "%d")
return doj
Since the list has million records, the time it takes to format all takes a lot of time.
How to make parellel function call in python similar to Parellel.Foreach in c#?
I think that in your case using parallel computation is a bit of an overkill. The slowness comes from the code, not from using a single processor. I'll show you in some steps how to make it faster, guessing a bit that you're working with a Pandas dataframe and what your dataframe contains (please stick to SO guidelines and include a complete working example!!)
For my test, I've used the following random dataframe with 100k rows (scale times up to get to your case):
N=int(1e5)
m_df = pd.DataFrame([['{}-{}-{}'.format(y,m,d)]
for y,m,d in zip(np.random.randint(2007,2019,N),
np.random.randint(1,13,N),
np.random.randint(1,28,N))],
columns=['doj'])
Now this is your code:
tstart = time()
m_df[['doj']] = m_df[['doj']].apply(lambda x: formatdoj(*x), axis=1)
print("Done in {:.3f}s".format(time()-tstart))
On my machine it runs in around 5.1s. It has several problems. The first one is you're using dataframes instead of series, although you work only on one column, and creating a useless lambda function. Simply doing:
m_df['doj'].apply(formatdoj)
Cuts down the time to 1.6s. Also joining strings with '+' is slow in python, you can change your formatdoj to:
def faster_formatdoj(doj):
return datetime.strptime(doj.split()[0], '%Y-%m-%d')
m_df['doj'] = m_df['doj'].apply(faster_formatdoj)
This is not a great improvement but does cut down a bit to 1.5s. If you need to join the strings for real (because e.g. they are not fixed), rather use '-'.join('%Y','%m','%d'), that's faster.
But the true bottleneck comes from using datetime.strptime a lot of times. It is intrinsically a slow command - dates are a bulky thing. On the other hand, if you have millions of dates, and assuming they're not uniformly spread since the beginning of humankind, chances are they are massively duplicated. So the following is how you should truly do it:
tstart = time()
# Create a new column with only the first word
m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0])
converter = {
x: faster_formatdoj(x) for x in m_df['doj_split'].unique()
}
m_df['doj'] = m_df['doj_split'].apply(lambda x: converter[x])
# Drop the column we added
m_df.drop(['doj_split'], axis=1, inplace=True)
print("Done in {:.3f}s".format(time()-tstart))
This works in around 0.2/0.3s, more than 10 times faster than your original implementation.
After all this, if you still are running to slow, you can consider working in parallel (rather parallelizing separately the first "split" instruction and, maybe, the apply-lambda part, otherwise you'd be creating many different "converter" dictionaries nullifying the gain). But I'd take that as a last step rather than the first solution...
[EDIT]: Originally in the first step of the last code box I used m_df['doj_split'] = m_df['doj'].str.split().apply(lambda x: x[0]) which is functionally equivalent but a bit slower than m_df['doj_split'] = m_df['doj'].apply(lambda x: x.split()[0]). I'm not entirely sure why, probably because it's essentially applying two functions instead of one.
Your best bet is to use dask. Dask has a data_frame type which you can use to create this a similar dataframe, but, while executing compute function, you can specify number of cores with num_worker argument. this will parallelize the task
Since I'm not sure about your example, I will give you another one using the multiprocessing library:
# -*- coding: utf-8 -*-
import multiprocessing as mp
input_list = ["str1", "str2", "str3", "str4"]
def format_str(str_input):
str_output = str_input + "_test"
return str_output
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
result = p.map(format_str, input_list)
print (result)
Now, let's say you want to map a function with several arguments, you should then use starmap():
# -*- coding: utf-8 -*-
import multiprocessing as mp
input_list = ["str1", "str2", "str3", "str4"]
def format_str(str_input, i):
str_output = str_input + "_test" + str(i)
return str_output
if __name__ == '__main__':
with mp.Pool(processes = 2) as p:
result = p.starmap(format_str, [(input_list, i) for i in range(len(input_list))])
print (result)
Do not forget to place the Pool inside the if __name__ == '__main__': and that multiprocessing will not work inside an IDE such as spyder (or others), thus you'll need to run the script in the cmd.
To keep the results, you can either save them to a file, or keep the cmd open at the end with os.system("pause") (Windows) or an input() on Linux.
It's a fairly simple way to use multiprocessing with python.

Categories

Resources