fastest method to read big data files in python - python

I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (e.g. each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) .
Reading an entire file into memory (e.g. by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. My strategy was to loop through each line iteratively and write it away in some folder. This strategy works fine, but it is extremely slow.
From my understanding there is a trade-off between memory usage and computation speed. Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours).
So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. While not losing functionality? * And if so, how would I implement this? Or am I going about this all wrong? Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month).
The method in particular I am trying is:
#Creates a time series per bond
import csv
import linecache
#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year
filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];
for j in range(0,5):
for i in range(0,12):
filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')
for x in range (0,60):
line = linecache.getline('allBonds.txt', x)
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:
for line in text_file:
temp=line.split(';');
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
Thanks in advance.
P.s. Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed.

I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . It supports reading the file in chunks (iterator=True option)
I think this part of your code may cause serious performance problems if the condition is matched frequently.
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^',
quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop.

Related

Most efficient way to loop through a file's values and check a dictionary for any/all corresponding instances

I have a file with user's names, one per line, and I need to compare each name in the file to all values in a csv file and make note each time the user name appears in the csv file. I need to make the search as efficient as possible as the csv file is 40K lines long
My example persons.txt file:
Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge
My example locations.csv file:
GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes
My Code:
import csv
def parse_files():
with open('data_file/persons.txt', 'r') as user_list:
lines = user_list.readlines()
for user_row in lines:
new_user = user_row.strip()
per = []
with open('data_file/locations.csv', newline='') as target_csv:
DictReader_loc = csv.DictReader(target_csv)
for loc_row in DictReader_loc:
if new_user.lower() in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+new_user, per)
print("Parse Complete")
def main():
parse_files()
main()
My code currently works. Based on the sample data in the example files, the code matches the 2 instances of "Samson, David" and 1 instance of "Simpson, Marge" in the locations.csv file. I'm hoping that someone can give me guidance on how I might transform either the persons.txt file or the locations.csv file (40K+ lines) so that the process is as efficient as it can be. I think it currently takes 10-15 minutes. I know looping isn't the most efficient, but I do need to check each name and see where it appears in the csv file.
I think #Tomalak's solution with SQLite is very useful, but if you want to keep it closer to your original code, see the version below.
Effectively, it reduces the amount of file opening/closing/reading that is going on, and hopefully will speed things up.
Since your sample is very small, I could not do any real measurements.
Going forward, you can consider using pandas for these kind of tasks - it can be very convenient working with CSVs and more optimized than the csv module.
import csv
def parse_files():
with open('persons.txt', 'r') as user_list:
# sets are faster to match against than lists
# do the lower() here to avoid repetition
user_set = set([u.strip().lower() for u in user_list.readlines()])
# open file at beginning, close after done
# you could also encapsulate the whole thing into a `with` clause if
# desired
target_csv = open("locations.csv", "r", newline='')
DictReader_loc = csv.DictReader(target_csv)
for user in user_set:
per = []
for loc_row in DictReader_loc:
if user in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("\n"+user, per)
print("Parse Complete")
target_csv.close()
def main():
parse_files()
main()

Load group from hdf5

I have an hdf5 file that contains datasets inside groups. Example:
group1/dataset1
group1/dataset2
group1/datasetX
group2/dataset1
group2/dataset2
group2/datasetX
I'm able to read each dataset independently. This is how I read a dataset from a .hdf5 file:
def hdf5_load_dataset(hdf5_filename, dsetname):
with h5py.File(hdf5_filename, 'r') as f:
dset = f[dsetname]
return dset[...]
# pseudo-code of how I call the hdf5_load_dataset() function
group = {'group1':'dataset1', 'group1':'dataset2' ...}
for group in groups:
for dataset in groups[group]:
dset_value = hdf5_load_dataset_value(path_hdf5_file, f'{group}/{dataset}')
# do stuff
I would like to know if it's possible to load into memory all the datasets of group1, then of group2, etc as a dictionary or similar in a single file read. My script is taking quite some time (4min) to read ~200k datasets. There are 2k groups with 100 datasets. So if I load a group in memory at once it will not overload it and I will gain in speed.
This is a pseudo-code of what I'm looking for:
for group in groups:
dset_group_as_dict = hdf5_load_group(path_hdf5_file, f'{group}')
for dataset in dset_group_as_dict;
#do stuff
EDIT:
Inside each .csv file:
time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05
For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:
XY_1/impact_X/time
XY_1/impact_Y/amplitude
where
time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...]) # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...]) # 500 items
XY_1 is a position in space.
impact_X means that X was impacted in position XY_1 so X amplitude has changed.
So, XY_1 must be in a different group of XY_2 as well as impact_X, impact_Y etc since they represent data to a particular XY position.
I need to create plots from each or only one (time, amplitude) pair (configurable). I also need to compare the amplitudes with a "golden" array to see the differences and calculate other stuff. To perform calculation I will read all the datasets, perform calculation and save the result.
I have more than 200k .csv files for each test case, in a total is more than 5M. Using 5M reads from disk will take quite some time in this case. For the 200k files, by exporting all the .csv to a unique JSON file, it takes ~40s to execute, using .csv is taking ~4min. I can't use a unique json any longer due to memory issues when loading the single JSON file. That's why I chose hdf5 as an alternative.
EDIT 2:
How I read the csv file:
def read_csv_return_list_of_rows(csv_file, _delimiter):
csv_file_list = list()
with open(csv_file, 'r') as f_read:
csv_reader = csv.reader(f_read, delimiter = _delimiter)
for row in csv_reader:
csv_file_list.append(row)
return csv_file_list
No, there is no single function that reads multiple groups or datasets at once. You have to make it yourself from lower level functions that read one group or dataset.
And can you give us some further context? What kind of data is it and how do you want to process it? (Do you want to make statistics? Make plots? Etcetera.) What are you ultimately trying to achieve? This may help us to avoid the to avoid the classical XY problem.
In your earlier question you said you converted a lot of small CSV file into one big HDF file. Can you tell us why? What is wrong with having many CSV small files?
In my experience HDF files with a huge number of groups and datasets are fairly slow, as you are experiencing now. Is it better to have relatively few, but larger, datasets. Is it possible for you to somehow merge multiple datasets into one? If not, HDF may not be the best solution for your problem.

Best way to write rows of a numpy array to file inside, NOT after, a loop?

I'm new here and to python in general, so please forgive any formatting issues and whatever else. I'm a physicist and I have a parametric model, where I want to iterate over one or more of the model's parameter values (possibly in an MCMC setting). But for simplicity, imagine I have just a single parameter with N possible values. In a loop, I compute the model and several scalar metrics pertaining to it.
I want to save the data [parameter value, metric1, metric2, ...] line-by-line to a file. I don't care what type: .pickle, .npz, .txt, .csv or anything else are fine.
I do NOT want to save the array after all N models have been computed. The issue here is that, sometimes a parameter value is so nonphysical that the program I call to calculate the model (which is a giant complicated thing years in development, so I'm not touching it) crashes the kernel. If I have N = 30000 models to do, and this happens at 29000, I'll be very unhappy and have wasted a lot of time. I also probably have to be conscious of memory usage - I've figured out how to do what I propose with a text file, but it crashes around 2600 lines because I don't think it likes opening a text file that long.
So, some pseudo-code:
filename = 'outFile.extension'
dataArray = np.zeros([N,3])
idx = 0
for p in Parameter1:
modelOutputVector = calculateModel(p)
metric1, metric2 = getMetrics(modelOutputVector)
dataArray[idx,0] = p
dataArray[idx,1] = metric1
dataArray[idx,2] = metric2
### Line that saves data here
idx+=1
I'm partial to npz or pickle formats, but can't figure out how to do this with either. If there is a better format or a better solution, I appreciate any advice.
Edit: What I tried to make a text file was this, inside the loop:
fileObject = open(filename, 'ab')
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
fileObject.close()
The first time it crashed at 2600 or whatever I thought it was just coincidence, but every time I try this, that's where it stops. I could hack it and make a batch of files that are all 2600 lines, but there's got to be a better solution.
Its hard to say with such a limited knowledge of the error, but if you think it is a file writing error maybe you could try something like:
with open(filename, 'ab') as fileObject:
# code that computes numpy array
np.savetxt(fileObject, rowOfData, delimiter = ',', newline = ' ')
fileObject.write('\n')
# no need to .close() because the "with open()" will handle it
However
I have not used np.savetxt()
I am not an expert on your project
I do not even know if it is truly a file writing error to begin with
I just prefer the with open() technique because that's how all the introductory python books I've read structure their file reading/writing processes, so I assume there is wisdom in it. You could also consider doing like fabianegli commented and save to separate files (thats what my work does).

Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt

MPI in Python: load data from a file by line concurrently

I'm new to python as well as MPI.
I have a huge data file, 10Gb, and I want to load it into, i.e., a list or whatever more efficient, please suggest.
Here is the way I load the file content into a list
def load(source, size):
data = [[] for _ in range(size)]
ln = 0
with open(source, 'r') as input:
for line in input:
ln += 1
data[ln%size].sanitize(line)
return data
Note:
source: is file name
size: is the number of concurrent process, I divide data into [size] of sublist.
for parallel computing using MPI in python.
Please advise how to load data more efficient and faster. I'm searching for days but I couldn't get any results matches my purpose and if there exists, please comment with a link here.
Regards
If I have understood the question, your bottleneck is not Python data structures. It is the I/O speed that limits the efficiency of your program.
If the file is written in continues blocks in the H.D.D then I don't know a way to read it faster than reading the file starting form the first bytes to the end.
But if the file is fragmented, create multiple threads each reading a part of the file. The must slow down the process of reading but modern HDDs implement a technique named NCQ (Native Command Queueing). It works by giving high priority to the read operation on sectors with addresses near the current position of the HDD head. Hence improving the overall speed of read operation using multiple threads.
To mention an efficient data structure in Python for your program, you need to mention what operations will you perform to the data? (delete, add, insert, search, append and so on) and how often?
By the way, if you use commodity hardware, 10GBs of RAM is expensive. Try reducing the need for this amount of RAM by loading the necessary data for computation then replacing the results with new data for the next operation. You can overlap the computation with the I/O operations to improve performance.
(original) Solution using pickling
The strategy for your task can go this way:
split the large file to smaller ones, make sure they are divided on line boundaries
have Python code, which can convert smaller files into resulting list of records and save them as
pickled file
run the python code for all the smaller files in parallel (using Python or other means)
run integrating code, taking pickled files one by one, loading the list from it and appending it
to final result.
To gain anything, you have to be careful as overhead can overcome all possible gains from parallel
runs:
as Python uses Global Interpreter Lock (GIL), do not use threads for parallel processing, use
processes. As processes cannot simply pass data around, you have to pickle them and let the other
(final integrating) part to read the result from it.
try to minimize number of loops. For this reason it is better to:
do not split the large file to too many smaller parts. To use power of your cores, best fit
the number of parts to number of cores (or possibly twice as much, but getting higher will
spend too much time on swithing between processes).
pickling allows saving particular items, but better create list of items (records) and pickle
the list as one item. Pickling one list of 1000 items will be faster than 1000 times pickling
small items one by one.
some tasks (spliting the file, calling the conversion task in parallel) can be often done faster
by existing tools in the system. If you have this option, use that.
In my small test, I have created a file with 100 thousands lines with content "98-BBBBBBBBBBBBBB",
"99-BBBBBBBBBBB" etc. and tested converting it to list of numbers [...., 98, 99, ...].
For spliting I used Linux command split, asking to create 4 parts preserving line borders:
$ split -n l/4 long.txt
This created smaller files xaa, xab, xac, xad.
To convert each smaller file I used following script, converting the content into file with
extension .pickle and containing pickled list.
# chunk2pickle.py
import pickle
import sys
def process_line(line):
return int(line.split("-", 1)[0])
def main(fname, pick_fname):
with open(pick_fname, "wb") as fo:
with open(fname) as f:
pickle.dump([process_line(line) for line in f], fo)
if __name__ == "__main__":
fname = sys.argv[1]
pick_fname = fname + ".pickled"
main(fname, pick_fname)
To convert one chunk of lines into pickled list of records:
$ python chunk2pickle xaa
and it creates the file xaa.pickled.
But as we need to do this in parallel, I used parallel tool (which has to be installed into
system):
$ parallel -j 4 python chunk2pickle.py {} ::: xaa xab xac xad
and I found new files with extension .pickled on the disk.
-j 4 asks to run 4 processes in parallel, adjust it to your system or leave it out and it will
default to number of cores you have.
parallel can also get list of parameters (input file names in our case) by other means like ls
command:
$ ls x?? |parallel -j 4 python chunk2pickle.py {}
To integrate the results, use script integrate.py:
# integrate.py
import pickle
def main(file_names):
res = []
for fname in file_names:
with open(fname, "rb") as f:
res.extend(pickle.load(f))
return res
if __name__ == "__main__":
file_names = ["xaa.pickled", "xab.pickled", "xac.pickled", "xad.pickled"]
# here you have the list of records you asked for
records = main(file_names)
print records
In my answer I have used couple of external tools (split and parallel). You may do similar task
with Python too. My answer is focusing only on providing you an option to keep Python code for
converting lines to required data structures. Complete pure Python answer is not covered here (it
would get much longer and probably slower.
Solution using process Pool (no explicit pickling needed)
Following solution uses multiprocessing from Python. In this case there is no need to pickle results
explicitly (I am not sure, if it is done by the library automatically, or it is not necessary and
data are passed using other means).
# direct_integrate.py
from multiprocessing import Pool
def process_line(line):
return int(line.split("-", 1)[0])
def process_chunkfile(fname):
with open(fname) as f:
return [process_line(line) for line in f]
def main(file_names, cores=4):
p = Pool(cores)
return p.map(process_chunkfile, file_names)
if __name__ == "__main__":
file_names = ["xaa", "xab", "xac", "xad"]
# here you have the list of records you asked for
# warning: records are in groups.
record_groups = main(file_names)
for rec_group in record_groups:
print(rec_group)
This updated solution still assumes, the large file is available in form of four smaller files.

Categories

Resources