I have an hdf5 file that contains datasets inside groups. Example:
group1/dataset1
group1/dataset2
group1/datasetX
group2/dataset1
group2/dataset2
group2/datasetX
I'm able to read each dataset independently. This is how I read a dataset from a .hdf5 file:
def hdf5_load_dataset(hdf5_filename, dsetname):
with h5py.File(hdf5_filename, 'r') as f:
dset = f[dsetname]
return dset[...]
# pseudo-code of how I call the hdf5_load_dataset() function
group = {'group1':'dataset1', 'group1':'dataset2' ...}
for group in groups:
for dataset in groups[group]:
dset_value = hdf5_load_dataset_value(path_hdf5_file, f'{group}/{dataset}')
# do stuff
I would like to know if it's possible to load into memory all the datasets of group1, then of group2, etc as a dictionary or similar in a single file read. My script is taking quite some time (4min) to read ~200k datasets. There are 2k groups with 100 datasets. So if I load a group in memory at once it will not overload it and I will gain in speed.
This is a pseudo-code of what I'm looking for:
for group in groups:
dset_group_as_dict = hdf5_load_group(path_hdf5_file, f'{group}')
for dataset in dset_group_as_dict;
#do stuff
EDIT:
Inside each .csv file:
time, amplitude
1.000e-08, -1.432e-07
1.001e-08, 7.992e-07
1.003e-08, -1.838e-05
1.003e-08, 2.521e-05
For each .csv file in each folder I have a dataset for the time and for the amplitude. The structure of the hdfile is like this:
XY_1/impact_X/time
XY_1/impact_Y/amplitude
where
time = np.array([1.000e-08, 1.001e-08, 1.003e-08, ...]) # 500 items
amplitude = np.array([-1.432e-07, 7.992e-07, -1.838e-05, ...]) # 500 items
XY_1 is a position in space.
impact_X means that X was impacted in position XY_1 so X amplitude has changed.
So, XY_1 must be in a different group of XY_2 as well as impact_X, impact_Y etc since they represent data to a particular XY position.
I need to create plots from each or only one (time, amplitude) pair (configurable). I also need to compare the amplitudes with a "golden" array to see the differences and calculate other stuff. To perform calculation I will read all the datasets, perform calculation and save the result.
I have more than 200k .csv files for each test case, in a total is more than 5M. Using 5M reads from disk will take quite some time in this case. For the 200k files, by exporting all the .csv to a unique JSON file, it takes ~40s to execute, using .csv is taking ~4min. I can't use a unique json any longer due to memory issues when loading the single JSON file. That's why I chose hdf5 as an alternative.
EDIT 2:
How I read the csv file:
def read_csv_return_list_of_rows(csv_file, _delimiter):
csv_file_list = list()
with open(csv_file, 'r') as f_read:
csv_reader = csv.reader(f_read, delimiter = _delimiter)
for row in csv_reader:
csv_file_list.append(row)
return csv_file_list
No, there is no single function that reads multiple groups or datasets at once. You have to make it yourself from lower level functions that read one group or dataset.
And can you give us some further context? What kind of data is it and how do you want to process it? (Do you want to make statistics? Make plots? Etcetera.) What are you ultimately trying to achieve? This may help us to avoid the to avoid the classical XY problem.
In your earlier question you said you converted a lot of small CSV file into one big HDF file. Can you tell us why? What is wrong with having many CSV small files?
In my experience HDF files with a huge number of groups and datasets are fairly slow, as you are experiencing now. Is it better to have relatively few, but larger, datasets. Is it possible for you to somehow merge multiple datasets into one? If not, HDF may not be the best solution for your problem.
Related
I have created a huge hdf5 dataset in the following form:
group1/raw
group1/preprocessed
group1/postprocessed
group2/raw
group2/preprocessed
group2/postprocessed
....
group10/raw
group10/preprocessed
group10/postprocessed
However, I realized that for portability I would like to have 10 different hdf5 files, one for each group. Is there a function in python to achieve this without looping through all the data and scanning the entire original hdf5 tree?
something like:
import h5py
file_path = 'path/to/data.hdf5'
hf = h5py.File(file_path, 'r')
print(hf.keys())
for group in hf.keys():
# create a new dataset for the group
hf_tmp = h5py.File(group + '.h5', 'w')
# get data from hf[key] and dumb them into the new file
# something like
# hf_tmp = hf[group]
# hf_tmp.dumb()
hf_tmp.close()
hf.close()
You have the right idea. There are several questions and answers on SO that show how to do this.
Start with this one. It shows how to loop over the keys and determine if it's a group or a dataset.: h5py: how to use keys() loop over HDF5 Groups and Datasets
Then look at these. Each shows a slightly different approach to the problem.
This shows one way. Extracting datasets from 1 HDF5 file to multiple files
Also, here is an earlier post I wrote: How to copy a dataset object to a different hdf5 file using pytables or h5py?
This does the opposite (copies datasets from different files to 1 file). It's useful because it demonstrates how to use the .copy() method: How can I combine multiple .h5 file?
Finally, you should review visititems() method to recursively search all Groups and Datasets. Take a look at this answer for details: is there a way to get datasets in all groups at once in h5py?
That should answer your questions.
Below is some pseudo-code that pulls all of these ideas together. It works for your schema, where all datasets are in root level groups. It will not work for the more general case with datasets at multiple group levels. Use visititems() for the more general case.
Pseudo-code below:
with h5py.File(file_path, 'r') as hf:
print(hf.keys())
# loop on group names at root level
for group in hf.keys():
hf_tmp = h5py.File(group + '.h5', 'w')
# loop on datasets names in group
for dset in hf[group].keys():
# copy dataset to the new group file
hf.copy(group+'/'+dset, hf_tmp)
hf_tmp.close()
I have recently sourced and curated a lot of reddit data from Google Bigquery.
The dataset looks like this:
Before passing this data to word2vec to create a vocabulary and be trained, it is required that I properly tokenize the 'body_cleaned' column.
I have attempted the tokenization with both manually created functions and NLTK's word_tokenize, but for now I'll keep it focused on using word_tokenize.
Because my dataset is rather large, close to 12 million rows, it is impossible for me to open and perform functions on the dataset in one go. Pandas tries to load everything to RAM and as you can understand it crashes, even on a system with 24GB of ram.
I am facing the following issue:
When I tokenize the dataset (using NTLK word_tokenize), if I perform the function on the dataset as a whole, it correctly tokenizes and word2vec accepts that input and learns/outputs words correctly in its vocabulary.
When I tokenize the dataset by first batching the dataframe and iterating through it, the resulting token column is not what word2vec prefers; although word2vec trains its model on the data gathered for over 4 hours, the resulting vocabulary it has learnt consists of single characters in several encodings, as well as emojis - not words.
To troubleshoot this, I created a tiny subset of my data and tried to perform the tokenization on that data in two different ways:
Knowing that my computer can handle performing the action on the dataset, I simply did:
reddit_subset = reddit_data[:50]
reddit_subset['tokens'] = reddit_subset['body_cleaned'].apply(lambda x: word_tokenize(x))
This produces the following result:
This in fact works with word2vec and produces model one can work with. Great so far.
Because of my inability to operate on such a large dataset in one go, I had to get creative with how I handle this dataset. My solution was to batch the dataset and work on it in small iterations using Panda's own batchsize argument.
I wrote the following function to achieve that:
def reddit_data_cleaning(filepath, batchsize=20000):
if batchsize:
df = pd.read_csv(filepath, encoding='utf-8', error_bad_lines=False, chunksize=batchsize, iterator=True, lineterminator='\n')
print("Beginning the data cleaning process!")
start_time = time.time()
flag = 1
chunk_num = 1
for chunk in df:
chunk[u'tokens'] = chunk[u'body_cleaned'].apply(lambda x: word_tokenize(x))
chunk_num += 1
if flag == 1:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Beginning writing a new file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='w+', index=None, header=True)
flag = 0
else:
chunk.dropna(how='any')
chunk = chunk[chunk['body_cleaned'] != 'deleted']
chunk = chunk[chunk['body_cleaned'] != 'removed']
print("Adding a chunk into an already existing file")
chunk.to_csv(str(filepath[:-4] + '_tokenized.csv'), mode='a', index=None, header=None)
end_time = time.time()
print("Processing has been completed in: ", (end_time - start_time), " seconds.")
Although this piece of code allows me to actually work through this huge dataset in chunks and produces results where otherwise I'd crash from memory failures, I get a result which doesn't fit my word2vec requirements, and leaves me quite baffled at the reason for it.
I used the above function to perform the same operation on the Data subset to compare how the result differs between the two functions, and got the following:
The desired result is on the new_tokens column, and the function that chunks the dataframe produces the "tokens" column result.
Is anyone any wiser to help me understand why the same function to tokenize produces a wholly different result depending on how I iterate over the dataframe?
I appreciate you if you read through the whole issue and stuck through!
First & foremost, beyond a certain size of data, & especially when working with raw text or tokenized text, you probably don't want to be using Pandas dataframes for every interim result.
They add extra overhead & complication that isn't fully 'Pythonic'. This is particularly the case for:
Python list objects where each word is a separate string: once you've tokenized raw strings into this format, as for example to feed such texts to Gensim's Word2Vec model, trying to put those into Pandas just leads to confusing list-representation issues (as with your columns where the same text might be shown as either ['yessir', 'shit', 'is', 'real'] – which is a true Python list literal – or [yessir, shit, is, real] – which is some other mess likely to break if any tokens have challenging characters).
the raw word-vectors (or later, text-vectors): these are more compact & natural/efficient to work with in raw Numpy arrays than Dataframes
So, by all means, if Pandas helps for loading or other non-text fields, use it there. But then use more fundamntal Python or Numpy datatypes for tokenized text & vectors - perhaps using some field (like a unique ID) in your Dataframe to correlate the two.
Especially for large text corpuses, it's more typical to get away from CSV and instead use large text files, with one text per newline-separated line, and any each line being pre-tokenized so that spaces can be fully trusted as token-separated.
That is: even if your initial text data has more complicated punctuation-sensative tokenization, or other preprocessing that combines/changes/splits other tokens, try to do that just once (especially if it involves costly regexes), writing the results to a single simple text file which then fits the simple rules: read one text per line, split each line only by spaces.
Lots of algorithms, like Gensim's Word2Vec or FastText, can either stream such files directly or via very low-overhead iterable-wrappers - so the text is never completely in memory, only read as needed, repeatedly, for multiple training iterations.
For more details on this efficient way to work with large bodies of text, see this artice: https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/
After taking gojomo's advice I simplified my approach at reading the csv file and writing to a text file.
My initial approach using pandas had yielded some pretty bad processing times for a file with around 12 million rows, and memory issues due to how pandas reads data all into memory before writing it out to a file.
What I also realized was that I had a major flaw in my previous code.
I was printing some output (as a sanity check), and because I printed output too often, I overflowed Jupyter and crashed the notebook, not allowing the underlying and most important task to complete.
I got rid of that, simplified reading with the csv module and writing into a txt file, and I processed the reddit database of ~12 million rows in less than 10 seconds.
Maybe not the finest piece of code, but I was scrambling to solve an issue that stood as a roadblock for me for a couple of days (and not realizing that part of my problem was my sanity checks crashing Jupyter was an even bigger frustration).
def generate_corpus_txt(csv_filepath, output_filepath):
import csv
import time
start_time = time.time()
with open(csv_filepath, encoding = 'utf-8') as csvfile:
datareader = csv.reader(csvfile)
count = 0
header = next(csvfile)
print(time.asctime(time.localtime()), " ---- Beginning Processing")
with open(output_filepath, 'w+') as output:
# Check file as empty
if header != None:
for row in datareader:
# Iterate over each row after the header in the csv
# row variable is a list that represents a row in csv
processed_row = str(' '.join(row)) + '\n'
output.write(processed_row)
count += 1
if count == 1000000:
print(time.asctime(time.localtime()), " ---- Processed 1,000,000 Rows of data.")
count = 0
print('Processing took:', int((time.time()-start_time)/60), ' minutes')
output.close()
csvfile.close()
I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...
It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.
Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt
I want to read a zipfile into memory and extract its content into a numpy array (as numpy-datatypes). This needs to happen in an extremely efficient/fast manner, since the files are rather big and there are many of them. Unfortunately looking at similiar questions didn't help me, because I couldn't find a way to convert the data into numpy-datatypes at the time of reading. Also the speed turned out to be a big problem.
For example: The zipfile "log_ks818.zip" contains "log_file.csv", which contains the needed data in the following format (yyyymmdd hhnnsszzz,float,float,zero):
20161001 190000100,1.000500,1.000800,0
20161001 190001000,1.001000,1.002000,0
20161001 190002500,1.001500,1.001200,0
...
The fastest that I managed to do so far (using pandas):
zfile = zipfile.ZipFile("log_ks818.zip", 'r')
df = pd.read_csv(io.BytesIO(zfile.read("log_file.csv")), header=None, usecols=[0, 1, 2], delimiter=',', encoding='utf-8')
print(df.head())
However this takes around 6 seconds for ~2,000,000 lines in the file (~80MB if unpacked), which is too slow (plus it's not a numpy object). When I compared the read-in speed of different numpy/pandas-methods, using the extracted file on the hard drive to test, np.fromfile performed the best with 0.08 seconds to simply get it into memory. It would be nice if it was possible to stay in this magnitude when reading the data from the zip-file.
I think this is not a problem about read speed from disk. Even though you are using HDD, reading 80MB into memory can be done in one second.
Take my experience as an example, the cost of time is determined by the process of uncompressing. If you just work with extracted data, it won't cost you a lot I believe.
I have got some (about 60) huge (>2 gig) CSV files which I want to loop through to to make subselections (e.g. each file contains data of 1 month of various financial products, i want to make 60-month time series of each product) .
Reading an entire file into memory (e.g. by loading the file in excel or matlab) is unworkable, so my initial search on stackoverflow made me try python. My strategy was to loop through each line iteratively and write it away in some folder. This strategy works fine, but it is extremely slow.
From my understanding there is a trade-off between memory usage and computation speed. Where loading the entire file in memory is one end of the spectrum (computer crashes), loading a single line unto the memory each time is obviously on the other end (computation time is about 5 hours).
So my main question is: *Is there a way that to load multiple lines into memory, as to do this process (100 times?) faster. While not losing functionality? * And if so, how would I implement this? Or am I going about this all wrong? Mind you, below is just a simplified code of what I am trying to do (I might want to make subselections in other dimensions than time). Assume that the original data files have no meaningful ordering (other than they being split into 60 files for each month).
The method in particular I am trying is:
#Creates a time series per bond
import csv
import linecache
#I have a row of comma-seperated bond-identifiers 'allBonds.txt' for each month
#I have 60 large files financialData_&month&year
filedoc=[];
months=['jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec'];
years=['08','09','10','11','12'];
bonds=[];
for j in range(0,5):
for i in range(0,12):
filedoc.append('financialData_' +str(months[i]) + str(years[j])+ '.txt')
for x in range (0,60):
line = linecache.getline('allBonds.txt', x)
bonds=line.split(','); #generate the identifiers for this particular month
with open(filedoc[x]) as text_file:
for line in text_file:
temp=line.split(';');
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file =open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^', quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
Thanks in advance.
P.s. Just to make sure: the code works at the moment (though any suggestions are welcome of course), but the issue is speed.
I would test pandas.read_csv mentioned in https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file . It supports reading the file in chunks (iterator=True option)
I think this part of your code may cause serious performance problems if the condition is matched frequently.
if temp[2] in bonds: : #checks if the bond of this iteration is among those we search for
output_file = open('monthOutput'+str(temp[2])+ str(filedoc[x]) +'.txt', 'a')
datawriter = csv.writer(output_file,dialect='excel',delimiter='^',
quoting=csv.QUOTE_MINIMAL)
datawriter.writerow(temp)
output_file.close()
It would be better to avoid opening a file, creating a cvs.writer() object and then closing the file inside a loop.