I'm doing something seemingly trivial that takes much longer than I would expect it to. I'm loading a 70MB file, running it through a reducer which calls a Python script that does not modify the data, and writing the data back to a new file.
It takes 42 minutes when I run it through the Python script, it takes less than one minute (including compilation) if I don't.
I'm trying to understand:
What am I doing wrong?
What is going on underneath the hood that takes so long?
I store the input and output files on Azure Data Lake Store. I'm using parallelism 1, a TSV input file of about 70MB (2000 rows, 2 columns). I'm just passing the data through. It takes 42 minutes until the job finishes.
I generated the test input data with this Python script:
import base64
# create a roughly 70MB TSV file with 2000 rows and 2 columns: ID (integer) and roughly 30KB data (string)
fo = open('testinput.tsv', 'wb')
for i in range(2000):
fo.write(str(i).encode() + b'\t' + base64.b85encode(bytearray(os.urandom(30000))) + b'\n')
fo.close()
This is the U-SQL script I use:
REFERENCE ASSEMBLY [ExtPython];
DECLARE #myScript = #"
def usqlml_main(df):
return df
";
#step1 =
EXTRACT
col1 string,
col2 string
FROM "/test/testinput.tsv" USING Extractors.Tsv();;
#step2 =
REDUCE #ncsx_files ON col1
PRODUCE col1 string, col2 string
USING new Extension.Python.Reducer(pyScript:#myScript);
OUTPUT #step2
TO "/test/testoutput.csv"
USING Outputters.Tsv(outputHeader: true);
I have the same issue.
I have a 116 mb csv file I want to read in (and then do stuff). When trying to read in the file and do nothing in the python script it times out after 5 hours, I even tried reducing the file to 9,28 mb it also times out after 5 hours.
However, when reduced to 1,32 mb the job finishes after 16 mins. (With results as expected).
REFERENCE ASSEMBLY [ExtPython];
DECLARE #myScript = #"
def usqlml_main(df):
return df
";
#train =
EXTRACT txt string,
casegroup string
FROM "/test/t.csv"
USING Extractors.Csv();
#train =
SELECT *,
1 AS Order
FROM #train
ORDER BY Order
FETCH 10000;
#train =
SELECT txt,
casegroup
FROM #train; // 1000 rows: 16 mins, 10000 rows: times out at 5 hours.
#m =
REDUCE #train ON txt, casegroup
PRODUCE txt string, casegroup string
USING new Extension.Python.Reducer(pyScript:#myScript);
OUTPUT #m
TO "/test/t_res.csv"
USING Outputters.Csv();
REDUCE #ROWSET ALL
If you don't reduce on ALL, it will invoke the python function per row.
If you want to use parallelism, you could create temporary groups to reduce on.
Related
I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?
One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)
Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.
I have the following code, which computes cosine similarity of the descriptions of tv shows and movies.
for i, row in df.iterrows():
doc = nlp(row['description'])
similarities[i] = {}
# print(row['title'])
for j, row2 in df.iterrows():
doc2 = nlp(row2['description'])
#print(f"{row['title']} x {row2['title']}: {doc.similarity(doc2):.10f}")
similarities[i][j] = doc.similarity(doc2)
I've also written this function, which takes as arguments two titles and returns their similarity
def lookup(title1, title2):
return similarities[lookup_by_title(title1)][lookup_by_title(title2)]
my issue is that the dataframe I loop through has 4884 rows, so I'm have a list of 23.8 million computations. So I'm wondering what the best way is to run the computations once and save that information somewhere efficiently.
After you calculate similarities at the first time, you can dump it to a local file, and then in the next times, instead of doing the computations again, just load similarities from the file.
You can use pickle for this, See a nice tutorial here.
I'm copying the samples in case the webpage won't be available in future. In your case, of course you need to replace config_dictionary with similarities:
Dump:
# Step 1
import pickle
config_dictionary = {'remote_hostname': 'google.com', 'remote_port': 80}
# Step 2
with open('config.dictionary', 'wb') as config_dictionary_file:
# Step 3
pickle.dump(config_dictionary, config_dictionary_file)
Load:
# Step 1
import pickle
# Step 2
with open('config.dictionary', 'rb') as config_dictionary_file:
# Step 3
config_dictionary = pickle.load(config_dictionary_file)
# After config_dictionary is read from file
print(config_dictionary)
I want to clean all the "waste" (making the files unsuitable for analysis) in unstructured text-files.
In this specific situation, one option to only retain the wanted information, is to only retain all numbers above 250 (the text is a combination of string, numbers, ...)
For a large number of text files, I want to do follow action in R:
x <- x[which(x >= "250"),]
The code for 1 text file works perfectly (above), when I try to do the same in a loop (for the large N of text files, it fails (error: incorrect number of dimensions o)).
for(i in 1:length(files)){
i<- i[which(i >= "250"),]
}
Anyone any idea how to solve this in R (or python) ?
picture: very simplified example of a text file, I want to retain everything between (START) and (END)
This makes no sense if it is 10 K files, why are you even trying to do in R or python? Why not just a simple awk or bash command? Moreover, your images is parsing info between START and END from the text files, not sure if it is data frame with columns across ( try to put in a simple dput rather than images.)
All you are trying to do is a grep between start and end across 10 k files. I would do that in bash.
something like this in bash should work.
for i in *.txt
do
sed -n '/START/,/END/{//!p}' i > i.edited.txt
done
If the columns are standard across in R you can do the following ( But, I would not read 10 K files in R memory).
read the files as a list of dataframe Then simply do an lapply
a = data.frame(col1 = c(100,250,300))
b = data.frame(col1 = c(250,450,100,346))
c = data.frame(col1 = c(250,123,122,340))
df_list <- list(a = a ,b = b,c = c)
lapply(df_list, subset, col1 >= 250)
I'm new to Python and am having a difficult time figuring out how to write a program that will write out a single .txt file for every line in a .csv file. For instance, I have the following .csv file with data from multiple calculations and I need .txt files created for each individual calculation. Formatting is rough to do here but the bold letters are column names and corresponding elements are underneath (ex: "Run2 and "20" belong to column C).
A B C D
Title: Run1 Run 2 Run3
"Initial Composition: FeO" 10 20 30
"Initial Composition: MgO" 40 50 60
I want my Python code to output the following:
1.txt:
Title: Run 1
Initial Composition: FeO 10
Initial Composition: Mgo: 40
2.txt:
Title: Run 2
Initial Composition: FeO 20
Initial Composition: Mgo: 50
The elements from A need to be printed in every .txt file with numbers from various calculations contained in columns B, C, etc... printed beside with a space. Bonus points for anyone who can also help me create custom filenames for the .txt files based on the title (ex: the data from column A creates a .txt file called "Run1.txt." Don't know if assigning each column to a dictionary and then appending them all together would be the best route?
Thank you!
Something like this:
with open('runs.csv','rb') as read_file:
reader = csv.reader(read_file)
for run in reader:
with open(run[0] + '.txt','wb') as write_file:
write_file.write(run[1] + '\n')
For a csv file with the format "Name of file","Run results", obviously this can be replaced with anything you want.
I have a netCDF file with eight variables. (sorry, can´t share the actual file)
Each variable have two dimensions, time and station. Time is about 14 steps and station is currently 38000 different ids.
So for 38000 different "locations" (actually just an id) we have 8 variables and 14 different times.
$ncdump -h stationdata.nc
netcdf stationdata {
dimensions:
station = 38000 ;
name_strlen = 40 ;
time = UNLIMITED ; // (14 currently)
variables:
int time(time) ;
time:long_name = "time" ;
time:units = "seconds since 1970-01-01" ;
char station_name(station, name_strlen) ;
station_name:long_name = "station_name" ;
station_name:cf_role = "timeseries_id" ;
float var1(time, station) ;
var1:long_name = "Variable 1" ;
var1:units = "m3/s" ;
float var2(time, station) ;
var2:long_name = "Variable 2" ;
var2:units = "m3/s" ;
...
This data needs to be loaded into a PostGres database so that the data can be join to some geometries matching the station_name for later visualization .
Currently I have done this in Python with the netCDF4-module. Works but it takes forever!
Now I am looping like this:
times = rootgrp.variables['time']
stations = rootgrp.variables['station_name']
for timeindex, time in enumerate(times):
stations = rootgrp.variables['station_name']
for stationindex, stationnamearr in enumerate(stations):
var1val = var1[timeindex][stationindex]
print "INSERT INTO ncdata (validtime, stationname, var1) \
VALUES ('%s','%s', %s);" % \
( time, stationnamearr, var1val )
This takes several minutes on my machine to run and I have a feeling it could be done in a much more clever way.
Anyone has any idea on how this can be done in a smarter way? Preferably in Python.
Not sure this is the right way to do it but I found a good way to solve this and thought I should share it.
In the first version the script took about one hour to run. After a rewrite of the code it now runs in less than 30 sec!
The big thing was to use numpy arrays and transponse the variables arrays from the NetCDF reader to become rows and then stack all columns to one matrix. This matrix was then loaded in the db using psycopg2 copy_from function. I got the code for that from this question
Use binary COPY table FROM with psycopg2
Parts of my code:
dates = num2date(rootgrp.variables['time'][:],units=rootgrp.variables['time'].units)
var1=rootgrp.variables['var1']
var2=rootgrp.variables['var2']
cpy = cStringIO.StringIO()
for timeindex, time in enumerate(dates):
validtimes=np.empty(var1[timeindex].size, dtype="object")
validtimes.fill(time)
# Transponse and stack the arrays of parameters
# [a,a,a,a] [[a,b,c],
# [b,b,b,b] => [a,b,c],
# [c,c,c,c] [a,b,c],
# [a,b,c]]
a = np.hstack((
validtimes.reshape(validtimes.size,1),
stationnames.reshape(stationnames.size,1),
var1[timeindex].reshape(var1[timeindex].size,1),
var2[timeindex].reshape(var2[timeindex].size,1)
))
# Fill the cStringIO with text representation of the created array
for row in a:
cpy.write(row[0].strftime("%Y-%m-%d %H:%M")+'\t'+ row[1] +'\t' + '\t'.join([str(x) for x in row[2:]]) + '\n')
conn = psycopg2.connect("host=postgresserver dbname=nc user=user password=passwd")
curs = conn.cursor()
cpy.seek(0)
curs.copy_from(cpy, 'ncdata', columns=('validtime', 'stationname', 'var1', 'var2'))
conn.commit()
There are a few simple improvements you can make to speed this up. All these are independent, you can try all of them or just a couple to see if it's fast enough. They're in roughly ascending order of difficulty:
Use the psycopg2 database driver, it's faster
Wrap the whole block of inserts in a transaction. If you're using psycopg2 you're already doing this - it auto-opens a transaction you have to commit at the end.
Collect up several rows worth of values in an array and do a multi-valued INSERT every n rows.
Use more than one connection to do the inserts via helper processes - see the multiprocessing module. Threads won't work as well because of GIL (global interpreter lock) issues.
If you don't want to use one big transaction you can set synchronous_commit = off and set a commit_delay so the connection can return before the disk flush actually completes. This won't help you much if you're doing all the work in one transaction.
Multi-valued inserts
Psycopg2 doesn't directly support multi-valued INSERT but you can just write:
curs.execute("""
INSERT INTO blah(a,b) VALUES
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s);
""", parms);
and loop with something like:
parms = []
rownum = 0
for x in input_data:
parms.extend([x.firstvalue, x.secondvalue])
rownum += 1
if rownum % 5 == 0:
curs.execute("""INSERT ...""", tuple(parms))
del(parms[:])
Organize your loop to access all the variables for each time. In other words, read and write a record at a time rather than a variable at a time. This can speed things up enormously, especially if the source netCDF dataset is stored on a file system with large disk blocks, e.g. 1MB or larger. For an explanation of why this is faster and a discussion of order-of-magnitude resulting speedups, see this NCO speedup discussion, starting with entry 7.