Large Data Analytics - python

I'm trying to analyze a large amount of GitHub Archive Data and am stumped by many limitations.
So my analysis requires me too search a 350GB Data set. I have a local copy of the data and there is also a copy available via Google BigQuery. The local dataset is split up into 25000 individual files. The dataset is a timeline of events.
I want to plot the number of stars each repository has since its creation. (Only for repos with > 1000 currently)
I can get this result very quickly using Google BigQuery, but it "analyzes" 13.6GB of data each time. This limits me to <75 requests without having to pay $5 per additional 75.
My other option is to search through my local copy, but searching through each file for a specific string (repository name) takes way too long. Took over an hour on an SSD drive to get through half the files before I killed the process.
What is a better way I can approach analyzing such a large amount of data?
Python Code for Searching Through all Local Files:
for yy in range(11,15):
for mm in range(1,13):
for dd in range(1,32):
for hh in range(0,24):
counter = counter + 1
if counter < startAt:
continue
if counter > stopAt:
continue
#print counter
strHH = str(hh)
strDD = str(dd)
strMM = str(mm)
strYY = str(yy)
if len(strDD) == 1:
strDD = "0" + strDD
if len(strMM) == 1:
strMM = "0" + strMM
#print strYY + "-" + strMM + "-" + strDD + "-" + strHH
try:
f = json.load (open ("/Volumes/WD_1TB/GitHub Archive/20"+strYY+"-"+strMM+"-"+strDD+"-"+strHH+".json", 'r') , cls=ConcatJSONDecoder)
for each_event in f:
if(each_event["type"] == "WatchEvent"):
try:
num_stars = int(each_event["repository"]["watchers"])
created_at = each_event["created_at"]
json_entry[4][created_at] = num_stars
except Exception, e:
print e
except Exception, e:
print e
Google Big Query SQL Command:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
AND repository_owner = "mojombo"
AND repository_name = "grit"
ORDER BY created_at
I am really stumped so any advice at this point would be greatly appreciated.

If most of your BigQuery queries only scan a subset of the data, you can do one initial query to pull out that subset (use "Allow Large Results"). Then subsequent queries against your small table will cost less.
For example, if you're only querying records where type = "WatchEvent", you can run a query like this:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
And set a destination table as well as the "Allow Large Results" flag. This query will scan the full 13.6 GB, but the output is only 1 GB, so subsequent queries against the output table will only charge you for 1 GB at most.
That still might not be cheap enough for you, but just throwing the option out there.

I found a solution to this problem - Using a database. i imported the relevant data from my 360+GB of JSON data to a MySQL Database and queried that instead. What used to be a 3hour+ query time per element became <10seconds.
MySQL wasn't the easiest thing to set up, and import took approximately ~7.5 hours, but the results made it well worth it for me.

Related

Hierarchical dictionary (reducing memory footprint or using a database)

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?
One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)
Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

JSON multiline file performace tuning

I am trying to read a multiline file using pyspark in databricks and then flatten the file to get respective column which will invariably be stored as a delta table. I have successfully done it for 40 odd files but having performance issue with the last one.
This flatten part is taking 10+ mins to generate but the write part is running for ever.
#Flatten the dataframe to get all keys
def flatten(df):
complex_fields = dict([
(field.name, field.dataType)
for field in df.schema.fields
if isinstance(field.dataType, T.ArrayType) or isinstance(field.dataType, T.StructType)
])
qualify = list(complex_fields.keys())[0] + "_"
while len(complex_fields) != 0:
col_name = list(complex_fields.keys())[0]
if isinstance(complex_fields[col_name], T.StructType):
expanded = [F.col(col_name + '.' + k).alias(col_name + '_' + k)
for k in [ n.name for n in complex_fields[col_name]]
]
df = df.select("*", *expanded).drop(col_name)
elif isinstance(complex_fields[col_name], T.ArrayType):
df=df.withColumn(col_name,F.explode_outer(col_name))
complex_fields = dict([
(field.name, field.dataType)
for field in df.schema.fields
if isinstance(field.dataType, T.ArrayType) or isinstance(field.dataType, T.StructType)
])
for df_col_name in df.columns:
df = df.withColumnRenamed(df_col_name, df_col_name.replace(qualify, ""))
return df
df1=flatten(df)
To increase performance i have increased the cores and executor memory , not sure what else can be done. This is 1 single file but have multiple arrays, looking at the DAG looks like one input record is going to create millions of records in the target file.
Need help if anyone has faced similar issues.
I have attached the sql interpeter that is generated automatically. From stdout log all i can understand is Garbace Collector(GC) is trying to free up space , and also has multiple Full GC running.

Making comparing 2 tables faster (Postgres/SQLAlchemy)

I wrote a code in python to manipulate a table I have in my database. I am doing so using SQL Alchemy. Basically I have table 1 that has 2 500 000 entries. I have another table 2 with 200 000 entries. Basically what I am trying to do, is compare my source ip and dest ip in table 1 with source ip and dest ip in table 2. if there is a match, I replace the ip source and ip dest in table 1 with a data that matches ip source and ip dest in table 2 and I add the entry in table 3. My code also checks if the entry isn't already in the new table. If so, it skips it and then goes on with the next row.
My problem is its extremely slow. I launched my script yesterday and in 24 hours it only went through 47 000 entries out of 2 500 000. I am wondering if there are anyways I can speed up the process. It's a postgres db and I can't tell if the script taking this much time is reasonable or if something is up. If anyone had a similar experience with something like this, how much time did it take before completion ?
Many thanks.
session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
for flow in flows:
if flow.vlan_destination in vlan_list:
usage = session.query(Table2).filter(Table2.ip ==
str(flow.ip_destination)).all()
if len(usage) > 0:
usage = usage[0].usage
else:
usage = str(flow.ip_destination)
usage_ip_src = session.query(Table2).filter(Table2.ip ==
str(flow.ip_source)).all()
if len(usage_ip_src) > 0:
usage_ip_src = usage_ip_src[0].usage
else:
usage_ip_src = str(flow.ip_source)
if flow.protocol == "17":
protocol = func.REPLACE(flow.protocol, "17", 'UDP')
elif flow.protocol == "1":
protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
elif flow.protocol == "6":
protocol = func.REPLACE(flow.protocol, "6", 'TCP')
else:
protocol = flow.protocol
is_in_db = session.query(Table3).filter(Table3.protocol ==
protocol)\
.filter(Table3.application == flow.application)\
.filter(Table3.destination_port == flow.destination_port)\
.filter(Table3.vlan_destination == flow.vlan_destination)\
.filter(Table3.usage_source == usage_ip_src)\
.filter(Table3.state == flow.state)\
.filter(Table3.usage_destination == usage).count()
if is_in_db == 0:
to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
flow.vlan_destination, flow.state)
session.add(to_add)
session.flush()
session.commit()
print("added " + str(i))
else:
print("usage already in DB")
i = i + 1
session.close()
EDIT As requested, here are more details : Table 1 has 11 columns, the two we are interested in are source ip and dest ip.
Table 1
Here, I have Table 2 :Table 2. It has an IP and a Usage. What my script is doing is that it takes source ip and dest ip from table one and looks up if there is a match in Table 2. If so, it replaces the ip address by usage, and adds this along with some of the columns of Table 1 in Table 3 :[Table3][3]
Along doing this, when adding the protocol column into Table 3, it writes the protocol name instead of the number, just to make it more readable.
EDIT 2 I am trying to think about this differently, so I did a diagram of my problem Diagram (X problem)
What I am trying to figure out is if my code (Y solution) is working as intended. I've been coding in python for a month only and I feel like I am messing something up. My code is supposed to take every row from my Table 1, compare it to Table 2 and add data to table 3. My Table one has over 2 million entries and it's understandable that it should take a while but its too slow. For example, when I had to load the data from the API to the db, it went faster than the comparisons im trying to do with everything that is already in the db. I am running my code on a virtual machine that has sufficient memory so I am sure it's my code that is lacking and I need direction to as what can be improved. Screenshots of my tables:
Table 2
Table 3
Table 1
EDIT 3 : Postgresql QUERY
SELECT
coalesce(table2_1.usage, table1.ip_source) AS coalesce_1,
coalesce(table2_2.usage, table1.ip_destination) AS coalesce_2,
CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END AS anon_1,
table1.application AS table1_application,
table1.destination_port AS table1_destination_port,
table1.vlan_destination AS table1_vlan_destination,
table1.state AS table1_state
FROM
table1
LEFT OUTER JOIN table2 AS table2_2 ON table2_2.ip = table1.ip_destination
LEFT OUTER JOIN table2 AS table2_1 ON table2_1.ip = table1.ip_source
WHERE
table1.vlan_destination IN (
%(vlan_destination_1) s,
%(vlan_destination_2) s,
%(vlan_destination_3) s,
%(vlan_destination_4) s,
%(vlan_destination_5) s
)
AND NOT (
EXISTS (
SELECT
1
FROM
table3
WHERE
table3.usage_source = coalesce(table2_1.usage, table1.ip_source)
AND table3.usage_destination = coalesce(table2_2.usage, table1.ip_destination)
AND table3.protocol = CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END
AND table3.application = table1.application
AND table3.destination_port = table1.destination_port
AND table3.vlan_destination = table1.vlan_destination
AND table3.state = table1.state
)
)
Given the current question, I think this at least comes close to what you might be after. The idea is to perform the entire operation in the database, instead of fetching everything – the whole 2,500,000 rows – and filtering in Python etc.:
from sqlalchemy import func, case
from sqlalchemy.orm import aliased
def newhotness(session, vlan_list):
# The query needs to join Table2 twice, so it has to be aliased
dst = aliased(Table2)
src = aliased(Table2)
# Prepare required SQL expressions
usage = func.coalesce(dst.usage, Table1.ip_destination)
usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
protocol = case({"17": "UDP",
"1": "ICMP",
"6": "TCP"},
value=Table1.protocol,
else_=Table1.protocol)
# Form a query producing the data to insert to Table3
flows = session.query(
usage_ip_src,
usage,
protocol,
Table1.application,
Table1.destination_port,
Table1.vlan_destination,
Table1.state).\
outerjoin(dst, dst.ip == Table1.ip_destination).\
outerjoin(src, src.ip == Table1.ip_source).\
filter(Table1.vlan_destination.in_(vlan_list),
~session.query(Table3).
filter_by(usage_source=usage_ip_src,
usage_destination=usage,
protocol=protocol,
application=Table1.application,
destination_port=Table1.destination_port,
vlan_destination=Table1.vlan_destination,
state=Table1.state).
exists())
stmt = insert(Table3).from_select(
["usage_source", "usage_destination", "protocol", "application",
"destination_port", "vlan_destination", "state"],
flows)
return session.execute(stmt)
If the vlan_list is selective, or in other words filters out most rows, this will perform a lot less operations in the database. Depending on the size of Table2 you may benefit from indexing Table2.ip, but do test first. If it is relatively small, I would guess that PostgreSQL will perform a hash or nested loop join there. If some column of the ones used to filter out duplicates in Table3 is unique, you could perform an INSERT ... ON CONFLICT ... DO NOTHING instead of removing duplicates in the SELECT using the NOT EXISTS subquery expression (which PostgreSQL will perform as an antijoin). If there is a possibility that the flows query may produce duplicates, add a call to Query.distinct() to it.

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

Python and MYSQL performance: Writing a massive number of SQL query results to file

I have a file containing a dictionary of values on each line that I grab and use to query a mysql database using each key as a query. The results of each query get placed in a dict and once all values for the query dict have been generated the line gets written out.
IN > foo bar someotherinfo {1: 'query_val', 2: 'query_val', 3: 'query_val'
OUT > foo bar someotherinfo 1_result 2_result 3_result
This whole process appears to be somewhat slow because I'm performing around 200,000 mysql queries per file and have around 10 files per sample and around 30 samples in total, so I'm looking to speed up the whole process.
I'm just wondering if the fileIO could be creating a bottleneck. Instead of writing the line_info (foo,bar,somblah) followed by each result dict as it's returned, would I be better of chunking these results into memory before writing them to file in batches?
Or is this simply a case of having to just wait it out... ?
Example Input line and output line
INPUT
XM_006557349.1 1 - exon XM_006557349.1_exon_2 10316 10534 {1: 10509:10534', 2: '10488:10508', 3: '10467:10487', 4: '10446:10466', 5: '10425:10445', 6: '10404:10424', 7: '10383:10403', 8: '10362:10382', 9: '10341:10361', 10: '10316:10340'}
OUTPUT
XM_006557349.1 1 - exon XM_006557349.1_exon_2 10316 105340.7083 0.2945 0.2 0.2931 0.125 0.1154 0.2095 0.5833 0.0569 0.0508
CODE
def array_2_meth(sample,bin_type,type,cur_meth):
bins_in = open('bin_dicts/'+bin_type,'r')
meth_out = open('meth_data/'+bin_type+'_'+sample+'_plus_'+type+'_meth.tsv','w')
for line in bins_in.readlines():
meth_dict = {}
# build array of data from each line
array = line.strip('\n').split('\t')
mrna_id = array[0]
assembly = array[1]
strand = array[2]
bin_dict = ast.literal_eval(array[7])
for bin in bin_dict:
coords = bin_dict[bin].split(':')
start = int(coords[0]) -1
end = int(coords[1]) +1
cur_meth.execute('select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5')
for row in cur_meth.fetchall():
if str(row[0]) == 'None':
meth_dict[bin] = 'no_cov'
else:
meth_dict[bin] = float(row[0])
meth_out.write('\t'.join(array[:7]))
for k in sorted(meth_dict.keys()):
meth_out.write('\t'+str(meth_dict[k]))
meth_out.write('\n')
meth_out.close()
Not sure if adding this code is going to be a massive help, but it should show the way I'm approaching this.. Any advice you could provide on mistakes I'm making in my approach or tips on how to optimise would be greatly appreciated!!!
Thanks ^_^
I think the fileIO shouldn't take too long, the main bottleneck is probably the amount of queries you are making. But from the example you provide I see no pattern in those start and end position so I have no idea how to cut down the amount of queries you are making.
I have a probably amazing or stupid ideas depending on your test results.(also i don't know shxt about python so ignore the syntax haha)
it SEEMS that every query will only return a single value?
maybe you could try something like
SQL = ''
for bin in bin_dict:
coords = bin_dict[bin].split(':')
start = int(coords[0]) -1
end = int(coords[1]) +1
SQL += 'select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5'
SQL += 'UNION ALL'
//somehow remove the last UNION ALL at end of loop
cur_meth.execute(str(SQL))
for row in cur_meth.fetchall():
//loop through the 10 row array and write to file
The core idea is to use UNION ALL to join all queries into 1, and thus you'll only need to do 1 transaction instead of 10 shown in your example. You also reduce the 10 write to file action into 1. The possible drawback is that UNION ALL might be slow, but as far as I know it shouldn't take anymore processing time then 10 individual queries as long as you keep the SQL format in my example.
The second obvious method is to do it multi-thread. if you are not using all your processing power of your machine, you could probably try to start multiple script/program at the same time as all you do is query data and doesn't modify anything. This would cause individual script slightly slower but overall faster as it should reduce wait time between queries.

Categories

Resources