I have this code in Pylons that calculates the network usage of the Linux system on which the webapp runs. Basically, to calculate the network utilization, we need to read the file /proc/net/dev twice, which gives us the the amount of transmitted data, and divide the subtracted values by the time elapsed between two reads.
I don’t want to do this calculation at regular intervals. There’s a JS code which periodically fetches this data. The transfer rate is the average amount of transmitted bytes between two requests per time unit. In Pylons, I used pylons.app_globals to store the reading which is going to be subtracted from the next reading at subsequent request. But apparently there’s no app_globals in Pyramid and I’m not sure if using thread locals is the correct course of action. Also, although request.registry.settings is apparently shared across all requests, I’m reluctant to store my data there, as the name implies it should only store the settings.
def netUsage():
netusage = {'rx':0, 'tx':0, 'time': time.time()}
rtn = {}
net_file = open('/proc/net/dev')
for line in net_file.readlines()[2:]:
tmp = map(string.atof, re.compile('\d+').findall(line[line.find(':'):]))
if line[:line.find(':')].strip() == "lo":
continue
netusage['rx'] += tmp[0]
netusage['tx'] += tmp[8]
net_file.close()
rx = netusage['rx'] - app_globals.prevNetusage['rx'] if app_globals.prevNetusage['rx'] else 0
tx = netusage['tx'] - app_globals.prevNetusage['tx'] if app_globals.prevNetusage'tx'] else 0
elapsed = netusage['time'] - app_globals.prevNetusage['time']
rtn['rx'] = humanReadable(rx / elapsed)
rtn['tx'] = humanReadable(tx / elapsed)
app_globals.prevNetusage = netusage
return rtn
#memorize(duration = 3)
def getSysStat():
memTotal, memUsed = getMemUsage()
net = netUsage()
loadavg = getLoadAverage()
return {'cpu': getCPUUsage(),
'mem': int((memUsed / memTotal) * 100),
'load1': loadavg[0],
'load5': loadavg[1],
'load15': loadavg[2],
'procNum': loadavg[3],
'lastProc': loadavg[4],
'rx': net['rx'],
'tx': net['tx']
}
Using request thread locals is considered bad design and should not be abused according to official pyramid docs.
My advice is to use some simple key-value storage like memcached or redis if possible.
Related
I have nested for-loop in python to create a netCDF file. The for-loop takes a pandas dataframe with time, lat, lot, and parameters and replaces the information in the netCDF file by the parameters in the correct location and time. This is taking too long since the pandas dataframe has more than 80000 rows and the netCDF file has around 8000 time-steps. I've been looking to use either xargs or multiprocessing but in the first case the uses files as inputs and in the second case, it produces as many outputs as processes I use. I have no experience in parallel processing so probably my affirmations are totally wrong. This is the code that I am using:
with Dataset(os.path.join('Downloads', inv, 'observations.nc'), 'w') as dset:
dset.createDimension('time_components', 6)
groups = ['obs', 'mix_apri', 'mix_apos', 'mix_background']
for group in groups:
dset.createGroup(group)
dset[group].createDimension('nt', 8760)
dset[group].createDimension('nlat', 80)
dset[group].createDimension('nlon', 100)
times_start = dset[group].createVariable('times_start', 'i4', ('nt', 'time_components'))
times_end = dset[group].createVariable('times_end', 'i4', ('nt', 'time_components'))
lats = dset[group].createVariable('lats', 'f4', ('nlat'))
lons = dset[group].createVariable('lons', 'f4', ('nlon'))
times_start[:,:] = list(emis_apri['biosphere']['times_start'])
times_end[:,:] = list(emis_apri['biosphere']['times_end'])
lats[:] = list(emis_apri['biosphere']['lats'])
lons[:] = list(emis_apri['biosphere']['lons'])
conc_obs = dset['obs'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
conc_mix_apri = dset['mix_apri'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
conc_mix_apos = dset['mix_apos'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
conc_mix_background = dset['mix_background'].createVariable('conc', 'f8', ('nt', 'nlat', 'nlon'))
for i in range(8760):
conc_obs[i,:,:] = emis_apri['biosphere']['emis'][i][:,:]*0
conc_mix_apri[:,:,:] = list(conc_obs)
conc_mix_apos[:,:,:] = list(conc_obs)
conc_mix_background[:,:,:] = list(conc_obs)
db = obsdb(os.path.join('Downloads', inv, 'observations.apos.tar.gz'))
nsites = db.sites.shape[0]
for isite, site in enumerate(db.sites.itertuples()):
dbs = db.observations.loc[db.observations.site == site.Index]
lat = where((array(emis_apri['biosphere']['lats']) >= list(dbs.lat)[0]-0.25) & (array(emis_apri['biosphere']['lats']) <= list(dbs.lat)[0]+0.25))[0][0]
lon = where((array(emis_apri['biosphere']['lons']) >= list(dbs.lon)[0]-0.25) & (array(emis_apri['biosphere']['lons']) <= list(dbs.lon)[0]+0.25))[0][0]
for i in range(len(list(dbs.time))):
for j in range(len(times_start)):
if datetime(*times_start[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]) and datetime(*times_end[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]):
conc_obs[i,lat,lon] = list(dbs.obs)[i]
conc_mix_apri[i,lat,lon] = list(dbs.mix_apri)[i]
conc_mix_apos[i,lat,lon] = list(dbs.mix_apos)[i]
conc_mix_background[i,lat,lon] = list(dbs.mix_background)[i]
From for isite, site in enumerate(db.sites.itertuples()): is the part of the code that I need to parallelize. I really appreciate any insights about this.
Consider the following as pseudocode as I cannot run any test without any samples etc. I have usually parallelized my code with mpi4py and in your case, you could do in the beginning:
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size(); # let your program know how many processors you are using
rank = comm.Get_rank() # let the running program know, which processor it is
Now, in the beginning of the code, let one of the processes be the so called master task, which can do all the basic/important stuff that cannot be done simultaneously by all the tasks. For example, opening/initializing some file for the output. So, in your code, for those parts, you can use:
if rank==0:
# do some important stuff
else:
# do something not important (for example a = 5)
comm.barrier() # this is important to synchronize the processes
Now, to parallelize your code, you can do the loop over distributed db.sites i.e. you divide the db.sites.itertuples() over the number of processor you are going to use:
allsites = db.sites.itertuples() # all the processor have to know all the sites
sites = allsites[rank::size] # each starts from it's current rank and jumps with the size
for isite, site in enumerate(sites):
dbs = db.observations.loc[db.observations.site == site.Index]
lat = where((array(emis_apri['biosphere']['lats']) >= list(dbs.lat)[0]-0.25) & (array(emis_apri['biosphere']['lats']) <= list(dbs.lat)[0]+0.25))[0][0]
lon = where((array(emis_apri['biosphere']['lons']) >= list(dbs.lon)[0]-0.25) & (array(emis_apri['biosphere']['lons']) <= list(dbs.lon)[0]+0.25))[0][0]
for i in range(len(list(dbs.time))):
for j in range(len(times_start)):
if datetime(*times_start[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]) and datetime(*times_end[j,:].data) >= Timestamp.to_pydatetime(list(dbs.time)[i]):
conc_obs[i,lat,lon] = list(dbs.obs)[i]
conc_mix_apri[i,lat,lon] = list(dbs.mix_apri)[i]
conc_mix_apos[i,lat,lon] = list(dbs.mix_apos)[i]
conc_mix_background[i,lat,lon] = list(dbs.mix_background)[i]
comm.barrier() # do not forget to synchronize
Nevertheless, in this case the "isite" has now value based on the size of the list, you are giving in. So instead of being 0...len(allsites), it is 0...len(allsites)/size. If the "isite" is important to have value from 0 to len(allsites), you somehow have to recalculate. Perhaps isite_global = isite*size+rank to get the actual number that the processor is doing.
So, in the end how to run the code, I usually do:
mpiexec -np 10 ipython script_name
at the terminal to run the code on 10 processors.
But, in any case, the hardest part is to parallelize the I/O operations without specific support from the library. I am not sure that netCDF4 supports parallel I/O meaning if your processors with ranks 0...X open the file simultaneously for X processors, write something to the specific location in the file and close the file, that afterwards the data from all the processors is written there.
Therefore, the safest idea is to somehow let one of the processors (master) be responsible for the output and exchange/collect data that needs to be written from all the subprocessors before writing.
Hope this helps, good luck with the code!
I've recently started using PyModbus and have found it very easy to do basic polling with their ModbusTCPClient and read_holding_registers function.
I'm interested now in the best ways to structure a more complex logger - non-consecutive registers, different function codes, different Endian encoding, etc.
For example - to avoid a separate 'read_holding_registers' call for each tag of a device, I have built a function that groups all consecutive tag registers to reduce the number of calls.
I'm planning to implement a similar thing for BinaryPayloadDecoders - group by registers with the same byteorder and wordorder to reduce the number of decoder instances.
def polldevicesfast(client, device, taglist):
#loop through tags, order by address, group consecutive addresses in single reads, merge resulting lists, decode
orderedtaglist = sorted(taglist, key = lambda i: i['address'])
callgroups = sorttogroups(orderedtaglist)
allreturns = []
results = []
for acall in callgroups:
areturn = client.read_holding_registers(acall['start'], (1 + (acall['end'] - acall['start'])), unit=device['device_id'])
allreturns = allreturns + areturn.registers
decoder = BinaryPayloadDecoder.fromRegisters(allreturns, byteorder=Endian.Big, wordorder=Endian.Big)
for tag in orderedtaglist:
results.append({'tagname': tag['name'], 'value': str(tag['autoScaling']['slope'] * mydecoder(tag['dataType'], decoder)), 'unit': tag['unit']})
client.close()
return results
None of this is extremely complicated - it just seems like there has to already be an accepted standard or template for this somewhere that I can't seem to find in any of their documentation online.
I have written a script to run around 530 API calls which i intend to run every 5 minutes, from these calls i store data to process in bulk later (Prediction ETC).
The API has a limit of 500 requests per second. However when running my code I am seeing a 2 second time per call (due to SSL i believe).
How can i speed this up to enable me to run 500 requests within 5 minutes, as the current time required renders the data i am collecting useless :(
Code:
def getsurge(lat, long):
response = client.get_price_estimates(
start_latitude=lat,
start_longitude=long,
end_latitude=-34.063676,
end_longitude=150.815075
)
result = response.json.get('prices')
return result
def writetocsv(database):
database_writer = csv.writer(database)
database_writer.writerow(HEADER)
pool = Pool()
# Open Estimate Database
while True:
for data in coordinates:
line = data.split(',')
long = line[3]
lat = line[4][:-2]
estimate = getsurge(lat, long)
timechecked = datetime.datetime.now()
for d in estimate:
if d['display_name'] == 'TAXI':
database_writer.writerow([timechecked, [line[0], line[1]], d['surge_multiplier']])
database.flush()
print(timechecked, [line[0], line[1]], d['surge_multiplier'])
Is the APi under your control? If so, create an endpoint which can give you all the data you need in one go.
I'm trying to analyze a large amount of GitHub Archive Data and am stumped by many limitations.
So my analysis requires me too search a 350GB Data set. I have a local copy of the data and there is also a copy available via Google BigQuery. The local dataset is split up into 25000 individual files. The dataset is a timeline of events.
I want to plot the number of stars each repository has since its creation. (Only for repos with > 1000 currently)
I can get this result very quickly using Google BigQuery, but it "analyzes" 13.6GB of data each time. This limits me to <75 requests without having to pay $5 per additional 75.
My other option is to search through my local copy, but searching through each file for a specific string (repository name) takes way too long. Took over an hour on an SSD drive to get through half the files before I killed the process.
What is a better way I can approach analyzing such a large amount of data?
Python Code for Searching Through all Local Files:
for yy in range(11,15):
for mm in range(1,13):
for dd in range(1,32):
for hh in range(0,24):
counter = counter + 1
if counter < startAt:
continue
if counter > stopAt:
continue
#print counter
strHH = str(hh)
strDD = str(dd)
strMM = str(mm)
strYY = str(yy)
if len(strDD) == 1:
strDD = "0" + strDD
if len(strMM) == 1:
strMM = "0" + strMM
#print strYY + "-" + strMM + "-" + strDD + "-" + strHH
try:
f = json.load (open ("/Volumes/WD_1TB/GitHub Archive/20"+strYY+"-"+strMM+"-"+strDD+"-"+strHH+".json", 'r') , cls=ConcatJSONDecoder)
for each_event in f:
if(each_event["type"] == "WatchEvent"):
try:
num_stars = int(each_event["repository"]["watchers"])
created_at = each_event["created_at"]
json_entry[4][created_at] = num_stars
except Exception, e:
print e
except Exception, e:
print e
Google Big Query SQL Command:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
AND repository_owner = "mojombo"
AND repository_name = "grit"
ORDER BY created_at
I am really stumped so any advice at this point would be greatly appreciated.
If most of your BigQuery queries only scan a subset of the data, you can do one initial query to pull out that subset (use "Allow Large Results"). Then subsequent queries against your small table will cost less.
For example, if you're only querying records where type = "WatchEvent", you can run a query like this:
SELECT repository_owner, repository_name, repository_watchers, created_at
FROM [githubarchive:github.timeline]
WHERE type = "WatchEvent"
And set a destination table as well as the "Allow Large Results" flag. This query will scan the full 13.6 GB, but the output is only 1 GB, so subsequent queries against the output table will only charge you for 1 GB at most.
That still might not be cheap enough for you, but just throwing the option out there.
I found a solution to this problem - Using a database. i imported the relevant data from my 360+GB of JSON data to a MySQL Database and queried that instead. What used to be a 3hour+ query time per element became <10seconds.
MySQL wasn't the easiest thing to set up, and import took approximately ~7.5 hours, but the results made it well worth it for me.
I am trying to simulate a situation where we have 5 machines that occur in a 1 -> 3 -> 1 situation. i.e the 3 in the middle operate in parallel to reduce the effective time they take.
I can easily simulate this by create a SimPy resource with a value of three like this:
simpy.Resource(env, capacity=3)
However in my situation each of the three resources operates slightly differently and sometimes I want to be able to use any of them (when I'm operating) or book a specific one (when i want to clean). Basically the three machines slowly foul up at different rates and operate slower, I want to be able to simulate these and also enable a clean to occur when one gets too dirty.
I have tried a few ways of simulating this but have come up with problems and issues every time.
The first was when it booked the resource it also booked one of the 3 machines (A,B,C) globals flags and a flag itself to tell it which machine it was using. This works but it's not clean and makes it really difficult to understand what is occurring with huge if statements everywhere.
The second was to model it as three separate resources and then try to wait and request one of the 3 machines with something like:
reqA = A.res.request()
reqB = B.res.request()
reqC = C.res.request()
unitnumber = yield reqA | reqB | reqC
yield env.process(batch_op(env, name, machineA, machineB, machineC, unitnumber))
But this doesn't work and I can't work out the best way to look at yielding one of a choice.
What would be the best way to simulate this scenario. For completeness here is what im looking for:
Request any of 3 machines
Request a specific machine
Have each machine track it's history
Have each machines characteristics be different. i.e on fouls up faster but works faster initially
Detect and schedule a clean based on the performance or indicator
This is what I have so far on my latest version of trying to model each as seperate resources
class Machine(object):
def __init__(self, env, cycletime, cleantime, k1foul, k2foul):
self.env = env
self.res = simpy.Resource(env, 1)
self.cycletime = cycletime
self.cleantime = cleantime
self.k1foul = k1foul
self.k2foul = k2foul
self.batchessinceclean = 0
def operate(self):
self.cycletime = self.cycletime + self.k2foul * np.log(self.k1foul * self.batchessinceclean + 1)
self.batchessinceclean += 1
yield self.env.timeout(self.cycletime)
def clean(self):
print('%s begin cleaning at %s' % (self.env.now))
self.batchessinceclean = 0
yield env.timeout(self.cleantime)
print('%s finished cleaning at %s' % (self.env.now))
You should try (Filter)Store:
import simpy
def user(machine):
m = yield machine.get()
print(m)
yield machine.put(m)
m = yield machine.get(lambda m: m['id'] == 1)
print(m)
yield machine.put(m)
m = yield machine.get(lambda m: m['health'] > 98)
print(m)
yield machine.put(m)
env = simpy.Environment()
machine = simpy.FilterStore(env, 3)
machine.put({'id': 0, 'health': 100})
machine.put({'id': 1, 'health': 95})
machine.put({'id': 2, 'health': 97.2})
env.process(user(machine))
env.run()