I am using the nidaqmx-python library for acquiring data. Is it possible to access an existing task, which is already defined in the NI MAX?
My solution, thanks to the tip from #nekomatic, is:
import nidaqmx
system = nidaqmx.system.System.local() # load local system
task_names = system.tasks.task_names # returns a list of task names
task = system.tasks[0] # selected the first task
loaded_task = task.load() # load the task
sent_samples = [] # list for saving acquired data
with loaded_task:
loaded_task.timing.cfg_samp_clk_timing(
rate=2560,
sample_mode=nidaqmx.constants.AcquisitionType.CONTINUOUS,
samps_per_chan=1000)
def callback(task_handle, every_n_samples_event_type,
number_of_samples, callback_data):
"""
Callback function/
"""
print('Every N Samples callback invoked.')
samples = loaded_task.read(number_of_samples_per_channel=2560)
sent_samples.extend(samples)
return 0
loaded_task.register_every_n_samples_acquired_into_buffer_event(
200, callback)
loaded_task.start()
input('Running task. Press Enter to stop.\n')
Related
I have a task function like this:
def task (s) :
# doing some thing
return res
The original program is:
res = []
for i in data :
res.append(task(i))
# using pickle to save res every 30s
I need to process a lot of data and I don't care the output order of the results. Due to the long running time, I need to save the current progress regularly. Now I'll change it to multiprocessing
pool = Pool(4)
status = []
res = []
for i in data :
status.append(pool.apply_async(task, (i,))
for i in status :
res.append(i.get())
# using pickle to save res every 30s
Supposed I have processes p0,p1,p2,p3 in Pool and 10 task, (task(0) .... task(9)). If p0 takes a very long time to finish the task(0).
Does the main process be blocked at the first "res.append(i.get())" ?
If p1 finished task(1) and p0 still deal with task(0), will p1 continue to deal with task(4) or later ?
If the answer to the first question is yes, then how to get other results in advance. Finally, get the result of task (0)
I update my code but the main process was blocked somewhere while other process were still dealing tasks. What's wrong ? Here is the core of code
with concurrent.futures.ProcessPoolExecutor(4) as ex :
for i in self.inBuffer :
futuresList.append(ex.submit(warpper, i))
for i in concurrent.futures.as_completed(futuresList) :
(word, r) = i.result()
self.resDict[word] = r
self.logger.info("{} --> {}".format(word, r))
cur = datetime.now()
if (cur - self.timeStmp).total_seconds() > 30 :
self.outputPickle()
self.timeStmp = datetime.now()
The length of self.inBuffer is about 100000. self.logger.info will write the info to a log file. For some special input i, the wrapper function will print auxiliary information with print. self.resDict is a dict to store result. self.outputPickle() will write a .pkl file using pickle.dump
At first, the code run normally, both the update of log file and print by warpper. But at a moment, I found that the log file has not been updated for a long time (several hours, the time to complete a warper shall not exceed 120s), but the warpper is still printing information(Until I kill the process it print about 100 messages without any updates of log file). Also, the time stamp of the output .pkl file doesn't change. Here is the implementation of outputPickle()
def outputPickle (self) :
if os.path.exists(os.path.join(self.wordDir, self.outFile)) :
if os.path.exists(os.path.join(self.wordDir, "{}_backup".format(self.outFile))):
os.remove(os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
shutil.copy(os.path.join(self.wordDir, self.outFile), os.path.join(self.wordDir, "{}_backup".format(self.outFile)))
with open(os.path.join(self.wordDir, self.outFile), 'wb') as f:
pickle.dump(self.resDict, f)
Then I add three printfunction :
print("getting res of something")
(word, r) = i.result()
print("finishing i.result")
self.resDict[word] = r
print("finished getting res of {}".format(word))
Here is the log:
getting res of something
finishing i.result
finished getting res of CNICnanotubesmolten
getting res of something
finishing i.result
finished getting res of CNN0
getting res of something
message by warpper
message by warpper
message by warpper
message by warpper
message by warpper
The log "message by warpper" can be printed at most once every time the warpper is called
Yes
Yes, as processes are submitted asynchronously. Also p1 (or other) will take another chunk of data if the size of the input iterable is larger than the max number of processes/workers
"... how to get other results in advance"
One of the convenient options is to rely on concurrent.futures.as_completed which will return the results as they are completed:
import time
import concurrent.futures
def func(x):
time.sleep(3)
return x ** 2
if __name__ == '__main__':
data = range(1, 5)
results = []
with concurrent.futures.ProcessPoolExecutor(4) as ex:
futures = [ex.submit(func, i) for i in data]
# processing the earlier results: as they are completed
for fut in concurrent.futures.as_completed(futures):
res = fut.result()
results.append(res)
print(res)
Sample output:
4
1
9
16
Another option is to use callback on apply_async(func[, args[, kwds[, callback[, error_callback]]]]) call; the callback accepts only single argument as the returned result of the function. In that callback you can process the result in minimal way (considering that it's tied to only a single argument/result from a concrete function). The general scheme looks as follows:
def res_callback(v):
# ... processing result
with open('test.txt', 'a') as f: # just an example
f.write(str(v))
print(v, flush=True)
if __name__ == '__main__':
data = range(1, 5)
results = []
with Pool(4) as pool:
tasks = [pool.apply_async(func, (i,), callback=res_callback) for i in data]
# await for tasks finished
But that schema would still require to somehow await (get() results) for submitted tasks.
Im trying to change play back speed in azure amp.
The following is the url generated from azure apis: https://ampdemo.azureedge.net/?url=https://testingmedia-usea.streaming.media.azure.net/bbd51d47-cc1a-4515-bac8-4053040f8c58/ignite.ism/manifest(format=mpd-time-cmaf,filter=filter1)&heuristicprofile=lowlatency
if you check that link there is no playback speed.
I saw the below link but dont know where to apply in python code
https://amp.azure.net/libs/amp/latest/docs/index.html#amp.player.options.playbackspeed
below is my code:
from dotenv import load_dotenv
from azure.identity import DefaultAzureCredential
from azure.mgmt.media import AzureMediaServices
from azure.storage.blob import BlobServiceClient
from azure.mgmt.media.models import (
Asset,
Transform,
TransformOutput,
BuiltInStandardEncoderPreset,
Job,
JobInputAsset,
JobOutputAsset,
OnErrorType,
Priority,
StreamingLocator,
AssetFilter,
PresentationTimeRange,
)
import os
import random
#Timer for checking job progress
import time
import requests
#Get environment variables
load_dotenv()
default_credential = DefaultAzureCredential(exclude_shared_token_cache_credential=True)
# Get the environment variables SUBSCRIPTIONID, RESOURCEGROUP and ACCOUNTNAME
subscription_id = os.getenv('SUBSCRIPTIONID')
resource_group = os.getenv('RESOURCEGROUP')
account_name = os.getenv('ACCOUNTNAME')
# The file you want to upload. For this example, put the file in the same folder as this script.
# The file ignite.mp4 has been provided for you.
source_file = "https://testingmedia.blob.core.windows.net/data/ignite.mp4"
#url = requests.get(source_file)
# This is a random string that will be added to the naming of things so that you don't have to keep doing this during testing
uniqueness = "streamAssetFilters-" + str(random.randint(0,9999))
# Change this to your specific streaming endpoint name if not using "default"
streaming_endpoint_name = "default"
# Set the attributes of the input Asset using the random number
in_asset_name = 'inputassetName' + uniqueness
in_alternate_id = 'inputALTid' + uniqueness
in_description = 'inputdescription' + uniqueness
# Create an Asset object
# The asset_id will be used for the container parameter for the storage SDK after the asset is created by the AMS client.
in_asset = Asset(alternate_id=in_alternate_id, description=in_description)
# Set the attributes of the output Asset using the random number
out_asset_name = 'outputassetName' + uniqueness
out_alternate_id = 'outputALTid' + uniqueness
out_description = 'outputdescription' + uniqueness
# Create an output asset object
out_asset = Asset(alternate_id=out_alternate_id, description=out_description)
# The AMS Client
print("Creating AMS Client")
client = AzureMediaServices(default_credential, subscription_id)
# Create an input Asset
print(f"Creating input asset {in_asset_name}")
input_asset = client.assets.create_or_update(resource_group, account_name, in_asset_name, in_asset)
# An AMS asset is a container with a specific id that has "asset-" prepended to the GUID.
# So, you need to create the asset id to identify it as the container
# where Storage is to upload the video (as a block blob)
in_container = 'asset-' + input_asset.asset_id
# create an output Asset
print(f"Creating output asset {out_asset_name}")
output_asset = client.assets.create_or_update(resource_group, account_name, out_asset_name, out_asset)
### Use the Storage SDK to upload the video ###
print(f"Uploading the file {source_file}")
blob_service_client = BlobServiceClient.from_connection_string(os.getenv('STORAGEACCOUNTCONNECTION'))
blob_client = blob_service_client.get_blob_client(in_container, "ignite.mp4")
# working_dir = os.getcwd() + "\Media"
# print(working_dir)
# print(f"Current working directory: {working_dir}")
# upload_file_path = os.path.join(working_dir, source_file)
# print(upload_file_path,"####")
# WARNING: Depending on where you are launching the sample from, the path here could be off, and not include the BasicEncoding folder.
# Adjust the path as needed depending on how you are launching this python sample file.
# Upload the video to storage as a block blob
#with open(url, "rb") as data:
blob_client.upload_blob_from_url(source_file)
transform_name = 'ContentAwareEncodingAssetFilters'
# Create a new Standard encoding Transform for Built-in Copy Codec
print(f"Creating Encoding transform named: {transform_name}")
# For this snippet, we are using 'BuiltInStandardEncoderPreset'
transform_output = TransformOutput(
preset=BuiltInStandardEncoderPreset(
preset_name="ContentAwareEncoding"
),
# What should we do with the job if there is an error?
on_error=OnErrorType.STOP_PROCESSING_JOB,
# What is the relative priority of this job to others? Normal, high or low?
relative_priority=Priority.NORMAL
)
print("Creating encoding transform...")
# Adding transform details
my_transform = Transform()
my_transform.description="Transform with Asset filters"
my_transform.outputs = [transform_output]
print(f"Creating transform {transform_name}")
transform = client.transforms.create_or_update(
resource_group_name=resource_group,
account_name=account_name,
transform_name=transform_name,
parameters=my_transform)
print(f"{transform_name} created (or updated if it existed already). ")
job_name = 'ContentAwareEncodingAssetFilters'+ uniqueness
print(f"Creating custom encoding job {job_name}")
files = (source_file)
# Create Job Input and Ouput Assets
input = JobInputAsset(asset_name=in_asset_name)
outputs = JobOutputAsset(asset_name=out_asset_name)
# Create the job object and then create transform job
the_job = Job(input=input, outputs=[outputs])
job: Job = client.jobs.create(resource_group, account_name, transform_name, job_name, parameters=the_job)
# Check job state
job_state = client.jobs.get(resource_group, account_name, transform_name, job_name)
# First check
print("First job check")
print(job_state.state)
# Check the state of the job every 10 seconds. Adjust time_in_seconds = <how often you want to check for job state>
def countdown(t):
while t:
mins, secs = divmod(t, 60)
timer = '{:02d}:{:02d}'.format(mins, secs)
print(timer, end="\r")
time.sleep(1)
t -= 1
job_current = client.jobs.get(resource_group, account_name, transform_name, job_name)
if(job_current.state == "Finished"):
print(job_current.state)
# TODO: Download the output file using blob storage SDK
return
if(job_current.state == "Error"):
print(job_current.state)
# TODO: Provide Error details from Job through API
return
else:
print(job_current.state)
countdown(int(time_in_seconds))
time_in_seconds = 10
countdown(int(time_in_seconds))
print(f"Creating locator for streaming...")
# Publish the output asset for streaming via HLS or DASH
locator_name = f"locator-{uniqueness}"
# Create the Asset filters
print("Creating an asset filter...")
asset_filter_name = 'filter1'
# Create the asset filter
asset_filter = client.asset_filters.create_or_update(
resource_group_name=resource_group,
account_name=account_name,
asset_name=out_asset_name,
filter_name=asset_filter_name,
parameters=AssetFilter(
# In this sample, we are going to filter the manifest by the time range of the presentation using the default timescale.
# You can adjust these settings for your own needs. Not that you can also control output tracks, and quality levels with a filter.
tracks=[],
# start_timestamp = 100000000 and end_timestamp = 300000000 using the default timescale will generate
# a play-list that contains fragments from between 10 seconds and 30 seconds of the VoD presentation.
# If a fragment straddles the boundary, the entire fragment will be included in the manifest.
presentation_time_range=PresentationTimeRange(start_timestamp=100000000, end_timestamp=300000000)
)
)
if asset_filter:
print(f"The asset filter ({asset_filter_name}) was successfully created.")
print()
else:
raise ValueError("There was an issue creating the asset filter.")
if output_asset:
streaming_locator = StreamingLocator(asset_name=out_asset_name, streaming_policy_name="Predefined_DownloadAndClearStreaming",filters=list(asset_filter_name.split(" ")))
locator = client.streaming_locators.create(
resource_group_name=resource_group,
account_name=account_name,
streaming_locator_name=locator_name,
parameters=streaming_locator
)
if locator:
print(f"The streaming locator {locator_name} was successfully created!")
else:
raise Exception(f"Error while creating streaming locator {locator_name}")
if locator.name:
hls_format = "format=m3u8-cmaf"
dash_format = "format=mpd-time-cmaf"
# Get the default streaming endpoint on the account
streaming_endpoint = client.streaming_endpoints.get(
resource_group_name=resource_group,
account_name=account_name,
streaming_endpoint_name=streaming_endpoint_name
)
if streaming_endpoint.resource_state != "Running":
print(f"Streaming endpoint is stopped. Starting endpoint named {streaming_endpoint_name}")
client.streaming_endpoints.begin_start(resource_group, account_name, streaming_endpoint_name)
basename_tup = os.path.splitext(source_file) # Extracting the filename and extension
path_extension = basename_tup[1] # Setting extension of the path
manifest_name = os.path.basename(source_file).replace(path_extension, "")
print(f"The manifest name is: {manifest_name}")
manifest_base = f"https://{streaming_endpoint.host_name}/{locator.streaming_locator_id}/{manifest_name}.ism/manifest"
hls_manifest = ""
if asset_filter_name is None:
hls_manifest = f'{manifest_base}({hls_format})'
else:
hls_manifest = f'{manifest_base}({hls_format},filter={asset_filter_name})'
print(f"The HLS (MP4) manifest URL is: {hls_manifest}")
print("Open the following URL to playback the live stream in an HLS compliant player (HLS.js, Shaka, ExoPlayer) or directly in an iOS device")
print({hls_manifest})
print()
dash_manifest = ""
if asset_filter_name is None:
dash_manifest = f'{manifest_base}({dash_format})'
else:
dash_manifest = f'{manifest_base}({dash_format},filter={asset_filter_name})'
print(f"The DASH manifest URL is: {dash_manifest}")
print("Open the following URL to playback the live stream from the LiveOutput in the Azure Media Player")
print(f"https://ampdemo.azureedge.net/?url={dash_manifest}&heuristicprofile=lowlatency")
print()
else:
raise ValueError("Locator was not created or Locator name is undefined.")
There's an example on https://amp.azure.net/libs/amp/latest/samples/dynamic_playback_speed.html for how to use playback speed. This is also available at https://github.com/Azure-Samples/azure-media-player-samples/blob/master/html/dynamic_playback_speed.html.
I have a setup where I need to extract data from Elasticsearch and store it on an Azure Blob. Now to get the data I am using Elasticsearch's _search and _scroll API. The indexes are pretty well designed and are formatted something like game1.*, game2.*, game3.* etc.
I've created a worker.py file which I stored in a folder called shared_code as Microsoft suggests and I have several Timer Trigger Functions which import and call worker.py. Due to the way ES was setup on our side I had to create a VNET and a static Outbound IP address which we've then whitelisted on ES. Conversely, the data is only available to be extracted from ES only on port 9200. So I've created an Azure Function App which has the connection setup and I am trying to create multiple Functions (game1-worker, game2-worker, game3-worker) to pull the data from ES running in parallel on minute 5. I've noticed if I add the FUNCTIONS_WORKER_PROCESS_COUNT = 1 setting then the functions will wait until the first triggered one finishes its task and then the second one triggers. If I don't add this app setting or increase the number, then once a function stopped because it finished working, it will try to start it again and then I get a OSError: [WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted error. Is there a way I can make these run in parallel but not have the mentioned error?
Here is the code for the worker.py:
#!/usr/bin/env python
# coding: utf-8
# # Elasticsearch to Azure Microservice
import json, datetime, gzip, importlib, os, re, logging
from elasticsearch import Elasticsearch
import azure.storage.blob as azsb
import azure.identity as azi
import os
import tempfile
def batch(game_name, env='prod'):
# #### Global Variables
env = env.lower()
connection_string = os.getenv('conn_storage')
lowerFormat = game_name.lower().replace(" ","_")
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
stateStorageContainerName = "azure-webjobs-state"
minutesOffset = 5
tempFilePath = tempfile.gettempdir()
curFileName = f"{lowerFormat}_cursor.py"
curTempFilePath = os.path.join(tempFilePath,curFileName)
curBlobFilePath = f"cursors/{curFileName}"
esUrl = os.getenv('esUrl')
# #### Connections
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
def uploadJsonGzipBlob(filePathAndName, jsonBody):
blob = azsb.BlobClient.from_connection_string(
conn_str=connection_string,
container_name=storageContainerName,
blob_name=filePathAndName
)
blob.upload_blob(gzip.compress(bytes(json.dumps(jsonBody), encoding='utf-8')))
def getAndLoadCursor(filePathAndName):
# Get cursor from blob
blob = azsb.BlobClient.from_connection_string(
conn_str=os.getenv('AzureWebJobsStorage'),
container_name=stateStorageContainerName,
blob_name=filePathAndName
)
# Stream it to Temp file
with open(curTempFilePath, "wb") as f:
data = blob.download_blob()
data.readinto(f)
# Load it by path
spec = importlib.util.spec_from_file_location("cursor", curTempFilePath)
cur = importlib.util.module_from_spec(spec)
spec.loader.exec_module(cur)
return cur
def writeCursor(filePathAndName, body):
blob = azsb.BlobClient.from_connection_string(
conn_str=os.getenv('AzureWebJobsStorage'),
container_name=stateStorageContainerName,
blob_name=filePathAndName
)
blob.upload_blob(body, overwrite=True)
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is None:
maxSizeMB = 10 # Default to 10 MB
else:
maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is None:
maxProcessTimeSeconds = 300 # Default to 300 seconds
else:
maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
try:
cur = getAndLoadCursor(curBlobFilePath)
except Exception as e:
dtStr = f"{datetime.datetime.utcnow():%Y/%m/%d %H:%M:00}"
writeCursor(curBlobFilePath, f"# Please use format YYYY/MM/DD HH24:MI:SS\nlastPolled = '{dtStr}'")
logging.info(f"No cursor file. Generated {curFileName} file with date {dtStr}")
return 0
# # Scrolling and Batching Engine
lastRowDateOffset = cur.lastPolled
nrFilesThisInstance = 0
while 1:
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
## Filter lt (less than) endDate to avoid infinite loops.
## Filter lt manually when compiling historical based on
endDate = initTime-datetime.timedelta(minutes=minutesOffset)
endDate = f"{endDate:%Y/%m/%d %H:%M:%S}"
doc = {
"query": {
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lt": endDate
}
}
}
}
Index = lowerFormat + ".*"
if env == 'dev': Index = 'dev.' + Index
if nrFilesThisInstance == 0:
page = es.search(
index = Index,
sort = "baseCtx.date:asc",
scroll = "2m",
size = 10000,
body = doc
)
else:
page = es.scroll(scroll_id = sid, scroll = "10m")
pageSize = len(page["hits"]["hits"])
data = page["hits"]["hits"]
sid = page["_scroll_id"]
totalSize = page["hits"]["total"]
print(f"Total Size: {totalSize}")
cnt = 0
# totalSize might be flawed as it returns at times an integer > 0 but array is empty
# To overcome this, I've added the below check for the array size instead
if pageSize == 0: break
while 1:
cnt += 1
page = es.scroll(scroll_id = sid, scroll = "10m")
pageSize = len(page["hits"]["hits"])
sid = page["_scroll_id"]
data += page["hits"]["hits"]
sizeMB = len(gzip.compress(bytes(json.dumps(data), encoding='utf-8'))) / (1024**2)
loopTime = datetime.datetime.utcnow()
processTimeSeconds = (loopTime-initTime).seconds
print(f"{cnt} Results pulled: {pageSize} -- Cumulative Results: {len(data)} -- Gzip Size MB: {sizeMB} -- processTimeSeconds: {processTimeSeconds} -- pageSize: {pageSize} -- startDate: {lastRowDateOffset} -- endDate: {endDate}")
if sizeMB > maxSizeMB: break
if processTimeSeconds > maxProcessTimeSeconds: break
if pageSize < 10000: break
lastRowDateOffset = max([x['_source']['baseCtx']['date'] for x in data])
lastRowDateOffsetDT = datetime.datetime.strptime(lastRowDateOffset, '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}.json.gz"
uploadJsonGzipBlob(outFile, data)
writeCursor(curBlobFilePath, f"# Please use format YYYY/MM/DD HH24:MI:SS\nlastPolled = '{lastRowDateOffset}'")
nrFilesThisInstance += 1
logging.info(f"File compiled: {outFile} -- {sizeMB} MB\n")
# If the while loop ran for more than maxProcessTimeSeconds then end it
if processTimeSeconds > maxProcessTimeSeconds: break
if pageSize < 10000: break
logging.info(f"Closing Connection to {esUrl}")
es.close()
return 0
And these are 2 of the timing triggers I am calling:
game1-worker
import logging
import datetime
import azure.functions as func
#from shared_code import worker
import importlib
def main(mytimer: func.TimerRequest) -> None:
utc_timestamp = datetime.datetime.utcnow().replace(
tzinfo=datetime.timezone.utc).isoformat()
if mytimer.past_due:
logging.info('The timer is past due!')
# Load a new instance of worker.py
spec = importlib.util.spec_from_file_location("worker", "shared_code/worker.py")
worker = importlib.util.module_from_spec(spec)
spec.loader.exec_module(worker)
worker.batch('game1name')
logging.info('Python timer trigger function ran at %s', utc_timestamp)
game2-worker
import logging
import datetime
import azure.functions as func
#from shared_code import worker
import importlib
def main(mytimer: func.TimerRequest) -> None:
utc_timestamp = datetime.datetime.utcnow().replace(
tzinfo=datetime.timezone.utc).isoformat()
if mytimer.past_due:
logging.info('The timer is past due!')
# Load a new instance of worker.py
spec = importlib.util.spec_from_file_location("worker", "shared_code/worker.py")
worker = importlib.util.module_from_spec(spec)
spec.loader.exec_module(worker)
worker.batch('game2name')
logging.info('Python timer trigger function ran at %s', utc_timestamp)
TL;DR
Based on what you described, multiple worker-processes share underlying runtime's resources (sockets).
For your usecase you just need to leave FUNCTIONS_WORKER_PROCESS_COUNT at 1. Default value is supposed to be 1, so not specifying it should mean the same as setting it to 1.
You need to understand how Azure Functions scale. It is very unnatural/confusing.
Assumes Consumption Plan.
Coding: You write Functions. Say F1 an F2. How you organize is up to you.
Provisioning:
You create a Function App.
You deploy F1 and F2 to this App.
You start the App. (not function).
Runtime:
At start
Azure spawns one Function Host. Think of this as a container/OS.
Inside the Host, one worker-process is created. This worker-process will host one instance of App.
If you change FUNCTIONS_WORKER_PROCESS_COUNT to say 10 then Host will spawn 10 processes and run your App inside each of them.
When a Function is triggered (function could be triggered due to timer, or REST calls or message in Q, ...)
Each worker-process is capable of servicing one request at a time. Be it a request for F1 or F2. One at a time.
Each Host is capable servicing one request per worker-process in it.
If backlog of requests grows, then Azure load balancer would trigger scale-out and create new Function Hosts.
Based on limited info, it seems like bad design to create 3 functions. You could instead create a single timer-triggered function, which sends out 3 messages to a Q (Storage Q should be more than plenty for such minuscule traffic), which in turn triggers your actual Function/implementation (which is storage Q triggered Function). Message would be something like {"game_name": "game1"}.
I am using python 3.7 with SimPy 4. I have 4 Resources (say "First Level") with a capacity of 5 and each Resource has an associated Resource (say "Second Level") with a capacity of 1 (So, 4 "First Level" Resources and 4 "Second Level" Resources in total). When an agent arrives, it requests a Resource from any Resource of the "First Level", when it gets access to it then it requests the associated Resource of the "Second Level".
I am using AnyOf to choose any of the "First Level" Resources. It works but I need to know which Resource is chosen by which agent. How can I do that?
Here is a representation of what I am doing so far:
from simpy.events import AnyOf, Event
num_FL_Resources = 4
capacity_FL_Resources = 5
FL_Resources = [simpy.Resource(env, capacity = capacity_FL_Resources ) for i in range(num_FL_Resources)]
events = [FirstLevelResource.request() for FirstLevelResource in FL_Resources]
yield Anyof(env, events)
Note 1: I didn't use Store or FilterStore in the "First Level" and randomly put the agent to one of the available Store because the agents are keep coming and all of the Stores might be in use. They need to queue up. Also, please let me know if there is a good way of using Store here.
Note 2: Resource.users gives me <Request() object at 0x...> so it isn't helpful.
Note 3:: I am using a nested dictionary for "First Level" and "Second Level" Resources like below. However, for convenience I didn't add my longer code here.
{'Resource1': {'FirstLevel1': <simpy.resources.resource.Resource at 0x121f45690>,
'SecondLevel1': <simpy.resources.resource.Resource at 0x121f45710>},
'Resource2': {'FirstLevel2': <simpy.resources.resource.Resource at 0x121f457d0>,
'SecondLevel2': <simpy.resources.resource.Resource at 0x121f458d0>},
'Resource3': {'FirstLevel3': <simpy.resources.resource.Resource at 0x121f459d0>,
'SecondLevel3': <simpy.resources.resource.Resource at 0x121f45a90>},
'Resource4': {'FirstLevel4': <simpy.resources.resource.Resource at 0x121f47750>,
'SecondLevel4': <simpy.resources.resource.Resource at 0x121f476d0>}}
So I did it with a store. In the store I have groups of first level objects that have a common second level resource. here is the code
"""
example of a two stage resource grab using a store and resouces
A agent will queue up to get a first level resource object
and then use this object to get a second level rescource
However groups of the frist level resouce have one common second level resource
so there will also be a queue for the second level resource.
programer: Michael R. Gibbs
"""
import simpy
import random
class FirstLevel():
"""
A frist level object, a group of these objects will make a type of resource
each object in the group will have the same second level resource
"""
def __init__(self, env, groupId, secondLevel):
self.env = env
self.groupId = groupId
self.secondLevel = secondLevel
def agent(env, agentId, firstLevelStore):
"""
sims a agent/entity that will first grab a first level resource
then a second level resource
"""
print(f'agent {agentId} requesting from store with {len(firstLevelStore.items)} and queue {len(firstLevelStore.get_queue)}')
# queue and get first level resouce
firstLevel = yield firstLevelStore.get()
print(f"agent {agentId} got first level resource {firstLevel.groupId} at {env.now}")
# use the first level resource to queue and get the second level resource
with firstLevel.secondLevel.request() as req:
yield req
print(f"agent {agentId} got second level resource {firstLevel.groupId} at {env.now}")
yield env.timeout(random.randrange(3, 10))
print(f"agent {agentId} done second level resource {firstLevel.groupId} at {env.now}")
# put the first level resource back into the store
yield firstLevelStore.put(firstLevel)
print(f"agent {agentId} done first level resource {firstLevel.groupId} at {env.now}")
def agentGen(env, firstLevelStore):
"""
creates a sequence of agents
"""
id = 1
while True:
yield env.timeout(random.randrange(1, 2))
print(f"agent {id} arrives {env.now}")
env.process(agent(env,id, firstLevelStore))
id += 1
if __name__ == '__main__':
print("start")
num_FL_Resources = 4 # number of first level groups/pools
capacity_FL_Resources = 5 # number of first level in each group/pool
env = simpy.Environment()
# store of all first level, all mixed togethers
store = simpy.Store(env, capacity=(num_FL_Resources * capacity_FL_Resources))
for groupId in range(num_FL_Resources):
# create the second level resource for each group os first level resources
secondLevel = simpy.Resource(env,1)
for cap in range(capacity_FL_Resources):
# create the individual first level objects for the group
firstLevel = FirstLevel(env,groupId,secondLevel)
store.items.append(firstLevel)
env.process(agentGen(env, store))
env.run(200)
print("done")
I am making a web scraper to build a database. The site I plan to use has index pages each containing 50 links. The amount of pages to be parsed is estimated to be around 60K and up, this is why I want to implement multiprocessing.
Here is some pseudo-code of what I want to do:
def harvester(index):
main=dict()
....
links = foo.findAll ( 'a')
for link in links:
main.append(worker(link))
# or maybe something like: map_async(worker(link))
def worker(url):
''' this function gather the data from the given url'''
return dictionary
Now what I want to do with that is to have a certain number of worker function to gather data in parallel on different pages. This data would then be appended to a big dictionary located in harvester or written directly in a csv file by the worker function.
I'm wondering how I can implement parallelism. I have done a faire
amount of research on using gevent, threading and multiprocessing but
I am not sure how to implement it.
I am also not sure if appending data to a large dictionary or writing
directly in a csv using DictWriter will be stable with that many input at the same time.
Thanks
I propose you to split your work into separate workers which communicate via Queues.
Here you mostly have IO wait time (crawling, csv writing)
So you can do the following (not tested, just see the idea):
import threading
import Queue
class CsvWriter(threading.Thread):
def __init__(self, resultq):
super(CsvWriter, self).__init__()
self.resultq = resultq
self.writer = csv.DictWriter(open('results.csv', 'wb'))
def run(self):
done = False
while not done:
row = self.requltq.get()
if row != -1:
self.writer.writerow(row)
else:
done = True
class Crawler(threading.Thread):
def __init__(self, inputqueue, resultq):
super(Crawler, self).__init__()
self.iq = inputq
self.oq = resultq
def run(self):
done = False
while not done:
link = self.iq.get()
if link != -1:
result = self.extract_data(link)
self.oq.put(result)
else:
done = True
def extract_data(self, link):
# crawl and extract what you need and return a dict
pass
def main():
linkq = Queue.Queue()
for url in your_urls:
linkq.put(url)
resultq = Queue.Queue()
writer = CsvWriter(resultq)
writer.start()
crawlers = [Crawler(linkq, resultq) for _ in xrange(10)]
[c.start() for c in crawlers]
[linkq.put(-1) for _ in crawlers]
[c.join() for c in crawlers]
resultq.put(-1)
writer.join()
This code should work (fix possible typos) and make it to exit when all the urls are finished