Salesforce Bulk API batch download with Python - python

I have an async Python script that creates a bulk API job/batch in Salesforce. After the batch is complete I then download the csv file for processing.
Here's my problem: A streaming download for a ~300 MB csv file using Python can take 3+ minutes using this asynchronous code:
If you're familiar with Salesforce bulk jobs, you can enter your information
into the variables below and download your batch results for testing. This is a working example of code provided you enter the necessary information.
import asyncio, aiohttp, aiofiles
from simple_salesforce import Salesforce
from credentials import credentials as cred
sf_data_path = 'C:/Users/[USER NAME]/Desktop/'
job_id = '[18 DIGIT JOB ID]'
batch_id = '[18 DIGIT BATCH ID]'
result_id = '[18 DIGIT RESULT ID]'
instance_name = '[INSTANCE NAME]'
result_url = f'https://{instance_name}.salesforce.com/services/async/45.0/job/{job_id}/batch/{batch_id}/result/{result_id}'
sf = Salesforce(username=['SALESFORCE USERNAME'],
password=['SALESFORCE PASSWORD'],
security_token=['SALESFORCE SECURITY TOKEN'],
organizationId=['SALESFORCE ORGANIZATION ID'])
async def download_results():
err = None
retries = 3
status = 'Not Downloaded'
for _ in range(retries):
try:
async with aiohttp.ClientSession() as session:
async with session.get(url=result_url,
headers={"X-SFDC-Session": sf.session_id, 'Content-Encoding': 'gzip'},
timeout=300) as resp:
async with aiofiles.open(f'{sf_data_path}_DOWNLOAD_.csv', 'wb') as outfile:
while True:
chunk = await resp.content.read(10485760) # = 10Mb
if not chunk:
break
await outfile.write(chunk)
status = 'Downloaded'
except Exception as e:
err = e
retries -= 1
status = 'Retrying'
continue
else:
break
else:
status = 'Failed'
return err, status, retries
asyncio.run(download_results())
However, if I download the result of the batch in the Developer Workbench: https://workbench.developerforce.com/asyncStatus.php?jobId='[18 DIGIT JOB ID]' the same file might download in 5 seconds.
There is obviously something going on here that I'm missing. I know that the Workbench uses PHP, is this functionality even available with Python? I figured the async calls would make this download quickly, but that doesn't seem to make it download as fast as the functionality in the browser. Any ideas?
Thanks!

You can try curl request to get the csv. This method is as quick as you see in the workbench.
More information you can read here:
https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/query_walkthrough.htm
https://developer.salesforce.com/docs/atlas.en-us.api_asynch.meta/api_asynch/query_get_job_results.htm#example_locator

Related

proxies do not work with aiohttp requests?

So I've been experimenting with web scraping with aiohttp, and I ran into this issue where whenever I use a proxy, the code within the session.get doesn't run. I've looked all over the internet and couldn't find a solution.
import asyncio
import time
import aiohttp
from aiohttp.client import ClientSession
import random
failed = 0
success = 0
proxypool = []
with open("proxies.txt", "r") as jsonFile:
lines = jsonFile.readlines()
for i in lines:
x = i.split(":")
proxypool.append("http://"+x[2]+":"+x[3].rstrip()+"#"+x[0]+":"+x[1])
async def download_link(url:str,session:ClientSession):
global failed
global success
proxy = proxypool[random.randint(0, len(proxypool))]
print(proxy)
async with session.get(url, proxy=proxy) as response:
if response.status != 200:
failed +=1
else:
success +=1
result = await response.text()
print(result)
async def download_all(urls:list):
my_conn = aiohttp.TCPConnector(limit=1000)
async with aiohttp.ClientSession(connector=my_conn,trust_env=True) as session:
tasks = []
for url in urls:
task = asyncio.ensure_future(download_link(url=url,session=session))
tasks.append(task)
await asyncio.gather(*tasks,return_exceptions=True) # the await must be nest inside of the session
url_list = ["https://www.google.com"]*100
start = time.time()
asyncio.run(download_all(url_list))
end = time.time()
print(f'download {len(url_list)-failed} links in {end - start} seconds')
print(failed, success)
Here is the problem though, the code works fine on my mac. However, when I try to run the exact same code on windows, it doesn't run. It also works fine without proxies, but as soon as I add them, it doesn't work.
At the end, you can see that I print failed and succeeded. On my mac it will output 0, 100, whereas on my windows computer, it will print 0,0 - This proves that that code isn't running (Also, nothing is printed)
The proxies I am using are paid proxies, and they work normally if I use requests.get(). Their format is "http://user:pass#ip:port"
I have also tried just using "http://ip:port" then using BasicAuth to carry the user and password, but this does not work either.
I've seen that many other people have had this problem, however the issue never seems to get solved.
Any help would be appreciated :)
So after some more testing and researching I found the issue, I needed to add ssl = False
So the correct way to make the request would be:
async with session.get(url, proxy=proxy, ssl = False) as response:
That worked for me.

aiohttp: client.get() returns html tag rather than file

I'm trying to downloads bounding box files (stored as gzipped tar archives) from image-net.org. When I print(resp.read()), rather than a stream of bytes representing the archive, I get the HTML b'<meta http-equiv="refresh" content="0;url=/downloads/bbox/bbox/[wnid].tar.gz" />\n where [wnid] refers to a particular wordnet identification string. This leads to the error tarfile.ReadError: file could not be opened successfully. Any thoughts on what exactly is the issue and/or how to fix it? Code is below (images is a pandas data frame).
def get_boxes(images, nthreads=1000):
def parse_xml(xml):
return 0
def read_tar(data, wnid):
bytes = io.BytesIO(data)
tar = tarfile.open(fileobj=bytes)
return 0
async def fetch_boxes(wnid, client):
url = ('http://www.image-net.org/api/download/imagenet.bbox.'
'synset?wnid={}').format(wnid)
async with client.get(url) as resp:
res = await loop.run_in_executor(executor, read_tar,
await resp.read(), wnid)
return res
async def main():
async with aiohttp.ClientSession(loop=loop) as client:
tasks = [asyncio.ensure_future(fetch_boxes(wnid, client))
for wnid in images['wnid'].unique()]
return await asyncio.gather(*tasks)
loop = asyncio.get_event_loop()
executor = ThreadPoolExecutor(nthreads)
shapes, boxes = zip(*loop.run_until_complete(main()))
return pd.concat(shapes, axis=0), pd.concat(boxes, axis=0)
EDIT: I understand now that this is a meta refresh used as a redirect. Would this be considered a "bug" in `aiohttp?
This is ok.
Some services have redirects from user-friendly web-page to a zip-file. Sometimes it is implemented using HTTP status (301 or 302, see example below) or using page with meta tag that contains redirect like in your example.
HTTP/1.1 302 Found
Location: http://www.iana.org/domains/example/
aiohttp can handle first case - automatically (when allow_redirects = True by default).
But in the second case library retrieves simple HTML and can't handle that automatically.
I run into the same problem \n
when I tried to download using wget from the same url as you did
http://www.image-net.org/api/download/imagenet.bbox.synset?wnid=n01729322
but it works if you input this directly
www.image-net.org/downloads/bbox/bbox/n01729322.tar.gz
ps. n01729322 is the wnid

Computer hangs on large number of Python async aiohttp requests

I have a text file with over 20 million lines in the below format:
ABC123456|fname1 lname1|fname2 lname2
.
.
.
.
My task is to read the file line by line and send both the names to Google transliteration API and print the results on the terminal (linux). Below is my code:
import asyncio
import urllib.parse
from aiohttp import ClientSession
async def getResponse(url):
async with ClientSession() as session:
async with session.get(url) as response:
response = await response.read()
print(response)
loop = asyncio.get_event_loop()
tasks = []
# I'm using test server localhost, but you can use any url
url = "https://www.google.com/inputtools/request?{}"
for line in open('tg.txt'):
vdata = line.split("|")
if len(vdata) == 3:
names = vdata[1]+"_"+vdata[2]
tdata = {"text":names,"ime":"transliteration_en_te"}
qstring = urllib.parse.urlencode(tdata)
task = asyncio.ensure_future(getResponse(url.format(qstring)))
tasks.append(task)
loop.run_until_complete(asyncio.wait(tasks))
In the above code, my file tg.txt contains 20+ million lines. When I run it, my laptop freezes and I have to hard restart it. But this code works fine when I use another file tg1.txt which has only 10 lines. What am I missing?
You can try to use asyncio.gather(*futures) instead of asyncio.wait.
Also try to do this with batches of fixed size (for example 10 lines per batch) and add print after each processed batch, it should help you to debug your app.
Also your future could finish in different order and it's better to store result of gather and print it when processing of batch is finished.

Python aiohttp (with asyncio) sends requests very slowly

Situation:
I am trying to send a HTTP request to all listed domains in a specific file I already downloaded and get the destination URL, I was forwarded to.
Problem: Well I have followed a tutorial and I get many less responses than expected. It's around 100 responses per second, but in the tutorial there are 100,000 responses per minute listed.
The script gets also slower and slower after a couple of seconds, so that I just get 1 response every 5 seconds.
Already tried: Firstly I thought that this problem is because I ran that on a Windows server. Well after I tried the script on my computer, I recognized that it was just a little bit faster, but not much more. On an other Linux server it was the same like on my computer (Unix, macOS).
Code: https://pastebin.com/WjLegw7K
work_dir = os.path.dirname(__file__)
async def fetch(url, session):
try:
async with session.get(url, ssl=False) as response:
if response.status == 200:
delay = response.headers.get("DELAY")
date = response.headers.get("DATE")
print("{}:{} with delay {}".format(date, response.url, delay))
return await response.read()
except Exception:
pass
async def bound_fetch(sem, url, session):
# Getter function with semaphore.
async with sem:
await fetch(url, session)
async def run():
os.chdir(work_dir)
for file in glob.glob("cdx-*"):
print("Opening: " + file)
opened_file = file
tasks = []
# create instance of Semaphore
sem = asyncio.Semaphore(40000)
with open(work_dir + '/' + file) as infile:
seen = set()
async with ClientSession() as session:
for line in infile:
regex = re.compile(r'://(.*?)/')
domain = regex.search(line).group(1)
domain = domain.lower()
if domain not in seen:
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
tasks.append(task)
del line
responses = asyncio.gather(*tasks)
await responses
infile.close()
del seen
del file
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)
I really don't know how to fix that issue. Especially because I'm very new to Python... but I have to get it to work somehow :(
It's hard to tell what is going wrong without actually debugging the code, but one potential problem is that file processing is serialized. In other words, the code never processes the next file until all the requests from the current file have finished. If there are many files and one of them is slow, this could be a problem.
To change this, define run along these lines:
async def run():
os.chdir(work_dir)
async with ClientSession() as session:
sem = asyncio.Semaphore(40000)
seen = set()
pending_tasks = set()
for f in glob.glob("cdx-*"):
print("Opening: " + f)
with open(f) as infile:
lines = list(infile)
for line in lines:
domain = re.search(r'://(.*?)/', line).group(1)
domain = domain.lower()
if domain in seen:
continue
seen.add(domain)
task = asyncio.ensure_future(bound_fetch(sem, 'http://' + domain, session))
pending_tasks.add(task)
# ensure that each task removes itself from the pending set
# when done, so that the set doesn't grow without bounds
task.add_done_callback(pending_tasks.remove)
# await the remaining tasks
await asyncio.wait(pending_tasks)
Another important thing: silencing all exceptions in fetch() is bad practice because there is no indication that something has started going wrong (due to either a bug or a simple typo). This might well be the reason your script becomes "slow" after a while - fetch is raising exceptions and you're never seeing them. Instead of pass, use something like print(f'failed to get {url}: {e}') where e is the object you get from except Exception as e.
Several additional remarks:
There is almost never a need to del local variables in Python; the garbage collector does that automatically.
You needn't close() a file opened using a with statement. with is designed specifically to do such closing automatically for you.
The code added domains to a seen set, but also processed an already seen domain. This version skips the domain for which it had already spawned a task.
You can create a single ClientSession and use it for the entire run.

Is it possible to send many requests asynchronously with Python

I'm trying to send about 70 requests to slack api but can't find a way to implement it in asynchronous way, I have about 3 second for it or I'm getting timeout error
here how I've tried to t implement it:
import asyncio
def send_msg_to_all(sc,request,msg):
user_list = sc.api_call(
"users.list"
)
members_array = user_list["members"]
ids_array = []
for member in members_array:
ids_array.append(member['id'])
real_users = []
for user_id in ids_array:
user_channel = sc.api_call(
"im.open",
user=user_id,
)
if user_channel['ok'] == True:
real_users.append(User(user_id, user_channel['channel']['id']) )
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(send_msg(sc, real_users, request, msg))
loop.close()
return HttpResponse()
async def send_msg(sc, real_users, req, msg):
for user in real_users:
send_ephemeral_msg(sc,user.user_id,user.dm_channel, msg)
def send_ephemeral_msg(sc, user, channel, text):
sc.api_call(
"chat.postEphemeral",
channel=channel,
user=user,
text=text
)
But it looks like I'm still doing it in a synchronous way
Any ideas guys?
Slack's API has a rate limit of 1 query per second (QPS) as documented here.
Even if you get this working you'll be well exceeding the limits and you will start to see HTTP 429 Too Many Requests errors. Your API token may even get revoked / cancelled if you continue at that rate.
I think you'll need to find a different way.

Categories

Resources