Im newbie in Python, this is my first work with REST API in python. First let me explain what i wanted to do. I have a csv file which have name of a product and some other details, these are missing data after migration. So now my job is to check in the downstream application1 if they contain these product or it is missing there too. if it is missing there should dig up back and back.
So Now I have API of Application 1(this would give the productname and details if that exists) and have an API for OAuth 2. This will create me a token and im using that token to access API of Application 1(it would look like this https://Applciationname/rest/< productname >) i get this < productname > from a list which is retrieved from first column of csv file. Everthing is working fine but my list is having 3000 entries it is taking almost 2 hours for me to complete.
Is there any fastest way to check this, BTW im calling token API only once. This is how my code looks like
list=[]
reading csv and appedning to list #using with open and csv reader here
get_token=requests.get(tokenurl,OAuthdetails) #similar type of code
token_dict=json.loads(get_token.content.decode())
token=token_dict['access_token']
headers={
'Authorization': 'Bearer'+' '+str(token)
}
url= https://Applciationname/rest/
for element in list:
full_url=url+element
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
writing the element value to text file "successcall" text file #using with open here
else:
writing the element value to text file "failurecall" text file #using with open here
Now could you please help me optimizing this, so that ill be finding the product names which are not in APP 1 faster
You could see Threading for your for loop. Like so:
import threading
lock = threading.RLock()
thread_list = []
def check_api(full_url):
api_response=requests.get(full_url,headers)
recieved_data=json.loads(api_response.content.decode())
if api_response.status_code=200 and len(recieved_data)!=0:
# dont forget to add a lock to writing to the file
with lock:
with open("successcall.txt", "a") as f:
f.write(recieved_data)
else:
# again, dont forget to add with lock like the one above
# writing the element value to text file "failurecall" text file #using with open here
for element in list:
full_url = url+element
t = threading.Thread(target=check_api, args=(full_url, ))
thread_list.append(t)
# start all threads
for thread in thread_list:
thread.start()
# wait for them all to finish
for thread in thread_list:
thread.finish()
You should also not write to the same file while using Threads since it might cause some problems unless you use locks
Related
I am trying to slice the results of an API response to process just the first n values in Python without writing to a file first.
Specifically I want to do analysis on the "front page" from HN, which is just the first 30 items. However the API (https://hacker-news.firebaseio.com/v0/topstories.json?print=pretty) gives you the first 500 results.
Right now I'm pulling top stories and writing to a file, then importing the file and truncating the string:
import json
import requests
#request JSON data from HN API
topstories = requests.get('https://hacker-news.firebaseio.com/v0/topstories.json')
#write return to .txt file named topstories.txt
with open('topstories.txt','w') as fd:
fd.write(topstories.text)
#truncate the text file to top 30 stories
f = open('topstories.txt','r+')
f.truncate(270)
f.close
This is inelegant and inefficient. I will have to do this again to extract each 8 digit object ID.
How do I process this API return data as much as possible in memory without writing to file?
Suggestion:
User jordanm suggested the code replacement:
fd.write(json.dumps(topstories.json()[:30]))
However that would just move the needle on when I would need to write/read versus doing anything else I want with it.
What you want is the io library
https://docs.python.org/3/library/io.html
Basically:
import io
f = io.StringIO(topstories.text)
f.truncate(270)
I found a tutorial and I'm trying to run this script, I did not work with python before.
tutorial
I've already seen what is running through logging.debug, checking whether it is connecting to google and trying to create csv file with other scripts
from urllib.parse import urlencode, urlparse, parse_qs
from lxml.html import fromstring
from requests import get
import csv
def scrape_run():
with open('/Users/Work/Desktop/searches.txt') as searches:
for search in searches:
userQuery = search
raw = get("https://www.google.com/search?q=" + userQuery).text
page = fromstring(raw)
links = page.cssselect('.r a')
csvfile = '/Users/Work/Desktop/data.csv'
for row in links:
raw_url = row.get('href')
title = row.text_content()
if raw_url.startswith("/url?"):
url = parse_qs(urlparse(raw_url).query)['q']
csvRow = [userQuery, url[0], title]
with open(csvfile, 'a') as data:
writer = csv.writer(data)
writer.writerow(csvRow)
print(links)
scrape_run()
The TL;DR of this script is that it does three basic functions:
Locates and opens your searches.txt file.
Uses those keywords and searches the first page of Google for each
result.
Creates a new CSV file and prints the results (Keyword, URLs, and
page titles).
Solved
Google add captcha couse i use to many request
its work when i use mobile internet
Assuming the links variable is full and contains data - please verify.
if empty - test the api call itself you are making, maybe it returns something different than you expected.
Other than that - I think you just need to tweak a little bit your file handling.
https://www.guru99.com/reading-and-writing-files-in-python.html
here you can find some guidelines regarding file handling in python.
in my perspective, you need to make sure you create the file first.
start on with a script which is able to just create a file.
after that enhance the script to be able to write and append to the file.
from there on I think you are good to go and continue with you're script.
other than that I think that you would prefer opening the file only once instead of each loop, it could mean much faster execution time.
let me know if something is not clear.
I am fairly new to python so please be kind. I am a network administrator but have been tasked with automating several processes of ours using python.
I am trying to take a list of networks id's and plug them into a URL using
For loop.
file = open('networkid.txt', 'r')
def main(file):
for x in file:
print(x)`
link = ('https://api.meraki.com/api/v0/networks/') +(Network ID) ('/syslogServers')
Each line in the .txt files contains a network ID, and I need that ID to be injected where (Network ID) is in the script, then I need the script to run the rest of the script not posted here and continue this until all ID's have been exhausted.
The current example layout is not how my script is setup but bits and pieces are cut to give you an idea of what I am aiming for.
To clarify the question at hand, how do I reference each line in the text file, which each line contains a network ID that I need to inject into the URL. From there, I am trying to establish a proper For Loop to continue this process until all network ID's in the list has been exhausted.
x contains the network ID after you strip off the newline.
for line in file:
networkID = line.strip()
link = 'https://api.meraki.com/api/v0/networks/' + networkID + '/syslogServers'
# do something with link
I'm relatively new to python, and I am trying to build a program that can visit a website using a proxy from a list of proxies in a text file, and continue doing so with each proxy in the file until they're all used. I found some code online and tweaked it to my needs, but when I run the program, the proxies are successfully used, but they don't get used in order. For whatever reason, the first proxy gets used twice in a row, then the second proxy gets used, then the first again, then third, blah blah. It doesn't go in order one by one.
The proxies in the text file are organized as such:
123.45.67.89:8080
987.65.43.21:8080
And so on. Here's the code I am using:
from fake_useragent import UserAgent
import pyautogui
import webbrowser
import time
import random
import random
import requests
from selenium import webdriver
import os
import re
proxylisttext = 'proxylistlist.txt'
useragent = UserAgent()
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference("network.proxy_type", 1)
def Visiter(proxy1):
try:
proxy = proxy1.split(":")
print ('Visit using proxy :',proxy1)
profile.set_preference("network.proxy.http", proxy[0])
profile.set_preference("network.proxy.http_port", int(proxy[1]))
profile.set_preference("network.proxy.ssl", proxy[0])
profile.set_preference("network.proxy.ssl_port", int(proxy[1]))
profile.set_preference("general.useragent.override", useragent.random)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get('https://www.iplocation.net/find-ip-address')
time.sleep(2)
driver.close()
except:
print('Proxy failed')
pass
def loadproxy():
try:
get_file = open(proxylisttext, "r+")
proxylist = get_file.readlines()
writeused = get_file.write('used')
count = 0
proxy = []
while count < 10:
proxy.append(proxylist[count].strip())
count += 1
for i in proxy:
Visiter(i)
except IOError:
print ("\n[-] Error: Check your proxylist path\n")
sys.exit(1)
def main():
loadproxy()
if __name__ == '__main__':
main()
And so as I said, this code successfully navigates to the ipchecker site using the proxy, but then it doesn't go line by line in order, the same proxy will get used multiple times. So I guess more specifically, how can I ensure the program iterates through the proxies one by one, without repeating? I have searched exhaustively for a solution, but I haven't been able to find one, so any help would be appreciated. Thank you.
Your problem is with these nested loops, which don't appear to be doing what you want:
proxy = []
while count < 10:
proxy.append(proxylist[count].strip())
count += 1
for i in proxy:
Visiter(i)
The outer loop builds up the proxy list, adding one value each time until there are ten. After each value has been added, the inner loop iterates over the proxy list that has been built so far, visiting each item.
I suspect you want to unnest the loops. That way, the for loop will only run after the while loop has completed, and so it will only visit each proxy once. Try something like this:
proxy = []
while count < 10:
proxy.append(proxylist[count].strip())
count += 1
for i in proxy:
Visiter(i)
You could simplify that into a single loop, if you want. For instance, using itertools.islice to handle the bounds checking, you could do:
for proxy in itertools.islice(proxylist, 10):
Visiter(proxy.strip())
You could even run that directly on the file object (since files are iterable) rather than calling readlines first, to read it into a list. (You might then need to add a seek call on the file before writing "used", but you may need that anyway, some OSs don't allow you to mix reads and writes without seeking in between.)
while count < 10:
proxy.append(proxylist[count].strip())
count += 1
for i in proxy:
Visiter(i)
The for loop within the while loop means that every time you hit proxy.append you'll call Visiter for every item already in proxy. That might explain why you're getting multiple hits per proxy.
As far as the out of order issue, I'm not sure why readlines() isn't maintaining the line order of your file but I'd try something like:
with open('filepath', 'r') as file:
for line in file:
do_stuff_with_line(line)
With the above you don't need to hold the whole file in memory at once either which ca be nice for big files.
Good luck!
I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files.
The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))
with open("fullEntities.txt","w") as f:
simplejson.dump(fullEntityList,f)
Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file.
I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas?
Also, I cannot do this with a database.
I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.
Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!
For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.
SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.
The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.
The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc.
See Queue.
I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.
Without being able to check it (as I don't have your api), you could try:
import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
CHUNK_SIZE = 1000
class EntityWriter(threading.Thread):
lines_written = False
_filename = "fullEntities.txt"
def __init__(self, queue):
super(EntityWriter, self).__init()
self._q = queue
self.running = False
def run(self):
self.running = True
with open(self._filename,"a") as f:
while True:
try:
entity = self._q.get(block=False)
if not EntityWriter.lines_written:
EntityWriter.lines_written = True
f.write("[")
simplejson.dump(entity,f)
else:
f.write(",\n")
simplejson.dump(entity,f)
except Queue.Empty:
break
self.running = False
def finish_file(self):
with open(self._filename,"a") as f:
f.write("]")
a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
fullEntityQueue.append(a.getFullEntity(entity))
if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
if writer is None or not writer.running:
writer = EntityWriter(fullEntityQueue)
writer.start()
writer.join()
writer.finish_file()
What this script does
The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.
Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.