For Range in Zip

For Range in Zip - python

I'm writing an api code. He pulls the token and id from the list, then goes back to the beginning and pulls the 2nd token and the 2nd id. I made a loop like this. But how do I add range into the for loop. So I want this loop to turn 10 times. How do I add Range.
import requests
import json
token = [line.strip() for line in open("./dosya/tokenler/1.txt")]
ids = [line.strip() for line in open("./dosya/tokenler/data.txt")]
for tokenn, idcek in zip(token, ids):
headers = {
'Content-Type': 'application/json',
'Client-Id': 'xxxxxxxxxxxxxxx',
'Authorization': tokenn
}
data = {
"to_id": "56822556", "from_id": idcek
}
response = requests.post("https://sitelink.com", headers=headers, json=data)
print(response.text)

If you want your existing for loop to run 100 times, you can use enumerate() to get the "loop index" as well, then stop the for loop on the 101th run (index 100):
for idx, (tokenn, idcek) in enumerate(zip(token, ids)):
if idx == 100:
# loop ran 100 times already, exiting loop
break
headers = {
'Content-Type': 'application/json',
'Client-Id': 'xxxxxxxxxxxxxxx',
'Authorization': tokenn
}
data = {
"to_id": "56822556", "from_id": idcek
}
"""existing code"""

If you know in advance what limit you want on the iteration, you can add that as a component of the zip:
lim = 100
for tokenn, idcek, _ in zip(token, ids, range(lim)):
# etc
You don't need to look at it; the zip will just terminate on the shortest input.

Related

Can't scrape all the company names from a webpage

I'm trying to parse all the company names from this webpage. There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.
This is what I can see about the number of results in response while going through dev tools:
hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431 <------------------------
nbPages: 1
page: 0
How can I get the rest of the results using requests?
I've tried so far:
import requests
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'
params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
r = s.post(url,params=params,json=payload)
print(len(r.json()['results'][0]['hits']))

As a workaround you can simulate search using alphabet as a search pattern. Using code below you will get all 2431 companies as dictionary with ID as a key and full company data dictionary as a value.
import requests
import string
params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
print(letter)
payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
}]
}
r = requests.post(url, params=params, json=payload)
result.update({h['id']: h for h in r.json()['results'][0]['hits']})
print(len(result))

UPDATE 01-04-2021
After reviewing the "fine print" in the Algolia API documentation, I discovered that the paginationLimitedTo parameter CANNOT BE USED in a query. This parameter can only be used during indexing by the data's owner.
It seems that you can use the query and offset this way:
payload = {"requests":[{"indexName":"YCCompany_production",
"params": "query=&offset=1000&length=500&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit"
"%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
Unfortunately, the paginationLimitedTo index set by the customer will not let you retrieve more than 1000 records via the API.
"hits": [],
"nbHits": 2432,
"offset": 1000,
"length": 500,
"message": "you can only fetch the 1000 hits for this query. You can extend the number of hits returned via the paginationLimitedTo index parameter or use the browse method. You can read our FAQ for more details about browsing: https://www.algolia.com/doc/faq/index-configuration/how-can-i-retrieve-all-the-records-in-my-index",
The browsing bypass method mentioned requires an ApplicationID and the AdminAPIKey
ORIGINAL POST
Based on the Algolia API documentation there is a query hit limit of 1000.
The documentation lists several ways to override or bypass this limit.
Part of the API is paginationLimitedTo, which by default is set to 1000 for performance and "scraping protection."
The syntax is:
'paginationLimitedTo': number_of_records
Another method mentioned in the documentation is setting the parameters offset and length.
offset lets you specify the starting hit (or record)
length sets the number of records returned
You could use these parameters to walk the records, thus potentially not impacting your scraping performance.
For instance you could scrape in blocks of 500.
records 1-500 (offset=0 and length=500)
records 501-1001 (offset=500 and length=500)
records 1002-1502 (offset=1001 and length=500)
etc...
or
records 1-500 (offset=0 and length=500)
records 500-1000 (offset=499 and length=500)
records 1000-1500 (offset=999 and length=500)
etc...
The latter one would produces a few duplicates, which could be easily removed when adding them to your in-memory storage (list, dictionary, dataframe).
----------------------------------------
My system information
----------------------------------------
Platform: macOS
Python: 3.8.0
Requests: 2.25.1
----------------------------------------

Try an explicit limit value in the payload to override the API default. For instance, insert limit=2500 into your request string.

Looks like you need to set the param like this to override defaults. With
index.set_settings
'paginationLimitedTo': number_of_records
Example use for Pyhton.
index.set_settings({'customRanking': ['desc(followers)']})
Further Info :- https://www.algolia.com/doc/api-reference/api-methods/set-settings/#examples

There are other way to solve this problem. First you can add &filters=objectID:SomeIds.
Algolia allows you to send 1000 different queries in one request.
This body will return you two objects:
{"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query&filters=objectID:271"}, {"indexName":"YCCompany_production","params":"hitsPerPage=1000&query&filters=objectID:5"}]}
You can check objectID values. Where are some range from 1-30000. Just send random objectIDs from 1-30000 and with only 30 request you will get all 3602 companies.
Here you have my java code:
public static void main(String[] args) throws IOException {
System.out.println("Start scraping content...>> " + new Timestamp(new Date().getTime()));
Set<Integer> allIds = new HashSet<>();
URL target = new URL("https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%3B%20JS%20Helper%20(3.7.0)&x-algolia-application-id=45BWZJ1SGC&x-algolia-api-key=Zjk5ZmFjMzg2NmQxNTA0NGM5OGNiNWY4MzQ0NDUyNTg0MDZjMzdmMWY1NTU2YzZkZGVmYjg1ZGZjMGJlYjhkN3Jlc3RyaWN0SW5kaWNlcz1ZQ0NvbXBhbnlfcHJvZHVjdGlvbiZ0YWdGaWx0ZXJzPSU1QiUyMnljZGNfcHVibGljJTIyJTVEJmFuYWx5dGljc1RhZ3M9JTVCJTIyeWNkYyUyMiU1RA%3D%3D");
String requestBody = "{\"requests\":[{\"indexName\":\"YCCompany_production\",\"params\":\"hitsPerPage=1000&query&filters=objectID:24638\"}]}";
int index = 1;
List<String> results = new ArrayList<>();
String bodyIndex = "{\"indexName\":\"YCCompany_production\",\"params\":\"hitsPerPage=1000&query&filters=objectID:%d\"}";
for (int i = 1; i <= 30; i++) {
StringBuilder body = new StringBuilder("{\"requests\":[");
for (int j = 1; j <= 1000; j++) {
body.append(String.format(bodyIndex, index));
body.append(",");
index++;
}
body = new StringBuilder(body.substring(0, body.length() - 1));
body.append("]}");
HttpURLConnection con = (HttpURLConnection) target.openConnection();
con.setDoOutput(true);
con.setRequestMethod(HttpMethod.POST.name());
con.setRequestProperty(HttpHeaders.CONTENT_TYPE, APPLICATION_JSON);
OutputStream os = con.getOutputStream();
os.write(body.toString().getBytes(StandardCharsets.UTF_8));
os.close();
con.connect();
String response = new String(con.getInputStream().readAllBytes(), StandardCharsets.UTF_8);
results.add(response);
}
results.forEach(result -> {
JsonArray array = JsonParser.parseString(result).getAsJsonObject().get("results").getAsJsonArray();
array.forEach(data -> {
if (((JsonObject) data).get("nbHits").getAsInt() == 0) {
return;
} else {
allIds.add(((JsonObject) data).get("hits").getAsJsonArray().get(0).getAsJsonObject().get("id").getAsInt());
}
});
});
System.out.println("Total scraped ids " + allIds.size());
System.out.println("Finish scraping content...>>>> " + new Timestamp(new Date().getTime()));
}

Payload requests - string concatenate layout error

I have code that reads a list of numbers that are parameters for a python order:
def search(number):
url = "http://localhost:8080/sistem/checkNumberStatus"
payload = '{\n\"SessionName\":\"POC\",\n\"celFull\":\"'+number+'\"\n}'
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data = payload)
object = open('numbers_wpp.txt', 'r')
for numbers in object:
search(number)
But, when I print the payload of my json the output is:
{
"SessionName":"POC",
"celFull":"5512997708936
"
}
{
"SessionName":"POC",
"celFull":"5512997709337
"
}
{
"SessionName":"POC",
"celFull":"5512992161195"
}
When reading the file with 3 numbers, the quotes in the celFull attribute closed correctly only in the last loop (last number), while the first two were broken into quotes. This break is giving error in queries.

If numbers_wpp.txt is a file that has each number on a different line, the following code would iterate over the file line by line.
for number in object:
search(number) # <--- numbers is actually a line in your file.
Since each number is on a new line, the preceding line has a '\n' at the end.
So
123
456
Is actually
123\n456
So number is actually 123\n. This causes your payload to have a \n before the quote closes.
You could fix this by calling number.strip() which strips whitespace from either end of the string.
Alternatively, you should consider not using a handcrafted json string and let the requests library do that for you. Documentation
def search(number):
url = "http://localhost:8080/sistem/checkNumberStatus"
payload = {"SessionName": "POC", "celFull": number} # <-- python dict
response = requests.post(url, data=payload)
object = open('numbers_wpp.txt', 'r')
for line in object:
number = int(line.strip()) # <-- convert to an integer
search(number)

Modifying API script to stop while loop when counter > x in Python?

it is my first time posting here so forgive me if my question is not up to par. As part of my job duties, I have to run API scripts from time to time though I really only have a basic understanding of python.
Below is a while loop:
hasMoreEntries = events['has_more'];
while (hasMoreEntries):
url = "https://api.dropboxapi.com/2/team_log/get_events/continue"
headers = {
"Authorization": 'Bearer %s' % aTokenAudit,
"Content-Type": "application/json"
}
data = {
"cursor": events['cursor']
}
r = requests.post(url, headers=headers, data=json.dumps(data))
events = r.json()
hasMoreEntries = events['has_more'];
for event in events['events']:
counter+=1;
print 'member id %s has done %s activites' % (memberId, counter)
From my understanding, the while loop will continuously count events and add to the counter. Because some users have too many events, I was thinking of stopping the counter at 5000 but not sure how to do so. Would adding an if/else somewhere work?

You can add a check that the counter is less than your maximum that you want it to get to in your while condition. e.g:
while hasMoreEntries and counter<=5000:
<snip>

Because you already increased the counter at the end of while, you can just only need to check the value of counter before each loop iteration. And based on comments of soon and Keerthana, here is my suggestion (I use the get() method just to avoid KeyError):
has_more_entries = events.get('has_more', None)
while (has_more_entries and counter<=5000):
url = "https://api.dropboxapi.com/2/team_log/get_events/continue"
headers = {
"Authorization": 'Bearer %s' % aTokenAudit,
"Content-Type": "application/json"
}
data = {
"cursor": events['cursor']
}
r = requests.post(url, headers=headers, data=json.dumps(data))
events = r.json()
has_more_entries = events.get('has_more', None)
if events.get('events', None):
counter += len(events['events'])
You can also take a look at the PEP8 coding style in Python here if you're interested

Get data by pages and merge it into one using Python (pagination)

I'm connecting to API which has 500 rows limit per call.
This is my code for a single API call (Works great):
def getdata(data):
auth_token = access_token
hed = {'Authorization': 'Bearer ' + auth_token, 'Accept': 'application/json'}
urlApi = 'https://..../orders?Offset=0&Limit=499'
datar = requests.get(urlApi, data=data, headers=hed, verify=True)
return datar
Now I want to scale it up so it will get me all the records.
This is what I tried to do:
In order to make sure that I have all the rows, I must iterate until there is no more data:
get 1st page
get 2nd page
merge
get 3rd page
merge
etc...
each page is an API call.
This is what I'm trying to do:
def getData(data):
auth_token = access_token
value_offset = 0
hed = {'Authorization': 'Bearer ' + auth_token, 'Accept': 'application/json'}
datarALL = None
while True:
urlApi = 'https://..../orders?Offset=' + value_offset + '&Limit=499'
responsedata = requests.get(urlApi, data=data, headers=hed, verify=True)
if responsedata.ok:
value_offset = value_offset + 499
#to do: merge the result of the get request
datarALL= datarALL+ responsedata (?)
# to do: check if response is empty then break out.
return datarALL
I couldn't find information about how I merge the results of the API calls nor how do I check if I can break the loop.
Edit:
To clear what I'm after.
I can see the results of the API call using:
logger.debug('response is : {0}'.format(datar.json()))
What I want to be able to do:
logger.debug('response is : {0}'.format(datarALL.json()))
and it will show all results from all calls. This requires generate API calls until there is no more data to get.
This is the return sample of API call:
"offset": 0,
"limit": 0,
"total": 0,
"results": [
{
"field1": 0,
"field2": "string",
"field3": "string",
"field4": "string"
}
]
}

In this case, you are almost correct with the idea.
is_valid = True
while is_valid:
is_valid = False
...
...
responsedata = requests.get(urlApi, data=data, headers=hed, verify=True)
if responsedata.status_code == 200: #Use status code to check request status, 200 for successful call
responsedata = responsedata.text
value_offset = value_offset + 499
#to do: merge the result of the get request
jsondata = json.loads(responsedata)
if "results" in jsondata:
if jsondata["results"]:
is_valid = True
if is_valid:
#concat array by + operand
datarALL = datarALL + jsondata["results"]
As I don't know if "results" still exists when the data ran out, so I checked both level.

API not accepting my JSON data from Python

I'm new to Python and dealing with JSON. I'm trying to grab an array of strings from my database and give them to an API. I don't know why I'm getting the missing data error. Can you guys take a look?
###########################################
rpt_cursor = rpt_conn.cursor()
sql="""SELECT `ContactID` AS 'ContactId' FROM
`BWG_reports`.`bounce_log_dummy`;"""
rpt_cursor.execute(sql)
row_headers=[x[0] for x in rpt_cursor.description] #this will extract row headers
row_values= rpt_cursor.fetchall()
json_data=[]
for result in row_values:
json_data.append(dict(zip(row_headers,result)))
results_to_load = json.dumps(json_data)
print(results_to_load) # Prints: [{"ContactId": 9}, {"ContactId": 274556}]
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
}
targetlist = '302'
# This is for their PUT to "add multiple contacts to lists".
api_request_url = 'https://api2.xyz.com/api/list/' + str(targetlist)
+'/contactid/Api_Key/' + bwg_apikey
print(api_request_url) #Prints https://api2.xyz.com/api/list/302/contactid/Api_Key/#####
response = requests.put(api_request_url, headers=headers, data=results_to_load)
print(response) #Prints <Response [200]>
print(response.content) #Prints b'{"status":"error","Message":"ContactId is Required."}'
rpt_conn.commit()
rpt_cursor.close()
###########################################################
Edit for Clarity:
I'm passing it this [{"ContactId": 9}, {"ContactId": 274556}]
and I'm getting this response body b'{"status":"error","Message":"ContactId is Required."}'
The API doc gives this as the from to follow for the request body.
[
{
"ContactId": "string"
}
]
When I manually put this data in there test thing I get what I want.
[
{
"ContactId": "9"
},
{
"ContactId": "274556"
}
]
Maybe there is something wrong with json.dumps vs json.load? Am I not creating a dict, but rather a string that looks like a dict?
EDIT I FIGURED IT OUT!:
This was dumb.
I needed to define results_to_load = [] as a dict before I loaded it at results_to_load = json.dumps(json_data).
Thanks for all the answers and attempts to help.

I would recommend you to go and check the API docs to be specific, but from it seems, the API requires a field with the name ContactID which is an array, rather and an array of objects where every object has key as contactId
Or
//correct
{
contactId: [9,229]
}
instead of
// not correct
[{contactId:9}, {contactId:229}]
Tweaking this might help:
res = {}
contacts = []
for result in row_values:
contacts.append(result)
res[contactId] = contacts
...
...
response = requests.put(api_request_url, headers=headers, data=res)

I FIGURED IT OUT!:
This was dumb.
I needed to define results_to_load = [] as an empty dict before I loaded it at results_to_load = json.dumps(json_data).
Thanks for all the answers and attempts to help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

For Range in Zip - python

If you know in advance what limit you want on the iteration, you can add that as a component of the zip: lim = 100 for tokenn, idcek, _ in zip(token, ids, range(lim)): # etc You don't need to look at it; the zip will just terminate on the shortest input.

Related

Can't scrape all the company names from a webpage

Payload requests - string concatenate layout error

Modifying API script to stop while loop when counter > x in Python?

Get data by pages and merge it into one using Python (pagination)

API not accepting my JSON data from Python

Categories

Resources