Can't scrape all the company names from a webpage

Can't scrape all the company names from a webpage - python

I'm trying to parse all the company names from this webpage. There are around 2431 companies in there. However, the way I've tried below can fetches me 1000 results.
This is what I can see about the number of results in response while going through dev tools:
hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431 <------------------------
nbPages: 1
page: 0
How can I get the rest of the results using requests?
I've tried so far:
import requests
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'
params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
r = s.post(url,params=params,json=payload)
print(len(r.json()['results'][0]['hits']))

As a workaround you can simulate search using alphabet as a search pattern. Using code below you will get all 2431 companies as dictionary with ID as a key and full company data dictionary as a value.
import requests
import string
params = {
'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
'x-algolia-application-id': '45BWZJ1SGC',
'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
print(letter)
payload = {
"requests": [{
"indexName": "YCCompany_production",
"params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
}]
}
r = requests.post(url, params=params, json=payload)
result.update({h['id']: h for h in r.json()['results'][0]['hits']})
print(len(result))

UPDATE 01-04-2021
After reviewing the "fine print" in the Algolia API documentation, I discovered that the paginationLimitedTo parameter CANNOT BE USED in a query. This parameter can only be used during indexing by the data's owner.
It seems that you can use the query and offset this way:
payload = {"requests":[{"indexName":"YCCompany_production",
"params": "query=&offset=1000&length=500&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit"
"%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}
Unfortunately, the paginationLimitedTo index set by the customer will not let you retrieve more than 1000 records via the API.
"hits": [],
"nbHits": 2432,
"offset": 1000,
"length": 500,
"message": "you can only fetch the 1000 hits for this query. You can extend the number of hits returned via the paginationLimitedTo index parameter or use the browse method. You can read our FAQ for more details about browsing: https://www.algolia.com/doc/faq/index-configuration/how-can-i-retrieve-all-the-records-in-my-index",
The browsing bypass method mentioned requires an ApplicationID and the AdminAPIKey
ORIGINAL POST
Based on the Algolia API documentation there is a query hit limit of 1000.
The documentation lists several ways to override or bypass this limit.
Part of the API is paginationLimitedTo, which by default is set to 1000 for performance and "scraping protection."
The syntax is:
'paginationLimitedTo': number_of_records
Another method mentioned in the documentation is setting the parameters offset and length.
offset lets you specify the starting hit (or record)
length sets the number of records returned
You could use these parameters to walk the records, thus potentially not impacting your scraping performance.
For instance you could scrape in blocks of 500.
records 1-500 (offset=0 and length=500)
records 501-1001 (offset=500 and length=500)
records 1002-1502 (offset=1001 and length=500)
etc...
or
records 1-500 (offset=0 and length=500)
records 500-1000 (offset=499 and length=500)
records 1000-1500 (offset=999 and length=500)
etc...
The latter one would produces a few duplicates, which could be easily removed when adding them to your in-memory storage (list, dictionary, dataframe).
----------------------------------------
My system information
----------------------------------------
Platform: macOS
Python: 3.8.0
Requests: 2.25.1
----------------------------------------

Try an explicit limit value in the payload to override the API default. For instance, insert limit=2500 into your request string.

Looks like you need to set the param like this to override defaults. With
index.set_settings
'paginationLimitedTo': number_of_records
Example use for Pyhton.
index.set_settings({'customRanking': ['desc(followers)']})
Further Info :- https://www.algolia.com/doc/api-reference/api-methods/set-settings/#examples

There are other way to solve this problem. First you can add &filters=objectID:SomeIds.
Algolia allows you to send 1000 different queries in one request.
This body will return you two objects:
{"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query&filters=objectID:271"}, {"indexName":"YCCompany_production","params":"hitsPerPage=1000&query&filters=objectID:5"}]}
You can check objectID values. Where are some range from 1-30000. Just send random objectIDs from 1-30000 and with only 30 request you will get all 3602 companies.
Here you have my java code:
public static void main(String[] args) throws IOException {
System.out.println("Start scraping content...>> " + new Timestamp(new Date().getTime()));
Set<Integer> allIds = new HashSet<>();
URL target = new URL("https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20JavaScript%20(3.35.1)%3B%20Browser%3B%20JS%20Helper%20(3.7.0)&x-algolia-application-id=45BWZJ1SGC&x-algolia-api-key=Zjk5ZmFjMzg2NmQxNTA0NGM5OGNiNWY4MzQ0NDUyNTg0MDZjMzdmMWY1NTU2YzZkZGVmYjg1ZGZjMGJlYjhkN3Jlc3RyaWN0SW5kaWNlcz1ZQ0NvbXBhbnlfcHJvZHVjdGlvbiZ0YWdGaWx0ZXJzPSU1QiUyMnljZGNfcHVibGljJTIyJTVEJmFuYWx5dGljc1RhZ3M9JTVCJTIyeWNkYyUyMiU1RA%3D%3D");
String requestBody = "{\"requests\":[{\"indexName\":\"YCCompany_production\",\"params\":\"hitsPerPage=1000&query&filters=objectID:24638\"}]}";
int index = 1;
List<String> results = new ArrayList<>();
String bodyIndex = "{\"indexName\":\"YCCompany_production\",\"params\":\"hitsPerPage=1000&query&filters=objectID:%d\"}";
for (int i = 1; i <= 30; i++) {
StringBuilder body = new StringBuilder("{\"requests\":[");
for (int j = 1; j <= 1000; j++) {
body.append(String.format(bodyIndex, index));
body.append(",");
index++;
}
body = new StringBuilder(body.substring(0, body.length() - 1));
body.append("]}");
HttpURLConnection con = (HttpURLConnection) target.openConnection();
con.setDoOutput(true);
con.setRequestMethod(HttpMethod.POST.name());
con.setRequestProperty(HttpHeaders.CONTENT_TYPE, APPLICATION_JSON);
OutputStream os = con.getOutputStream();
os.write(body.toString().getBytes(StandardCharsets.UTF_8));
os.close();
con.connect();
String response = new String(con.getInputStream().readAllBytes(), StandardCharsets.UTF_8);
results.add(response);
}
results.forEach(result -> {
JsonArray array = JsonParser.parseString(result).getAsJsonObject().get("results").getAsJsonArray();
array.forEach(data -> {
if (((JsonObject) data).get("nbHits").getAsInt() == 0) {
return;
} else {
allIds.add(((JsonObject) data).get("hits").getAsJsonArray().get(0).getAsJsonObject().get("id").getAsInt());
}
});
});
System.out.println("Total scraped ids " + allIds.size());
System.out.println("Finish scraping content...>>>> " + new Timestamp(new Date().getTime()));
}

Related

Error 308 Permanent Redirect in X++/C# code

I need to send an API request to SMS API.
My Python code is working but my X++/C# code is not working.
I tried with Postman, it is working as well.
Here's my Python code:
import requests
headers = {"Accept":"application/json"}
data = {"AppSid":"#############YxLtXtaN###",
"SenderID":"####23423#####",
"Body":"This is a test message.",
"Recipient":"###45645######",
"responseType":"JSON",
"CorrelationID":"",
"baseEncode":"true",
"statusCallback":"sent",
"async":"false"}
r = requests.post('http://myapi/rest/SMS/messages', auth=('user#domain.com', 'password'), headers=headers,
data=data)
Here's my X++/C# code:
class Class1
{
public static void main(Args _args)
{
str destinationUrl = 'myapi', requestXml, responseXml;
System.Net.HttpWebRequest request;
System.Net.HttpWebResponse response;
CLRObject clrObj;
System.Byte[] bytes;
System.Text.Encoding utf8;
System.IO.Stream requestStream, responseStream;
System.IO.StreamReader streamReader;
System.Exception ex;
System.Net.WebHeaderCollection httpHeader;
str byteStr;
System.Byte[] byteArray;
System.IO.Stream stream;
System.IO.Stream dataStream;
byteStr = strfmt('%1:%2', "user#domain.com", "password");
requestXml = " {\"AppSid\":\"###########\", \"SenderID\":\"########-AD\", \"Body\":\"This is a test message from ## from X++ Coding Language..\", \"Recipient\":\"######\", \"responseType\":\"JSON\", \"CorrelationID\":\"\", \"baseEncode\":\"true\", \"statusCallback\":\"sent\", \"async\":\"false\"}";
try
{
new InteropPermission(InteropKind::ClrInterop).assert();
httpHeader = new System.Net.WebHeaderCollection();
clrObj = System.Net.WebRequest::Create(destinationUrl);
request = clrObj;
utf8 = System.Text.Encoding::get_UTF8();
bytes = utf8.GetBytes(requestXml);
request.set_KeepAlive(true);
request.set_ContentType("application/xml");
request.AllowAutoRedirect=true;
utf8 = System.Text.Encoding::get_UTF8();
byteArray = utf8.GetBytes(byteStr);
byteStr = System.Convert::ToBase64String(byteArray);
httpHeader.Add("Authorization", 'Basic ' + byteStr);
request.set_ContentType("text/xml; encoding='utf-8'");
request.set_ContentLength(bytes.get_Length());
request.set_Method("POST");
request.set_Headers(httpHeader);
requestStream = request.GetRequestStream();
requestStream.Write(bytes, 0, bytes.get_Length());
response = request.GetResponse();
responseStream = response.GetResponseStream();
streamReader = new System.IO.StreamReader(responseStream);
responseXml = streamReader.ReadToEnd();
info(responseXml);
}
catch (Exception::CLRError)
{
//bp deviation documented
ex = CLRInterop::getLastException().GetBaseException();
error(ex.get_Message());
}
requestStream.Close();
streamReader.Close();
responseStream.Close();
response.Close();
}
}
I'm getting this error:
error code : The remote server returned an error: (308) Permanent
Redirect.

I made two changes.
removed the Authorization by commenting it out
//httpHeader.Add("Authorization", 'Basic ' + byteStr);
Made it HTTPS instead of HTTP. I could see the Location: HTTPS in the exception Response object in Visual Studio.
These 2 things resolved the issue for me.

I don't know about x++, but you could try explicitly allowing redirect on your webRequest object.
webRequest = System.Net.WebRequest::Create('http://myapi/rest/SMS/messages') as System.Net.HttpWebRequest;
webRequest.AllowAutoRedirect = true;
webRequest.MaximumAutomaticRedirections = 1; //Set value according to your requirements

For Range in Zip

I'm writing an api code. He pulls the token and id from the list, then goes back to the beginning and pulls the 2nd token and the 2nd id. I made a loop like this. But how do I add range into the for loop. So I want this loop to turn 10 times. How do I add Range.
import requests
import json
token = [line.strip() for line in open("./dosya/tokenler/1.txt")]
ids = [line.strip() for line in open("./dosya/tokenler/data.txt")]
for tokenn, idcek in zip(token, ids):
headers = {
'Content-Type': 'application/json',
'Client-Id': 'xxxxxxxxxxxxxxx',
'Authorization': tokenn
}
data = {
"to_id": "56822556", "from_id": idcek
}
response = requests.post("https://sitelink.com", headers=headers, json=data)
print(response.text)

If you want your existing for loop to run 100 times, you can use enumerate() to get the "loop index" as well, then stop the for loop on the 101th run (index 100):
for idx, (tokenn, idcek) in enumerate(zip(token, ids)):
if idx == 100:
# loop ran 100 times already, exiting loop
break
headers = {
'Content-Type': 'application/json',
'Client-Id': 'xxxxxxxxxxxxxxx',
'Authorization': tokenn
}
data = {
"to_id": "56822556", "from_id": idcek
}
"""existing code"""

If you know in advance what limit you want on the iteration, you can add that as a component of the zip:
lim = 100
for tokenn, idcek, _ in zip(token, ids, range(lim)):
# etc
You don't need to look at it; the zip will just terminate on the shortest input.

How to receive data through websockets in python

I'm trying to retrieve data programmatically through websockets and am failing due to my limited knowledge around this. On visiting the site at https://www.tradingview.com/chart/?symbol=ASX:RIO I notice one of the websocket messages being sent out is ~m~60~m~{"m":"quote_fast_symbols","p":["qs_p089dyse9tcu","ASX:RIO"]}
My code is as follows:
from websocket import create_connection
import json
ws = create_connection("wss://data.tradingview.com/socket.io/websocket?from=chart%2Fg0l68xay%2F&date=2019_05_27-12_19")
ws.send(json.dumps({"m":"quote_fast_symbols","p"["qs_p089dyse9tcu","ASX:RIO"]}))
result = ws.recv()
print(result)
ws.close()
Result of the print:
~m~302~m~{"session_id":"<0.25981.2547>_nyc2-charts-3-webchart-5#nyc2-compute-3_x","timestamp":1558976872,"release":"registry:5000/tvbs_release/webchart:release_201-106","studies_metadata_hash":"888cd442d24cef23a176f3b4584ebf48285fc1cd","protocol":"json","javastudies":"javastudies-3.44_955","auth_scheme_vsn":2}
I get this result no matter what message I send out, out of the almost multitude of messages that seem to be sent out. I was hoping one of the messages sent back will be the prices info for the low and highs for RIO. Is there other steps I should include to get this data? I understand there might be some form of authorisation needed but I dont know the workflow.

Yes, there is much more to setup and it needs to be done in order. The following example written in Node.js will subscribe to the BINANCE:BTCUSDT real time data and fetch historical 5000 bars on the daily chart.
Ensure you have proper value of the origin field set in header section before connecting. Otherwise your connection request will be rejected by the proxy. I most common ws there is no way to do this. Use faye-websocket instead
const WebSocket = require('faye-websocket')
const ws = new WebSocket.Client('wss://data.tradingview.com/socket.io/websocket', [], {
headers: { 'Origin': 'https://data.tradingview.com' }
});
After connecting you need to setup your data stream. I don't know if all of this commands needs to be performed. This probably can be shrink even more but it works. Basically what you need to do is to create new quote and chart sessions and within these sessions request stream of the data of the prior resolved symbol.
ws.on('open', () => {
const quote_session = 'qs_' + getRandomToken()
const chart_session = 'cs_' + getRandomToken()
const symbol = 'BINANCE:BTCUSDT'
const timeframe = '1D'
const bars = 5000
sendMsg(ws, "set_auth_token", ["unauthorized_user_token"])
sendMsg(ws, "chart_create_session", [chart_session, ""])
sendMsg(ws, "quote_create_session", [quote_session])
sendMsg(ws, "quote_set_fields", [quote_session,"ch","chp","current_session","description","local_description","language","exchange","fractional","is_tradable","lp","lp_time","minmov","minmove2","original_name","pricescale","pro_name","short_name","type","update_mode","volume","currency_code","rchp","rtc"])
sendMsg(ws, "quote_add_symbols",[quote_session, symbol, {"flags":['force_permission']}])
sendMsg(ws, "quote_fast_symbols", [quote_session, symbol])
sendMsg(ws, "resolve_symbol", [chart_session,"symbol_1","={\"symbol\":\""+symbol+"\",\"adjustment\":\"splits\",\"session\":\"extended\"}"])
sendMsg(ws, "create_series", [chart_session, "s1", "s1", "symbol_1", timeframe, bars])
});
ws.on('message', (msg) => { console.log(`RX: ${msg.data}`) })
And finally implementation of the helper methods
const getRandomToken = (stringLength=12) => {
characters = 'abcdefghijklmnopqrstuvwxyz0123456789'
const charactersLength = characters.length;
let result = ''
for ( var i = 0; i < stringLength; i++ ) {
result += characters.charAt(Math.floor(Math.random() * charactersLength))
}
return result
}
const createMsg = (msg_name, paramsList) => {
const msg_str = JSON.stringify({ m: msg_name, p: paramsList })
return `~m~${msg_str.length}~m~${msg_str}`
}
const sendMsg = (ws, msg_name, paramsList) => {
const msg = createMsg(msg_name, paramsList)
console.log(`TX: ${msg}`)
ws.send(createMsg(msg_name, paramsList))
}

Modifying API script to stop while loop when counter > x in Python?

it is my first time posting here so forgive me if my question is not up to par. As part of my job duties, I have to run API scripts from time to time though I really only have a basic understanding of python.
Below is a while loop:
hasMoreEntries = events['has_more'];
while (hasMoreEntries):
url = "https://api.dropboxapi.com/2/team_log/get_events/continue"
headers = {
"Authorization": 'Bearer %s' % aTokenAudit,
"Content-Type": "application/json"
}
data = {
"cursor": events['cursor']
}
r = requests.post(url, headers=headers, data=json.dumps(data))
events = r.json()
hasMoreEntries = events['has_more'];
for event in events['events']:
counter+=1;
print 'member id %s has done %s activites' % (memberId, counter)
From my understanding, the while loop will continuously count events and add to the counter. Because some users have too many events, I was thinking of stopping the counter at 5000 but not sure how to do so. Would adding an if/else somewhere work?

You can add a check that the counter is less than your maximum that you want it to get to in your while condition. e.g:
while hasMoreEntries and counter<=5000:
<snip>

Because you already increased the counter at the end of while, you can just only need to check the value of counter before each loop iteration. And based on comments of soon and Keerthana, here is my suggestion (I use the get() method just to avoid KeyError):
has_more_entries = events.get('has_more', None)
while (has_more_entries and counter<=5000):
url = "https://api.dropboxapi.com/2/team_log/get_events/continue"
headers = {
"Authorization": 'Bearer %s' % aTokenAudit,
"Content-Type": "application/json"
}
data = {
"cursor": events['cursor']
}
r = requests.post(url, headers=headers, data=json.dumps(data))
events = r.json()
has_more_entries = events.get('has_more', None)
if events.get('events', None):
counter += len(events['events'])
You can also take a look at the PEP8 coding style in Python here if you're interested

API not accepting my JSON data from Python

I'm new to Python and dealing with JSON. I'm trying to grab an array of strings from my database and give them to an API. I don't know why I'm getting the missing data error. Can you guys take a look?
###########################################
rpt_cursor = rpt_conn.cursor()
sql="""SELECT `ContactID` AS 'ContactId' FROM
`BWG_reports`.`bounce_log_dummy`;"""
rpt_cursor.execute(sql)
row_headers=[x[0] for x in rpt_cursor.description] #this will extract row headers
row_values= rpt_cursor.fetchall()
json_data=[]
for result in row_values:
json_data.append(dict(zip(row_headers,result)))
results_to_load = json.dumps(json_data)
print(results_to_load) # Prints: [{"ContactId": 9}, {"ContactId": 274556}]
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json',
}
targetlist = '302'
# This is for their PUT to "add multiple contacts to lists".
api_request_url = 'https://api2.xyz.com/api/list/' + str(targetlist)
+'/contactid/Api_Key/' + bwg_apikey
print(api_request_url) #Prints https://api2.xyz.com/api/list/302/contactid/Api_Key/#####
response = requests.put(api_request_url, headers=headers, data=results_to_load)
print(response) #Prints <Response [200]>
print(response.content) #Prints b'{"status":"error","Message":"ContactId is Required."}'
rpt_conn.commit()
rpt_cursor.close()
###########################################################
Edit for Clarity:
I'm passing it this [{"ContactId": 9}, {"ContactId": 274556}]
and I'm getting this response body b'{"status":"error","Message":"ContactId is Required."}'
The API doc gives this as the from to follow for the request body.
[
{
"ContactId": "string"
}
]
When I manually put this data in there test thing I get what I want.
[
{
"ContactId": "9"
},
{
"ContactId": "274556"
}
]
Maybe there is something wrong with json.dumps vs json.load? Am I not creating a dict, but rather a string that looks like a dict?
EDIT I FIGURED IT OUT!:
This was dumb.
I needed to define results_to_load = [] as a dict before I loaded it at results_to_load = json.dumps(json_data).
Thanks for all the answers and attempts to help.

I would recommend you to go and check the API docs to be specific, but from it seems, the API requires a field with the name ContactID which is an array, rather and an array of objects where every object has key as contactId
Or
//correct
{
contactId: [9,229]
}
instead of
// not correct
[{contactId:9}, {contactId:229}]
Tweaking this might help:
res = {}
contacts = []
for result in row_values:
contacts.append(result)
res[contactId] = contacts
...
...
response = requests.put(api_request_url, headers=headers, data=res)

I FIGURED IT OUT!:
This was dumb.
I needed to define results_to_load = [] as an empty dict before I loaded it at results_to_load = json.dumps(json_data).
Thanks for all the answers and attempts to help.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Can't scrape all the company names from a webpage - python

Try an explicit limit value in the payload to override the API default. For instance, insert limit=2500 into your request string.

Related

Error 308 Permanent Redirect in X++/C# code

For Range in Zip

How to receive data through websockets in python

Modifying API script to stop while loop when counter > x in Python?

API not accepting my JSON data from Python

Categories

Resources