output more than limited results from a form request - python

I have the following script that posts a search terms into a form and retrieves results:
import mechanize
url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.set_handle_robots(False) # ignore robots
br.open(url)
br.select_form(name="form")
br["search_surname"] = "*"
res = br.submit()
content = res.read()
with open("surnames.txt", "w") as f:
f.write(content)
however the rendered web page, and hence script here limits the search to 250 results. Is there any way I can bypass this limit and retrieve all results?
Thank you

You could simply iterate over possible prefixes to get around the limit. There is 270,000 names and a limit of 250 results per query, therefore you need to make at least 1080 requests, there are 26 letters in the alphabet so if we assume there is an even distribution this would mean we would need to use a little over 2 letters as a prefix (log(1080)/log(26)), however it is unlikely to be that even (how many people have surnames starting with ZZ after all).
To get around this we use a modified depth first search like so:
import string
import time
import mechanize
def checkPrefix(prefix):
#Return list of names with this prefix.
url = "http://www.taliesin-arlein.net/names/search.php"
br = mechanize.Browser()
br.open(url)
br.select_form(name="form")
br["search_surname"] = prefix+'*'
res = br.submit()
content = res.read()
return extractSurnames(content)
def extractSurnames(pageText):
#write function to extract text from html
Q=[x for x in string.ascii_lowercase]
listOfSurnames=[]
while Q:
curPrefix=Q.pop()
print curPrefix
curSurnames=checkPrefix(curPrefix)
if len(curSurnames)<250:
#store surnames could also write to file.
listOfSurnames+=curSurnames
else:
#We clearly didnt get all of the names need to subdivide more
Q+=[curPrefix+x for x in string.ascii_lowercase]
time.sleep(5) # Sleep here to avoid overloading the server for other people.
Thus we query more in places where there are too many results to be displayed, but we do not query ZZZZ if there is less than 250 surnames that start with ZZZ (or shorter). Without knowing how skewed the name distribution is, hard to estimate how long this will take but the 5 seconds sleep multiplied by 1080 is 1.5 hours or so so you are probably looking at at least half a day if not longer.
Note: This could be made more efficient by declaring the browser globally, however whether this is appropriate depends on where this code will be placed.

Related

How do you generate a valid youtube URL with python

Is there any way to generate a valid youtube URL with python
import requests
from string import ascii_uppercases, ascii_lowercase, digits
charset = list(ascii_uppercase) + list(ascii_lowercase)+ list(digits)
def gen_id():
res = ""
for i in range(11):
res += random.choice(charset)
return res
youtube_url = "https://www.youtube.com/watch?v=" + gen_id()
resp = requests.get(youtube_url)
print (resp.status_code)
I am using this example to generate random youtube url
I get response code 200 but no video found when i try to open the video in the browser
I looked at this method but it does not work
ID's are generated randomly, and are not that predictable. They are all supposedly Base64 though, which helps limit the number of characters (you will probably want to add dashes and underscores to your random generation since codes like gbhDL8BT_w0 are possible). The only real approach known is generation and then testing, and as some commenters mentioned, this might get rate-limited by YouTube.
There are some additional details provided in this answer to a similar question that might help in doing the generation, or satiating curiosity.
It's not possible to pick always a valid random url from all the videos Youtube has, not every valid sequence is a valid id. You have to check yourself that the urls you want to choose randomly are valid. Pick some videos up and put them in a list.
myUrls = [
"https://www.youtube.com/watch?v=...",
"https://www.youtube.com/watch?v=...",
...
]
youtube_url = random.choice(myUrls)

Scrape latitude and longitude locations obtained from Mapbox

I'm working on a divvy dataset project.
I want to scrape information for each suggestion location and comments provided from here http://suggest.divvybikes.com/.
Am I able to scrape this information from Mapbox? It is displayed on a map so it must have the information somewhere.
I visited the page, and logged my network traffic using Google Chrome's Developer Tools. Filtering the requests to view only XHR (XmlHttpRequest) requests, I saw a lot of HTTP GET requests to various REST APIs. These REST APIs return JSON, which is ideal. Only two of these APIs seem to be relevant for your purposes - one is for places, the other for comments associated with those places. The places API's JSON contains interesting information, such as place ids and coordinates. The comments API's JSON contains all comments regarding a specific place, identified by its id. Mimicking those calls is pretty straightforward with the third-party requests module. Fortunately, the APIs don't seem to care about request headers. The query-string parameters (the params dictionary) need to be well-formulated though, of course.
I was able to come up with the following two functions: get_places makes multiple calls to the same API, each time with a different page query-string parameter. It seems that "page" is the term they use internally to split up all their data into different chunks - all the different places/features/stations are split up across multiple pages, and you can only get one page per API call. The while-loop accumulates all places in a giant list, and it keeps going until we receive a response which tells us there are no more pages. Once the loop ends, we return the list of places.
The other function is get_comments, which takes a place id (string) as a parameter. It then makes an HTTP GET request to the appropriate API, and returns a list of comments for that place. This list may be empty if there are no comments.
def get_places():
import requests
from itertools import count
api_url = "http://suggest.divvybikes.com/api/places"
page_counter = count(1)
places = []
for page_nr in page_counter:
params = {
"page": str(page_nr),
"include_submissions": "true"
}
response = requests.get(api_url, params=params)
response.raise_for_status()
content = response.json()
places.extend(content["features"])
if content["metadata"]["next"] is None:
break
return places
def get_comments(place_id):
import requests
api_url = "http://suggest.divvybikes.com/api/places/{}/comments".format(place_id)
response = requests.get(api_url)
response.raise_for_status()
return response.json()["results"]
def main():
from operator import itemgetter
places = get_places()
place_id = places[12]["id"]
print("Printing comments for the thirteenth place (id: {})\n".format(place_id))
for comment in map(itemgetter("comment"), get_comments(place_id)):
print(comment)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Printing comments for the thirteenth place (id: 107062)
I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.
>>>
For this example, I'm printing all the comments for the 13th place in our list of places. I picked that one because it is the first place which actually has comments (0 - 11 didn't have any comments, most places don't seem to have comments). In this case, this place only had one comment.
EDIT - If you wanted to save the place ids, latitude, longitude and comments in a CSV, you can try changing the main function to:
def main():
import csv
print("Getting places...")
places = get_places()
print("Got all places.")
fieldnames = ["place id", "latitude", "longitude", "comments"]
print("Writing to CSV file...")
with open("output.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames)
writer.writeheader()
num_places_to_write = 25
for place_nr, place in enumerate(places[:num_places_to_write], start=1):
print("Writing place #{}/{} with id {}".format(place_nr, num_places_to_write, place["id"]))
writer.writerow(dict(zip(fieldnames, [place["id"], *place["geometry"]["coordinates"], [c["comment"] for c in get_comments(place["id"])]])))
return 0
With this, I got results like:
place id,latitude,longitude,comments
107098,-87.6711076553,41.9718155716,[]
107097,-87.759540081,42.0121073671,[]
107096,-87.747695446,42.0263916146,[]
107090,-87.6642036438,42.0162096564,[]
107089,-87.6609444613,41.8852953922,[]
107083,-87.6007853815,41.8199433342,[]
107082,-87.6355862613,41.8532736671,[]
107075,-87.6210737228,41.8862644836,[]
107074,-87.6210737228,41.8862644836,[]
107073,-87.6210737228,41.8862644836,[]
107065,-87.6499611139,41.9627251578,[]
107064,-87.6136027649,41.8332984674,[]
107062,-87.7073025402,42.0760990584,"[""I contacted Divvy about this five years ago and would like to pick the conversation back up! The Evanston Divvy bikes are regularly spotted in Wilmette and we'd love to expand the system for riders. We could easily have four stations - at the Metra Train Station, and the CTA station, at the lakefront Gillson Park and possibly one at Edens Plaza in west Wilmette. Please, please, please contact me directly. Thanks.""]"
In this case, I used the list-slicing syntax (places[:num_places_to_write]) to only write the first 25 places to the CSV file, just for demonstration purposes. However, after about the first thirteen were written, I got this exception message:
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
So, I'm guessing that the comment-API doesn't expect to receive so many requests in such a short amount of time. You may have to sleep in the loop for a bit to get around this. It's also possible that the API doesn't care, and just happened to timeout.

Handling multiple user URL inputs that then need to be split and processed individually

I'm newer to Python so please be easy on me senpai, since this is probably a simple loop I'm overlooking. Essentially what I'm attempting to do is have a user input a list of URLS separated by commas, then individually those URLS get joined to the ending of an API call. I have it working perfect when I remove the .split for one address, but I'd love to know how to get it to handle multiple user inputs. I tried setting a counter, and an upper limit for a loop then having it work that way but couldn't get it working properly.
import requests
import csv
import os
Domain = input ("Enter the URLS seperated by commas").split(',')
URL = 'https:APIcalladdresshere&' + Domain
r = requests.get(URL)
lm = r.text
j = lm.replace(';',',')
file = open(Domain +'.csv', "w",)
file.write(j)
file.close()
file.close()
print (j)
print (URL)
I unfortunately don't have enough reputation to comment and ask what you mean by it not working properly (I'm guessing you mean something that I've mentioned down below), but maybe if you have something like a list of domains and then looking for a specific input that makes you break the loop (so you don't have an upper limit like you said) that might solve your issue. Something like:
Domains = []
while True:
domain = input ("Enter the URLS seperated by commas: (Enter 'exit' to exit)")
if 'exit' in domain.lower():
break
else:
Domains.append(domain.split(','))
Urls = []
for domain in Domains:
URL = 'https:APIcalladdresshere&' + domain
Urls.append(domain) #or you could just write Urls.append('https:APIcalladdresshere&' + domain)
But then the line URL = 'https:APIcalladdresshere&' + Domain will throw a TypeError because you're trying to add a list to a string (you converted Domain to a list with Domain.split(',')). The loop above works just fine, but if you insist on comma-separated urls, try:
URL = ['https:APIcalladdresshere&' + d for d in Domain]
where URL is now a list that you can iterate over.
Hope this helps!

retrieved URLs, trouble building payload to use requests module

I'm a Python novice, thanks for your patience.
I retrieved a web page, using the requests module. I used Beautiful Soup to harvest a few hundred href objects (links). I used uritools to create an array of full URLs for the target pages I want to download.
I don't want everybody who reads this note to bombard the web server with requests, so I'll show a hypothetical example that is realistic for just 2 hrefs. The array looks like this:
hrefs2 = ['http://ku.edu/pls/WP040?PT001F01=910&pf7331=11',
'http://ku.edu/pls/WP040?PT001F01=910&pf7331=12']
If I were typing these into 100s of lines of code, I understand what to do in order to retrieve each page:
from lxml import html
import requests
url = 'http://ku.edu/pls/WP040/'
payload = {'PT001F01' : '910', 'pf7331' : '11')
r = requests.get(url, params = payload)
Then get the second page
payload = {'PT001F01' : '910', 'pf7331' : '12')
r = requests.get(url, params = payload)
And keep typing in payload objects. Not all of the hrefs I'm dealing with are sequential, not all of the payloads are different simply in the last integer.
I want to automate this and I don't see how to create the payloads from the hrefs2 array.
While fiddling with uritools, I find urisplit which can give me the part I need to parse into a payload:
[urisplit(x)[3] for x in hrefs2]
['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
Each one of those has to be turned into a payload object and I don't understand what to do.
I'm using Python3 and I used uritools because that appears to be the standards-compliant replacement of urltools.
I fell back on shell script to get pages with wget, which does work, but it is so un-Python-ish that I'm asking here for what to do. I mean, this does work:
import subprocess
for i in hrefs2:
subprocess.call(["wget", i])
You can pass the full url to requests.get() without splitting up the parameters.
>>> requests.get('http://ku.edu/pls/WP040?PT001F01=910&pf7331=12')
<Response [200]>
If for some reason you don't want to do that, you'll need to split up the parameters some how. I'm sure there are better ways to do it, but the first thing that comes to mind is:
a = ['PT001F01=910&pf7331=11',
'PT001F01=910&pf7331=12']
# list to store all url parameters after they're converted to dicts
urldata = []
#iterate over list of params
for param in a:
data = {}
# split the string into key value pairs
for kv in param.split('&'):
# split the pairs up
b = kv.split('=')
# first part is the key, second is the value
data[b[0]] = b[1]
# After converting every kv pair in the parameter, add the result to a list.
urldata.append(data)
You could do this with less code but I wanted to be clear what was going on. I'm sure there is already a module somewhere out there that does this for you too.

Cannot decode or read website URL for counting a string

I am trying to perform a search and count of data in a website using the code below, you can see I have added a few extra prints in the code for debugging, currently the result is always "0" which says to me there is an error in reading the file of some sort. If I print the variable called html, I can clearly see that all three strings I am searching for are contained in the html, yet as previously mentioned none of my prints print anything, and the final print count simply returns "0". As you can see I have tried three different methods, same problem each time.
import urllib2
import urllib
import re
import json
import mechanize
post_url = "url_of_fishermans_finds"
browser = mechanize.Browser()
browser.set_handle_robots(False)
browser.addheaders = [('User-agent', 'Firefox')]
html = browser.open(post_url).read().decode('UTF-8')
# Attempted method 1
print html.count("SEA BASS")
# Attempted method 2
count = 0
enabled = False
for line in html:
if 'MAIN FISHERMAN' in line:
print "found main fisherman"
enabled = True
elif 'SEA BASS' in line:
print "found fish"
count += 1
elif 'SECONDARY FISHERMAN' in line:
print "found secondary fisherman"
enabled = False
print count
# Attempted method 3
relevant = re.search(r"MAIN FISHERMAN(.*)SECONDARY FISHERMAN", html)[1]
found = relevant.count("SEA BASS")
print found
It is probably something really simple, any comments or help would be greatly appreciated. Kind regards AEA
Regarding your regular expressions method #3, it appears you aren't grouping your search result prior to running count. I don't have the HTML you're looking at but you may also be running into trouble with your use of the '.' if there are newlines between your two search terms. With these issues in mind, try something like the following to correct these errors (note: in Python 3 syntax):
relevantcompile = re.compile("MAIN FISHERMAN(.*)SECONDARY FISHERMAN", re.DOTALL)
relevantsearch = re.search(relevantcompile, html)
relevantgrouped = relevantsearch.group()
relevantcount = relevantgrouped.count("SEA BASS")
print(relevantcount)
Also, keep in mind comments above regarding the case sensitivity of regular expressions searches. Hope this helps :)

Categories

Resources