What is the easiest way to loop through a series URLs until there are no more results returned?
If the number of URLs is fixed e.g 9, something like the following code would work
for i in range(1,10):
print('http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page='+ str(i)+'&sort_order=default ')
However, the number of URLs is dynamic, and I get a page saying "Sorry, there are currently no listings in this category." when I overshoot. Example below.
http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page=10&sort_order=default
What is the easiest way to only return pages with results?
Cheers
Steve
# count is an iterator that just keeps going
# from itertools import count
# but I'm not going to use it, because you want to set a reasonable limit
# otherwise you'll loop endlessly if your end condition fails
# requests is third party but generally better than the standard libs
import requests
base_url = 'http://www.trademe.co.nz/browse/categorylistings.aspx?v=list&rptpath=4-380-50-7145-&mcatpath=sports%2fcycling%2fmountain-bikes%2ffull-suspension&page={}&sort_order=default'
for i in range(1, 30):
result = requests.get(base_url.format(i))
if result.status_code != 200:
break
content = result.content.decode('utf-8')
# Note, this is actually quite fragile
# For example, they have 2 spaces between 'no' and 'listings'
# so looking for 'no listings' would break
# for a more robust solution be more clever.
if 'Sorry, there are currently no' in content:
break
# do stuff with your content here
print(i)
Related
I'm newer to Python so please be easy on me senpai, since this is probably a simple loop I'm overlooking. Essentially what I'm attempting to do is have a user input a list of URLS separated by commas, then individually those URLS get joined to the ending of an API call. I have it working perfect when I remove the .split for one address, but I'd love to know how to get it to handle multiple user inputs. I tried setting a counter, and an upper limit for a loop then having it work that way but couldn't get it working properly.
import requests
import csv
import os
Domain = input ("Enter the URLS seperated by commas").split(',')
URL = 'https:APIcalladdresshere&' + Domain
r = requests.get(URL)
lm = r.text
j = lm.replace(';',',')
file = open(Domain +'.csv', "w",)
file.write(j)
file.close()
file.close()
print (j)
print (URL)
I unfortunately don't have enough reputation to comment and ask what you mean by it not working properly (I'm guessing you mean something that I've mentioned down below), but maybe if you have something like a list of domains and then looking for a specific input that makes you break the loop (so you don't have an upper limit like you said) that might solve your issue. Something like:
Domains = []
while True:
domain = input ("Enter the URLS seperated by commas: (Enter 'exit' to exit)")
if 'exit' in domain.lower():
break
else:
Domains.append(domain.split(','))
Urls = []
for domain in Domains:
URL = 'https:APIcalladdresshere&' + domain
Urls.append(domain) #or you could just write Urls.append('https:APIcalladdresshere&' + domain)
But then the line URL = 'https:APIcalladdresshere&' + Domain will throw a TypeError because you're trying to add a list to a string (you converted Domain to a list with Domain.split(',')). The loop above works just fine, but if you insist on comma-separated urls, try:
URL = ['https:APIcalladdresshere&' + d for d in Domain]
where URL is now a list that you can iterate over.
Hope this helps!
I have a list and while I constantly append element to it, I want to check that it isn't empty and get element at the same time. Normally we wait all the elements append to the list and then we get element from list do something. In this case, we lose some time for waiting all elements add to the list. What knowledge do I need to acquire to make this happen (Multiprocessing, multiprocessing.dummy, asynchronous) ,sorry, I am still new for doing this, I think it's better for me to explain to you why I want to achieve this kind of effect,this problem came from a web crawler
import requests
from model import Document
def add_concrete_content(input_list):
"""input_list data structure [{'url': 'xxx', 'title': 'xxx'}...]"""
for e in input_list:
r = requests.get(e['url'])
html = r.content
e['html'] = html
return input_list
def save(input_list):
for e in input_list:
Document.create(**e)
if __name__ == '__main__':
res = add_concrete_content(list)
save(res)
"""this is I normally do, I save data to mysql or whatever database,but
I think the drawback is I have to wait all the html add to dict and then
save to database, what if I have to deal with tons of data? Can I save
the dict with the html first? Can I save some time? A friend of mine
said this is a typical producer consumer problem, probably gonna use
at least two threads and lock, because without lock, data probably
gonna fall into disorder"""
You're being vague. And I think there's a misconception in the way you want things to happen.
You don't need any extra python rock-science to do what you want:
you can check if the list is empty by simply doing: if list_: (where list_ is your list)
you can verify any element by using list_[idx] (where idx is the index of the element). For example, list_[0] will get you the first element of the list, while list_[-1], the last one.
That said, you don't have to wait for all the elements to be added to the list if you need to process them on the go. You might look for something like this:
def push(list_):
count = 0
while True:
list_.append(count)
f()
count += 1
if count == 1000:
break
def f():
print('First element: {}'.format(list_[0]))
print('Last element: {}'.format(list_[-1]))
if __name__ == '__main__':
list_ = []
push(list_)
I'd like to make a program that makes offline copies of math questions from Khan Academy. I have a huge 21.6MB text file that contains data on all of their exercises, but I have no idea how to start analyzing it, much less start pulling the questions from it.
Here is a pastebin containing a sample of the JSON data. If you want to see all of it, you can find it here. Warning for long load time.
I've never used JSON before, but I wrote up a quick Python script to try to load up individual "sub-blocks" (or equivalent, correct term) of data.
import sys
import json
exercises = open("exercises.txt", "r+b")
byte = 0
frontbracket = 0
backbracket = 0
while byte < 1000: #while byte < character we want to read up to
#keep at 1000 for testing purposes
char = exercises.read(1)
sys.stdout.write(char)
#Here we decide what to do based on what char we have
if str(char) == "{":
frontbracket = byte
while True:
char = exercises.read(1)
if str(char)=="}":
backbracket=byte
break
exercises.seek(frontbracket)
block = exercises.read(backbracket-frontbracket)
print "Block is " + str(backbracket-frontbracket) + " bytes long"
jsonblock = json.loads(block)
sys.stdout.write(block)
print jsonblock["translated_display_name"]
print "\nENDBLOCK\n"
byte = byte + 1
Ok, the repeated pattern appears to be this: http://pastebin.com/4nSnLEFZ
To get an idea of the structure of the response, you can use JSONlint to copy/paste portions of your string and 'validate'. Even if the portion you copied is not valid, it will still format it into something you can actually read.
First I have used requests library to pull the JSON for you. It's a super-simple library when you're dealing with things like this. The API is slow to respond because it seems you're pulling everything, but it should work fine.
Once you get a response from the API, you can convert that directly to python objects using .json(). What you have is essentially a mixture of nested lists and dictionaries that you can iterate through and pull specific details. In my example below, my_list2 has to use a try/except structure because it would seem that some of the entries do not have two items in the list under translated_problem_types. In that case, it will just put 'None' instead. You might have to use trial and error for such things.
Finally, since you haven't used JSON before, it's also worth noting that it can behave like a dictionary itself; you are not guaranteed the order in which you receive details. However, in this case, it seems the outermost structure is a list, so in theory it's possible that there is a consistent order but don't rely on it - we don't know how the list is constructed.
import requests
api_call = requests.get('https://www.khanacademy.org/api/v1/exercises')
json_response = api_call.json()
# Assume we first want to list "author name" with "author key"
# This should loop through the repeated pattern in the pastebin
# access items as a dictionary
my_list1 = []
for item in json_response:
my_list1.append([item['author_name'], item['author_key']])
print my_list1[0:5]
# Now let's assume we want the 'sha' of the SECOND entry in translated_problem_types
# to also be listed with author name
my_list2 = []
for item in json_response:
try:
the_second_entry = item['translated_problem_types'][0]['items'][1]['sha']
except IndexError:
the_second_entry = 'None'
my_list2.append([item['author_name'], item['author_key'], the_second_entry])
print my_list2[0:5]
I am having an issue getting all of the data from this site...
The section of the code I cannot get to produce all of the data is "pn"
I am hoping this code would product these numbers from the site.
58312-GA4
58312-RG4
58312-RR$
I have tried a number of things from switching the tags and classes and going back and fourth with find, findAll, and find_all and no matter what I try I am getting only one result.
Any help would be great - thanks
Here is the code:
theurl="http://www.colehersee.com/home/grid/cat/14/?"
thepage = urllib.request.urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
for pn in soup.find('table',{"class":"mod_products_grid_listing"}).find_all('span',{"class":"product_code"}):
pn2 = pn.text
for main in soup.find_all('nav',{"id":"breadcrumb"}):
main1 = main.text
print(pn2)
print (main1)
You're running the for loop for getting the 'pn' value quite separately from the for loop for the 'main' value. To be specific, by the time your code reaches the second for loop, the previous for loop has already executed in its entirety.
This results in the variable pn2 getting assigned the last value that was returned by the for loop.
You might want to do something like
pn2 = []
for pn in soup.find('table',{"class":"mod_products_grid_listing"}).find_all('span',{"class":"product_code"}):
pn2.append(pn.text)
I have the following python code
from urlparse import urlparse
def clean_url(url):
new_url = urlparse(url)
if new_url.netloc == '':
return new_url.path.strip().decode()
else:
return new_url.netloc.strip().decode()
print clean_url("http://www.facebook.com/john.doe")
print clean_url("http://facebook.com/john.doe")
print clean_url("facebook.com/john.doe")
print clean_url("www.facebook.com/john.doe")
print clean_url("john.doe")
In each example I take in a string and return it. This is not what I want. I am trying to take each example and always return "http://www.facebook.com/john.doe" even if they just type www.* or just john.doe.
I am fairly new to programming so please be gentle.
I know this answer is a little late to the party, but if this is exactly what you're trying to do, I recommend a slightly different approach. Rather than reinventing the wheel for canonicalizing facebook urls, consider using the work that Google has already done for use with their Social Graph API.
They've already implemented patterns for a number of similar sites, including facebook. More information on that is here:
http://code.google.com/p/google-sgnodemapper/
import urlparse
p = urlparse.urlsplit("john.doe")
=> ('','','john.doe','','')
The first element of the tuple should be "http://", the second element of the tuple should be "www.facebook.com/", and you can leave the fourth and fifth elements of the tuple alone. You can then reassemble your URL after processing it.
Just an FYI, to ensure a safe url segment for 'john.doe' (this may not apply to facebook, but its a good rule to know) use urllib.quote(string) to properly escape whitespace, etc.
I am not very sure if I understood what you asked, but you can try this code, I tested and works fine but If you have trouble with this let me know.
I hope it helps
! /usr/bin/env python
import urlparse
def clean_url(url):
url_list = []
# split values into tuple
url_tuple = urlparse.urlsplit(url)
# as tuples are immutable so take this to a list
# so we can change the values that we need
counter = 0
for element in url_tuple:
url_list.append(element)
# validate each element individually
url_list[0] = 'http'
url_list[1] = 'www.facebook.com'
# get user name from the original url
# ** I understood the user is the only value
# for sure in the url, right??
user = url.split('/')
if len(user) == 1:
# the user was the only value sent
url_list[2] = user[0]
else:
# get the last element of the list
url_list[2] = user[len(user)-1]
# convert the list into a tuple and
# get all the elements together in the url again
new_url = urlparse.urlunsplit(tuple(url_list))
return new_url
if name == 'main':
print clean_url("http://www.facebook.com/john.doe")
print clean_url("http://facebook.com/john.doe")
print clean_url("facebook.com/john.doe")
print clean_url("www.facebook.com/john.doe")
print clean_url("john.doe")