I am trying to get the names of members of a group I am a member of. I am able to get the names in the first page but not sure how to go to the next page:
My Code:
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
Using the code above I am successfully getting all names on the first page but question is -- how do I go to the next page?
In the Graph API Explorer Output screen I see this:
What change does my code need to keep going to next pages and get names of ALL members of the group?
The JSON returned by the Graph API is telling you where to get the next page of data, in data['paging']['next']. You could give something like this a try:
def printNames():
json_obj = urllib2.urlopen(url)
data = json.load(json_obj)
for each in data['data']:
print each['name']
return data['paging']['next'] # Return the URL to the next page of data
url = 'https://graph.facebook.com/v2.5/1671554786408615/members?access_token=<MY_CUSTOM_ACCESS_CODE_HERE>'
url = printNames()
print "====END OF PAGE 1===="
url = printNames()
print "====END OF PAGE 2===="
You would need to add checks, for instance ['paging']['next'] will only be available in your JSON object if there is a next page, so you might want to modify your function to return a more complex structure to convey this information, but this should give you the idea.
Related
from requests_html import HTMLSession
url = 'https://www.walmart.com/search?q=70+inch+tv&page=2&affinityOverride=default'
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1,timeout=20)
product = r.html.find('div.mb1.ph1.pa0-xl.bb.b--near-white.w-25')
productinfo = []
for item in product.absolute_links:
# ra = s.get(item)
# name = ra.html.find('h1',first=True).text
products = {
'link' :item,
}
productinfo.append(products)
print(productinfo)
print(len(productinfo))
Output
for item in product.absolute_links:
AttributeError: 'list' object has no attribute 'absolute_links'
I want to get link of every product than scrape some data from this website by requests-html library , but i'm getting attribute error.please help me.chack the website html
But can I solve captcha and logging via requests-html library? I'm not super familiar with requests-html library
Neither am I, but you can paste the request from your browser to https://curlconverter.com/ (they also have instructions on how to copy the request) and they'll convert it to python code for a request with headers and cookies that you can then paste into your code. The last line of their code will be response = requests.get(....., but you can replace it with r = s.get(... so that your code can still use requests_html methods like .html.render and .absolute_links (requests doesn't parse the HTML).
Just keep in mind that the cookies will expire, likely within a few hours, and that you'll have to copy them from your browser again by then if you want to keep scraping this way.
for item in product.absolute_links:
AttributeError: 'list' object has no attribute 'absolute_links'
You can only apply .absolute_links to an element and .find returns a list of elements (unless you specify first=True). Also .absolute_links returns a set of links [even when that set only contains one link], so you need to either loop through them or convert to list and access it through indexing to get at the link/s.
product = r.html.find('div.mb1.ph1.pa0-xl.bb.b--near-white.w-25')
productinfo = []
for prod in product:
item = prod.absolute_links # get product link/s
# ra = s.get(list(item)[0]) # go to first product link
# name = ra.html.find('h1',first=True).text
products = {'link' :item, }
productinfo.append(products)
or, to absolutely ensure that you're looping through a unique urls,
product = r.html.find('div.mb1.ph1.pa0-xl.bb.b--near-white.w-25')
prodUrls = set().union(*[d.absolute_links for d in product]) # combine all sets of product links
productinfo = []
for item in prodUrls:
# ra = s.get(item)
# name = ra.html.find('h1',first=True).text
products = {'link' :item, }
productinfo.append(products)
Btw, if it doesn't find any products, then ofc you won't get any links even if the error goes away, so add a line to print the request status (in case something went wrong there) as well as how many products and links were extracted.
print(r.status_code, r.reason, f' - {len(product)} products and {len(prodUrls)} product links from', r.url)
I am trying to create my first Python web-scraper to automate one task for work - I need to write all vacancies from this website (only for health) to an Excel file. Using a tutorial, I have come up with the following program.
However, in step 6, I receive an error stating: IndexError: list index out of range.
I have tried using start_page = paging[2].text, as I thought that the first page may be the base page, but it results in the same error.
Here are the steps that I followed:
I checked that the website https://iworkfor.nsw.gov.au allows scraping
Imported the necessary libraries:
import requests
from bs4 import BeautifulSoup
import pandas
stored the URL as a variable:
base_url = "https://iworkfor.nsw.gov.au/nsw-health-jobs?divisionid=1"
Get the HTML content:
r = requests.get(base_url)`
c = r.content
parse HTML
soup = BeautifulSoup(c,"html.parser")
To extract the first and last page numbers
paging = soup.find("div",{"class":"pana jobResultPaging tab-paging-top"}).find_all("a")
start_page = paging[1].text
last_page = paging[len(paging)-2].text
Making an empty list to append all the content:
web_content_list = []
Making page links from the page numbers ,crawl through the pages and extract the contents from the corresponding tags
for page_number in range(int(start_page),int(last_page) + 1):
# To form the url based on page numbers
url = base_url+"&page="+str(page_number)
r = requests.get(base_url+"&page="+str(page_number))
c = r.content
soup = BeautifulSoup(c,"html.parser")
To extract the Title
vacancies_header = soup.find_all("div", {"class":"box-sec2-left"})
To extract the LHD, Job type and Job Reference number
vacancies_content = soup.find_all("div", {"class":"box-sec2-right"})
To process vacancy by vacancy by looping
for item_header,item_content in zip(vacancies_header,vacancies_content):
# To store the information to a dictionary
web_content_dict = {}
web_content_dict["Title"]=item_header.find("a").text.replace("\r","").replace("\n","")
web_content_dict["Date Posted"] = item_header.find("span").text
web_content_dict["LHD"] = item_content.find("h5").text
web_content_dict["Position Type"] = item_content.find("p").text
web_content_dict["Job Reference Number"] = item_content.find("span",{"class":"box-sec2-reference"}).text
# To store the dictionary to into a list
web_content_list.append(web_content_dict)
To make a dataframe with the list
df = pandas.DataFrame(web_content_list)
To write the dataframe to a csv file
df.to_csv("Output.csv")
Ideally, the program will write the data about all vacancies to a CSV file in a nice table with the columns: title, date posted, LHD, Position Type, Job reference number.
The problem is that your initial call to find() returns an empty <div>, and so your subsequent call to find_all returns an empty list:
>div = soup.find("div",{"class":"pana jobResultPaging tab-paging-top"
>div
<div class="pana jobResultPaging tab-paging-top">
</div>
>div.find_all("a")
[]
Update:
The reason you're unable to parse the contents of the <div> in question (i.e. why it's empty) has to do with the fact that the data retrieved from the server is "paginated" by client-side javascript (code in your browser). Your python code is parsing only the HTML that is returned by the request to iworkfor.nsw.gov.au; the data that is what you're after (and what is turned into "pages") is requested by that same javascript and returned by the server in a format called JSON.
So, the bad news is that the instructions that have been provided to you will not work. You will have to parse the JSON returned by the server and then decode the escaped HTML that it contains.
I am trying to write a program that will collect specific information from an ebay product page and write that information to a text file. To do this I'm using BeautifulSoup and Requests and I'm working with Python 2.7.9.
I've been mostly using this tutorial (Easy Web Scraping with Python) with a few modifications. So far everything works as intended until it writes to the text file. The information is written, just not in the format that I would like.
What I'm getting is this:
{'item_title': u'Old Navy Pink Coat M', 'item_no': u'301585876394', 'item_price': u'US $25.00', 'item_img': 'http://i.ebayimg.com/00/s/MTYwMFgxMjAw/z/Sv0AAOSwv0tVIoBd/$_35.JPG'}
What I was hoping for was something that would be a bit easier to work with.
For example :
New Shirt 5555555555 US $25.00 http://ImageURL.jpg
In other words I want just the scraped text and not the brackets, the 'item_whatever', or the u'.
After a bit of research I suspect my problem is to do with the encoding of the information as its being written to the text file, but I'm not sure how to fix it.
So far I have tried,
def collect_data():
with open('writetest001.txt','w') as x:
for product_url in get_links():
get_info(product_url)
data = "'{0}','{1}','{2}','{3}'".format(item_data['item_title'],'item_price','item_no','item_img')
x.write(str(data))
In the hopes that it would make the data easier to format in the way I want. It only resulted in "NameError: global name 'item_data' is not defined" displayed in IDLE.
I have also tried using .split() and .decode('utf-8') in various positions but have only received AttributeErrors or the written outcome does not change.
Here is the code for the program itself.
import requests
import bs4
#Main URL for Harvesting
main_url = 'http://www.ebay.com/sch/Coats-Jackets-/63862/i.html?LH_BIN=1&LH_ItemCondition=1000&_ipg=24&rt=nc'
#Harvests Links from "Main" Page
def get_links():
r = requests.get(main_url)
data = r.text
soup = bs4.BeautifulSoup(data)
return [a.attrs.get('href')for a in soup.select('div.gvtitle a[href^=http://www.ebay.com/itm]')]
print "Harvesting Now... Please Wait...\n"
print "Harvested:", len(get_links()), "URLs"
#print (get_links())
print "Finished Harvesting... Scraping will Begin Shortly...\n"
#Scrapes Select Information from each page
def get_info(product_url):
item_data = {}
r = requests.get(product_url)
data = r.text
soup = bs4.BeautifulSoup(data)
#Fixes the 'Details about ' problem in the Title
for tag in soup.find_all('span',{'class':'g-hdn'}):
tag.decompose()
item_data['item_title'] = soup.select('h1#itemTitle')[0].get_text()
#Grabs the Price, if the item is on sale, grabs the sale price
try:
item_data['item_price'] = soup.select('span#prcIsum')[0].get_text()
except IndexError:
item_data['item_price'] = soup.select('span#mm-saleDscPrc')[0].get_text()
item_data['item_no'] = soup.select('div#descItemNumber')[0].get_text()
item_data['item_img'] = soup.find('img', {'id':'icImg'})['src']
return item_data
#Collects information from each page and write to a text file
write_it = open("writetest003.txt","w","utf-8")
def collect_data():
for product_url in get_links():
write_it.write(str(get_info(product_url))+ '\n')
collect_data()
write_it.close()
You were on the right track.
You need a local variable to assign the results of get_info to. The variable item_data you tried to reference only exists within the scope of the get_info function. You can use the same variable name though, and assign the results of the function to it.
There was also a syntax issue in the section you tried with respect to how you're formatting the items.
Replace the section you tried with this:
for product_url in get_links():
item_data = get_info(product_url)
data = "{0},{1},{2},{3}".format(*(item_data[item] for item in ('item_title','item_price','item_no','item_img')))
x.write(data)
I am attempting to create a bot that fetches market links from steam but have run into a problem. I was able to return all the data from a single page, but when I attempt to get multiple pages it just gives me copies of the first page though I give it working links (eg: http://steamcommunity.com/market/search?q=appid%3A753#p1 and then http://steamcommunity.com/market/search?q=appid%3A753#p2). I have tested the links and they work in my browser. This is my code.
import urllib2
import random
import time
start_url = "http://steamcommunity.com/market/search?q=appid%3A753"
end_page = 3
urls = []
def get_raw(url):
req = urllib2.Request(url)
response = urllib2.urlopen(req)
return response.read()
def get_market_urls(html):
index = 0
while index != -1:
index = html.find("market_listing_row_link", index+25)
beg = html.find("http", index)
end = html.find('"',beg)
print html[beg:end]
urls.append(html[beg:end])
def go_to_page(page):
return start_url+"#p"+str(page)
def wait(min, max):
wait_t = random.randint(min,max)
time.sleep(wait_t)
for i in range(end_page):
url = go_to_page(i+1)
raw = get_raw(url)
get_market_urls(raw)
Your problem is that you've misunderstood what the URL says.
The number after the hashtag doesn't mean it's a different URL that can be fetched. This is called the query string. In that particular page the query string explains to the javascript which page to pull off AJAX. (Read about it Here and Here if you're interested..).
Anyway, you shoul look at the url: http://steamcommunity.com/market/search/render/?query=appid%3A753&start=00&count=10. You can play with the start=00&count=10 parameters to get the results you want.
Enjoy.
How can I get the current URL and save it as a string in python?
I have some code that uses encodedURL = urllib.quote_plus to change the URL in a for loop going through a list. I cannot save encodedURL as a new variable because it's in a for loop and will always return the last item in the list.
My end goal is that I want to get the URL of a hyperlink that the user clicks on, so I can display certain content on that specific URL.
Apologies if I have left out important information. There is too much code and too many modules to post it all here. If you need anything else please let me know.
EDIT: To add more description:
I have a page which has a list of user comments about a website. The website is hyperlinked to that actual website, and there is a "list all comments about this website" link. My goal is that when the user clicks on list all comments about this website, it will open another page showing every comment that is about that website. The problem is I cannot get the website they are referring to when clicking 'all comments about this website'
Don't know if it helps but this is what I am using:
z=[ ]
for x in S:
y = list(x)
z.append(y)
for coms in z:
url = urllib.quote_plus(coms[2])
coms[2] = "'Commented on:' <a href='%s'> %s</a> (<a href = 'conversation?page=%s'> all </a>) " % (coms[2],coms[2], url)
coms[3] += "<br><br>"
deCodedURL = urllib.unquote_plus(url)
text2 = interface.list_comments_page(db, **THIS IS THE PROBLEM**)
page_comments = {
'comments_page':'<p>%s</p>' % text2,
}
if environ['PATH_INFO'] == '/conversation':
headers = [('content-type' , 'text/html')]
start_response("200 OK", headers)
return templating.generate_page(page_comments)
So your problem is you need to parse the URL for the query string, and urllib has some helpers for that:
>>> i
'conversation?page=http://www.google.com/'
>>> urllib.splitvalue(urllib.splitquery(i)[1])
('page', 'http://www.google.com/')