Scraping data from a website with Infinite Scroll?

Scraping data from a website with Infinite Scroll? - python

I am trying to scrape a website for titles as well as other items but for the sake of brevity, just game titles.
I have tried using selenium and beautiful soup in tandem to grab the titles, but I cannot seem to get all the September releases no matter what I do. In fact, I get some of the August game titles as well. I think it has to do with the fact that there is no ending to the website. How would I grab just the September titles? Below is the code I used and I have tried to use Scrolling but I do not think I understand how to use it properly.
EDIT: My goal is to be able to eventually get each month by changing a few lines of code.
from selenium import webdriver
from bs4 import BeautifulSoup
titles = []
chromedriver = 'C:/Users/Chase The Great/Desktop/Podcast/chromedriver.exe'
driver = webdriver.Chrome(chromedriver)
driver.get('https://www.releases.com/l/Games/2019/9/')
res = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(res, 'lxml')
for title in soup.find_all(class_= 'calendar-item-title'):
titles.append(title.text)
I am expected to get 133 titles and I get some August titles plus only part of the titles as such:
['SubaraCity', 'AER - Memories of Old', 'Vambrace: Cold Soul', 'Agent A: A Puzzle in Disguise', 'Bubsy: Paws on Fire!', 'Grand Brix Shooter', 'Legend of the Skyfish', 'Vambrace: Cold Soul', 'Obakeidoro!', 'Pokemon Masters', 'Decay of Logos', 'The Lord of the Rings: Adventure ...', 'Heave Ho', 'Newt One', 'Blair Witch', 'Bulletstorm: Duke of Switch Edition', 'The Ninja Saviors: Return of the ...', 'Re:Legend', 'Risk of Rain 2', 'Decay of Logos', 'Unlucky Seven', 'The Dark Pictures Anthology: Man ...', 'Legend of the Skyfish', 'Astral Chain', 'Torchlight II', 'Final Fantasy VIII Remastered', 'Catherine: Full Body', 'Root Letter: Last Answer', 'Children of Morta', 'Himno', 'Spyro Reignited Trilogy', 'RemiLore: Lost Girl in the Lands ...', 'Divinity: Original Sin 2 - Defini...', 'Monochrome Order', 'Throne Quest Deluxe', 'Super Kirby Clash', 'Himno', 'Post War Dreams', 'The Long Journey Home', 'Spice and Wolf VR', 'WRC 8', 'Fantasy General II', 'River City Girls', 'Headliner: NoviNews', 'Green Hell', 'Hyperforma', 'Atomicrops', 'Remothered: Tormented Fathers']

Seems to me that in order to get only september, first you want to grab only the section for september:
section = soup.find('section', {'class': 'Y2019-M9 calendar-sections'})
Then once you fetch the section for September get all the titles which are in an <a> tag like this:
for title in section.find_all('a', {'class': ' calendar-item-title subpage-trigg'}):
titles.append(title.text)
Please note that none of the previous has been tested.
UPDATE:
The problem is that everytime you want load the page, it gives you only the very first section that contains only 24 items, in order to access them you have to scroll down(infinite scroll).
If you open the browser developers tool, select Network and then XHR you will notice that everytime you scroll and load the next "page" there is a request with an url similar to this:
https://www.releases.com/calendar/nextAfter?blockIndex=139&itemIndex=23&category=Games&regionId=us
Where my guess is that blockIndex is meant for the month and itemIndex is for every page loaded, if you are looking only for the month of september blockIndex will be always 139 in that request the challenge is to get the next itemIndex for the next page so you can construct your next request.
The next itemIndex will be always the last itemIndex of the previous request.
I did make a script that does what you want using only BeautifulSoup. Use it at your own discretion, there are some constants that may be extracted dynamically, but I think this could give you a head start:
import json
import requests
from bs4 import BeautifulSoup
DATE_CODE = 'Y2019-M9'
LAST_ITEM_FIRST_PAGE = f'calendar-item col-xs-6 to-append first-item calendar-last-item {DATE_CODE}-None'
LAST_ITEM_PAGES = f'calendar-item col-xs-6 to-append calendar-last-item {DATE_CODE}-None'
INITIAL_LINK = 'https://www.releases.com/l/Games/2019/9/'
BLOCK = 139
titles = []
def get_next_page_link(div: BeautifulSoup):
index = div['item-index']
return f'https://www.releases.com/calendar/nextAfter?blockIndex={BLOCK}&itemIndex={index}&category=Games&regionId=us'
def get_content_from_requests(page_link):
headers = requests.utils.default_headers()
headers['User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
req = requests.get(page_link, headers=headers)
return BeautifulSoup(req.content, 'html.parser')
def scroll_pages(link: str):
print(link)
page = get_content_from_requests(link)
for div in page.findAll('div', {'date-code': DATE_CODE}):
item = div.find('a', {'class': 'calendar-item-title subpage-trigg'})
if item:
# print(f'TITLE: {item.getText()}')
titles.append(item.getText())
last_index_div = page.find('div', {'class': LAST_ITEM_FIRST_PAGE})
if not last_index_div:
last_index_div = page.find('div', {'class': LAST_ITEM_PAGES})
if last_index_div:
scroll_pages(get_next_page_link(last_index_div))
else:
print(f'Found: {len(titles)} Titles')
print('No more pages to scroll finishing...')
scroll_pages(INITIAL_LINK)
with open(f'titles.json', 'w') as outfile:
json.dump(titles, outfile)
if your goal is to use Selenium, I think the same principle may apply unless it has a scrolling capability as it is loading the page.
Replacing INITIAL_LINK, DATE_CODE & BLOCK accordingly, will get you other months as well.

Related

Can't scrape Bangood site with beautiful soup and selenium

Hi guys i found some problems in using Beautiful Soup.
I'm trying to scrape Bangood's Website, but, I don't know why, i've only succedeed in scraping item's name.
Using selenium I scraped Item's Price (only un USD not in euros)
So I ask for your help, I would be so pleased if you knew any way to overcome these problems.
I would like to scrape Name, Price in Euros, Discount, Stars, Image, but I cannot understand why Beautiful soup doesn't work.
Ps. Obviously I don't want all the functions but the reason why beautiful soup give all these problems and an example if you can.
Now I'm trying to post here the html I want to scrape (in beautiful soup if possible).
Thanks for all!
The link i wanna scrape = https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA
<span class="main-price" oriprice-range="0-0" oriprice="22.99">19,48€</span>
<strong class="average-num">4.95</strong>
<img src="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg.webp" id="landingImage" data-large="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg" dpid="left_largerView_image_180411|product|18101211554" data-src="https://imgaz1.staticbg.com/thumb/large/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg" style="height: 100%; transform: translate3d(0px, 0px, 0px);">
These are the functions i'm using
This doesn't work:
def take_image_bang(soup): #beautiful soup and json
img_div = soup.find("div", attrs={"class":'product-image'})
imgs_str = img_div.img.get('data-large') # a string in Json format
# convert to a dictionary
imgs_dict = json.loads(imgs_str)
print(imgs_dict)
#each key in the dictionary is a link of an image, and the value shows the size (print all the dictionay to inspect)
#num_element = 0
#first_link = list(imgs_dict.keys())[num_element]
return imgs_dict
These work (but only USD not Euros for the function get_price):
def get_title_bang(soup): #beautiful soup
try:
# Outer Tag Object
title = soup.find("span", attrs={"class":'product-title-text'})
# Inner NavigableString Object
title_value = title.string
# Title as a string value
title_string = title_value.strip()
# # Printing types of values for efficient understanding
# print(type(title))
# print(type(title_value))
# print(type(title_string))
# print()
except AttributeError:
title_string = ""
return title_string
def get_Bangood_price(driver): #selenium
c = CurrencyConverter()
prices = driver.find_elements_by_class_name('main-price')
for price in prices:
price = price.text.replace("US$","")
priceZ = float(price)
price_EUR = c.convert(priceZ, 'USD', 'EUR')
return price_EUR

As you want price in EUR url needs be change you can set accoriding from web page
import requests
from bs4 import BeautifulSoup
headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"}
res=requests.get("https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA&DCC=IT&currency=EUR",headers=headers)
soup=BeautifulSoup(res.text,"lxml")
Finding title:
main_div=soup.find("div",class_="product-info")
title=main_div.find("h1",class_="product-title").get_text(strip=True)
print(title)
Output:
ANENG AN8008 Vero RMS Digitale Multimetri Tester di AC DC Corrente Tensione Resistenza Frenquenza CapacitàCOD
For findign reviews:
star=[i.get_text(strip=True) for i in main_div.find("div",class_="product-reviewer").find_all("dd")]
star
Output:
['5 Stella2618 (95.8%)',
'4 Stella105 (3.8%)',
'3 Stella9 (0.3%)',
'2 Stella0 (0.0%)',
'1 Stella2 (0.1%)']
finding price and other you can get from script tag use json to load it!
data=soup.find("script", attrs={"type":"application/ld+json"}).string.strip().strip(";")
import json
main_data=json.loads(data)
finding values from it:
price=main_data['offers']['priceCurrency']+" "+main_data['offers']['price']
image=main_data['image']
print(price,image)
Output:
EUR 19.48 https://imgaz3.staticbg.com/thumb/view/oaupload/banggood/images/1B/ED/b3e9fd47-ebb4-479b-bda2-5979c9e03a11.jpg
for finding discount price as prices are update dynamically you can
use xhr link to call and find data from it! here is the
url
use post request for it!

To scrape the data in euros, you need to change your link address and add this to the end of the link:
For EURO add: &currency=EUR
For USD add: &currency=USD
For Euro the link should be :
https://it.banggood.com/ANENG-AN8008-True-RMS-Wave-Output-Digital-Multimeter-AC-DC-Current-Volt-Resistance-Frequency-Capacitance-Test-p-1157985.html?rmmds=flashdeals&cur_warehouse=USA&currency=EUR
For another example: if you wish to change the warehouse for the product change:
For CN change: cur_warehouse=CN
For USA change: cur_warehouse=USA
For PL change: cur_warehouse=PL
These are dynamic variables for a URL that changes the webpage depending on their inputs.
After this, your second method should work just fine. Happy scraping!!!

Selenium in Python: Run scraping code after all lazy-loading component is loaded

new to selenium and I have the below question still after searching for solutions.
I am trying to access all the links on this website (https://www.ecb.europa.eu/press/pressconf/html/index.en.html).
The individual links gets loaded in a "lazy-load" fashion. And it gets loaded gradually as user scrolls down the screen.
driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.ecb.europa.eu/press/pressconf/html/index.en.html")
# scrolling
lastHeight = driver.execute_script("return document.body.scrollHeight")
#print(lastHeight)
pause = 0.5
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(pause)
newHeight = driver.execute_script("return document.body.scrollHeight")
if newHeight == lastHeight:
break
lastHeight = newHeight
print(lastHeight)
# ---
elems = driver.find_elements_by_xpath("//a[#href]")
for elem in elems:
url=elem.get_attribute("href")
if re.search('is\d+.en.html', url):
print(url)
However it only gets the required link of the last lazy-loading element, and everything before it is not obtained because they are not loaded.
I want to make sure all lazy-loading element to have loaded before executing any scraping codes. How can I do that?
Many thanks

Selenium was not designed for web-scraping (although in complicated cases it can be useful). In your case, do F12 -> Network and look at the XHR tab when you scroll down the page. You can see that the queries that are added contain the year in their urls. So the page generates subqueries when you scroll down and reach other years.
Look at response tab to find divs and classes and build beautifulsoup 'find_all'.
A simple little loop through years with requests and bs is enough :
import requests as rq
from bs4 import BeautifulSoup as bs
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0"}
resultats = []
for year in range(1998, 2021+1, 1):
url = "https://www.ecb.europa.eu/press/pressconf/%s/html/index_include.en.html" % year
resp = rq.get(url, headers=headers)
soup = bs(resp.content)
titles = map(lambda x: x.text, soup.find_all("div", {"class", "title"}))
subtitles = map(lambda x: x.text, soup.find_all("div", {"class", "subtitle"}))
dates = map(lambda x: x.text, soup.find_all("dt"))
zipped = list(zip(dates, titles, subtitles))
resultats.extend(zipped)
resultat contains :
...
('8 November 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Frankfurt am Main, 8 November 2012'),
('4 October 2012',
'Mario Draghi, Vítor Constâncio:\xa0Introductory statement to the press conference (with Q&A)',
'Mario Draghi, President of the ECB, Vítor Constâncio, Vice-President of the ECB, Brdo pri Kranju, 4 October 2012'),
...

Scraping author names from a website with try/except using Python

I am trying to use Try/Except in order to scrape through different pages of a URL containing author data. I need a set of author names from 10 subsequent pages of this website.
# Import Packages
import requests
import bs4
from bs4 import BeautifulSoup as bs
# Output list
authors = []
# Website Main Page URL
URL = 'http://quotes.toscrape.com/'
res = requests.get(URL)
soup = bs4.BeautifulSoup(res.text,"lxml")
# Get the contents from the first page
for item in soup.select(".author"):
authors.append(item.text)
page = 1
pagesearch = True
# Get the contents from 2-10 pages
while pagesearch:
# Check if page is available
try:
req = requests.get(URL + '/' + 'page/' + str(page) + '/')
soup = bs(req.text, 'html.parser')
page = page + 1
for item in soup.select(".author"): # Append the author class from the webpage html
authors.append(item.text)
except:
print("Page not found")
pagesearch == False
break # Break if no page is remaining
print(set(authors)) # Print the output as a unique set of author names
First page doesn't have any page number in it's URL so I treated it separately. I'm using the try/except block for iterating through all of the possible pages and throw an exception and break the loop when the last page is scanned.
When I run the program, it enters to an infinite loop while it needs to print the "Page not found" message when the pages are over. When I interrupt the kernel, I see the correct result as a list and my exception statement but nothing before that. I get the following result.
Page not found
{'Allen Saunders', 'J.K. Rowling', 'Pablo Neruda', 'J.R.R. Tolkien', 'Harper Lee', 'J.M. Barrie',
'Thomas A. Edison', 'J.D. Salinger', 'Jorge Luis Borges', 'Haruki Murakami', 'Dr. Seuss', 'George
Carlin', 'Alexandre Dumas fils', 'Terry Pratchett', 'C.S. Lewis', 'Ralph Waldo Emerson', 'Jim
Henson', 'Suzanne Collins', 'Jane Austen', 'E.E. Cummings', 'Jimi Hendrix', 'Khaled Hosseini',
'George Eliot', 'Eleanor Roosevelt', 'André Gide', 'Stephenie Meyer', 'Ayn Rand', 'Friedrich
Nietzsche', 'Mother Teresa', 'James Baldwin', 'W.C. Fields', "Madeleine L'Engle", 'William
Nicholson', 'George R.R. Martin', 'Marilyn Monroe', 'Albert Einstein', 'George Bernard Shaw',
'Ernest Hemingway', 'Steve Martin', 'Martin Luther King Jr.', 'Helen Keller', 'Charles M. Schulz',
'Charles Bukowski', 'Alfred Tennyson', 'John Lennon', 'Garrison Keillor', 'Bob Marley', 'Mark
Twain', 'Elie Wiesel', 'Douglas Adams'}
What can be the reason for this ? Thanks.

I think that's because there is a page literally. The exception may arise when there is no page to show on the browser.
But when you make a request for this one:
http://quotes.toscrape.com/page/11/
Then, the browser shows a page that bs4 still can parse to get an element.
How to stop at page 11? You can trace the presence of the Next Page Button.
Thanks for reading.

Try using the built-in range() function to go from pages 1-10 instead:
import requests
from bs4 import BeautifulSoup
url = "http://quotes.toscrape.com/page/{}/"
authors = []
for page in range(1, 11):
response = requests.get(url.format(page))
print("Requesting Page: {}".format(response.url))
soup = BeautifulSoup(response.content, "html.parser")
for tag in soup.select(".author"):
authors.append(tag.text)
print(set(authors))

BeautifulSoup scrape the first title tag in each <li>

I have some code that goes through the cast list of a show or movie on Wikipedia. Scraping all the actor's names and storing them. The current code I have finds all the <a> in the list and stores their title tags. It currently goes:
from bs4 import BeautifulSoup
URL = input()
website_url = requests.get(URL).text
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').find_all('a'):
title = x.get('title')
print (title)
if title is not None:
Stars.append(title)
else:
continue
While this partially works there are two downsides:
It doesn't work if the actor doesn't have a Wikipedia page hyperlink.
It also scrapes any other hyperlink title it finds. e.g. https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull returns ['Harrison Ford', 'Indiana Jones (character)', 'Bullwhip', 'Cate Blanchett', 'Irina Spalko', 'Bob cut', 'Rosa Klebb', 'From Russia with Love (film)', 'Karen Allen', 'Marion Ravenwood', 'Ray Winstone', 'Sallah', 'List of characters in the Indiana Jones series', 'Sexy Beast', 'Hamstring', 'Double agent', 'John Hurt', 'Ben Gunn (Treasure Island)', 'Treasure Island', 'Courier', 'Jim Broadbent', 'Marcus Brody', 'Denholm Elliott', 'Shia LaBeouf', 'List of Indiana Jones characters', 'The Young Indiana Jones Chronicles', 'Frank Darabont', 'The Lost World: Jurassic Park', 'Jeff Nathanson', 'Marlon Brando', 'The Wild One', 'Holes (film)', 'Blackboard Jungle', 'Rebel Without a Cause', 'Switchblade', 'American Graffiti', 'Rotator cuff']
Is there a way I can get BeautifulSoup to scrape the first two Words after each <li>? Or even a better solution for what I am trying to do?

You can use css selectors to grab only the first <a> in a <li>:
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Example
from bs4 import BeautifulSoup
URL = 'https://en.wikipedia.org/wiki/Indiana_Jones_and_the_Kingdom_of_the_Crystal_Skull#Cast'
website_url = requests.get(URL).text
soup = BeautifulSoup(website_url,'lxml')
section = soup.find('span', id='Cast').parent
Stars = []
for x in section.find_next('ul').select('li > a:nth-of-type(1)'):
Stars.append(x.get('title'))
Stars
Output
['Harrison Ford',
'Cate Blanchett',
'Karen Allen',
'Ray Winstone',
'John Hurt',
'Jim Broadbent',
'Shia LaBeouf']

You can use Regex to fetch all the names from the text content of <li/> and just take the first two names and it will also fix the issue in case the actor doesn't have a Wikipedia page hyperlink
import re
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)", <text_content_from_li>)
Example:
text = "Cate Blanchett as Irina Spalko, a villainous Soviet agent. Screenwriter David Koepp created the character."
re.findall("([A-Z]{1}[a-z]+) ([A-Z]{1}[a-z]+)",text)
Output:
[('Cate', 'Blanchett'), ('Irina', 'Spalko'), ('Screenwriter', 'David')]

There is considerable variation for the html for cast within the film listings on Wikipaedia. Perhaps look to an API to get this info?
E.g. imdb8 allows for a reasonable number of calls which you could use with the following endpoint
https://imdb8.p.rapidapi.com/title/get-top-cast
There also seems to be Python IMDb API
Or choose something with more regular html. For example, if you take the imdb film ids in a list you can extract full cast and main actors, from IMDb as follows. To get the shorter cast list I am filtering out the rows which occur at/after the text "Rest" within "Rest of cast listed alphabetically:"
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
movie_ids = ['tt0367882', 'tt7126948']
base = 'https://www.imdb.com'
with requests.Session() as s:
for movie_id in movie_ids:
link = f'https://www.imdb.com/title/{movie_id}/fullcredits?ref_=tt_cl_sm'
# print(link)
r = s.get(link)
soup = bs(r.content, 'lxml')
print(soup.select_one('title').text)
full_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list [href*=name]:has(img)')]
main_cast = [(i.img['title'], base + i['href']) for i in soup.select('.cast_list tr:not(:has(.castlist_label:contains(cast)) ~ tr, :has(.castlist_label:contains(cast))) [href*=name]:has(img)')]
df_full = pd.DataFrame(full_cast, columns = ['Actor', 'Link'])
df_main = pd.DataFrame(main_cast, columns = ['Actor', 'Link'])
# print(df_full)
print(df_main)

Scrapy or BeautifulSoup to scrape links and text from various websites

I am trying to scrape the links from an inputted URL, but its only working for one url (http://www.businessinsider.com). How can it be adapted to scrape from any url inputted? I am using BeautifulSoup, but is Scrapy better suited for this?
def WebScrape():
linktoenter = input('Where do you want to scrape from today?: ')
url = linktoenter
html = urllib.request.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
if linktoenter in url:
print('Retrieving your links...')
links = {}
n = 0
link_title=soup.findAll('a',{'class':'title'})
n += 1
links[n] = link_title
for eachtitle in link_title:
print(eachtitle['href']+","+eachtitle.string)
else:
print('Please enter another Website...')

You could make a more generic scraper, searching for all tags and all links within those tags. Once you have the list of all links, you can use a regular expression or similar to find the links that match your desired structure.
import requests
from bs4 import BeautifulSoup
import re
response = requests.get('http://www.businessinsider.com')
soup = BeautifulSoup(response.content)
# find all tags
tags = soup.find_all()
links = []
# iterate over all tags and extract links
for tag in tags:
# find all href links
tmp = tag.find_all(href=True)
# append masters links list with each link
map(lambda x: links.append(x['href']) if x['href'] else None, tmp)
# example: filter only careerbuilder links
filter(lambda x: re.search('[w]{3}\.careerbuilder\.com', x), links)

code:
def WebScrape():
url = input('Where do you want to scrape from today?: ')
html = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(html, "lxml")
title_tags = soup.findAll('a', {'class': 'title'})
url_titles = [(tag['href'], tag.text)for tag in title_tags]
if title_tags:
print('Retrieving your links...')
for url_title in url_titles:
print(*url_title)
out:
Where do you want to scrape from today?: http://www.businessinsider.com
Retrieving your links...
http://www.businessinsider.com/trump-china-drone-navy-2016-12 Trump slams China's capture of a US Navy drone as 'unprecedented' act
http://www.businessinsider.com/trump-thank-you-rally-alabama-2016-12 'This is truly an exciting time to be alive'
http://www.businessinsider.com/how-smartwatch-pioneer-pebble-lost-everything-2016-12 How the hot startup that stole Apple's thunder wound up in Silicon Valley's graveyard
http://www.businessinsider.com/china-will-return-us-navy-underwater-drone-2016-12 Pentagon: China will return US Navy underwater drone seized in South China Sea
http://www.businessinsider.com/what-google-gets-wrong-about-driverless-cars-2016-12 Here's the biggest thing Google got wrong about self-driving cars
http://www.businessinsider.com/sheriff-joe-arpaio-still-wants-to-investigate-obamas-birth-certificate-2016-12 Sheriff Joe Arpaio still wants to investigate Obama's birth certificate
http://www.businessinsider.com/rents-dropping-in-new-york-bubble-pop-2016-12 Rents are finally dropping in New York City, and a bubble might be about to pop
http://www.businessinsider.com/trump-david-friedman-ambassador-israel-2016-12 Trump's ambassador pick could drastically alter 2 of the thorniest issues in the US-Israel relationship
http://www.businessinsider.com/can-hackers-be-caught-trump-election-russia-2016-12 Why Trump's assertion that hackers can't be caught after an attack is wrong
http://www.businessinsider.com/theres-a-striking-commonality-between-trump-and-nixon-2016-12 There's a striking commonality between Trump and Nixon
http://www.businessinsider.com/tesla-year-in-review-2016-12 Tesla's biggest moments of 2016
http://www.businessinsider.com/heres-why-using-uber-to-fill-public-transportation-gaps-is-a-bad-idea-2016-12 Here's why using Uber to fill public transportation gaps is a bad idea
http://www.businessinsider.com/useful-hard-adopt-early-morning-rituals-productive-exercise-2016-12 4 morning rituals that are hard to adopt but could really pay off
http://www.businessinsider.com/most-expensive-champagne-bottles-money-can-buy-2016-12 The 11 most expensive Champagne bottles money can buy
http://www.businessinsider.com/innovations-in-radiology-2016-11 5 innovations in radiology that could impact everything from the Zika virus to dermatology
http://www.businessinsider.com/ge-healthcare-mr-freelium-technology-2016-11 A new technology is being developed using just 1% of the finite resource needed for traditional MRIs

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping data from a website with Infinite Scroll? - python

Related

Can't scrape Bangood site with beautiful soup and selenium

Selenium in Python: Run scraping code after all lazy-loading component is loaded

Scraping author names from a website with try/except using Python

BeautifulSoup scrape the first title tag in each <li>

Scrapy or BeautifulSoup to scrape links and text from various websites

Categories

Resources