Python Scraper for Links - python

import requests
from bs4 import BeautifulSoup
data = requests.get("http://www.basketball-reference.com/leagues/NBA_2014_games.html")
soup = BeautifulSoup(data.content)
soup.find_all("a")
for link in soup.find_all("a"):
"<a href='%s'>%s</a>" %(link.get("href=/boxscores"),link.text)
I am trying to get the links for the box scores only. Then run a loop and organize the data from the individual links into a csv. I need to save the links as vectors and run a loop....then I am stuck and I am not sure if this is even the proper way to do it.

The idea is to iterate over all links that have href attribute (a[href] CSS Selector), then loop over the links and construct an absolute link if href attribute value doesn't start with http. Collect all links into a list of lists and use writerows() to dump it to csv:
import csv
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://www.basketball-reference.com'
data = requests.get("http://www.basketball-reference.com/leagues/NBA_2014_games.html")
soup = BeautifulSoup(data.content)
links = [[urljoin(base_url, link['href']) if not link['href'].startswith('http') else link['href']]
for link in soup.select("a[href]")]
with open('output.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(links)
output.csv now contains:
http://www.sports-reference.com
http://www.baseball-reference.com
http://www.sports-reference.com/cbb/
http://www.pro-football-reference.com
http://www.sports-reference.com/cfb/
http://www.hockey-reference.com/
http://www.sports-reference.com/olympics/
http://www.sports-reference.com/blog/
http://www.sports-reference.com/feedback/
http://www.basketball-reference.com/my/auth.cgi
http://twitter.com/bball_ref
...
It is unclear what your output should be, but this is, at least, what you can use a starting point.

Related

How to create list of urls from csv file to iterate?

I am working on a webscrape code, he work fine, now I want replace the url, with a CSV file who containt thousand of url, it's like this :
url1
url2
url3
.
.
.urlX
my first line web scrape code is a basic :
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "HERE THE URL FROM EACH LINE OF THE CSV FILE"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
how can i do for tell to python, to use the urls from the CSV, i think to do a dico, but i dont very know how i can do that, anyone have a solution please ? i know it's seams very simple for you, but it will be very usefull for me.
If this is just a list of urls, you don't really need the csv module. But here is a solution assuming the url is in column 0 of the file. You want a csv reader, not writer, and then its a simple case of iterating the rows and taking action.
from bs4 import BeautifulSoup
import requests
import csv
with open("url-collection.csv", newline="") as fileobj:
for row in csv.reader(fileobj):
# TODO: add try/except to handle errors
url = row[0]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')

Finding tables returns [] with bs4

I am trying to scrape a table from this url: https://cryptoli.st/lists/fixed-supply
I gather that the table I want is in the div class "dataTables_scroll". I use the following code and it only returns an empty list:
from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = requests.get("https://cryptoli.st/lists/fixed-supply")
soup = bs(url.content, 'lxml')
table = soup.find_all("div", {"class": "dataTables_scroll"})
print(table)
Any help would be most appreciated.
Thanks!
The reason is that the response you get from requests.get() does not contain table data in it.
It might be loaded on client-side(by javascript).
What can you do about this? Using a selenium webdriver is a possible solution. You can "wait" until the table is loaded and becomes interactive, then get the page content with selenium, pass the context to bs4 to do the scraping.
You can check the response by writing it to a file:
f = open("demofile.html", "w", encoding='utf-8')
f.write(soup.prettify())
f.close()
and you will be able to see "...Loading..." where the table is expected.
I believe the data is loaded from a script tag. I have to go to work so can't spend more time working out how to appropriately recreate the a dataframe from the "|" delimited data at present, but the following may serve as a starting point for others, as it extracts the relevant entries from the script tag for the table body.
import requests, re
import ast
r = requests.get('https://cryptoli.st/lists/fixed-supply').text
s = re.search(r'cl\.coinmainlist\.dataraw = (\[.*?\]);', r, flags = re.S).group(1)
data = ast.literal_eval(s)
data = [i.split('|') for i in data]
print(data)

Writing URLs via Beautifulsoup to a csv file vertically

I have a project for one of my college classes that requires me to pull all URLs from a page on the U.S. census bureau website and store them in a CSV file. For the most part I've figured out how to do that but for some reason when the data gets appended to the CSV file, all the entries are being inserted horizontally. I would expect the data to be arranged vertically, meaning row 1 has the first item in the list, row 2 has the second item and so on. I have tried several approaches but the data always ends up as a horizontal representation. I am new to python and obviously don't have a firm enough grasp on the language to figure this out. Any help would be greatly fully appreciated.
I am parsing the website using Beautifulsoup4 and the request library. Pulling all the 'a' tags from the website was easy enough and getting the URLs from those 'a' tags into a list was pretty clear as well. But when I append the list to my CSV file with a writerow function, all the data ends up in one row as opposed to one separate row for each URL.
import requests
import csv
requests.get
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(links)
pprint(links)
Try this:
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
for link in links:
if (isinstance(link, str)):
f.write(link + "\n",)
I changed it to check whether a given link was indeed a string and if so, add a newline after it.
Try making a list of lists, by appending the url inside a list
links.append([link.get('href')])
Then the csv writer will put each list on a new line with writerows
writer.writerows(links)

Python Data Scraping: Scraping through the title with series of href and prettify doesnt work

I am a newbie in Python, and my first try is to do some web scraping from a random site. this is my code and I am confused on what is the turn around into this.
I am scraping for the title and the size of the episode but it has 2 href and prettify doesn't work.
this is the code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://1337x.to/popular-tv').text
soup = BeautifulSoup(source, 'lxml')
tvhead = soup.find('tbody')
filename = tvhead.tr.find_all('td',class_='coll-1 name')
print(filename)
Now, I wanted to scrape the title and the file size of that episode then loop for all of them in that page. and I am confused. Please help.
But before this I was able to get just the title with this code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://1337x.to/popular-tv').text
soup = BeautifulSoup(source, 'lxml')
for tvtitle in soup.find_all('td',class_='coll-1 name'):
a = tvtitle.find_all('a')[1].text
print (a)
print()
If I understand your question correctly you are probably trying to achieve something like this:
filenames = tvhead.find_all('td',class_='coll-1 name')
for filename in filenames:
print(filename.get_text())
Keep in mind that when you used .tr with BeautifulSoup, it will give you only the first tr, similar to what find instead of find_all would do.

How to collect a continuous set of webpages using python?

https://example.net/users/x
Here, x is a number that ranges from 1 to 200000. I want to run a loop to get all the URLs and extract contents from every URL using beautiful soup.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
content = urlopen(re.compile(r"https://example.net/users/[0-9]//"))
soup = BeautifulSoup(content)
Is this the right approach? I have to perform two things.
Get a continuous set of URLs
Extract & store retrieved contents from every page/URL.
UPDATE:
I've to get only one particular value from each of the webpages.
soup = BeautifulSoup(content)
divTag = soup.find_all("div", {"class":"classname"})
for tag in divTag:
ulTags = tag.find_all("ul", {"class":"classname"})
for tag in ulTags:
aTags = tag.find_all("a",{"class":"classname"})
for tag in aTags:
name = tag.find('img')['alt']
print(name)
You could try this:
import urllib2
import shutil
urls = []
for i in range(10):
urls.append(str('https://www.example.org/users/' + i))
def getUrl(urls):
for url in urls:
# Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/', '_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
If you just need the contents of a web page, you could probably use lxml, from which you could parse the content. Something like:
from lxml import etree
r = requests.get('https://example.net/users/x')
dom = etree.fromstring(r.text)
# parse seomthing
title = dom.xpath('//h1[#class="title"]')[0].text
Additionally, if you are scraping 10s or 100s of thousands of pages, you might want to look into something like grequests where you can do multiple asynchronous HTTP requests.

Categories

Resources