Web Scraping & Csv making - python

I am doing a project that I am a little stuck on. I am trying to web scrape a website, to grab all of the cast and its characters.
I successfully grab the cast and characters from the wiki page and I get it to print out their personal wikipedia pages.
My question is, is it possible to make a program that produces the wiki by using a loop?
Next, I would want to use the loop to scrape each actors/actresses from their personal wikipedia page to scrape their age and they are from?
After the full code works and outputs what I have asked I need help by creating a csv file to make it print into the created csv file.
Thanks in advance.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://en.wikipedia.org/wiki/Law_%26_Order:_Special_Victims_Unit'
html = urlopen(url)
bs = BeautifulSoup(html, 'html.parser')
The above code imports the libraries I am going to use and has the url I am trying to scrape:
right_table=bs.find('table', class_='wikitable plainrowheaders')
table_rows = right_table.find_all('td')
for row in table_rows:
try:
if row.a['href'].startswith('/wiki'):
print(row.get_text())
print(row.a['href'])
print('-------------------------')
except TypeError:
pass
This is before I added the below print statements, I think there is a way to create a list that then creates a loop to grab where it prints "/wiki/...."
right_table=bs.find('table', class_='wikitable plainrowheaders')
table_rows = right_table.find_all('td')
for row in table_rows:
try:
if row.a['href'].startswith('/wiki'):
print(row.get_text())
link = ('https://en.wikipedia.org/'+ row.a['href'])
print(link)
print('-------------------------')
except TypeError:
pass
The above code currently prints the Cast and their allocated wikipedia page but I am unsure if thats the write way to do it. I am also printing my outcomes first before putting it all in the CSV to check and make sure I am printing out the right code
def age_finder(url):
...
return age
The code above I am unsure what to put in where the "..." is to help return the age
for url in url_list:
age_finder(url)

Related

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

how to scrape multiple pages in python with bs4

I have a query as I have been scraping a website "https://www.zaubacorp.com/company-list" as not able to scrape the email id from the given link in the table. Although the need to scrape Name, Email and Directors from the link in the given table. Can anyone please, resolve my issue as I am a newbie to web scraping using python with beautiful soup and requests.
Thank You
Dieksha
#Scraping the website
#Import a liabry to query a website
import requests
#Specify the URL
companies_list = "https://www.zaubacorp.com/company-list"
link = requests.get("https://www.zaubacorp.com/company-list").text
#Import BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(link,'lxml')
soup.table.find_all('a')
all_links = soup.table.find_all('a')
for link in all_links:
print(link.get("href"))
Well let's break down the website and see what we can do.
First off, I can see that this website is paginated. This means that we have to deal with something as simple as the website using part of the GET query string to determine what page we are requesting to some AJAX call that is filling the table with new data when you click next. From clicking on the next page and subsequent pages, we are in some luck that the website uses the GET query parameter.
Our URL for requesting the webpage to scrape is going to be
https://www.zaubacorp.com/company-list/p-<page_num>-company.html
We are going to write a bit of code that will fill that page num with values ranging from 1 to the last page you want to scrape. In this case, we do not need to do anything special to determine the last page of the table since we can skip to the end and find that it will be page 13,333. This means that we will be making 13,333 page requests to this website to fully collect all of its data.
As for gathering the data from the website we will need to find the table that holds the information and then iteratively select the elements to pull out the information.
In this case we can actually "cheat" a little since there appears to be only a single tbody on the page. We want to iterate over all the and pull out the text. I'm going to go ahead and write the sample.
import requests
import bs4
def get_url(page_num):
page_num = str(page_num)
return "https://www.zaubacorp.com/company-list/p-1" + page_num + "-company.html"
def scrape_row(tr):
return [td.text for td in tr.find_all("td")]
def scrape_table(table):
table_data = []
for tr in table.find_all("tr"):
table_data.append(scrape_row(tr))
return table_data
def scrape_page(page_num):
req = requests.get(get_url(page_num))
soup = bs4.BeautifulSoup(req.content, "lxml")
data = scrape_table(soup)
for line in data:
print(line)
for i in range(1, 3):
scrape_page(i)
This code will scrape the first two pages of the website and by just changing the for loop range you can get all 13,333 pages. From here you should be able to just modify the printout logic to save to a CSV.

Scrape texts from multiple websites and save separately in text files

I am a beginner in python, have been using it for my master thesis to conduct textual analysis in gaming industry. I have been trying to scrape reviews from several gaming critic sites.
I used a list of URLs in the code to scrape the reviews and have been successful. Unfortunately, i could not write each reviews in a separate file. as i write the files, either i receive only the review from the last URL in the list to all the files, or all of the reviews in all of the files after changing the indent. following here is my code. could you kindly suggest what's wrong in here?
from bs4 import BeautifulSoup
import requests
urls= ['http://www.playstationlifestyle.net/2018/05/08/ao-international-tennis-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/27/atelier-lydie-and-suelle-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/15/attack-on-titan-2-review-from-a-different-perspective-ps4/#/slide/1']
for url in urls:
r=requests.get(url).text
soup= BeautifulSoup(r, 'lxml')
for i in range(len(urls)):
file=open('filename%i.txt' %i, 'w')
for article_body in soup.find_all('p'):
body=article_body.text
file.write(body)
file.close()
I think you only need one for loop. If I understand correctly, you only want to iterate through urls and store an individual file for each.
Therefore, I would suggest removing the second for statement. You do though then need to modify the for url in urls to get a unique index for the current url you can use for i and you can use enumerate for that.
Your single for statement would become:
for i, url in enumerate(urls):
I've not tested this myself but this is what I believe should resolve your issue.
I totally believe you are a beginner in python. I post the right one before explaining it.
for i,url in enumerate(urls):
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
file = open('filename{}.txt'.format(i), 'w')
for article_body in soup.find_all('p'):
body = article_body.text
file.write(body)
file.close()
The reason why i receive only the review from the last URL in the list to all the files
one variable for one value , so after for-loop finished you will get the last result(the third one). The result of first and second result will be override
for url in urls:
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')

Loop over different webpages using Python

I am currently following a course in Big Data but do not understand much of it. For an assignment, I would like to find out which topics are discussed on the TripAdvisor-forum about Amsterdam. I want to create a CSV-file including the topic, the author and the amount of replies per topic. Some questions:
How can a make a list of all the topics? I checked the website-source for all the pages and the topic is always stated behind 'onclick="setPID(34603)' and ends with </a>. I tried '(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)' but it's not working.
The replies are not given in the commentsection, but in a separate row on the page. How can I make a loop and append all the replies to a new variable?
How do I loop over the first 20 pages? The URL in my code only includes the 1st page, giving 20 topics.
Do I create the CSV file before or after the looping?
Here is my code:
from urllib import request
import re
import csv
topiclist=[]
metalist=[]
req = request.Request('https://www.tripadvisor.com/ShowForum-g188590-i60-
Amsterdam_North_Holland_Province.html', headers={'User-Agent' :
"Mozilla/5.0"})
tekst=request.urlopen(req).read()
tekst=tekst.decode(encoding="utf-8",errors="ignore").replace("\n"," ")
.replace("\t"," ")
topicsection=re.findall(r'<b><a(.*?)</div>',tekst)
topic=[]
for post in topicsection:
topic.append(re.findall(r'onclick="setPID(34603)">(.*?)</a>', post)
author=[]
for post in topicsection:
author.append(re.findall(r'(.*?)',
post))
replies=re.findall(r'<td class="reply rowentry.*?">(.*?)</td>',tekst)
Don't use regular expressions to parse HTML. Use an html parser such as beautifulsoup.
e.g -
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.tripadvisor.com/ShowForum-g188590-i60-Amsterdam_North_Holland_Province.html")
soup = BeautifulSoup(r.content, "html.parser") #or another parser such as lxml
topics = soup.find_all("a", {'onclick': 'setPID(34603)'})
#do stuff

BeautifulSoup's "find" acting inconsistently (bs4)

I'm scraping the NFL's website for player statistics. I'm having an issue when parsing the web page and trying to get to the HTML table which contains the actual information I'm looking for. I successfully downloaded the page and saved it into the directory I'm working in. For reference, the page I've saved can be found here.
# import relevant libraries
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("1998.html"))
result = soup.find(id="result")
print result
I found that at one point, I ran the code and result printed the correct table I was looking for. Every other time, it doesn't contain anything! I'm assuming this is user error, but I can't figure out what I'm missing. Using "lxml" returned nothing and I can't get html5lib to work (parsing library??).
Any help is appreciated!
First, you should read the contents of your file before passing it to BeautifulSoup.
soup = BeautifulSoup(open("1998.html").read())
Second, verify manually that the table in question exists in the HTML by printing the contents to screen. The .prettify() method makes the data easier to read.
print soup.prettify()
Lastly, if the element does in fact exist, the following will be able to find it:
table = soup.find('table',{'id':'result'})
A simple test script I wrote cannot reproduce your results.
import urllib
from bs4 import BeautifulSoup
def test():
# The URL of the page you're scraping.
url = 'http://www.nfl.com/stats/categorystats?tabSeq=0&statisticCategory=PASSING&conference=null&season=1998&seasonType=REG&d-447263-s=PASSING_YARDS&d-447263-o=2&d-447263-n=1'
# Make a request to the URL.
conn = urllib.urlopen(url)
# Read the contents of the response
html = conn.read()
# Close the connection.
conn.close()
# Create a BeautifulSoup object and find the table.
soup = BeautifulSoup(html)
table = soup.find('table',{'id':'result'})
# Find all rows in the table.
trs = table.findAll('tr')
# Print to screen the number of rows found in the table.
print len(trs)
This outputs 51 every time.

Categories

Resources