Scrape texts from multiple websites and save separately in text files

Scrape texts from multiple websites and save separately in text files - python

I am a beginner in python, have been using it for my master thesis to conduct textual analysis in gaming industry. I have been trying to scrape reviews from several gaming critic sites.
I used a list of URLs in the code to scrape the reviews and have been successful. Unfortunately, i could not write each reviews in a separate file. as i write the files, either i receive only the review from the last URL in the list to all the files, or all of the reviews in all of the files after changing the indent. following here is my code. could you kindly suggest what's wrong in here?
from bs4 import BeautifulSoup
import requests
urls= ['http://www.playstationlifestyle.net/2018/05/08/ao-international-tennis-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/27/atelier-lydie-and-suelle-review/#/slide/1',
'http://www.playstationlifestyle.net/2018/03/15/attack-on-titan-2-review-from-a-different-perspective-ps4/#/slide/1']
for url in urls:
r=requests.get(url).text
soup= BeautifulSoup(r, 'lxml')
for i in range(len(urls)):
file=open('filename%i.txt' %i, 'w')
for article_body in soup.find_all('p'):
body=article_body.text
file.write(body)
file.close()

I think you only need one for loop. If I understand correctly, you only want to iterate through urls and store an individual file for each.
Therefore, I would suggest removing the second for statement. You do though then need to modify the for url in urls to get a unique index for the current url you can use for i and you can use enumerate for that.
Your single for statement would become:
for i, url in enumerate(urls):
I've not tested this myself but this is what I believe should resolve your issue.

I totally believe you are a beginner in python. I post the right one before explaining it.
for i,url in enumerate(urls):
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')
file = open('filename{}.txt'.format(i), 'w')
for article_body in soup.find_all('p'):
body = article_body.text
file.write(body)
file.close()
The reason why i receive only the review from the last URL in the list to all the files
one variable for one value , so after for-loop finished you will get the last result(the third one). The result of first and second result will be override
for url in urls:
r = requests.get(url).text
soup = BeautifulSoup(r, 'lxml')

Related

BeautifulSoup not returning results of a search on a website

I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result:
https://www.nga.gov/content/ngaweb/collection-search-result.html
How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?

Actually, data is generating from api calls json response. Here is the desired
list of links.
Code:
import requests
import json
url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)
for item in r.json()['results']:
url = item['url']
abs_url = f'https://www.nga.gov{url}'
print(abs_url)
Output:
https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html

This seems to work for me:
from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
print(a['href'])
It returns all of the html a href links.
For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

I am trying to scrape the website, and there are 12 pages with X links on them - I just want to extract all the links, and store them for later usage.
But there is an awkward problem with extracting links from the pages. To be precise, my output contains only the last link from each of the pages.
I know that this description may sound confusing, so let me show you the code and images:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
print(paper_urls)
time.sleep(5)
But problem is, my output looks like [redacted].
Instead of ~80 links, I'm getting this! I wondered what happened, and it looks like my script from every generated URL (from the list named "issues" in the code) gets only the last listed link?! How to fix it? I do not have any idea what should be the problem here.

Were you perhaps missing an indentation when appending to paper_urls?
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags) # added missing indentation
print(paper_urls)
time.sleep(5)
The whole code, after moving the print outside the loop, would look like this:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
import time
#here I tried to make a loop for generating page's URLs, and store URLs in the list "issues"
archive = '[redacted URL]'
issues =[]
#i am going for issues 163-175
for i in range(163,175):
url_of_issue = archive + '/' + str(i)
issues.append(url_of_issue)
#now, I want to extract links from generated pages
#idea is simple - loop iterates over the list of URLs/pages and from each issue page get URLS of the listed papers, storing them in the list "paper_urls"
paper_urls =[]
for url in issues:
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, "html.parser")
for a in soup.select('.obj_article_summary .title a'):
ahrefTags=(a['href'])
paper_urls.append(ahrefTags)
#print(ahrefTags) #uncomment if you wish to print each and every link by itself
#time.sleep(5) #uncomment if you wish to add a delay between each request
print(paper_urls)

How to scrape the content on different tabs from this website?

I'm trying to scrape the synonyms of words from Thesaurus.com (using bs4 and requests). An example page: https://www.thesaurus.com/browse/put?s=t
The problem I'm running into is that there are three different tabs for different meanings of the word, and each one has a different list of synonyms. I want to scrape them all, but I'm not sure how to access the synonyms listed in the non-active tab. When I print the soup'd html and ctrl+f, the synonyms from the non-primary tab don't seem to be in the html at all, which seems to indicate to me that I cannot just simply scrape the page once. Anyway, what's the easiest way to get these other synonyms?
EDIT:
As requested by a comment, here is my code to scrape a word currently. This code works perfectly, as I said; I just can't figure out how to scrape from the other tabs on the page.
import requests
from bs4 import BeautifulSoup
results = {}
def scrape_word(word):
url = baseURL + word
response = requests.get(url, timeout=5)
if (response.status_code == 404):
results[word] = []
return
soup = BeautifulSoup(response.content, 'html.parser')
# Look only at the main content div or else you'll get other undesired tags
soup = soup.find('div', attrs={'class': 'css-1kc5m8x e1qo4u830'})
# The below variable is the class used by Thesaurus.com for synonyms
word_class = 'css-133coio etbu2a32'
# Structure of site is spans hold links which have text of the word
spanList = soup.find_all('span', attrs={'class': word_class})
synList = [span.find('a').text for span in spanList]
results[word] = synList
EDIT2: a comment by cddt has given me an alternative route to the data I desire, but if anyone knows, I'm still curious how I could've solved the problem without his solution.

Writing URLs via Beautifulsoup to a csv file vertically

I have a project for one of my college classes that requires me to pull all URLs from a page on the U.S. census bureau website and store them in a CSV file. For the most part I've figured out how to do that but for some reason when the data gets appended to the CSV file, all the entries are being inserted horizontally. I would expect the data to be arranged vertically, meaning row 1 has the first item in the list, row 2 has the second item and so on. I have tried several approaches but the data always ends up as a horizontal representation. I am new to python and obviously don't have a firm enough grasp on the language to figure this out. Any help would be greatly fully appreciated.
I am parsing the website using Beautifulsoup4 and the request library. Pulling all the 'a' tags from the website was easy enough and getting the URLs from those 'a' tags into a list was pretty clear as well. But when I append the list to my CSV file with a writerow function, all the data ends up in one row as opposed to one separate row for each URL.
import requests
import csv
requests.get
from bs4 import BeautifulSoup
from pprint import pprint
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(links)
pprint(links)

Try this:
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get('https://www.census.gov/programs-surveys/popest.html')
soup = BeautifulSoup(page.text, 'html.parser')
## Create Link to append web data to
links = []
# Pull text from all instances of <a> tag within BodyText div
AllLinks = soup.find_all('a')
for link in AllLinks:
links.append(link.get('href'))
with open("htmlTable.csv", "w") as f:
writer = csv.writer(f)
for link in links:
if (isinstance(link, str)):
f.write(link + "\n",)
I changed it to check whether a given link was indeed a string and if so, add a newline after it.

Try making a list of lists, by appending the url inside a list
links.append([link.get('href')])
Then the csv writer will put each list on a new line with writerows
writer.writerows(links)

How to use beautiful soup and requests to extract article in website which is split in different pages

How to use beautiful soup and requests to extract each article to get full article in website which is split in different pages?
for example, this website
http://www.pagebypagebooks.com/F_Scott_Fitzgerald/The_Lees_Of_Happiness/Authors_Note_p1.html
thank you!

Here a snippet to help you:
import requests
TOC = ["Chapter_I_p%d.html", "Chapter_III_p%d.html", ...]
# get all page for chapter I
for i in range(1, 10): # 10 because i noticed it.
url = "http://www.pagebypagebooks.com/F_Scott_Fitzgerald/The_Lees_Of_Happiness/Chapter_I_p%d.html" % i
r = requests.get(url)
if not r.status_code in (200, 201):
continue
content = r.content
with open(os.path.split(url)[-1], "wb") as fin:
# write the whole content.
# bs4 can be used here to select 'p' tags...
fin.write(content)
Note: for each chapter you can loop over the page as describe.
My approach to scrap pages is to identify the pattern beetwen the urls (here the page number) or with bs4 selector get next page to fetch it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrape texts from multiple websites and save separately in text files - python

Related

BeautifulSoup not returning results of a search on a website

Awkward problem with iterrating over the list and extracting only last linked link from the page [BS4]

How to scrape the content on different tabs from this website?

Writing URLs via Beautifulsoup to a csv file vertically

How to use beautiful soup and requests to extract article in website which is split in different pages

Categories

Resources