How do I scrape a website and put data into a file?

How do I scrape a website and put data into a file? - python

I have a apple music link as seen here, I want to get all the song names and put them into a file.
This is what I have tried:
for i in soup.findAll('div', {'class':'song-name typography-body-tall'}):
with open("playlist.txt", "w") as f:
f.write(i)
But nothing is writing into the file, please can I get some help for this - thanks in advance.

Firstly make sure you are actually scraping the website:
import requests
import sys
from bs4 import BeautifulSoup
a = requests.get("https://music.apple.com/gb/playlist/eminem-essentials/pl.9200aa618dc24867b2aa7f00466fd404")
soup = BeautifulSoup(a.text,features="html.parser")
Then collect the songs:
songs = soupfindAll('div', {'class':'song-name typography-body-tall'})
And finally put it in a loop to go through it all and put them into a file:
song = [song.get_text() for song in songs if song]
original_stdout = sys.stdout
with open('playlist.txt', 'w') as f:
sys.stdout = f
for idx in range(len(song)):
print(f'{song[idx]}')
sys.stdout = original_stdout
Make sure to import everything I have, importing sys is important to print out everything into the file

Besides using Beautiful Soup, if you want to scrape content of a website in details then one of the best libraries in the business is Scrapy. An easy way by which Scrapy crawls content from a website is by Xpath selectors.
Here's the scrapy documentation: https://docs.scrapy.org/en/latest/
Xpaths tutorial for scraping meta content, with Scrapy: https://linuxhint.com/scrapy-with-xpath-selectors/

Related

Downloading PDF's using Python webscraping not working

Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.

Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

Writing to txt file multiple string, only last string saves?

I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''

Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline

here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...

Loading more content in a webpage and issues writing to a file

I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.
I am currently stuck with 2 issues.
Only the first few links are scraped. I'm unable to extract links
from other pages(Website contains load more button). I don't know
how to use the XHR object in the code.
The second half of the code reads only the last link(stored in the
csv file), scrapes the respective information and stores it in a
text file. It does not go through all the links from the beginning.
I am unable to figure out where I have gone wrong in terms of file
handling and f.seek(0).
from pprint import pprint
import requests
import lxml
import csv
import urllib2
from bs4 import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content, "lxml")
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
results = soup.findAll('a', {'rel': 'bookmark'})
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
pprint(get_url_for_search_key('digital advertising'))
with open('ctp_output.csv', 'w+') as f:
f.write('\n'.join(get_url_for_search_key('digital advertising')))
f.seek(0)
Reading CSV file, scraping respective content and storing in .txt file
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')

Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.
For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.

If there are more pages with results observe what changes in the URL when you manually click to go to the next page of results.
I can guarantee 100% that a small piece of the URL will have eighter a subpage number or some other variable encoded in it that strictly relates to the subpage.
Once you have figured out what is the combination you just fit that into a for loop where you put a .format() into the URL that you want to scrape and keep navigating this way through all the subpages of the results.
As to what is the last subpage number - you have to inspect the html code of the site you are scraping and find the variable responsible for it and extract its value. See if there is "class": "Page" or equivalent in their code - it may contain that number that you will need for your for loop.
Unfortunately there is no magic navigate through subresults option....
But this gets pretty close :).
Good luck.

Scraping data from multiple places on same HTML with Beautiful Soup (Python)

I have problems with scraping data from certain URL with Beautiful Soup.
I've successfully made part where my code opens text file with list of URL's and goes through them.
First problem that I encounter is when I want to go through two separate places on HTML page.
With code that I wrote so far, it only goes through first "class" and just doesn't want to search and scrap another one that I defined.
Second issue is that I can get data only if I run my script in terminal with:
python mitel.py > mitel.txt
Output that I get is not the one that I want. I am just looking for two strings from it, but I cannot find a way to extract it.
Finally, there's no way I can get my results to write to CSV.
I only get last string of last URL from url-list into my CSV.
Can you assist TOTAL beginner in Python?
Here's my script:
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urllist.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll(True, {"class":["tel-number", "tel-result main"]}):
finalt = target.text.strip()
print finalt
with open('output_file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(finalt)
For some reason, I cannot paste succesfully targeted HTML code, so I'll just put a link here to one of the pages, and if it gets needed, I'll try to somehow paste it, although, its very big and complex.
Targeted URL for scraping
Thank you so much in advance!

Well I managed to get some results with the help of #furas and google.
With this code, I can get all "a" from the page, and then in MS Excel I was able to get rid of everything that wasn't name and phone.
Sorting and rest of the stuff is also done in excel... I guess I am to big of a newbie to accomplish everything in one script.
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
finalt = []
proxy = urllib2.ProxyHandler({'http': 'http://163.158.216.152:80'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
with open('mater.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll('a'):
finalt.append( target.text.strip() )
print finalt
with open('imena1-50.csv', 'wb') as f:
writer = csv.writer(f)
for i in finalt:
writer.writerow([i])
It also uses proxy.. sort off. Didn't get it to get proxies from .txt list.
Not bad for first python scraping, but far from efficient and the way I imagine it.

maybe your selector is wrong, try this
for target in soup.findAll(True, {"class":["tel-number",'tel-result-main']}):

Automating filename generation from url text

I am parsing some content from the web and then saving it to a file. So far I manually create the filename.
Here's my code:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
How could I automate the process of creating and saving the following html filename from the url:
The-Google-Way-Revolutionizing-Management (instead of html_output_test?
This name comes from the original bookstore url that I posted and that probably was modified to avoid product adv.
Thanks!

You can use BeautifulSoup to get the title text from the page, I would let requests handle the encoding with .content:
url = "http://rads.stackoverflow.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
For that particular page if you just want The Google Way: How One Company is Revolutionizing Management As We Know It the product title is in the class a-size-large:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
The link with The-Google-Way-Revolutionizing-Management is in the link tag:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
So to get that part you need to parse it:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)

You could parse the web page using beautiful soup, get the of the page, then slugify it and use as file name, or generate a random filename, something like os.tmpfile.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I scrape a website and put data into a file? - python

Related

Downloading PDF's using Python webscraping not working

Writing to txt file multiple string, only last string saves?

Loading more content in a webpage and issues writing to a file

Scraping data from multiple places on same HTML with Beautiful Soup (Python)

Automating filename generation from url text

Categories

Resources