I have problems with scraping data from certain URL with Beautiful Soup.
I've successfully made part where my code opens text file with list of URL's and goes through them.
First problem that I encounter is when I want to go through two separate places on HTML page.
With code that I wrote so far, it only goes through first "class" and just doesn't want to search and scrap another one that I defined.
Second issue is that I can get data only if I run my script in terminal with:
python mitel.py > mitel.txt
Output that I get is not the one that I want. I am just looking for two strings from it, but I cannot find a way to extract it.
Finally, there's no way I can get my results to write to CSV.
I only get last string of last URL from url-list into my CSV.
Can you assist TOTAL beginner in Python?
Here's my script:
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urllist.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll(True, {"class":["tel-number", "tel-result main"]}):
finalt = target.text.strip()
print finalt
with open('output_file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(finalt)
For some reason, I cannot paste succesfully targeted HTML code, so I'll just put a link here to one of the pages, and if it gets needed, I'll try to somehow paste it, although, its very big and complex.
Targeted URL for scraping
Thank you so much in advance!
Well I managed to get some results with the help of #furas and google.
With this code, I can get all "a" from the page, and then in MS Excel I was able to get rid of everything that wasn't name and phone.
Sorting and rest of the stuff is also done in excel... I guess I am to big of a newbie to accomplish everything in one script.
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
finalt = []
proxy = urllib2.ProxyHandler({'http': 'http://163.158.216.152:80'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
with open('mater.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll('a'):
finalt.append( target.text.strip() )
print finalt
with open('imena1-50.csv', 'wb') as f:
writer = csv.writer(f)
for i in finalt:
writer.writerow([i])
It also uses proxy.. sort off. Didn't get it to get proxies from .txt list.
Not bad for first python scraping, but far from efficient and the way I imagine it.
maybe your selector is wrong, try this
for target in soup.findAll(True, {"class":["tel-number",'tel-result-main']}):
Related
I am working on a webscrape code, he work fine, now I want replace the url, with a CSV file who containt thousand of url, it's like this :
url1
url2
url3
.
.
.urlX
my first line web scrape code is a basic :
from bs4 import BeautifulSoup
import requests
from csv import writer
url= "HERE THE URL FROM EACH LINE OF THE CSV FILE"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
how can i do for tell to python, to use the urls from the CSV, i think to do a dico, but i dont very know how i can do that, anyone have a solution please ? i know it's seams very simple for you, but it will be very usefull for me.
If this is just a list of urls, you don't really need the csv module. But here is a solution assuming the url is in column 0 of the file. You want a csv reader, not writer, and then its a simple case of iterating the rows and taking action.
from bs4 import BeautifulSoup
import requests
import csv
with open("url-collection.csv", newline="") as fileobj:
for row in csv.reader(fileobj):
# TODO: add try/except to handle errors
url = row[0]
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
This is my first time posting so apologies if there is any errors. I currently have a file with a list of URLs, and I am trying to create a python program which will go to the URLs and grab the text from the HTML page and save it in a .txt file. I am currently using beautifulsoup to scrape these sites and many of them are throwing errors which I am unsure how to solve. I am looking for a better way to this: I have posted by code below.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from urllib.request import Request
import datefinder
from dateutil.parser import parse
import json
import re
import random
import time
import scrapy
import requests
import urllib
import os.path
from os import path
#extracts page contents using beautifulSoup
def page_extract(url):
req = Request(url,
headers={'User-Agent': 'Mozilla/5.0'})
webpage = uReq(req, timeout=5).read()
page_soup = soup(webpage, "lxml")
return page_soup
#opens file that contains the links
file1 = open('links.txt', 'r')
lines = file1.readlines()
#for loop that iterates through the list of urls I have
for i in range(0, len(lines)):
fileName = str(i)+".txt"
url = str(lines[i])
print(i)
try:
#if the scraping is successful i would like it to save the text contents in a text file with the text file name
# being the index
soup2 = page_extract(url)
text = soup2.text
f = open("Politifact Files/"+fileName,"x")
f.write(str(text))
f.close()
print(url)
except:
#otherwise save it to another folder which contains all the sites that threw an error
f = open("Politifact Files Not Completed/"+fileName,"x")
f.close()
print("NOT DONE: "+url)
Thanks #Thierry Lathuille and #Dr Pi for your response. I was able to find a solution to this problem by looking into python libraries that are able to webscrape the important text off of a webpage. I came across one called 'Trafilatura' which is able to accomplish this task. The documentation for this library is here at: https://pypi.org/project/trafilatura/.
I have a apple music link as seen here, I want to get all the song names and put them into a file.
This is what I have tried:
for i in soup.findAll('div', {'class':'song-name typography-body-tall'}):
with open("playlist.txt", "w") as f:
f.write(i)
But nothing is writing into the file, please can I get some help for this - thanks in advance.
Firstly make sure you are actually scraping the website:
import requests
import sys
from bs4 import BeautifulSoup
a = requests.get("https://music.apple.com/gb/playlist/eminem-essentials/pl.9200aa618dc24867b2aa7f00466fd404")
soup = BeautifulSoup(a.text,features="html.parser")
Then collect the songs:
songs = soupfindAll('div', {'class':'song-name typography-body-tall'})
And finally put it in a loop to go through it all and put them into a file:
song = [song.get_text() for song in songs if song]
original_stdout = sys.stdout
with open('playlist.txt', 'w') as f:
sys.stdout = f
for idx in range(len(song)):
print(f'{song[idx]}')
sys.stdout = original_stdout
Make sure to import everything I have, importing sys is important to print out everything into the file
Besides using Beautiful Soup, if you want to scrape content of a website in details then one of the best libraries in the business is Scrapy. An easy way by which Scrapy crawls content from a website is by Xpath selectors.
Here's the scrapy documentation: https://docs.scrapy.org/en/latest/
Xpaths tutorial for scraping meta content, with Scrapy: https://linuxhint.com/scrapy-with-xpath-selectors/
Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.
Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.
I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''
Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...