I know that this is a repeated question however from all answers on web I could not find the solution as all throwing error.
Simply trying to scrape headers from the web and save them to a txt file.
scraping code works well, however, it saves only last string bypassing all headers to the last one.
I have tried looping, putting writing code before scraping, appending to list etc, different method of scraping however all having the same issue.
please help.
here is my code
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
for headlines in page.find_all("h2"):
print(headlines.text.strip())
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
file_object.write(str(headlines.text.strip()))
'''
Every time your for loop runs, it overwrites the headlines variable, so when you get to writing to the file, the headlines variable only stores the last headline. An easy solution to this is to bring the for loop inside your with statement, like so:
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n") # write a newline after each headline
here is full working code corrected as per advice.
from bs4 import BeautifulSoup
import requests
def nytscrap():
from bs4 import BeautifulSoup
import requests
url = "http://www.nytimes.com"
page = BeautifulSoup(requests.get(url).text, "lxml")
filename = "NYTHeads.txt"
with open(filename, 'w') as file_object:
for headlines in page.find_all("h2"):
print(headlines.text.strip())
file_object.write(headlines.text.strip()+"\n")
this code will trough error in Jupiter work but when an
opening file, however when file open outside Jupiter headers saved...
Related
I'm new in python world and wondering how can I scrape data from github into CSV file, e.g.
https://gist.github.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19#file-pokemon-csv
I'm trying with this code, however it is not very successful. Definitely there should be an easier way how to do it.
Thank you in advance!
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://gist.github.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
pokemon_table = soup.find('table', class_= 'highlight tab-size js-file-line-container')
for pokemon in pokemon_table.find_all('tr'):
name = [pokemon.find('td', class_= 'blob-code blob-code-inner js-file-line').text]
with open('output.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(name)
If you do not mind taking the 'Raw version' of the CSV file, the following code would work:
import requests
response = requests.get("https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/bd584ee6c307cc9fab5ba38916e98a85de9c2ba7/pokemon.csv")
with open("output.csv", "w") as file:
file.write(response.text)
The URL you are using does not link to the CSV's Raw version, but the following does:
https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/bd584ee6c307cc9fab5ba38916e98a85de9c2ba7/pokemon.csv
Edit 1: Just to clarify it, you can access the Raw version of that CSV file by pressing the 'Raw' button on the top-right side of the CSV file shown in the link you provided.
Edit 2: Also, it looks like the following URL would also work, and it is shorter and easier to 'build' based in the original URL:
https://gist.githubusercontent.com/simsketch/1a029a8d7fca1e4c142cbfd043a68f19/raw/pokemon.csv
Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.
Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.
I want to be able to scrape the data from a particular website (https://physionet.org/challenge/2012/set-a/) and the subdirectories like it, while also taking each text file and adding it to a giant csv or excel file so that I might be able to see all the data in one place.
I have deployed the following code, similar to this article, but my code basically downloads all the text files on the page, and stores them in my working directory. And, it honestly just takes too long to run.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
url = 'https://physionet.org/challenge/2012/set-a/'
response = requests.get(url)
response # 200 indicates that it works...
soup = BeautifulSoup(response.text, "html.parser")
for i in range(5,len(soup.findAll('a'))+1): #'a' tags are for links
one_a_tag = soup.findAll('a')[i]
link = one_a_tag['href']
download_url = 'https://physionet.org/challenge/2012/set-a/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/132539.txt')+1:])
time.sleep(1) #pause the code for a sec
Actual results are just a bunch of text files crowding my working directory, but before the for loop stops, I'd like to put it in one large csv file format.
If you want to save them, but have to do it a bit at a time (maybe you don't have enough RAM to hold everything in at once), then I would just append the files to a master file one by one.
import requests
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
output_file = 'outputfile.txt'
url = 'https://physionet.org/challenge/2012/set-a/'
# Download and find all the links. Check the last 4 characters to verify it's one
# of the files we are looking for
response = requests.get(url, verify=False)
soup = BeautifulSoup(response.text, "html.parser")
links = [a['href'] for a in soup.find_all('a') if a['href'][-4:] == '.txt']
# Clear the current file
with open(output_file, 'w'):
pass
# Iterate through all the links
for href in links:
response = requests.get("{}{}".format(url, href), verify=False)
if response:
# Open up the output_file in append mode so we can just write to the one file
with open(output_file, 'a') as f:
f.write(response.text)
print(len(response.text.split('\n')))
The one downside to this is that you would have the headers from each text file. but you can change the f.write() to the following and get it without any headers
f.write("\n".join(response.text.split('\n')[1:]))
If you do have the available RAM, you could read in all the files using a list comprehension then use pandas.concat() to put them in one giant dataframe. Then use df.to_csv() to export it to a file.
df = pd.concat([pd.read_csv("{}{}".format(url, href)) for href in links])
df.to_csv(output_file)
I am working on a web scraping project which involves scraping URLs from a website based on a search term, storing them in a CSV file(under a single column) and finally scraping the information from these links and storing them in a text file.
I am currently stuck with 2 issues.
Only the first few links are scraped. I'm unable to extract links
from other pages(Website contains load more button). I don't know
how to use the XHR object in the code.
The second half of the code reads only the last link(stored in the
csv file), scrapes the respective information and stores it in a
text file. It does not go through all the links from the beginning.
I am unable to figure out where I have gone wrong in terms of file
handling and f.seek(0).
from pprint import pprint
import requests
import lxml
import csv
import urllib2
from bs4 import BeautifulSoup
def get_url_for_search_key(search_key):
base_url = 'http://www.marketing-interactive.com/'
response = requests.get(base_url + '?s=' + search_key)
soup = BeautifulSoup(response.content, "lxml")
return [url['href'] for url in soup.findAll('a', {'rel': 'bookmark'})]
results = soup.findAll('a', {'rel': 'bookmark'})
for r in results:
if r.attrs.get('rel') and r.attrs['rel'][0] == 'bookmark':
newlinks.append(r["href"])
pprint(get_url_for_search_key('digital advertising'))
with open('ctp_output.csv', 'w+') as f:
f.write('\n'.join(get_url_for_search_key('digital advertising')))
f.seek(0)
Reading CSV file, scraping respective content and storing in .txt file
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
Regarding your second problem, your mode is off. You'll need to convert w+ to a+. In addition, your indentation is off.
with open('ctp_output.csv', 'rb') as f1:
f1.seek(0)
reader = csv.reader(f1)
for line in reader:
url = line[0]
soup = BeautifulSoup(urllib2.urlopen(url))
with open('ctp_output.txt', 'a+') as f2:
for tag in soup.find_all('p'):
f2.write(tag.text.encode('utf-8') + '\n')
The + suffix will create the file if it doesn't exist. However, w+ will erase all contents before writing at each iteration. a+ on the other hand will append to a file if it exists, or create it if it does not.
For your first problem, there's no option but to switch to something that can automate clicking browser buttons and whatnot. You'd have to look at selenium. The alternative is to manually search for that button, extract the url from the href or text, and then make a second request. I leave that to you.
If there are more pages with results observe what changes in the URL when you manually click to go to the next page of results.
I can guarantee 100% that a small piece of the URL will have eighter a subpage number or some other variable encoded in it that strictly relates to the subpage.
Once you have figured out what is the combination you just fit that into a for loop where you put a .format() into the URL that you want to scrape and keep navigating this way through all the subpages of the results.
As to what is the last subpage number - you have to inspect the html code of the site you are scraping and find the variable responsible for it and extract its value. See if there is "class": "Page" or equivalent in their code - it may contain that number that you will need for your for loop.
Unfortunately there is no magic navigate through subresults option....
But this gets pretty close :).
Good luck.
I have problems with scraping data from certain URL with Beautiful Soup.
I've successfully made part where my code opens text file with list of URL's and goes through them.
First problem that I encounter is when I want to go through two separate places on HTML page.
With code that I wrote so far, it only goes through first "class" and just doesn't want to search and scrap another one that I defined.
Second issue is that I can get data only if I run my script in terminal with:
python mitel.py > mitel.txt
Output that I get is not the one that I want. I am just looking for two strings from it, but I cannot find a way to extract it.
Finally, there's no way I can get my results to write to CSV.
I only get last string of last URL from url-list into my CSV.
Can you assist TOTAL beginner in Python?
Here's my script:
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urllist.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll(True, {"class":["tel-number", "tel-result main"]}):
finalt = target.text.strip()
print finalt
with open('output_file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(finalt)
For some reason, I cannot paste succesfully targeted HTML code, so I'll just put a link here to one of the pages, and if it gets needed, I'll try to somehow paste it, although, its very big and complex.
Targeted URL for scraping
Thank you so much in advance!
Well I managed to get some results with the help of #furas and google.
With this code, I can get all "a" from the page, and then in MS Excel I was able to get rid of everything that wasn't name and phone.
Sorting and rest of the stuff is also done in excel... I guess I am to big of a newbie to accomplish everything in one script.
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
finalt = []
proxy = urllib2.ProxyHandler({'http': 'http://163.158.216.152:80'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
with open('mater.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll('a'):
finalt.append( target.text.strip() )
print finalt
with open('imena1-50.csv', 'wb') as f:
writer = csv.writer(f)
for i in finalt:
writer.writerow([i])
It also uses proxy.. sort off. Didn't get it to get proxies from .txt list.
Not bad for first python scraping, but far from efficient and the way I imagine it.
maybe your selector is wrong, try this
for target in soup.findAll(True, {"class":["tel-number",'tel-result-main']}):