Hell there,
I designed a python code using "scrapy" to scrawl a URL page in order to extract the text I need on the page so as to import the results on an excel file. However, I struggle with the last part of the code, the print function. I do not know how to properly use print so that it exports the results of "'name': brickset.css(NAME_SELECTOR).extract_first()" (the text in the URL) in the excel file. Can someone help me?
I would really appreciate
Viktor
import scrapy
class BrickSetSpider(scrapy.Spider):
name = "brickset_spider"
start_urls = ['https://bitcointalk.org/index.php?topic=1944505.0']
def parse(self, response):
POST_SELECTOR = '.post'
for brickset in response.css(POST_SELECTOR):
NAME_SELECTOR = 'div'
yield {
'name': brickset.css(NAME_SELECTOR).extract_first(),
}
import sys
orig_stdout = sys.stdout
f = open('Scrappingtest1.xls', 'a')
sys.stdout = f
print(yield)
sys.stdout = orig_stdout
f.close()
Related
I'm trying to pull the average of temperatures from this API from a bunch of different ZIP codes. I can currently do so by manually changing the ZIP code in the URL for the API, but I was hoping it to be able to loop through a list of ZIP codes or ask for input and use those zip codes.
However, I'm rather new and have no idea on how to add variables and stuff to a link, either that or I'm overcomplicating it. So basically I was searching for some methods to add a variable to the link or something to the same effect so I can change it whenever I want.
import urllib.request
import json
out = open("output.txt", "w")
link = "http://api.openweathermap.org/data/2.5/weather?zip={zip-code},us&appid={api-key}"
print(link)
x = urllib.request.urlopen(link)
url = x.read()
out.write(str(url, 'utf-8'))
returnJson = json.loads(url)
print('\n')
print(returnJson["main"]["temp"])
import urllib.request
import json
zipCodes = ['123','231','121']
out = open("output.txt", "w")
for i in zipCodes:
link = "http://api.openweathermap.org/data/2.5/weather?zip=" + i + ",us&appid={api-key}"
x = urllib.request.urlopen(link)
url = x.read()
out.write(str(url, 'utf-8'))
returnJson = json.loads(url)
print(returnJson["main"]["temp"])
out.close()
You can achieve what you want by looping through a list of zipcodes and creating a new URL from them.
I have a scraper that takes in a list of URLS, and scans them for additional links, that it then follows to find anything that looks like an email (using REGEX), and returns a list of urls/email addresses.
I currently have it set up in a Jupyter Notebook, so I can easily view the output while testing. The problem is, it takes forever to run - because I'm not limiting the depth of the scraper (per URL).
Ideally, the scraper would go a max of 2-5 pages deep from each start url.
Here's what I have so far:
First, I'm importing my dependencies:
import os, re, csv, scrapy, logging
import pandas as pd
from scrapy.crawler import CrawlerProcess
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from googlesearch import search
from time import sleep
from Urls import URL_List
And I set turn off logs and warnings for using Scrapy inside the Jupyter Notebook:
logging.getLogger('scrapy').propagate = False
From there, I extract the URLS from my URL file:
def get_urls():
urls = URL_List['urls']
Then, I set up my spider:
class MailSpider(scrapy.Spider):
name = 'email'
def parse(self, response):
I search for links inside URLs.
links = LxmlLinkExtractor(allow=()).extract_links(response)
Then take in a list of URLs as input, reading their source codes one by one.
links = [str(link.url) for link in links]
links.append(str(response.url))
I send links from one parse method to another.
And set callback argument that defines which method the request URL must be sent to.
for link in links:
yield scrapy.Request(url=link, callback=self.parse_link)
I then pass URLS to the parse_link method — this method applies regex findall to look for emails
def parse_link(self, response):
html_text = str(response.text)
mail_list = re.findall('\w+#\w+\.{1}\w+', html_text)
dic = {'email': mail_list, 'link': str(response.url)}
df = pd.DataFrame(dic)
df.to_csv(self.path, mode='a', header=False)
The google_urls list are passed as an argument when we call the process method to run the Spider, path defines where to save the CSV file.
Then, I save those emails in a CSV file:
def ask_user(question):
response = input(question + ' y/n' + '\n')
if response == 'y':
return True
else:
return False
def create_file(path):
response = False
if os.path.exists(path):
response = ask_user('File already exists, replace?')
if response == False: return
with open(path, 'wb') as file:
file.close()
For each website, I make a data frame with columns: [email, link], and append it to a previously created CSV file.
Then, I put it all together:
def get_info(root_file, path):
create_file(path)
df = pd.DataFrame(columns=['email', 'link'], index=[0])
df.to_csv(path, mode='w', header=True)
print('Collecting urls...')
google_urls = get_urls()
print('Searching for emails...')
process = CrawlerProcess({'USER_AGENT': 'Mozilla/5.0'})
process.crawl(MailSpider, start_urls=google_urls, path=path)
process.start()
print('Cleaning emails...')
df = pd.read_csv(path, index_col=0)
df.columns = ['email', 'link']
df = df.drop_duplicates(subset='email')
df = df.reset_index(drop=True)
df.to_csv(path, mode='w', header=True)
return df
get_urls()
Lastly, I define a keyword and run the scraper:
keyword = input("Who is the client? ")
df = get_info(f'{keyword}_urls.py', f'{keyword}_emails.csv')
On a list of 100 URLS, I got back 44k results with an email addresses syntax.
Anyone know how to limit the depth?
Set DEPTH_LIMIT in your Spider like this
class MailSpider(scrapy.Spider):
name = 'email'
custom_settings = {
"DEPTH_LIMIT": 5
}
def parse(self, response):
pass
I am trying to collect all the poems from under the category "Índice general alfabético" on this site http://amediavoz.com/. There it appears the title of the poems which one has to click to get to the actual poems. Basically I want to copy all the text of each poems from each of these pages (the text within <p></p> under xpath "/html/body/blockquote[2]/blockquote" in each of the pages) except the ending information about the poem under <i></i> in the HTML code. I would like to save everything in .txt files, either one big one, or one per page.
This code is an attempt to do this.
import scrapy
class FirstSpider(scrapy.Spider):
name = "FirstSpider"
start_urls = ['http://amediavoz.com/']
def start_requests(self):
url = ['http://amediavoz.com/']
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
xp = "//a[#target='_blank']/#href"
for url in response.xpath(xp).extract():
page = response.url.split("/")[-2]
filename = 'Poems=%s.txt' % page
sub = url.css('blockquote')[1]
with open(filename, 'wb') as f:
f.write(sub.xpath('//font/text()').extract())
self.log('Saved file %s' % filename)
f.close()
When I run the code I dont get any error message but no output either, that is, a text file.
Any help is appreciated.
Sorry, I don't know Spanish. I just roughly extract the text, not necessarily right. If you can mark which data you need to extract from HTML, I will help you modify the code.
from simplified_scrapy.spider import Spider, SimplifiedDoc
class FirstSpider(Spider):
name = 'FirstSpider'
start_urls = ['http://amediavoz.com/']
refresh_urls = True
def extract(self, url, html, models, modelNames):
try:
doc = SimplifiedDoc(html)
if url['url']==self.start_urls[0]:
lstA = doc.listA(url=url['url'],start='blockquote',end='La voz de los poetas')
return [{"Urls":lstA}]
blockquotes = doc.getElementsByTag('blockquote')
page = url['url'].split("/")[-1]
filename = 'data/Poems=%s.txt' % page
with open(filename, 'w') as f:
for blockquote in blockquotes:
f.write(blockquote.getText('\n'))
f.write('\n')
print ('Saved file %s' % filename)
return True
except Exception as e:
print ('extract',e)
from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(FirstSpider())# start scrapping
Python 2.6
I'm trying to parse my pdf files and one way to do that is to transform it into html and extracting headings along with their paragraphs.
So, I tried pdf2htmlEX and it converted my pdf into html without disturbing my pdf format... So far, I was happy but when I tried to access my headings by using such commands:
>> import subprocess
>> path = "/home/administrator/Documents/pdf_file.pdf"
>> subprocess.call(["pdf2htmlEX" , path])
But when I opened my html file it was giving me unnecessary stuff along with my text and more importantly my text doesn't have heading tags just bunch of divs and span.
>> f = open('/home/administrator/Documents/pdf_file.html','r')
>> f = f.read()
>> print f
I even tried to access it using BeautifulSoup
>> from bs4 import BeautifulSoup as bs
>> soup = BeautifulSoup(f)
>> soup.find('div', attrs={'class': 'site-content'}).h1
It didn't gave me anything coz there was no tags. I have also tried HTMLParser
from HTMLParser import HTMLParser
# create a subclass and override the handler methods
class myhtmlparser(HTMLParser):
def __init__(self):
self.reset()
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
def handle_starttag(self, tag, attrs):
self.NEWTAGS.append(tag)
self.NEWATTRS.append(attrs)
def handle_data(self, data):
self.HTMLDATA.append(data)
def clean(self):
self.NEWTAGS = []
self.NEWATTRS = []
self.HTMLDATA = []
parser = myhtmlparser()
parser.feed(f)
# Extract data from parser
tags = parser.NEWTAGS
attrs = parser.NEWATTRS
data = parser.HTMLDATA
# Clean the parser
parser.clean()
# Print out our data
#print tags
print data
but they all are not fulfilling my required desire. All I want is to extract each headings along with their required paragraphs from that html file is that too much to ask... :p I searched almost every site and read almost everything on this but all my effort ends in vain. Plz guide me in this...
If it's python3 and up, it should be
outputFilename = outputDir + filename.replace(".pdf",".html")
subprocess.run(["pdf2htmlEX",file,outputFilename])
This is a small widget that I am designing that is designed to 'browse' while circumventing proxy settings. I have been told on Code Review that it would be beneficial here, but am struggling to put it in with my program's current logic. Here is the code:
import urllib.request
import webbrowser
import os
import tempfile
location = os.path.dirname(os.path.abspath(__file__))
proxy_handler = urllib.request.ProxyHandler(proxies=None)
opener = urllib.request.build_opener(proxy_handler)
def navigate(query):
response = opener.open(query)
html = response.read()
return html
def parse(data):
start = str(data)[2:-1]
lines = start.split('\\n')
return lines
while True:
url = input("Path: ")
raw_data = navigate(url)
content = parse(raw_data)
with open('cache.html', 'w') as f:
f.writelines(content)
webbrowser.open_new_tab(os.path.join(location, 'cache.html'))
Hopefully someone who has worked with these modules before can help me. The reason that I want to use tempfile is that my program gets raw html, parses it and stores it in a file. This file is overwritten every time a new input comes in, and would ideally be deleted when the program stops running. Also, the file doesn't have to exist when the program initializes so it seems logical from that view also.
Since you are passing the name of the file to webbrowser.open_new_tab(), you should use a NamedTemporaryFile
cache = tempfile.NamedTemporaryFile()
...
cache.seek(0)
cache.writelines(bytes(line, 'UTF-8') for line in content)
cache.seek(0)
webbrowser.open_new_tab('file://' + cache.name)