I have scrapy pulling data from a web page. An issue Ive run across is it pulls alot of whitespace and Ive elected to use .strip() as suggested by others. Ive run into an issue though
if a.strip():
print a
if b.strip():
print b
Returns:
a1
b1
.
.
.
But this:
if a.strip():
aList.append(a)
if b.strip():
bList.append(b)
print aList, bList
Returns this:
a1
b1
Im trying to simulate the whitespace that I remove with .strip() here, but you get the point. For whatever reason it adds the whitespace to the list even though I told it not to. I can even print the list in the if statement and it also shows correctly, but for whatever reason, when I decide to print outside the if statements it doesnt work as I intended.
Here is my entire code:
# coding: utf-8
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.exporter import CsvItemExporter
import re
import csv
import urlparse
from stockscrape.items import EPSItem
from itertools import izip
class epsScrape(BaseSpider):
name = "eps"
allowed_domains = ["investors.com"]
ifile = open('test.txt', "r")
reader = csv.reader(ifile)
start_urls = []
for row in ifile:
url = row.replace("\n","")
if url == "symbol":
continue
else:
start_urls.append("http://research.investors.com/quotes/nyse-" + url + ".htm")
ifile.close()
def parse(self, response):
f = open("eps.txt", "a+")
sel = HtmlXPathSelector(response)
sites = sel.select("//div")
# items = []
for site in sites:
symbolList = []
epsList = []
item = EPSItem()
item['symbol'] = site.select("h2/span[contains(#id, 'qteSymb')]/text()").extract()
item['eps'] = site.select("table/tbody/tr/td[contains(#class, 'rating')]/span/text()").extract()
strSymb = str(item['symbol'])
newSymb = strSymb.replace("[]","").replace("[u'","").replace("']","")
strEps = str(item['eps'])
newEps = strEps.replace("[]","").replace(" ","").replace("[u'\\r\\n","").replace("']","")
if newSymb.strip():
symbolList.append(newSymb)
# print symbolList
if newEps.strip():
epsList.append(newEps)
# print epsList
print symbolList, epsList
for symb, eps in izip(symbolList, epsList):
f.write("%s\t%s\n", (symb, eps))
f.close()
strip does not modify the string in-place. It returns a new string with the whitespace stripped.
>>> a = ' foo '
>>> b = a.strip()
>>> a
' foo '
>>> b
'foo'
I figured out what it was that was causing the confusion. Its the location which I declared the variable/list. I was declaring it inside the for loop so everytime it iterated it rewrote and a blank list or variable is the same outcome of false for my if statement.
Related
I'm trying to deploy a spider to scrapinghub and cannot figure out how to tackle a data input problem. I need to read IDs from a csv and append them to my start urls as a list comprehension for the spider to crawl:
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
start_urls = ['http://www.example.com/PID=' + str(x) for x in csvdata]
The requirements file and pkgutil.get_data parts works but I'm stuck on converting the data IO into the list. What's the process for converting the data call into the list comprehension?
EDIT:
Thanks! This got me 90% of the way there!
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
csvio = StringIO(csvdata)
raw = csv.reader(csvio)
# TODO : update code to get exact value from raw
start_urls = ['http://www.example.com/PID=' + str(x[0]) for x in raw]
The str(x) needed str(x[0]) as a quick fix since the loop was reading in the square brackets in url encoding which broke the links:
str(x) resulted in "http://www.example.com/PID=%5B'0001'%5D"
but str(x[0]) gets it out of the list brackets: "http://www.example.com/PID='0001'"
class exampleSpider(scrapy.Spider):
name = "exampleSpider"
#local scrapy method to extract data
#PID = pd.read_csv('resources/PID_list.csv')
#scrapinghub method
csvdata = pkgutil.get_data("exampleSpider", "resources/PID_list.csv")
csvio = StringIO(csvdata)
raw = csv.reader(csvio)
# TODO : update code to get exact value from raw
start_urls = ['http://www.example.com/PID=' + str(x) for x in raw]
You can use StringIO to turn a string into something with a read() method, which csv.reader should be able to handle. I hope this will help you :)
I've been working on a python script that will scrape certain webpages.
The beginning of the script looks like this:
# -*- coding: UTF-8 -*-
import urllib2
import re
database = ''
contents = open('contents.html', 'r')
for line in contents:
entry = ''
f = re.search('(?<=a href=")(.+?)(?=\.htm)', line)
if f:
entry = f.group(0)
page = urllib2.urlopen('https://indo-european.info/pokorny-etymological-dictionary/' + entry + '.htm').read()
m = re.search('English meaning( )+\s+(.+?)</font>', page)
if m:
title = m.group(2)
else:
title = 'N/A'
This accesses each page and grabs a title from it. Then I have a number of blocks of code that test whether certain text is present in each page, here is an example of one:
abg = re.findall('\babg\b', page);
if len(abg) == 0:
abg = 'N'
else:
abg = 'Y'
Then, finally, still in the for loop, I add this information to the variable database:
database += '\n' + str('<F>') + str(entry) + '<TITLE="' + str(title) + '"><FQ="N"><SQ="N"><ABG="' + str(abg) + '"></F>'
Note that I have used str() for each variable because I was getting a "can't concatenate strings and lists" error for some reason.
Once the for loop is completed, I write the database variable to a file:
f = open('database.txt', 'wb')
f.write(database)
f.close()
When I run this in the command line, it times out or never completes running. Any ideas as to what might be causing the issue?
EDIT: I fixed it. It seems the program was getting slowed down by the fact that I was having the database variable store the result of each line's iteration through the loop. All I had to do to fix the issue was change the write function to happen during the for loop.
Basically what I'm trying to do is make a program in python which takes a URL, copys the source, and pulls all comments out and presents them to the user.
import urllib2
import html2text
import PullsCommentsOut.pullscommentsout
url = raw_input('Please input URL with the text you want to analyze: ')
page = urllib2.urlopen(url)
html_content = page.read().decode('utf8')
rendered_content = html2text.html2text(html_content).encode('ascii',
'ignore')
f = open('file_text.txt', 'wb')
f.write(rendered_content)
f.close()
result = PullsCommentsOut.pullscommentsout(html_content)
print result
And my second file, 'PullsCommentsOut'
import re
def pullscommentsout():
def comment_remover(text):
def replacer(match):
s = match.group(0)
if s.startswith('/'):
print s
return " " # note: a space and not an empty string
else:
return s
pattern = re.compile(
r'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"',
re.DOTALL | re.MULTILINE
)
return re.sub(pattern, replacer, text)
fd = open("test.c", "r")
buf = fd.read()
comment_remover(buf)
For the life of me I can't figure out why Python doesn't think I'm not importing the proper module? It doesn't make sense.
I need to add more text so it allows me to post, so, how are you all doing? I'm doing pretty good I guess. No complaints.
I've made a parser written in python which is doing it's job perfectly except for some duplicates coming along. Moreover, when I open csv file I can see that every result is surrounded by square braces. Is there any workaround to get rid of duplicates data and square braces on the fly? Here is what I tried with:
import csv
import requests
from lxml import html
def parsingdata(mpg):
data = set()
outfile=open('RealYP.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
pg=1
while pg<=mpg:
url="https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page="+str(pg)
page=requests.get(url)
tree=html.fromstring(page.text)
titles = tree.xpath('//div[#class="info"]')
items = []
for title in titles:
comb = []
Name = title.xpath('.//span[#itemprop="name"]/text()')
Address = title.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone = title.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
try:
comb.append(Name[0])
comb.append(Address[0])
comb.append(Phone[0])
except:
continue
items.append(comb)
pg+=1
for item in items:
writer.writerow(item)
parsingdata(3)
Now it is working fine.
Edit: Rectified portion taken from bjpreisler
This script removes dups when I am working with a .csv file. Check if this works for you :)
with open(file_out, 'w') as f_out, open(file_in, 'r') as f_in:
# write rows from in-file to out-file until all the data is written
checkDups = set() # set for removing duplicates
for line in f_in:
if line in checkDups: continue # skip duplicate
checkDups.add(line)
f_out.write(line)
You are currently writing a list (items) to the csv which is why it is in brackets. To avoid this, use another for loop that could look like this:
for title in titles:
comb = []
Name = title.xpath('.//span[#itemprop="name"]/text()')
Address = title.xpath('.//span[#itemprop="streetAddress" and #class="street-address"]/text()')
Phone = title.xpath('.//div[#itemprop="telephone" and #class="phones phone primary"]/text()')
if Name:
Name = Name[0]
if Address:
Address = Address[0]
if Phone:
Phone = Phone[0]
comb.append(Name)
comb.append(Address)
comb.append(Phone)
print comb
items.append(comb)
pg+=1
for item in items:
writer.writerow(item)
parsingdata(3)
This should write each item separately to your csv. It turns out the items you were appending to comb were lists themselves, so this extracts them.
And the concise version of this scraper I found lately is:
import csv
import requests
from lxml import html
url = "https://www.yellowpages.com/search?search_terms=Coffee%20Shops&geo_location_terms=Los%20Angeles%2C%20CA&page={0}"
def parsingdata(link):
outfile=open('YellowPage.csv','w',newline='')
writer=csv.writer(outfile)
writer.writerow(["Name","Address","Phone"])
for page_link in [link.format(i) for i in range(1, 4)]:
page = requests.get(page_link).text
tree = html.fromstring(page)
for title in tree.xpath('//div[#class="info"]'):
Name = title.findtext('.//span[#itemprop="name"]')
Address = title.findtext('.//span[#itemprop="streetAddress"]')
Phone = title.findtext('.//div[#itemprop="telephone"]')
print([Name, Address, Phone])
writer.writerow([Name, Address, Phone])
parsingdata(url)
I have question about parsing HTML tags with python.
My code looks like:
#!/usr/bin/python
# -*- coding: utf-8 -*-
from lxml import html
import requests
import urllib2
import sys
import re
import time
import urllib
import datetime
def get_web():
try:
input_sat = open('rtc.xml','w')
godina = datetime.date.today().strftime("%Y")
print godina
mjesec = datetime.date.today().strftime("%m")
print mjesec
for x in range (32):
if x < 1:
x = x + 1
var = x
url = 'http://www.rts.rs/page/tv/sr/broadcast/20/RTS+1.html?month={}&year={}&day={}&type=0'.format(mjesec, godina, var)
page = requests.get(url)
tree = html.fromstring(page.text)
a = tree.xpath('//div[#id="center"]/h1/text()') # datum
b = tree.xpath('//div[#class="ProgramTime"]/text()') # time
c = tree.xpath('//div[#class="ProgramName"]/text()')
e = tree.xpath('//div[#class="ProgramName"]/a[#class="recnik"]/text()')
for line in zip(a,b,c,e):
var = line[0]
print >> input_sat, line+'\n'
except:
pass
get_web()
The script works fine and gets tags from a URL, but how can I write them into a file for processing?
When I run my code with a for loop, it doesn't work. I don't know where the problem is.
I rewrote my code, it won't output what's on the page to a file.
As I understand it, your print() function is incorrect. You have to use the write() function of the handler, and also encode the text to UTF-8:
for line in zip(a,b,c,e):
var = line[0]
input_sat.write(line[0].encode('utf-8') + '\n')
It yields:
Programska šema - sreda, 01. jan 2014