I am using praw to scrape a subbreddit "RocketLeagueExchange"
I want to check if the reddit title has any of the ignorewords. If not, append the list with its title and url.
When removing the part about checking if any of the ignorewords are in the title it works.
Id also like to put
if not any(ignorewords in submission.title.lower for ignorewords in submission.title.lower):
but I get an error when using .lower
File "main.py", line 19, in <module>
if not any(ignorewords in submission.title.lower for ignorewords in submission.title.lower):
TypeError: 'builtin_function_or_method' object is not iterable
What I tried:
platform = "Xbox"
item = "tw zomba"
ignorewords= ["pricecheck","price check","discussion","giveaway","store"]
reddittrades = []
for submission in reddit.subreddit("RocketLeagueExchange").search("{} {}".format(platform, item), limit=10):
if not any(ignorewords in submission.title for ignorewords in submission.title):
reddittrades.append(submission.title + submission.url)
print(reddittrades)
I get [] as the output - when there are clearly many results on reddit
I think what you are trying to achieve is the following:
platform = "Xbox"
item = "tw zomba"
ignorewords= ["pricecheck","price check","discussion","giveaway","store"]
reddittrades = []
for submission in reddit.subreddit("RocketLeagueExchange").search("{} {}".format(platform, item), limit=10):
if not any([word.lower() in ignorewords for word in submission.title]):
reddittrades.append(submission.title + submission.url)
print(reddittrades)
Related
I’m trying to scrape some data of TripAdvisor.
I'm interested to get the "Price Range/ Cuisine & Meals" of restaurants.
So I use the following xpath to extract each of this 3 lines in the same class :
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]
I'm doing the test directly in scrapy shell and it's working fine :
scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html
But when I integrate it to my script, I've the following error :
Traceback (most recent call last):
File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
I paste you part of my code and I explain it below :
# extract restaurant cuisine
row_cuisine_overviewcard = \
(response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
row_cuisine_card = \
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
In tripAdvisor restaurants, there is 2 different type of pages, with 2 different format.
The first with a class overviewcard, an the second, with a class cards
So I want to check if the first is present (overviewcard), if not, execute the second (card), and if not, put "None" value.
:D But looks like Python execute both .... and as the second one don't exist in the page, the script stop.
Could it be an indentation error ?
Thanks for your help
Regards
Your second selector (row_cuisine_card) fails because the element does not exist on the page. When you then try to access [1] in the result it throws an error because the result array is empty.
Assuming you really want item 1, try this
row_cuisine_overviewcard = \
(response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
# Here we get all the values, even if it is empty.
row_cuisine_card = \
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall())
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
# Here we check first if that result has more than 1 item, and then we check the value.
elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
You should apply the same kind of safety checking whenever you try to get a specific index from a selector. In other words, make sure you have a value before you access it.
Your problem is already in your check in this line_
row_cuisine_card = \
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
You are trying to extract a value from the website that may not exist. In other words, if
response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')
returns no or only one element, then you cannot access the second element in the returned list (which you want to access with the appended [1]).
I would recommend storing the values that you extract from the website into a local variable first in order to then check whether or not a value that you want has been found. My guess is that the page it breaks on does not have the information you want.
This could roughly look like the following code:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2:
row_cuisine_overviewcard = cuisine_overviewcard_sections[1]
cuisine_card_sections = response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2:
row_cuisine_card = cuisine_card_sections[1]
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
Since you only need a part of the information, if the first XPath check already returns the correct answer, the code can be beautified a bit:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2 and cuisine_overviewcard_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
else:
cuisine_card_sections = response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2 and cuisine_card_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
This way you only do a (potentially expensive) XPath search when it actually is necessary.
I am trying to retrieve a tags using boto3 but I constantly run into the ListIndex out of range error.
My Code:
rds = boto3.client('rds',region_name='us-east-1')
rdsinstances = rds.describe_db_instances()
for rdsins in rdsinstances['DBInstances']:
rdsname = rdsins['DBInstanceIdentifier']
arn = "arn:aws:rds:%s:%s:db:%s"%(reg,account_id,rdsname)
rdstags = rds.list_tags_for_resource(ResourceName=arn)
if 'MyTag' in rdstags['TagList'][0]['Key']:
print "Tags exist and the value is:%s"%rdstags['TagList'][0]['Value']
The error that I have is:
Traceback (most recent call last):
File "rdstags.py", line 49, in <module>
if 'MyTag' in rdstags['TagList'][0]['Key']:
IndexError: list index out of range
I also tried using the for loop by specifying the range, it didn't seem to work either.
for i in range(0,10):
print rdstags['TagList'][i]['Key']
Any help is appreciated. Thanks!
You should iterate over list of tags first and compare MyTag with each item independently:
something like that:
if 'MyTag' in [tag['Key'] for tag in rdstags['TagList']]:
print "Tags exist and.........."
or better:
for tag in rdstags['TagList']:
if tag['Key'] == 'MyTag':
print "......"
I use function have_tag to find tag in all modul in Boto3
client = boto3.client('rds')
instances = client.describe_db_instances()['DBInstances']
if instances:
for i in instances:
arn = i['DBInstanceArn']
# arn:aws:rds:ap-southeast-1::db:mydbrafalmarguzewicz
tags = client.list_tags_for_resource(ResourceName=arn)['TagList']
print(have_tag('MyTag'))
print(tags)
Function search tags:
def have_tag(self, dictionary: dict, tag_key: str):
"""Search tag key
"""
tags = (tag_key.capitalize(), tag_key.lower())
if dictionary is not None:
dict_with_owner_key = [tag for tag in dictionary if tag["Key"] in tags]
if dict_with_owner_key:
return dict_with_owner_key[0]['Value']
return None
I'm trying to build a dictionary of keywords and put it into a scrapy item.
'post_keywords':{1: 'midwest', 2: 'i-70',}
The point is that this will all go inside a json object later on down the road. I've tried initializing a new blank dictionary first, but that doesn't work.
Pipeline code:
tag_count = 0
for word, tag in blob.tags:
if tag == 'NN':
tag_count = tag_count+1
nouns.append(word.lemmatize())
keyword_dict = dict()
key = 0
for item in random.sample(nouns, tag_count):
word = Word(item)
key=key+1
keyword_dict[key] = word
item['post_keywords'] = keyword_dict
Item:
post_keywords = scrapy.Field()
Output:
Traceback (most recent call last):
File "B:\Mega Sync\Programming\job_scrape\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "B:\Mega Sync\Programming\job_scrape\cl_tech\cl_tech\pipelines.py", line215, in process_item
item['post_noun_phrases'] = noun_phrase_dict
TypeError: 'unicode' object does not support item assignment
It SEEMS like pipelines behave weirdly, like they don't want to run all the code in the pipeline UNLESS all the item assignments check out, which makes it so that my initialized dictionaries aren't created or something.
Thanks to MarkTolonen for the help.
My mistake was using the variable name 'item' for more than two things.
This works:
for thing in random.sample(nouns, tag_count):
word = Word(thing)
key = key+1
keyword_dict[key] = word
item['post_keywords'] = keyword_dict
I am trying to get the below program working. It is supposed to find email addresses in a website but, it is breaking. I suspect the problem is with initializing result = [] inside the crawl function. Below is the code:
# -*- coding: utf-8 -*-
import requests
import re
import urlparse
# In this example we're trying to collect e-mail addresses from a website
# Basic e-mail regexp:
# letter/number/dot/comma # letter/number/dot/comma . letter/number
email_re = re.compile(r'([\w\.,]+#[\w\.,]+\.\w+)')
# HTML <a> regexp
# Matches href="" attribute
link_re = re.compile(r'href="(.*?)"')
def crawl(url, maxlevel):
result = []
# Limit the recursion, we're not downloading the whole Internet
if(maxlevel == 0):
return
# Get the webpage
req = requests.get(url)
# Check if successful
if(req.status_code != 200):
return []
# Find and follow all the links
links = link_re.findall(req.text)
for link in links:
# Get an absolute URL for a link
link = urlparse.urljoin(url, link)
result += crawl(link, maxlevel - 1)
# Find all emails on current page
result += email_re.findall(req.text)
return result
emails = crawl('http://ccs.neu.edu', 2)
print "Scrapped e-mail addresses:"
for e in emails:
print e
The error I get is below:
C:\Python27\python.exe "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py"
Traceback (most recent call last):
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 41, in <module>
emails = crawl('http://ccs.neu.edu', 2)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
File "C:/Users/Sagar Shah/PycharmProjects/crawler/webcrawler.py", line 35, in crawl
result += crawl(link, maxlevel - 1)
TypeError: 'NoneType' object is not iterable
Process finished with exit code 1
Any suggestions will help. Thanks!
The problem is this:
if(maxlevel == 0):
return
Currently it return None when maxlevel == 0. You can't concatenate a list with a None object.
You need to return an empty list [] to be consistent.
I'm trying to tally the number of instances of a top level domain occur in a file containing 800K+ top level domain strings that I scraped from URLs. In the code below, when I used "if mstlds in ntld:" the results appeared to be correct but upon inspection "co" and "com", "ca" and "cat" counts are incorrect. But if I use == or "is" I don't get any matches at all but instead an error:
Traceback (most recent call last):
File "checktlds4malware.py", line 111, in
mtlds_line = mtlds.readline()
AttributeError: 'str' object has no attribute 'readline'
tld_file = open(sys.argv[1],'r')
tld_line = tld_file.readline()
while tld_line:
#print(tld_line)
tld_line = tld_line.strip()
columns = tld_line.split()
ntld = columns[0] # get the ICANN TLD
ntld = ntld.lower()
mtlds = open ('malwaretlds.txt', 'r')
mtlds_line = mtlds.readline()
while mtlds_line:
print(mtlds_line)
mtlds_line = mtlds_line.strip()
columns = mtlds_line.split()
mtlds = columns[0]
mtlds = mtlds.lower()
#raw_input()
# I don't get the error when using "in" not ==
# but the comparison is not correct.
if mtlds_line == ntld:
m_count += 1
print 'ntld and mtld match: Malware domain count for ', ntld, m_count
mtlds_line = mtlds.readline()
print 'Final malware domain count for ', ntld, m_count
This is because within your while loop, you are setting mtlds to be a String. Thus, once you attempt to use the readline() method you throw the error (pretty self explanatory). You have to remember that only outside the scope of your interior while loop is mtlds pointing to a file.