Scrapy > IndexError: list index out of range - python

I’m trying to scrape some data of TripAdvisor.
I'm interested to get the "Price Range/ Cuisine & Meals" of restaurants.
So I use the following xpath to extract each of this 3 lines in the same class :
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()').extract()[1]
I'm doing the test directly in scrapy shell and it's working fine :
scrapy shell https://www.tripadvisor.com/Restaurant_Review-g187514-d15364769-Reviews-La_Gaditana_Castellana-Madrid.html
But when I integrate it to my script, I've the following error :
Traceback (most recent call last):
File "/usr/lib64/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/offsite.py", line 29, in process_spider_output
for x in result:
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "/usr/lib64/python3.6/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "/root/Scrapy_TripAdvisor_Restaurant-master/tripadvisor_las_vegas/tripadvisor_las_vegas/spiders/res_las_vegas.py", line 64, in parse_listing
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
File "/usr/lib/python3.6/site-packages/parsel/selector.py", line 61, in __getitem__
o = super(SelectorList, self).__getitem__(pos)
IndexError: list index out of range
I paste you part of my code and I explain it below :
# extract restaurant cuisine
row_cuisine_overviewcard = \
(response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
row_cuisine_card = \
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
In tripAdvisor restaurants, there is 2 different type of pages, with 2 different format.
The first with a class overviewcard, an the second, with a class cards
So I want to check if the first is present (overviewcard), if not, execute the second (card), and if not, put "None" value.
:D But looks like Python execute both .... and as the second one don't exist in the page, the script stop.
Could it be an indentation error ?
Thanks for your help
Regards

Your second selector (row_cuisine_card) fails because the element does not exist on the page. When you then try to access [1] in the result it throws an error because the result array is empty.
Assuming you really want item 1, try this
row_cuisine_overviewcard = \
(response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()')[1])
# Here we get all the values, even if it is empty.
row_cuisine_card = \
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()').getall())
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
# Here we check first if that result has more than 1 item, and then we check the value.
elif (len(row_cuisine_card) > 1 and row_cuisine_card[1] == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
else:
cuisine = None
You should apply the same kind of safety checking whenever you try to get a specific index from a selector. In other words, make sure you have a value before you access it.

Your problem is already in your check in this line_
row_cuisine_card = \
(response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')[1])
You are trying to extract a value from the website that may not exist. In other words, if
response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()')
returns no or only one element, then you cannot access the second element in the returned list (which you want to access with the appended [1]).
I would recommend storing the values that you extract from the website into a local variable first in order to then check whether or not a value that you want has been found. My guess is that the page it breaks on does not have the information you want.
This could roughly look like the following code:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2:
row_cuisine_overviewcard = cuisine_overviewcard_sections[1]
cuisine_card_sections = response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2:
row_cuisine_card = cuisine_card_sections[1]
if (row_cuisine_overviewcard == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
elif (row_cuisine_card == "CUISINES"):
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
Since you only need a part of the information, if the first XPath check already returns the correct answer, the code can be beautified a bit:
# extract restaurant cuisine
cuisine = None
cuisine_overviewcard_sections = response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__categoryTitle--14zKt"]/text()'
if len(cuisine_overviewcard_sections) >= 2 and cuisine_overviewcard_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[#class="restaurants-detail-overview-cards-DetailsSectionOverviewCard__tagText--1XLfi"]/text()')[1]
else:
cuisine_card_sections = response.xpath('//div[#class="restaurants-details-card-TagCategories__categoryTitle--o3o2I"]/text()'
if len(cuisine_card_sections) >= 2 and cuisine_card_sections[1] == "CUISINES":
cuisine = \
response.xpath('//div[#class="restaurants-details-card-TagCategories__tagText--2170b"]/text()')[1]
This way you only do a (potentially expensive) XPath search when it actually is necessary.

Related

How to fill a column of type 'multi-select' in notion-py(Notion API)?

I am trying to create a telegram-bot that will create notes in notion, for this I use:
notion-py
pyTelegramBotAPI
then I connected my notion by adding token_v2, and then receiving data about the note that I want to save in notion, at the end I save a note on notion like this:
def make_notion_row():
collection_view = client.get_collection_view(list_url[temporary_category]) #take collection
print(temporary_category)
print(temporary_name)
print(temporary_link)
print(temporary_subcategory)
print(temporary_tag)
row = collection_view.collection.add_row() #make row
row.ssylka = temporary_link #this is link
row.nazvanie_zametki = temporary_name #this is name
if temporary_category == 0: #this is category, where do I want to save the note
row.stil = temporary_subcategory #this is subcategory
tags = temporary_tag.split(',') #temporary_tags is text that has many tags separated by commas. I want to get these tags as an array
for tag_one in tags:
**add_new_multi_select_value("Теги", tag_one): #"Теги" is "Tag column" in russian. in this situation, tag_one takes on the following values: ['my_hero_academia','midoria']**
else:
row.kategoria = temporary_subcategory
this script works, but the problem is filling in the Tags column which is of type multi-select.
Since in the readme 'notion-py', nothing was said about filling in the 'multi-select', therefore
I used the bkiac function:https://github.com/jamalex/notion-py/issues/51
here is the slightly modified by me ​function:
art_tags = ['ryuko_matoi', 'kill_la_kill']
def add_new_multi_select_value(prop, value, style=None):
​global temporary_prop_schema
​if style is None:
​style = choice(art_tags)
​collection_schema = collection_view.collection.get(["schema"])
​prop_schema = next(
​(v for k, v in collection_schema.items() if v["name"] == prop), None
​)
​if not prop_schema:
​raise ValueError(
​f'"{prop}" property does not exist on the collection!'
​)
​if prop_schema["type"] != "multi_select":
​raise ValueError(f'"{prop}" is not a multi select property!')
​dupe = next(
​(o for o in prop_schema["options"] if o["value"] == value), None
​)
​if dupe:
​raise ValueError(f'"{value}" already exists in the schema!')
​temporary_prop_schema = prop_schema
​prop_schema["options"].append(
​{"id": str(uuid1()), "value": value, "style": style}
​)
​collection.set("schema", collection_schema)`
But it turned out that this function does not work, and gives the following error:
add_new_multi_select_value("Теги","my_hero_academia)
Traceback (most recent call last):
​File "<pyshell#4>", line 1, in <module>
​add_new_multi_select_value("Теги","my_hero_academia)
​File "C:\Users\laere\OneDrive\Documents\Programming\Other\notion-bot\program\notionbot\test.py", line 53, in add_new_multi_select_value
​collection.set("schema", collection_schema)
​File "C:\Users\laere\AppData\Local\Programs\Python\Python39-32\lib\site-packages\notion\records.py", line 115, in set
​self._client.submit_transaction(
​File "C:\Users\laere\AppData\Local\Programs\Python\Python39-32\lib\site-packages\notion\client.py", line 290, in submit_transaction
​self.post("submitTransaction", data)
​File "C:\Users\laere\AppData\Local\Programs\Python\Python39-32\lib\site-packages\notion\client.py", line 260, in post
​raise HTTPError(
requests.exceptions.HTTPError: Unsaved transactions: Not allowed to edit column: schema
this is my table image: link
this is my telegram chatting to bot: link
Honestly, I don’t know how to solve this problem, the question is how to fill a column of type 'multi-select'?
I solved this problem using this command
row.set_property("Категория", temporary_subcategory)
and do not be afraid if there is an error "options ..." this can be solved by adding settings for the 'multi-select' field.

How to check if word is not in list

I am using praw to scrape a subbreddit "RocketLeagueExchange"
I want to check if the reddit title has any of the ignorewords. If not, append the list with its title and url.
When removing the part about checking if any of the ignorewords are in the title it works.
Id also like to put
if not any(ignorewords in submission.title.lower for ignorewords in submission.title.lower):
but I get an error when using .lower
File "main.py", line 19, in <module>
if not any(ignorewords in submission.title.lower for ignorewords in submission.title.lower):
TypeError: 'builtin_function_or_method' object is not iterable
What I tried:
platform = "Xbox"
item = "tw zomba"
ignorewords= ["pricecheck","price check","discussion","giveaway","store"]
reddittrades = []
for submission in reddit.subreddit("RocketLeagueExchange").search("{} {}".format(platform, item), limit=10):
if not any(ignorewords in submission.title for ignorewords in submission.title):
reddittrades.append(submission.title + submission.url)
print(reddittrades)
I get [] as the output - when there are clearly many results on reddit
I think what you are trying to achieve is the following:
platform = "Xbox"
item = "tw zomba"
ignorewords= ["pricecheck","price check","discussion","giveaway","store"]
reddittrades = []
for submission in reddit.subreddit("RocketLeagueExchange").search("{} {}".format(platform, item), limit=10):
if not any([word.lower() in ignorewords for word in submission.title]):
reddittrades.append(submission.title + submission.url)
print(reddittrades)

Pandoc Filter via Panflute not Working as Expected

Problem
For a Markdown document I want to filter out all sections whose header titles are not in the list to_keep. A section consists of a header and the body until the next section or the end of the document. For simplicity lets assume that the document only has level 1 headers.
When I make a simple case distinction on whether the current element has been preceeded by a header in to_keep and do either return None or return [] I get an error. That is, for pandoc --filter filter.py -o output.pdf input.md I get TypeError: panflute.dump needs input of type "panflute.Doc" but received one of type "list" (code, example file and complete error message at the end).
I use Python 3.7.4 and panflute 1.12.5 and pandoc 2.2.3.2.
Question
If make a more fine grained distinction on when to do return [], it works (function action_working). My question is, why is this more fine grained distinction neccesary? My solution seems to work, but it might well be accidental... How can I get this to work properly?
Files
error
Traceback (most recent call last):
File "filter.py", line 42, in <module>
main()
File "filter.py", line 39, in main
return run_filter(action_not_working, doc=doc)
File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 266, in run_filter
return run_filters([action], *args, **kwargs)
File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 253, in run_filters
dump(doc, output_stream=output_stream)
File "C:\Users\ody_he\AppData\Local\Continuum\anaconda3\lib\site-packages\panflute\io.py", line 132, in dump
raise TypeError(msg)
TypeError: panflute.dump needs input of type "panflute.Doc" but received one of type "list"
Error running filter filter.py:
Filter returned error status 1
input.md
# English
Some cool english text this is!
# Deutsch
Dies ist die deutsche Übersetzung!
# Sources
Some source.
# Priority
**Medium** *[Low | Medium | High]*
# Status
**Open for Discussion** *\[Draft | Open for Discussion | Final\]*
# Interested Persons (mailing list)
- Franz, Heinz, Karl
fiter.py
from panflute import *
to_keep = ['Deutsch', 'Status']
keep_current = False
def action_not_working(elem, doc):
'''For every element we check if it occurs in a section we wish to keep.
If it is, we keep it and return None (indicating to keep the element unchanged).
Otherwise we remove the element (return []).'''
global to_keep, keep_current
update_keep(elem)
if keep_current:
return None
else:
return []
def action_working(elem, doc):
global to_keep, keep_current
update_keep(elem)
if keep_current:
return None
else:
if isinstance(elem, Header):
return []
elif isinstance(elem, Para):
return []
elif isinstance(elem, BulletList):
return []
def update_keep(elem):
'''if the element is a header we update to_keep.'''
global to_keep, keep_current
if isinstance(elem, Header):
# Keep if the title of a section is in too keep
keep_current = stringify(elem) in to_keep
def main(doc=None):
return run_filter(action_not_working, doc=doc)
if __name__ == '__main__':
main()
I think what happens is that panflute call the action on all elements, including the Doc root element. If keep_current is False when walking the Doc element, it will be replaced by a list. This leads to the error message you are seeing, as panflute expectes the root node to always be there.
The updated filter only acts on Header, Para, and BulletList elements, so the Doc root node will be left untouched. You'll probably want to use something more generic like isinstance(elem, Block) instead.
An alternative approach could be to use panflute's load and dump elements directly: load the document into a Doc element, manually iterate over all blocks in args and remove all that are unwanted, then dump the resulting doc back into the output stream.
from panflute import *
to_keep = ['Deutsch', 'Status']
keep_current = False
doc = load()
for top_level_block in doc.args:
# do things, remove unwanted blocks
dump(doc)

star wars api => IndexError: list index out of range error

I am working with a star wars API from http://swapi.co/api/. I can connect to it just fine and my project is coming along fine. However, I am running into the following error message: IndexError: list index out of range error. Looking at other stack overflow questions it appears that this could be an off by one error. I am not sure how to fix it in regards to my Program. Here is the code:
url = ('http://swapi.co/api/' + str(view))
#Setting up a get request to pull the data from the URL
r = requests.get(url)
if r.status_code == 200:
print("status_code", r.status_code)
else:
print("Sorry it appears your connection failed!")
#Storing the API response in a variable
response_dict = r.json()
print("There are currently", response_dict['count'], "items to view")
repo_dicts = response_dict['results']
num = 0
while num < response_dict['count']:
if view == 'films':
repo_dict = repo_dicts[num]['title']
print(str(num) + " " + repo_dict)
elif view == 'starships':
repo_dict = repo_dicts[num]['name']
print(str(num) + " " + repo_dict)
num += 1
Now the line that is giving me the problem is in that elif view == 'starships' area. Actually if one goes to the API you can see certain categories like films, people, starships etc. All of the categories, except films, have greater than 10 things in them. I also notice that if I go to http://swapi.co/api/starships/4/ there will be no detail found. Could the fact that some of the categories have no data be causing my problem? Thank you for any insight!!
Here is the traceback error message:
Traceback (most recent call last):
File "main.py", line 102, in <module>
main()
File "main.py", line 98, in main
began()
File "main.py", line 87, in began
connect(view)
File "main.py", line 31, in connect
repo_dict = repo_dicts[num]['name']
IndexError: list index out of range
Iterate through results you have using foreach loop like this:
for item in repo_dicts:
if view == 'films':
repo_dict = item['title']
print(str(num) + " " + repo_dict)
elif view == 'starships':
repo_dict = item['name']
print(str(num) + " " + repo_dict)
Reason is because api returns 10 items in response_dict['results'] but response_dict['count'] is 37. Consult api documentation on why this happens. My guess this is possible pagination happening.

Not iterating through whole dictionary

So basically, I have an api from which i have several dictionaries/arrays. (http://dev.c0l.in:5984/income_statements/_all_docs)
When getting the financial information for each company from the api (e.g. sector = technology and statement = income) python is supposed to return 614 technology companies, however i get this error:
Traceback (most recent call last):
File "C:\Users\samuel\Desktop\Python Project\Mastercopy.py", line 83, in <module>
user_input1()
File "C:\Users\samuel\Desktop\Python Project\Mastercopy.py", line 75, in user_input1
income_statement_fn()
File "C:\Users\samuel\Desktop\Python Project\Mastercopy.py", line 51, in income_statement_fn
if is_response ['sector'] == user_input3:
KeyError: 'sector'
on a random company (usually on one of the 550-600th ones)
Here is the function for income statements
def income_statement_fn():
user_input3 = raw_input("Which sector would you like to iterate through in Income Statement?: ")
print 'Starting...'
for item in income_response['rows']:
is_url = "http://dev.c0l.in:5984/income_statements/" + item['id']
is_request = urllib2.urlopen(is_url).read()
is_response = json.loads(is_request)
if is_response ['sector'] == user_input3:
csv.writerow([
is_response['company']['name'],
is_response['company']['sales'],
is_response['company']['opening_stock'],
is_response['company']['purchases'],
is_response['company']['closing_stock'],
is_response['company']['expenses'],
is_response['company']['interest_payable'],
is_response['company']['interest_receivable']])
print 'loading...'
print 'done!'
print end - start
Any idea what could be causing this error?
(I don't believe that it is the api itself)
Cheers
Well, on testing the url you pass in the urlopen call, with a random number, I got this:
{"error":"not_found","reason":"missing"}
In that case, your function will return exactly the error you get. If you want your program to handle the error nicely and add a "missing" line instead of actual data, you could do that for instance:
def income_statement_fn():
user_input3 = raw_input("Which sector would you like to iterate through in Income Statement?: ")
print 'Starting...'
for item in income_response['rows']:
is_url = "http://dev.c0l.in:5984/income_statements/" + item['id']
is_request = urllib2.urlopen(is_url).read()
is_response = json.loads(is_request)
if is_response.get('sector', False) == user_input3:
csv.writerow([
is_response['company']['name'],
is_response['company']['sales'],
is_response['company']['opening_stock'],
is_response['company']['purchases'],
is_response['company']['closing_stock'],
is_response['company']['expenses'],
is_response['company']['interest_payable'],
is_response['company']['interest_receivable']])
print 'loading...'
else:
csv.writerow(['missing data'])
print 'done!'
print end - start
The problem seems to be with the final row of your income_response data
{"id":"_design/auth","key":"_design/auth","value":{"rev":"1-3d8f282ec7c26779194caf1d62114dc7"}}
This does not have a sector value. You need to alter your code to handle this line, for example by ignoring any line where the sector key is not present.
You could easily have debugged this with a few print statements - for example insert
print item['id'], is_response.get('sector', None)
into your code before the part that outputs the CSV.
A KeyError means that the key you tried to use does not exist in the dictionary. When checking for a key, it is much safer to use .get(). So you would replace this line:
if is_response['sector'] == user_input3:
With this:
if is_response.get('sector') == user_input3:

Categories

Resources