Scrapy - how to prevent output lines with blank element? - python

Using a very basic Scrapy script, I want to ensure that none of my output lines include a blank item.
That is, say I have the standard
items = []
for list in lists:
item = TypeItem()
item['thing1'] = list.select('h1/text()').extract()
item['thing2'] = list.select('h2/text()').extract()
item['thing3'] = list.select('h3/text()').extract()
items.append(item)
return(items)
I want to prevent any csv line that says "thing1,,thing3" or ",thing2," or the like.
(I'm new to stackoverflow, so I don't know if it's appropriate to ask multiple questions at a time, but since they're related, if I could:
Q2: if I put in the check "if item not in items" before items.append(item), would it stop any duplicate full lines, or just duplicate individual items? If the latter, how do I prevent duplicate lines?)

For your Q2, I think it would not stop duplicates because they are objects (instances of classes) and all different. You should subclass it and implement __eq__().
You could achieve that goal after retrieving all elements using the csv parser, couldn't you?
Also, you could save the xpath result to a variable and check if it's blank, like:
thing1 = list.select('h1/text()').extract()[0]
if thing1.strip():
...
Also, you could use an additional xpath expression to check that none of your texts will be blank, like:
items = []
for list in lists:
if list.select('.[h1[text()] and h2[text()] and h3[text()]]'):
item = TypeItem()
item['thing1'] = list.select('h1/text()').extract()
item['thing2'] = list.select('h2/text()').extract()
item['thing3'] = list.select('h3/text()').extract()
items.append(item)
return(items)

Related

Cleaner or easier way to write this?

I'm scrapping from here: https://www.usatoday.com/sports/ncaaf/sagarin/ and the page is just a mess of font tags. I've been able to successfully scrape the data that I need, but I'm curious if I could written this 'cleaner' I guess for lack of a better word. It just seems silly that I have to use three different temporary lists as I stage the cleanup of the scrapped data.
For example, here is my snippet of code that gets the overall rating for each team in the "table" on that page:
source = urllib.request.urlopen('https://www.usatoday.com/sports/ncaaf/sagarin/').read()
soup = bs.BeautifulSoup(source, "lxml")
page_source = soup.find("font", {"color": "#000000"}
sagarin_raw_rating_list = page_source.find_all("font", {"color": "#9900ff"})
raw_ratings = sagarin_raw_rating_list[:-1]
temp_list = [element.text for element in raw_ratings]
temp_list_cleanup1 = [element for element in temp_list if element != 'RATING']
temp_list_cleanup2 = re.findall("&nbsp\s*(-?\d+\.\d+)", str(temp_list_cleanup1))
final_ratings_list = [element for element in temp_list_cleanup2 if element != home_team_advantage] # This variable is scrapped from another piece of code
print(final_ratings_list)
This is for a private program for me and some friends so I'm the only one ever maintaining it, but it just seems a bit convoluted. Part of the problem is the site because I have to do so much work to extract the relevant data.
The main thing I see is that you turn temp_list_cleanup1 into a string kind of unnecessarily. I don't think there's going to be that much of a difference between re.findall on one giant string and re.search on a bunch of smaller strings. After that you can swap out most of the list comprehensions [...] for generator comprehensions (...). It doesn't eliminate any lines of code, but you don't store extra lists that you won't ever need again
temp_iter = (element.text for element in raw_ratings)
temp_iter_cleanup1 = (element for element in temp_iter if element != 'RATING')
# search each element individually, rather than one large string
temp_iter_cleanup2 = (re.search("&nbsp\s*(-?\d+\.\d+)", element).group(1)
for element in temp_iter_cleanup1)
# here do a list comprehension so that you have the scrubbed data stored
final_ratings_list = [element for element in temp_iter_cleanup2 if element != home_team_advantage]

How can I take a text file and create a triple nested list from it with tkinter python

I'm making a program that allows the user to log loot they receive from monsters in an MMO. I have the drop tables for each monster stored in text files. I've tried a few different formats but I still can't pin down exactly how to take that information into python and store it into a list of lists of lists.
The text file is formatted like this
item 1*4,5,8*ns
item 2*3*s
item 3*90,34*ns
The item # is the name of the item, the numbers are different quantities that can be dropped, and the s/ns is whether the item is stackable or not stackable in game.
I want the entire drop table of the monster to be stored in a list called currentDropTable so that I can reference the names and quantities of the items to pull photos and log the quantities dropped and stuff.
The list for the above example should look like this
[["item 1", ["4","5","8"], "ns"], ["item 2", ["2","3"], "s"], ["item 3", ["90","34"], "ns"]]
That way, I can reference currentDropTable[0][0] to get the name of an item, or if I want to log a drop of 4 of item 1, I can use currentDropTable[0][1][0].
I hope this makes sense, I've tried the following and it almost works, but I don't know what to add or change to get the result I want.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
currentDropTable.append(item)
dropTableFile = open("droptable.txt", "r").read().split('\n')
convert_drop_table(dropTableFile)
print(currentDropTable)
This prints everything properly except the quantities are still an entity without being a list, so it would look like
[['item 1', '4,5,8', 'ns'], ['item 2', '2,3', 's']...etc]
I've tried nesting another for j in i, split(',') but then that breaks up everything, not just the list of quantities.
I hope I was clear, if I need to clarify anything let me know. This is the first time I've posted on here, usually I can just find another solution from the past but I haven't been able to find anyone who is trying to do or doing what I want to do.
Thank you.
You want to split only the second entity by ',' so you don't need another loop. Since you know that item = i.split('*') returns a list of 3 items, you can simply change your innermost for-loop as follows,
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
Here you replace the second element of item with a list of the quantities.
You only need to split second element from that list.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
The first thing I feel bound to say is that it's usually a good idea to avoid using global variables in any language. Errors involving them can be hard to track down. In fact you could simply omit that function convert_drop_table from your code and do what you need in-line. Then readers aren't obliged to look elsewhere to find out what it does.
And here's yet another way to parse those lines! :) Look for the asterisks then use their positions to select what you want.
currentDropTable = []
with open('droptable.txt') as droptable:
for line in droptable:
line = line.strip()
p = line.find('*')
q = line.rfind('*')
currentDropTable.append([line[0:p], line[1+p:q], line[1+q:]])
print (currentDropTable)

Python appending a list to a list and then clearing it

I have this part of code isolated for testing purposes and this question
noTasks = int(input())
noOutput = int(input())
outputClist = []
outputCList = []
for i in range(0, noTasks):
for w in range(0, noOutput):
outputChecked = str(input())
outputClist.append(outputChecked)
outputCList.append(outputClist)
outputClist[:] = []
print(outputCList)
I have this code here, and i get this output
[[], []]
I can't figure out how to get the following output, and i must clear that sublist or i get something completely wrong...
[["test lol", "here can be more stuff"], ["test 2 lol", "here can be more stuff"]]
In Python everything is a object. A list is a object with elements. You only create one object outputclist filling and clearing its contents. In the end, you have one list multiple times in outputCList, and as your last thing is clearing the list, this list is empty.
Instead, you have to create a new list for every task:
noTasks = int(input())
noOutput = int(input())
output = []
for i in range(noTasks):
checks = []
for w in range(noOutput):
checks.append(input())
output.append(checks)
print(output)
Instead of passing the contained elements in outputClist to outputCList (not the greatest naming practice either to just have one capitalization partway through be the only difference in variable names), you are passing a reference to the list itself. To get around this important and useful feature of Python that you don't want to make use of, you can pretty easily just pass a new list containing the elements of outputClist by changing this line
outputCList.append(outputClist)
to
outputCList.append(list(outputClist))
or equivalently, as #jonrsharpe states in his comment
outputCList.append(outputClist[:])

accessing values of a dictionary with duplicate keys

I have a dictionary that looks like this:
reply = {icon:[{name:whatever,url:logo1.png},{name:whatever,url:logo2.png}]}
how do i access logo1.png ?
I tried :
print reply[icon][url]
and it gives me a error:
list indices must be integers, not str
EDIT:
Bear in mind sometimes my dictionary changes to this :
reply = {icon:{name:whatever,url:logo1.png}}
I need a general solution which will work for both kinds of dictionaries
EDIT2:
My solution was like this :
try:
icon = reply['icon']['url']
print icon
except Exception:
icon = reply['icon'][0]['url']
print ipshit,icon
This works but looks horrible. I was wondering if there was an easier way than this
Have you tried this?
reply[icon][0][url]
If you know for sure all the different kinds of responses that you will get, you'll have to write a parser where you're explicitly checking if the values are lists or dicts.
You could try this if it is only the two possibilities that you've described:
def get_icon_url(reply):
return reply['icon'][0]['url']\
if type(reply['icon']) is list else reply['icon']['url']
so in this case, icon is the key to a list, that has two dictionaries with two key / value pairs in each. Also, it looks like you might want want your keys to be strings (icon = 'icon', name='name').. but perhaps they are variables in which case disregard, i'm going to use strings below because it seems the most correct
so:
reply['icon'] # is equal to a list: []
reply['icon'][0] # is equal to a dictionary: {}
reply['icon'][0]['name'] # is equal to 'whatever'
reply['icon'][0]['url'] # is equal to 'logo1.png'
reply['icon'][1] # is equal to the second dictionary: {}
reply['icon'][1]['name'] # is equal to 'whatever'
reply['icon'][1]['url'] # is equal to 'logo2.png'
you can access elements of those inner dictionaries by either knowing how many items are in the list, and reference theme explicitly as done above, or you can iterating through them:
for picture_dict in reply['icon']:
name = picture_dict['name'] # is equal to 'whatever' on both iterations
url = picture_dict['url'] #is 'logo1.png' on first iteration, 'logo2.png' on second.
Cheers!
Not so different, but maybe it looks better (KeyError gives finer control):
icon_data = reply['icon']
try:
icon = icon_data['url']
print icon
except KeyError:
icon = icon_data[0]['url']
print ipshit,icon
or:
icon_data = reply['icon']
if isinstance(icon_data, list):
icon_data = icon_data[0]
icon = icon_data['url']

use slice in for loop to build a list

I would like to build up a list using a for loop and am trying to use a slice notation. My desired output would be a list with the structure:
known_result[i] = (record.query_id, (align.title, align.title,align.title....))
However I am having trouble getting the slice operator to work:
knowns = "output.xml"
i=0
for record in NCBIXML.parse(open(knowns)):
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
i+=1
which results in:
list assignment index out of range.
I am iterating through a series of sequences using BioPython's NCBIXML module but the problem is adding to the list. Does anyone have an idea on how to build up the desired list either by changing the use of the slice or through another method?
thanks zach cp
(crossposted at [Biostar])1
You cannot assign a value to a list at an index that doesn't exist. The way to add an element (at the end of the list, which is the common use case) is to use the .append method of the list.
In your case, the lines
known_results[i] = record.query_id
known_results[i][1] = (align.title for align in record.alignment)
Should probably be changed to
element=(record.query_id, tuple(align.title for align in record.alignment))
known_results.append(element)
Warning: The code above is untested, so might contain bugs. But the idea behind it should work.
Use:
for record in NCBIXML.parse(open(knowns)):
known_results[i] = (record.query_id, None)
known_results[i][1] = (align.title for align in record.alignment)
i+=1
If i get you right you want to assign every record.query_id one or more matching align.title. So i guess your query_ids are unique and those unique ids are related to some titles. If so, i would suggest a dictionary instead of a list.
A dictionary consists of a key (e.g. record.quer_id) and value(s) (e.g. a list of align.title)
catalog = {}
for record in NCBIXML.parse(open(knowns)):
catalog[record.query_id] = [align.title for align in record.alignment]
To access this catalog you could either iterate through:
for query_id in catalog:
print catalog[query_id] # returns the title-list for the actual key
or you could access them directly if you know what your looking for.
query_id = XYZ_Whatever
print catalog[query_id]

Categories

Resources