Cleaner or easier way to write this? - python

I'm scrapping from here: https://www.usatoday.com/sports/ncaaf/sagarin/ and the page is just a mess of font tags. I've been able to successfully scrape the data that I need, but I'm curious if I could written this 'cleaner' I guess for lack of a better word. It just seems silly that I have to use three different temporary lists as I stage the cleanup of the scrapped data.
For example, here is my snippet of code that gets the overall rating for each team in the "table" on that page:
source = urllib.request.urlopen('https://www.usatoday.com/sports/ncaaf/sagarin/').read()
soup = bs.BeautifulSoup(source, "lxml")
page_source = soup.find("font", {"color": "#000000"}
sagarin_raw_rating_list = page_source.find_all("font", {"color": "#9900ff"})
raw_ratings = sagarin_raw_rating_list[:-1]
temp_list = [element.text for element in raw_ratings]
temp_list_cleanup1 = [element for element in temp_list if element != 'RATING']
temp_list_cleanup2 = re.findall("&nbsp\s*(-?\d+\.\d+)", str(temp_list_cleanup1))
final_ratings_list = [element for element in temp_list_cleanup2 if element != home_team_advantage] # This variable is scrapped from another piece of code
print(final_ratings_list)
This is for a private program for me and some friends so I'm the only one ever maintaining it, but it just seems a bit convoluted. Part of the problem is the site because I have to do so much work to extract the relevant data.

The main thing I see is that you turn temp_list_cleanup1 into a string kind of unnecessarily. I don't think there's going to be that much of a difference between re.findall on one giant string and re.search on a bunch of smaller strings. After that you can swap out most of the list comprehensions [...] for generator comprehensions (...). It doesn't eliminate any lines of code, but you don't store extra lists that you won't ever need again
temp_iter = (element.text for element in raw_ratings)
temp_iter_cleanup1 = (element for element in temp_iter if element != 'RATING')
# search each element individually, rather than one large string
temp_iter_cleanup2 = (re.search("&nbsp\s*(-?\d+\.\d+)", element).group(1)
for element in temp_iter_cleanup1)
# here do a list comprehension so that you have the scrubbed data stored
final_ratings_list = [element for element in temp_iter_cleanup2 if element != home_team_advantage]

Related

Nested list or list of string pairs

I have some pairs of strings. First contains name, second contains city of birth.
I use them in web scraping. When I find appropriate element on web page I want in for loop make send_keys(name) and do other operations like click or enter. For second element from web page I want also make for loop and send_keys(city). How can I do it?
Should I make list of string pairs or nested list?
Like:
list_1 = [["Ann", "London"], ["John", "Barcelona"], ["Kate", "Paris"]]
list_2 = [("Ann", "London"), ("John", "Barcelona"), ("Kate", "Paris")]
What is better if my double iteration should look like:
for element in list_1:
el_scraped = driver.find.....
el_scraped.send_keys(element)
el_scraped.click()
for element2 in element:
el2_scraped = driver.find ....
el2_scarped.send_keys(element2)
el2_scraped.click()
I have a problem with for loop construction. I only post some operations between one loop and another. Can someone help me with for loops and make appropriate list?
You can store the data in any iterator unless you call them appropriately.
I don't see any necessity for a nested for loop.
For the data format in list_1 you can call them as below:
for name,city in list_1:
el_scraped = driver.find.....
el_scraped.send_keys(name)
el_scraped.click()
el2_scraped = driver.find ....
el2_scarped.send_keys(city)
el2_scraped.click()

multiple findAll in one for loop

I'm using BeatufulSoap to read some data from web page.
This code works fine, but I would like to improve it.
How do I make the for loop to extract more than one piece of data per iteration? Here I have 3 for loops to get values from:
for elem in bsObj.findAll('div', class_="grad"): ...
for elem in bsObj.findAll('div', class_="ulica"): ...
for elem in bsObj.findAll('div', class_="kada"): ...
How to change this to work in one for loop? Of course I'd like a simple solution.
Output can be list
My code so far
from bs4 import BeautifulSoup
# get data from a web page into the ``html`` varaible here
bsObj = BeautifulSoup(html.read(),'lxml')
mj=[]
adr=[]
vri=[]
for mjesto in bsObj.findAll('div', class_="grad"):
print (mjesto.get_text())
mj.append(mjesto.get_text())
for adresa in bsObj.findAll('div', class_="ulica"):
print (adresa.get_text())
adr.append(adresa.get_text())
for vrijeme in bsObj.findAll('div', class_="kada"):
print (vrijeme.get_text())
vri.append(vrijeme.get_text())
You can use BeautifulSoup's select method to target your various desired elements, and do whatever you want with them. In this case we are going to simplify the CSS selector pattern by using the :is() pseudo-class, but basically we are searching for any div that has class grad, ulica, or kada. As each element is returned that matches the pattern, we just sort them by which class they correspond to:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
lokacija="http://www.hep.hr/ods/bez-struje/19?dp=koprivnica&el=124"
datum="12.02.2019"
lokacija=lokacija+"&datum="+datum
print(lokacija)
r = requests.get(lokacija)
print(type(str(r)))
print(r.status_code)
html = urlopen(lokacija)
bsObj = BeautifulSoup(html.read(),'lxml')
print("Datum radova:",datum)
print("HEP područje:",bsObj.h3.get_text())
mj=[]
adr=[]
vri=[]
hep_podrucje=bsObj.h3.get_text()
for el in bsObj.select('div:is(.grad, .ulica, .kada)'):
if 'grad' in el.get('class'):
print (el.get_text())
mj.append(el.get_text())
elif 'ulica' in el.get('class'):
print(el.get_text())
adr.append(el.get_text())
elif 'kada' in el.get('class'):
print (el.get_text())
vri.append(el.get_text())
Note: basic explanation ahead. If you know this, skip directly to the listing of possibilities
To change the code into a loop, you have to look at the part that stays the same and the part that varies. In your case, you find a div, get the text and append it to a list.
The class attribute of the div objects varies each time, so does the list you append to. A for loop works by having one variable that is assigned different values each iteration, then executig the code within.
We get a basic structure:
for div_class in <div classes>:
<stuff to do>
Now, in <stuff to do>, we have a different list each time. We need some way of getting a different list into the loop. For this, there are multiple possibilities:
Put the list into a dict and use item lookup
zip the lists with <div classes> and iterate over them
The first two will involve using nested loops, the result looking similar to this:
list_1 = []
list_2 = []
list_3 = []
for div_class, the_list in zip(['div_cls1', 'div_cls2', 'div_cls3'], [list_1, list_2, list_3]):
for elem in bsObj.find_all('div', class_=div_class):
the_list.append(elem.get_text())
or
lists = {'div_cls1': [], 'div_cls2': [], 'div_cls3': []}
for div_class in lists: # note: keys MUST match the class of div elements
for elem in bsObj.find_all('div', class_=div_class):
lists[div_class].append(elem.get_text)
Of course, the inner loop could be replaced by list comprehension (works for the dict approach): lists[div_class] = [elem.get_text() for elem in bsObj.find_all('div', class_=div_class)]

How can I take a text file and create a triple nested list from it with tkinter python

I'm making a program that allows the user to log loot they receive from monsters in an MMO. I have the drop tables for each monster stored in text files. I've tried a few different formats but I still can't pin down exactly how to take that information into python and store it into a list of lists of lists.
The text file is formatted like this
item 1*4,5,8*ns
item 2*3*s
item 3*90,34*ns
The item # is the name of the item, the numbers are different quantities that can be dropped, and the s/ns is whether the item is stackable or not stackable in game.
I want the entire drop table of the monster to be stored in a list called currentDropTable so that I can reference the names and quantities of the items to pull photos and log the quantities dropped and stuff.
The list for the above example should look like this
[["item 1", ["4","5","8"], "ns"], ["item 2", ["2","3"], "s"], ["item 3", ["90","34"], "ns"]]
That way, I can reference currentDropTable[0][0] to get the name of an item, or if I want to log a drop of 4 of item 1, I can use currentDropTable[0][1][0].
I hope this makes sense, I've tried the following and it almost works, but I don't know what to add or change to get the result I want.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
currentDropTable.append(item)
dropTableFile = open("droptable.txt", "r").read().split('\n')
convert_drop_table(dropTableFile)
print(currentDropTable)
This prints everything properly except the quantities are still an entity without being a list, so it would look like
[['item 1', '4,5,8', 'ns'], ['item 2', '2,3', 's']...etc]
I've tried nesting another for j in i, split(',') but then that breaks up everything, not just the list of quantities.
I hope I was clear, if I need to clarify anything let me know. This is the first time I've posted on here, usually I can just find another solution from the past but I haven't been able to find anyone who is trying to do or doing what I want to do.
Thank you.
You want to split only the second entity by ',' so you don't need another loop. Since you know that item = i.split('*') returns a list of 3 items, you can simply change your innermost for-loop as follows,
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
Here you replace the second element of item with a list of the quantities.
You only need to split second element from that list.
def convert_drop_table(list):
global currentDropTable
currentDropTable = []
for i in list:
item = i.split('*')
item[1] = item[1].split(',')
currentDropTable.append(item)
The first thing I feel bound to say is that it's usually a good idea to avoid using global variables in any language. Errors involving them can be hard to track down. In fact you could simply omit that function convert_drop_table from your code and do what you need in-line. Then readers aren't obliged to look elsewhere to find out what it does.
And here's yet another way to parse those lines! :) Look for the asterisks then use their positions to select what you want.
currentDropTable = []
with open('droptable.txt') as droptable:
for line in droptable:
line = line.strip()
p = line.find('*')
q = line.rfind('*')
currentDropTable.append([line[0:p], line[1+p:q], line[1+q:]])
print (currentDropTable)

Nested for-loop and appending to empty objects

I am providing values to a website filter In order to generate different html which l parse. I want to save each page source to a different Python object in order to distinguish the data. I have a list of empty objects which l will append to. parsing page source,and want to save each page source to its own Python object, which is itself in a list. In this way
The challenge is how to append the td elements from a particular html source, to the particular empty object in the list. I need to store html source at each iteration, in a separate object, which is itself found in a list.
I will simplify my example:
years = ['2015', '2016]
weeks = ['1', '2']
store = [[], [], [], []]
This gives me 4 sets of html source that I need to capture:
for y in years:
for w in weeks:
#I will use y and w in webdriver.select to provide values for web page filter
I will then use BS to copy page source for each iteration:
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
And then iterate through the particular page source to extract td elements:
counter = 0
for el in soup.find_all('td'):
to provide index for store list in order to append td elements to separate empty objects
for el in soup.find_all('td'):
store[counter].append(el.get_text())
counter = counter + 1
Strip the element of html characters, and add to counter to move to the next object in the store list.
But the result is that all the td elements get appended to first object in the list instead of each html source having its own object. What am I missing?
Would it better to somehow use map function?
Your statement
counter=counter+1
is not within the for loop.
You need to indent it at the same level as the previous line, so that counter is incremented each time around the loop

Scrapy - how to prevent output lines with blank element?

Using a very basic Scrapy script, I want to ensure that none of my output lines include a blank item.
That is, say I have the standard
items = []
for list in lists:
item = TypeItem()
item['thing1'] = list.select('h1/text()').extract()
item['thing2'] = list.select('h2/text()').extract()
item['thing3'] = list.select('h3/text()').extract()
items.append(item)
return(items)
I want to prevent any csv line that says "thing1,,thing3" or ",thing2," or the like.
(I'm new to stackoverflow, so I don't know if it's appropriate to ask multiple questions at a time, but since they're related, if I could:
Q2: if I put in the check "if item not in items" before items.append(item), would it stop any duplicate full lines, or just duplicate individual items? If the latter, how do I prevent duplicate lines?)
For your Q2, I think it would not stop duplicates because they are objects (instances of classes) and all different. You should subclass it and implement __eq__().
You could achieve that goal after retrieving all elements using the csv parser, couldn't you?
Also, you could save the xpath result to a variable and check if it's blank, like:
thing1 = list.select('h1/text()').extract()[0]
if thing1.strip():
...
Also, you could use an additional xpath expression to check that none of your texts will be blank, like:
items = []
for list in lists:
if list.select('.[h1[text()] and h2[text()] and h3[text()]]'):
item = TypeItem()
item['thing1'] = list.select('h1/text()').extract()
item['thing2'] = list.select('h2/text()').extract()
item['thing3'] = list.select('h3/text()').extract()
items.append(item)
return(items)

Categories

Resources