multiple findAll in one for loop - python

I'm using BeatufulSoap to read some data from web page.
This code works fine, but I would like to improve it.
How do I make the for loop to extract more than one piece of data per iteration? Here I have 3 for loops to get values from:
for elem in bsObj.findAll('div', class_="grad"): ...
for elem in bsObj.findAll('div', class_="ulica"): ...
for elem in bsObj.findAll('div', class_="kada"): ...
How to change this to work in one for loop? Of course I'd like a simple solution.
Output can be list
My code so far
from bs4 import BeautifulSoup
# get data from a web page into the ``html`` varaible here
bsObj = BeautifulSoup(html.read(),'lxml')
mj=[]
adr=[]
vri=[]
for mjesto in bsObj.findAll('div', class_="grad"):
print (mjesto.get_text())
mj.append(mjesto.get_text())
for adresa in bsObj.findAll('div', class_="ulica"):
print (adresa.get_text())
adr.append(adresa.get_text())
for vrijeme in bsObj.findAll('div', class_="kada"):
print (vrijeme.get_text())
vri.append(vrijeme.get_text())

You can use BeautifulSoup's select method to target your various desired elements, and do whatever you want with them. In this case we are going to simplify the CSS selector pattern by using the :is() pseudo-class, but basically we are searching for any div that has class grad, ulica, or kada. As each element is returned that matches the pattern, we just sort them by which class they correspond to:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
lokacija="http://www.hep.hr/ods/bez-struje/19?dp=koprivnica&el=124"
datum="12.02.2019"
lokacija=lokacija+"&datum="+datum
print(lokacija)
r = requests.get(lokacija)
print(type(str(r)))
print(r.status_code)
html = urlopen(lokacija)
bsObj = BeautifulSoup(html.read(),'lxml')
print("Datum radova:",datum)
print("HEP područje:",bsObj.h3.get_text())
mj=[]
adr=[]
vri=[]
hep_podrucje=bsObj.h3.get_text()
for el in bsObj.select('div:is(.grad, .ulica, .kada)'):
if 'grad' in el.get('class'):
print (el.get_text())
mj.append(el.get_text())
elif 'ulica' in el.get('class'):
print(el.get_text())
adr.append(el.get_text())
elif 'kada' in el.get('class'):
print (el.get_text())
vri.append(el.get_text())

Note: basic explanation ahead. If you know this, skip directly to the listing of possibilities
To change the code into a loop, you have to look at the part that stays the same and the part that varies. In your case, you find a div, get the text and append it to a list.
The class attribute of the div objects varies each time, so does the list you append to. A for loop works by having one variable that is assigned different values each iteration, then executig the code within.
We get a basic structure:
for div_class in <div classes>:
<stuff to do>
Now, in <stuff to do>, we have a different list each time. We need some way of getting a different list into the loop. For this, there are multiple possibilities:
Put the list into a dict and use item lookup
zip the lists with <div classes> and iterate over them
The first two will involve using nested loops, the result looking similar to this:
list_1 = []
list_2 = []
list_3 = []
for div_class, the_list in zip(['div_cls1', 'div_cls2', 'div_cls3'], [list_1, list_2, list_3]):
for elem in bsObj.find_all('div', class_=div_class):
the_list.append(elem.get_text())
or
lists = {'div_cls1': [], 'div_cls2': [], 'div_cls3': []}
for div_class in lists: # note: keys MUST match the class of div elements
for elem in bsObj.find_all('div', class_=div_class):
lists[div_class].append(elem.get_text)
Of course, the inner loop could be replaced by list comprehension (works for the dict approach): lists[div_class] = [elem.get_text() for elem in bsObj.find_all('div', class_=div_class)]

Related

Can you iterate over only tags with the .children iterator from BeautifulSoup?

I am pulling down an xml file using BeautifulSoup with this code
dlink = r'https://www.sec.gov/Archives/edgar/data/1040188/000104018820000126/primary_doc.xml'
dreq = requests.get(dlink).content
dsoup = BeautifulSoup(dreq, 'lxml')
There is a level I'm trying to access and then place the elements into a dictionary. I've got it working with this code:
if dsoup.otherincludedmanagerscount.text != '0':
inclmgr = []
for i in dsoup.find_all('othermanagers2info'):
for m in i.find_all('othermanager2'):
for o in m.find_all('othermanager'):
imd={}
if o.cik:
imd['cik'] = o.cik.text
if o.form13ffilenumber:
imd['file_no'] = o.form13ffilenumber.text
imd['name'] = o.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
I assume its easier to use the .children or .descendants generators, but every time I run it, I get an error. Is there a way to only iterate over tags using the BeautifulSoup generators?
Something like this?
for i in dsoup.othermanagers2info.children:
imd['cik'] = i.cik.text
AttributeError: 'NavigableString' object has no attribute 'cik'
Assuming othermanagers2info is a single item; you can create the same results using 1 for loop:
for i in dsoup.find('othermanagers2info').find_all('othermanager'):
imd={}
if i.cik:
imd['cik'] = i.cik.text
if i.form13ffilenumber:
imd['file_no'] = i.form13ffilenumber.text
imd['name'] = i.find('name').text
inclmgr.append(imd)
comp_dict['incl_mgr'] = inclmgr
You can also do for i in dsoup.find('othermanagers2info').findChildren():. However this will produce different results (unless you add additional code). It will flattened the list and include both parent & child items. You can also pass in a node name

Return only a string instead of two almost identical

I'm trying to get several links from a webpage, but when I print the result I get:
/t54-EXAMPLE-fansub
/t54-EXAMPLE-fansub#55
How can I only get only one of those in the output instead of both?
You could do this:
>>> '/t54-EXAMPLE-fansub#55'.split('#') # just to show you the list output
['/t54-EXAMPLE-fansub', '55']
>>> '/t54-EXAMPLE-fansub#55'.split('#')[0]
'/t54-EXAMPLE-fansub'
>>> '/t54-EXAMPLE-fansub'.split('#')[0]
'/t54-EXAMPLE-fansub'
I am assuming you will have a list called "links" that contains all the links you scraped.
links = ["/t54-EXAMPLE-fansub#55","/t54-EXAMPLE-fansub","/t55-EXAMPLE-fansub"]
links = set(map(lambda x:x[:x.index('#')] if '#' in x else x, links))
for link in links:
print(link)
This will change the type of links to a set, be careful about that. This code is just an example implementation of what you can do: Go through the links, strip the part after the first '#' , create a set so that you can keep track of what you already encountered.

Cleaner or easier way to write this?

I'm scrapping from here: https://www.usatoday.com/sports/ncaaf/sagarin/ and the page is just a mess of font tags. I've been able to successfully scrape the data that I need, but I'm curious if I could written this 'cleaner' I guess for lack of a better word. It just seems silly that I have to use three different temporary lists as I stage the cleanup of the scrapped data.
For example, here is my snippet of code that gets the overall rating for each team in the "table" on that page:
source = urllib.request.urlopen('https://www.usatoday.com/sports/ncaaf/sagarin/').read()
soup = bs.BeautifulSoup(source, "lxml")
page_source = soup.find("font", {"color": "#000000"}
sagarin_raw_rating_list = page_source.find_all("font", {"color": "#9900ff"})
raw_ratings = sagarin_raw_rating_list[:-1]
temp_list = [element.text for element in raw_ratings]
temp_list_cleanup1 = [element for element in temp_list if element != 'RATING']
temp_list_cleanup2 = re.findall("&nbsp\s*(-?\d+\.\d+)", str(temp_list_cleanup1))
final_ratings_list = [element for element in temp_list_cleanup2 if element != home_team_advantage] # This variable is scrapped from another piece of code
print(final_ratings_list)
This is for a private program for me and some friends so I'm the only one ever maintaining it, but it just seems a bit convoluted. Part of the problem is the site because I have to do so much work to extract the relevant data.
The main thing I see is that you turn temp_list_cleanup1 into a string kind of unnecessarily. I don't think there's going to be that much of a difference between re.findall on one giant string and re.search on a bunch of smaller strings. After that you can swap out most of the list comprehensions [...] for generator comprehensions (...). It doesn't eliminate any lines of code, but you don't store extra lists that you won't ever need again
temp_iter = (element.text for element in raw_ratings)
temp_iter_cleanup1 = (element for element in temp_iter if element != 'RATING')
# search each element individually, rather than one large string
temp_iter_cleanup2 = (re.search("&nbsp\s*(-?\d+\.\d+)", element).group(1)
for element in temp_iter_cleanup1)
# here do a list comprehension so that you have the scrubbed data stored
final_ratings_list = [element for element in temp_iter_cleanup2 if element != home_team_advantage]

Why do I need to specify the size of this list, else it gives list index out of range error

I am trying to parse a list of urls from a webpage. I did the following things:
Got a list of all "a" tags.
Used a for loop to get("href")
While looping, I kept assigning the get value to a new empty list called links
But I kept getting a index out of range error. I thought it might be because of the way I was incrementing the index of links, but I am sure that is not the case.
This is the error prone code:
import urllib
import bs4
url = "http://tellerprimer.ucdavis.edu/pdf/"
response = urllib.urlopen(url)
webpage = response.read()
soup = bs4.BeautifulSoup(webpage, 'html.parser')
i = 0
links = []
for tags in soup.find_all('a'):
links[i] = str(tags.get('href'))
i +=1
print i, links
I gave links a fixed length and it fixed it, like so:
links = [0]*89 #89 is the length of soup.find_all('a')
I want to know what was causing this problem.
You are attempting to assign something to a non-existent index. When you create links, you create it as an empty list.
Then you do links[i], but links is empty, so there is no ith index.
The proper way to do this is:
links.append(str(tags.get('href')))
This also means that you can eliminate your i variable. It's not needed.
for tags in soup.find_all('a'):
links.append(str(tags.get('href')))
print links
This will print all 89 links in your links list.
The list is initially empty, so you're trying to assign values to non-existing index locations in the list.
Use append() to add items to a list:
links = []
for tags in soup.find_all('a'):
links.append(str(tags.get('href')))
Or use map() instead:
links = map(lambda tags: str(tags.get('href')), soup.find_all('a'))
Or use a list comprehension:
links = [str(tags.get('href')) for tags in soup.find_all('a')]

Nested for-loop and appending to empty objects

I am providing values to a website filter In order to generate different html which l parse. I want to save each page source to a different Python object in order to distinguish the data. I have a list of empty objects which l will append to. parsing page source,and want to save each page source to its own Python object, which is itself in a list. In this way
The challenge is how to append the td elements from a particular html source, to the particular empty object in the list. I need to store html source at each iteration, in a separate object, which is itself found in a list.
I will simplify my example:
years = ['2015', '2016]
weeks = ['1', '2']
store = [[], [], [], []]
This gives me 4 sets of html source that I need to capture:
for y in years:
for w in weeks:
#I will use y and w in webdriver.select to provide values for web page filter
I will then use BS to copy page source for each iteration:
html = browser.page_source
soup = BeautifulSoup(html, "lxml")
And then iterate through the particular page source to extract td elements:
counter = 0
for el in soup.find_all('td'):
to provide index for store list in order to append td elements to separate empty objects
for el in soup.find_all('td'):
store[counter].append(el.get_text())
counter = counter + 1
Strip the element of html characters, and add to counter to move to the next object in the store list.
But the result is that all the td elements get appended to first object in the list instead of each html source having its own object. What am I missing?
Would it better to somehow use map function?
Your statement
counter=counter+1
is not within the for loop.
You need to indent it at the same level as the previous line, so that counter is incremented each time around the loop

Categories

Resources