Using Beautiful Soup to Find Links Before a Certain Letter

Using Beautiful Soup to Find Links Before a Certain Letter - python

I have a BeautifulSoup problem that hopefully you can help me out with.
Currently, I have a website with a lot of links on it. The links lead to pages that contain the data of that item that is linked. If you want to check it out, it's this one: http://ogle.astrouw.edu.pl/ogle4/ews/ews.html. What I ultimately want to accomplish is to print out the links of the data that are labeled with an 'N'. It may not be apparent at first, but if you look closely on the website, some of the data have 'N' after their Star No, and others do not. Afterwards, I use that link to download a file containing the information I need on that data. The website is very convenient because the download URLs only change a bit from data to data, so I only need to change a part of the URL, as you'll see in the code below.
I currently have accomplished the data downloading part. However, this is where you come in. Currently, I need to put in the identification number of the BLG event that I desire. (This will become apparent after you view the code below.) However, the website is consistently updating over time, and having to manually search for 'N' events takes up unnecessary time. I want the Python code to be able to do it for me. My original thoughts on the subject were that I could have BeautifulSoup search through the text for all N's, but I ran into some issues on accomplishing that. I feel like I am not familiar enough with BeautifulSoup to get done what I wish to get done. Some help would be appreciated.
The code I have currently is below. I have put in a range of BLG events that have the 'N' label as an example.
#Retrieve .gz files from URLs
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL))
#Select the desired data numbers
numbers = list(range(974,998))
x=0
for i in numbers:
numbers[x] = str(i)
x += 1
print(numbers)
#Get all links and put into list
allLinks = []
for link in soup.find_all('a'):
list_links = link.get('href')
allLinks.append(list_links)
#Remove None datatypes from link list
while None in allLinks:
allLinks.remove(None)
#print(allLinks)
#Remove all links but links to data pages and gets rid of the '.html'
list_Bindices = [i for i, s in enumerate(allLinks) if 'b' in s]
print(list_Bindices)
bLinks = []
for x in list_Bindices:
bLinks.append(allLinks[x])
bLinks = [s.replace('.html', '') for s in bLinks]
#print(bLinks)
#Create a list of indices for accessing those pages
list_Nindices = []
for x in numbers:
list_Nindices.append([i for i, s in enumerate(bLinks) if x in s])
#print(type(list_Nindices))
#print(list_Nindices)
nindices_corrected = []
place = 0
while place < (len(list_Nindices)):
a = list_Nindices[place]
nindices_corrected.append(a[0])
place = place + 1
#print(nindices_corrected)
#Get the page names (without the .html) from the indices
nLinks = []
for x in nindices_corrected:
nLinks.append(bLinks[x])
#print(nLinks)
#Form the URLs for those pages
final_URLs = []
for x in nLinks:
y = "ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/"+ x + "/phot.dat"
final_URLs.append(y)
#print(final_URLs)
#Retrieve the data from the URLs
z = 0
for x in final_URLs:
name = nLinks[z] + ".dat"
#print(name)
urllib.request.urlretrieve(x, name)
z += 1
#hrm = urllib.request.urlretrieve("ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/blg-0974.tar.gz", "practice.gz")
This piece of code has taken me quite some time to write, as I am not a professional programmer, nor an expert in BeautifulSoup or URL manipulation in any way. In fact, I use MATLAB more than Python. As such, I tend to think in terms of MATLAB, which translates into less efficient Python code. However, efficiency is not what I am searching for in this problem. I can wait the extra five minutes for my code to finish if it means that I understand what is going on and can accomplish what I need to accomplish. Thank you for any help you can offer! I realize this is a fairly muti-faceted problem.

This should do it:
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL), 'html5lib')
Here, I'm using the html5lib to parse the url content.
Next, we'll look through the table, extracting links if the star names have a 'N' in them:
table = soup.find('table')
links = []
for tr in table.find_all('tr', {'class' : 'trow'}):
td = tr.findChildren()
if 'N' in td[4].text:
links.append('http://ogle.astrouw.edu.pl/ogle4/ews/' + td[1].a['href'])
print(links)
Output:
['http://ogle.astrouw.edu.pl/ogle4/ews/blg-0974.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0975.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0976.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0977.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0978.html',
...
]

Related

Python BeautifulSoup - Improve readability of find by Id function?

I would like to improve the readability following code, especially lines 8 to 11
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
question1 = str(soup.find(id='i1'))
question1 = question1.split('>')[1].lstrip().split('.')[1]
question1 = question1[1:]
question1 = question1.replace("_", "")
print(question1)
Thanks in advance :)

You could use the following
question1 = soup.find(id='i1').getText().split(".")[1].replace("_","").strip()
to replace lines 8 to 11.
.getText() takes care of removing the html-tags. Rest is pretty much the same.
In python you can almos always just chain operations. So your code would also be valid a a one-liner:
question1 = str(soup.find(id='i1')).split('>')[1].lstrip().split('.')[1][1:].replace("_", "")
But in most cases it is better to leave the code in a more readable form than to reduce the line-count.

Abhinav, is not very clear what you want to achieve, the script is actually already very simple which is a good thing and follow the Pythonic principle of The Zen of Python:
"Simple is better than complex."
Also is not comprehensive of what you actually mean:
Make it more simple as in Understandable and clear for Human beings?
Make it more simple for the machine to compute it, hence improve performance?
Reduce the line of codes and follow more the programming Guidelines?
I point this out because for next time would be better to make it more explicit in the question, having said that, as I don't know exactly what you mean, I come up with an answer that more or less covers all of 3 points:
ANSWER
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# ========= < FUNCTION TO GET ALL QUESTION DYNAMICALLY > ========= #
def clean_string_by_id(page, id):
content = str(page.find(id=id)) # Get Content of page by different ids
if content != 'None': # Check if there is actual content or not
find_question = content.split('>') # NOTE: Split at tags closing
if len(find_question) >= 2 and find_question[1][0].isdigit(): # NOTE: If len is 1 means that is not the correct element Also we check if the first element is a digit means that is correct
cleaned_question = find_question[1].split('.')[1].strip() # We get the actual Question and strip it already !
result = cleaned_question.replace('_', '')
return result
else:
return
# ========= < Scan the entire page Dynamically + add result to a list> ========= #
all_questions = []
for i in range(1, 50): # NOTE: I went up to 50 but there may be many more, I let you test it
get_question = clean_string_by_id(soup, f'i{i}')
if get_question: # Append result to list only if there is actual content
all_questions.append(get_question)
# ========= < show all results > ========= #
for question in all_questions:
print(question)
NOTE
Here I'm assuming that you want to get all elements from this page, hence you don't want to write 2000 variables, as you can see I left the logic basically the same as yours but I wrapped everything in a Function instead.
In fact the steps you follow were pretty good and yes you may "improve it" or make it "smarter" however comprehensible wins complexity. Also take in mind that I assumed that get all the 'questions' from that Google Forms was your goal.
EDIT
As pointed by #wuerfelfreak and as he explains in his answer further improvement can be achived by using getText() function
Hence here the result of the above function using getText:
def clean_string_by_id(page, id):
content = page.find(id=id)
if content: # NOTE: Check if there is actual content or not, same as if len(content) >= 0
find_question = content.getText() # NOTE: Split at tags closing
if find_question: # NOTE: same as do if len(findÑ_question) >= 1: ... If is 0 means that is a empty line so we skip it
cleaned_question = find_question.split('.')[1].strip() # Same as before
result = cleaned_question.replace('_', '')
return result
Documentations & Guides
Zen of Python
getText
geeksforgeeks.org | isdigit()

How to avoid overwriting data when creating a list? Selenium Webdriver, Python

I want to scrape every page on the following website: https://www.top40.nl/top40/2020/week-34 (for each year and weeknumber) by clicking on the song, then move to 'songinfo' and then scrape all data in the table listed there. For this question, I only scraped the title so far.
This the url I use:
url = 'https://www.top40.nl/top40/'
However, when I print the songs_list, it will only return the last title on the website. As such, I believe I am overwriting.
Hopefully someone can explain me which mistake(s) I am making and if there is any easier way to scrape the table on each page, very happy to hear.
Please find my python code below:
for year in range(2015,2016):
for week in range(1,2):
page_url = url+str(year) + '/' + 'week-' + str(week)
driver.get(page_url)
lists = driver.find_elements_by_xpath("//a[#data-linktype='title']")
links = []
for l in lists:
print(l.get_attribute('href'))
links.append(l.get_attribute('href'))
for link in links:
driver.get(link)
driver.find_element_by_xpath("//a[#href='#songinfo']").click()
songs = driver.find_elements_by_xpath(""".//*[#id="songinfo"]/table/tbody/tr[2]/td""")
songs_list = []
for s in songs:
print(s.get_attribute('innerHTML'))
songs_list.append(s.get_attribute('innerHTML'))```

The line songs_list = [] is inside the for link in links loop, so with each new iteration it gets set to an empty list (and then you append to this new, empty list). Once you end all your loops, you only see the songs_list created.
The simplest fix is to place the songs_list = [] line outside all for loops, ex:
songs_list = []
for year in range(2015,2016):
for week in range(1,2):
# etc

How to access a specific object in a class HTML while web scraping with python

Now, I understand that this may be a simple question, but I don't know anything about HTML and I'm new to web scraping with python. I was wondering if anyone could tell me how to access this specific object in this class on this website (https://sky.lea.moe/stats/Igris/Apple). The specific object I want to access is in HTML below.
'''
Average Skill Level:
32.5 == $0
'''
My current code looks like this and prints out an empty list, and even if it did print, I only want it to print out everything from this specific line of code shown above.
import bs4
res = requests.get('https://sky.lea.moe/stats/Igris/Apple')
soup = bs4.BeautifulSoup(res.text, 'lxml')
type(soup)
skillAverageList = []
for i in soup.select('.stat-value'):
skillAverageList.append(i.text)
Any help would be appreciated, hopefully this will further help me understand HTML and python as a whole. Thanks in advance.

import requests
from bs4 import BeautifulSoup
res = requests.get('https://sky.lea.moe/stats/Igris/Apple')
soup = BeautifulSoup(res.text, 'lxml')
print(soup.find("div", {"id":"additional_stats_container"}).find_all("div",class_="additional-stat")[-2].get_text(strip=True))
Output:
Average Skill Level:32.5

elements = soup.find_all("span", class_="stat-name")
skill = [i for i in elements if "Average Skill" in i.text] #getting element that has "Average Skill" in its text
idx = elements.index(skill) #getting its index to get the value of same index from values
values = soup.find_all("span", class_="stat-value")
value = values[idx] #as told earlier index of name would be same for value
print(skill[0].text + value.text)

Python script extract from HTML

I'm writing a script that scans through a set of links. Within each link the script searches a table for a row. Once found, it increments the variable total_rank which is the sum ranks found on each web page. The rank is equal to the row number.
The code looks like this and is outputting zero:
import requests
from bs4 import BeautifulSoup
import time
url_to_scrape = 'https://www.teamrankings.com/ncb/stats/'
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html.parser")
stat_links = []
for a in soup.select(".chooser-list ul"):
list_entry = a.findAll('li')
relative_link = list_entry[0].find('a')['href']
link = "https://www.teamrankings.com" + relative_link
stat_links.append(link)
total_rank = 0
for link in stat_links:
r = requests.get(link)
soup = BeautifulSoup(r.text, "html.parser")
team_rows = soup.select(".tr-table.datatable.scrollable.dataTable.no-footer table")
for row in team_rows:
if row.findAll('td')[1].text.strip() == 'Oklahoma':
rank = row.findAll('td')[0].text.strip()
total_rank = total_rank + rank
# time.sleep(1)
print total_rank
debugging team_rows is empty after the select() call thing is, I've also tried different tags. For example I've tried soup.select(".scroll-wrapper div") I've tried soup.select("#DataTables_Table_0_wrapper div") all are returning nothing

The selector
".tr-table datatable scrollable dataTable no-footer tr"
Selects a <tr> element anywhere under a <no-footer> element anywhere under a <dataTable> element....etc.
I think really "datatable scrollable dataTable no-footer" are classes on your .tr-table? So in that case, they should be joined with the first class with a period. So I believe the final correct selector is:
".tr-table.datatable.scrollable.dataTable.no-footer tr"
UPDATE: the new selector looks like this:
".tr-table.datatable.scrollable.dataTable.no-footer table"
The problem here is that the first part, .tr-table.datatable... refers to the table itself. Assuming you're trying to get the rows of this table:
<table class="tr-table datatable scrollable dataTable no-footer" id="DataTables_Table_0" role="grid">
The proper selector remains the one I originally suggested.

The #audiodude's answer is correct though the suggested selector is not working for me.
You don't need to check every single class of the table element. Here is the working selector:
team_rows = soup.select("table.datatable tr")
Also, if you need to find Oklahoma inside the table - you don't have to iterate over every row and cell in the table. Just directly search for a specific cell and get the previous containing the rank:
rank = soup.find("td", {"data-sort": "Oklahoma"}).find_previous_sibling("td").get_text()
total_rank += int(rank) # it is important to convert the row number to int
Also note that you are extracting more stats links than you should - looks like the Player Stats links should not be followed since you are focused specifically on the Team Stats. Here is one way to get Team Stats links only:
links_list = soup.find("h2", text="Team Stats").find_next_sibling("ul")
stat_links = ["https://www.teamrankings.com" + a["href"]
for a in links_list.select("ul.expand-content li a[href]")]

My loop isn't running

I'm writing a code in Python to get all the 'a' tags in a URL using Beautiful soup, then I use the link at position 3 and follow that link, I will repeat this process about 18 times. I included the code below, which has the process repeated 4 times. When I run this code I get the same 4 links in the results. I should get 4 different links. I think there is something wrong in my loop, specifically in the line that says y=url. I need help figuring out what the problem is.
import re
import urllib
from BeautifulSoup import *
list1=list()
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
for i in range (4): # repeat 4 times
htm2= urllib.urlopen(url).read()
soup1=BeautifulSoup(htm2)
tags1= soup1('a')
for tag1 in tags1:
x2 = tag1.get('href', None)
list1.append(x2)
y= list1[2]
if len(x2) < 3: # no 3rd link
break # exit the loop
else:
url=y
print y

You're continuing to add the third link EVER FOUND to your result list. Instead you should be adding the third link OF THAT ITERATION (which is soup('a')[2]), then reassigning your url and going again.
url = 'https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html'
result = []
for i in range(4):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
for link in links:
result.append(link)
try:
third_link = links[2]['href']
except IndexError: # less than three links
break
else:
url = third_link
print(url)
This is actually pretty simple in a recursive function:
def get_links(url):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if len(links) < 3:
# base case
return links
else:
# recurse on third link
return links + get_links(links[2]['href'])
You can even modify that to make sure you don't recurse too deep
def get_links(url, times=None):
'''Returns all <a> tags from `url` and every 3rd link, up to `times` deep
get_links("protocol://hostname.tld", times=2) -> list
if times is None, recurse until there are fewer than 3 links to be found
'''
def _get_links(url, TTL):
soup = BeautifulSoup(urllib.urlopen(url).read())
links = soup('a')
if (times is not None and TTL >= times) or \
len(links) < 3:
# base case
return links
else:
return links + _get_links(links[2]['href'], TTL+1)
return _get_links(url, 0)

Your current code
y= list1[2]
just prints the URL located at index 2 of list1. Since that list only gets appended to, list[2] doesn't change. You should instead be selecting different indices each time you print if you want different URLs. I'm not sure what it is specifically that you're trying to print, but y= list1[-1] for instance would end up printing the last URL added to the list on that iteration (different each time).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Beautiful Soup to Find Links Before a Certain Letter - python

Related

Python BeautifulSoup - Improve readability of find by Id function?

How to avoid overwriting data when creating a list? Selenium Webdriver, Python

How to access a specific object in a class HTML while web scraping with python

Python script extract from HTML

My loop isn't running

Categories

Resources