Python BeautifulSoup - Improve readability of find by Id function? - python

I would like to improve the readability following code, especially lines 8 to 11
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
question1 = str(soup.find(id='i1'))
question1 = question1.split('>')[1].lstrip().split('.')[1]
question1 = question1[1:]
question1 = question1.replace("_", "")
print(question1)
Thanks in advance :)

You could use the following
question1 = soup.find(id='i1').getText().split(".")[1].replace("_","").strip()
to replace lines 8 to 11.
.getText() takes care of removing the html-tags. Rest is pretty much the same.
In python you can almos always just chain operations. So your code would also be valid a a one-liner:
question1 = str(soup.find(id='i1')).split('>')[1].lstrip().split('.')[1][1:].replace("_", "")
But in most cases it is better to leave the code in a more readable form than to reduce the line-count.

Abhinav, is not very clear what you want to achieve, the script is actually already very simple which is a good thing and follow the Pythonic principle of The Zen of Python:
"Simple is better than complex."
Also is not comprehensive of what you actually mean:
Make it more simple as in Understandable and clear for Human beings?
Make it more simple for the machine to compute it, hence improve performance?
Reduce the line of codes and follow more the programming Guidelines?
I point this out because for next time would be better to make it more explicit in the question, having said that, as I don't know exactly what you mean, I come up with an answer that more or less covers all of 3 points:
ANSWER
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# ========= < FUNCTION TO GET ALL QUESTION DYNAMICALLY > ========= #
def clean_string_by_id(page, id):
content = str(page.find(id=id)) # Get Content of page by different ids
if content != 'None': # Check if there is actual content or not
find_question = content.split('>') # NOTE: Split at tags closing
if len(find_question) >= 2 and find_question[1][0].isdigit(): # NOTE: If len is 1 means that is not the correct element Also we check if the first element is a digit means that is correct
cleaned_question = find_question[1].split('.')[1].strip() # We get the actual Question and strip it already !
result = cleaned_question.replace('_', '')
return result
else:
return
# ========= < Scan the entire page Dynamically + add result to a list> ========= #
all_questions = []
for i in range(1, 50): # NOTE: I went up to 50 but there may be many more, I let you test it
get_question = clean_string_by_id(soup, f'i{i}')
if get_question: # Append result to list only if there is actual content
all_questions.append(get_question)
# ========= < show all results > ========= #
for question in all_questions:
print(question)
NOTE
Here I'm assuming that you want to get all elements from this page, hence you don't want to write 2000 variables, as you can see I left the logic basically the same as yours but I wrapped everything in a Function instead.
In fact the steps you follow were pretty good and yes you may "improve it" or make it "smarter" however comprehensible wins complexity. Also take in mind that I assumed that get all the 'questions' from that Google Forms was your goal.
EDIT
As pointed by #wuerfelfreak and as he explains in his answer further improvement can be achived by using getText() function
Hence here the result of the above function using getText:
def clean_string_by_id(page, id):
content = page.find(id=id)
if content: # NOTE: Check if there is actual content or not, same as if len(content) >= 0
find_question = content.getText() # NOTE: Split at tags closing
if find_question: # NOTE: same as do if len(findÑ_question) >= 1: ... If is 0 means that is a empty line so we skip it
cleaned_question = find_question.split('.')[1].strip() # Same as before
result = cleaned_question.replace('_', '')
return result
Documentations & Guides
Zen of Python
getText
geeksforgeeks.org | isdigit()

Related

How to call this function right?

Hello guys I'm currently trying to build a phone_numbers generator with two methods. The first one is phone_numbers and should be called every time you need new generated number. You can get the numbers from next_phone_numbers. This function should also call phone_numbers if there are no numbers available. However I always get an error message when I call it. Here is my code:
def phone_number(self):
request = requests.get('https://www.bestrandoms.com/random-at-phone-number')
soup = BeautifulSoup(request.content, 'html.parser')
self.phone_numbers = soup.select_one('#main > div > div.col-xs-12.col-sm-9.main > ul > textarea').text
self.phone_numbers = self.phone_numbers.splitlines()
for phone in range(len(self.phone_numbers)):
self.phone_numbers[phone] = '+43' + self.phone_numbers[phone][1:].replace(' ', '')
self.phone_numbers.extend(self.phone_numbers)
return self.phone_numbers
#property
def next_phone_number(self):
self.phone_index += 1
if self.phone_index >= len(self.phone_numbers):
self.phone_number()
return self.phone_numbers[self.phone_index]
error-message:
enter image description here
The issue is that the phone_number() function is not extending the self.phone_numbers list enough to cover the fact that the self.phone_index is out of range. Perhaps consider doing this to extend until it is large enough:
#property
def next_phone_number(self):
self.phone_index += 1
while self.phone_index >= len(self.phone_numbers):
self.phone_number()
return self.phone_numbers[self.phone_index]
When you do self.phone_numbers = soup.select_one(...), you're clobbering the old list of phone numbers you had in self.phone_numbers and replacing it with the data you've parsed from your web page. You later refine this into a new list, but it's still not doing what you want in terms of adding new numbers (I've not tried the code, I'd guess you always get the same amount of them from the web).
You should use a different variable name for the new data. That way you can extend the existing list with the new data without overwriting anything you already have:
def phone_number(self):
request = requests.get('https://www.bestrandoms.com/random-at-phone-number')
soup = BeautifulSoup(request.content, 'html.parser')
# these lines all use new local variables instead of clobbering self.phone_numbers
new_data = soup.select_one('#main > div > div.col-xs-12.col-sm-9.main > ul > textarea').text
new_numbers = new_data.splitlines()
new_numbers_reformatted = ['+43' + number[1:].replace(' ', '') for number in new_numbers]
# so now we can extend the list as desired
self.phone_numbers.extend(new_numbers_reformatted)
return self.phone_numbers
It's possible that you'll also need to change the initialization code in your class to make sure it initializes self.phone_numbers to an empty list, if it's not doing that already. The bad behavior of your method might have been covering up a bug if the list was not being created anywhere else.

Count stuck at 7

I am working on a project for school where I am creating a nutrition plan based off our schools nutrition menu. I am trying to create a dictionary with every item and its calorie content but for some reason the loop im using gets stuck at 7 and will never advance the rest of the list. To add to my dictionary. So when I search for a known key (Sour Cream) it throws and error because it is never added to the dictionary. I have also noticed it prints several numbers twice in a row as well double adding them to the dictionary.
edit: have discovered the double printing was from the print statement I had - still wondering about the 7 however
from bs4 import BeautifulSoup
import urllib3
import requests
url = "https://menus.sodexomyway.com/BiteMenu/Menu?menuId=14756&locationId=11870001&whereami=http://mnsu.sodexomyway.com/dining-near-me/university-dining-center"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
allFood = soup.findAll('a', attrs={'class':'get-nutritioncalculator primary-textcolor'})
allCals = soup.findAll('a', attrs={'class':'get-nutrition primary-textcolor'})
nums = '0123456789'
def printData(charIndex):
for char in allFood[charIndex].contents:
print(char)
for char in allCals[charIndex].contents:
print(char)
def getGoals():
userCalories = int(input("Please input calorie goal for the day (kC): "))
#Display Info (Text/RsbPi)
fullList = {}
def compileFood():
foodCount = 0
for food in allFood:
print(foodCount)
for foodName in allFood[foodCount].contents:
fullList[foodName] = 0
foodCount += 1
print(foodCount)
compileFood()
print(fullList['Sour Cream'])
Any help would be great. Thanks!
Ok first why is this happening:
The reason is because the food on the index 7 is empty. Because it's empty it will never enter your for loop and therefore never increase your foodCount => it will stuck at 7 forever.
So if you would shift your index increase outside of the for loop it would work without a problem.
But you doing something crude here.
You already iterate through the food item and still use an additional variable.
You could solve it smarter this way:
def compileFood():
for food in allFood:
for foodName in food.contents:
fullList[foodName] = 0
With this you don't need to care about an additional variable at all.

Storing Multi-dimensional Lists?

(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.

Using Beautiful Soup to Find Links Before a Certain Letter

I have a BeautifulSoup problem that hopefully you can help me out with.
Currently, I have a website with a lot of links on it. The links lead to pages that contain the data of that item that is linked. If you want to check it out, it's this one: http://ogle.astrouw.edu.pl/ogle4/ews/ews.html. What I ultimately want to accomplish is to print out the links of the data that are labeled with an 'N'. It may not be apparent at first, but if you look closely on the website, some of the data have 'N' after their Star No, and others do not. Afterwards, I use that link to download a file containing the information I need on that data. The website is very convenient because the download URLs only change a bit from data to data, so I only need to change a part of the URL, as you'll see in the code below.
I currently have accomplished the data downloading part. However, this is where you come in. Currently, I need to put in the identification number of the BLG event that I desire. (This will become apparent after you view the code below.) However, the website is consistently updating over time, and having to manually search for 'N' events takes up unnecessary time. I want the Python code to be able to do it for me. My original thoughts on the subject were that I could have BeautifulSoup search through the text for all N's, but I ran into some issues on accomplishing that. I feel like I am not familiar enough with BeautifulSoup to get done what I wish to get done. Some help would be appreciated.
The code I have currently is below. I have put in a range of BLG events that have the 'N' label as an example.
#Retrieve .gz files from URLs
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL))
#Select the desired data numbers
numbers = list(range(974,998))
x=0
for i in numbers:
numbers[x] = str(i)
x += 1
print(numbers)
#Get all links and put into list
allLinks = []
for link in soup.find_all('a'):
list_links = link.get('href')
allLinks.append(list_links)
#Remove None datatypes from link list
while None in allLinks:
allLinks.remove(None)
#print(allLinks)
#Remove all links but links to data pages and gets rid of the '.html'
list_Bindices = [i for i, s in enumerate(allLinks) if 'b' in s]
print(list_Bindices)
bLinks = []
for x in list_Bindices:
bLinks.append(allLinks[x])
bLinks = [s.replace('.html', '') for s in bLinks]
#print(bLinks)
#Create a list of indices for accessing those pages
list_Nindices = []
for x in numbers:
list_Nindices.append([i for i, s in enumerate(bLinks) if x in s])
#print(type(list_Nindices))
#print(list_Nindices)
nindices_corrected = []
place = 0
while place < (len(list_Nindices)):
a = list_Nindices[place]
nindices_corrected.append(a[0])
place = place + 1
#print(nindices_corrected)
#Get the page names (without the .html) from the indices
nLinks = []
for x in nindices_corrected:
nLinks.append(bLinks[x])
#print(nLinks)
#Form the URLs for those pages
final_URLs = []
for x in nLinks:
y = "ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/"+ x + "/phot.dat"
final_URLs.append(y)
#print(final_URLs)
#Retrieve the data from the URLs
z = 0
for x in final_URLs:
name = nLinks[z] + ".dat"
#print(name)
urllib.request.urlretrieve(x, name)
z += 1
#hrm = urllib.request.urlretrieve("ftp://ftp.astrouw.edu.pl/ogle/ogle4/ews/2017/blg-0974.tar.gz", "practice.gz")
This piece of code has taken me quite some time to write, as I am not a professional programmer, nor an expert in BeautifulSoup or URL manipulation in any way. In fact, I use MATLAB more than Python. As such, I tend to think in terms of MATLAB, which translates into less efficient Python code. However, efficiency is not what I am searching for in this problem. I can wait the extra five minutes for my code to finish if it means that I understand what is going on and can accomplish what I need to accomplish. Thank you for any help you can offer! I realize this is a fairly muti-faceted problem.
This should do it:
from urllib.request import urlopen
import urllib.request
from bs4 import BeautifulSoup
#Access website
URL = 'http://ogle.astrouw.edu.pl/ogle4/ews/ews.html'
soup = BeautifulSoup(urlopen(URL), 'html5lib')
Here, I'm using the html5lib to parse the url content.
Next, we'll look through the table, extracting links if the star names have a 'N' in them:
table = soup.find('table')
links = []
for tr in table.find_all('tr', {'class' : 'trow'}):
td = tr.findChildren()
if 'N' in td[4].text:
links.append('http://ogle.astrouw.edu.pl/ogle4/ews/' + td[1].a['href'])
print(links)
Output:
['http://ogle.astrouw.edu.pl/ogle4/ews/blg-0974.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0975.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0976.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0977.html', 'http://ogle.astrouw.edu.pl/ogle4/ews/blg-0978.html',
...
]

How to increment between pages using Selenium and BeautifulSoup?

I'm trying to get my code to increment through the pages of this website and I can't seem to get it to loop and increment, instead doing the first page, and giving up. Is there something I'm doing wrong?
if(pageExist is not None):
if(countitup != pageNum):
countitup = countitup + 1
driver.get('http://800notes.com/Phone.aspx/%s/%s' % (tele800,countitup))
delay = 4
scamNum = soup.find_all(text=re.compile(r"Scam"))
spamNum = soup.find_all(text=re.compile(r"Call type: Telemarketer"))
debtNum = soup.find_all(text=re.compile(r"Call type: Debt Collector"))
hospitalNum = soup.find_all(text=re.compile(r"Hospital"))
scamCount = len(scamNum) + scamCount
spamCount = len(spamNum) + spamCount
debtCount = len(debtNum) + debtCount
hospitalCount = len(hospitalNum) + hospitalCount
block = soup.find(text=re.compile(r"OctoNet HTTP filter"))
extrablock = soup.find(text=re.compile(r"returning an unknown error"))
type(block) is str
type(extrablock) is str
if(block is not None or extrablock is not None):
print("\n Damn. Gimme an hour to fix this.")
time.sleep(2000)
Repo: https://github.com/GarnetSunset/Haircuttery/tree/Experimental
pageExist is not None this seems to be the problem.
Since it checks whether the page is None and it will most likely never be none. There is no official way to check for HTTP responses but we can use something like this.
if (soup.find_element_by_xpath('/html/body/p'[contains(text(),'400')])
#this will check if there's a 400 code in the p tag.
or
if ('400' in soup.find_element_by_xpath('/html/body/p[1]').text)
I'm sure there are other ways that one can do this but this is one of those and so that's the only issue here. You can then increment or keep the rest of your code pretty much as soon as you fix the first if .
I might have made some mistakes (syntax) in my code since I'm not testing it but the logic applies), great code tho!
Also instead of
type(block) is str
type(extrablock) is str
the pythonic way is using
isinstace
isinstance(block, str)
isinstance(extrablock, str)
and as for time.sleep you can use WebDriverWait there are two available methods, implicit and explicit wait, please take a look here.

Categories

Resources