I am working on a project for school where I am creating a nutrition plan based off our schools nutrition menu. I am trying to create a dictionary with every item and its calorie content but for some reason the loop im using gets stuck at 7 and will never advance the rest of the list. To add to my dictionary. So when I search for a known key (Sour Cream) it throws and error because it is never added to the dictionary. I have also noticed it prints several numbers twice in a row as well double adding them to the dictionary.
edit: have discovered the double printing was from the print statement I had - still wondering about the 7 however
from bs4 import BeautifulSoup
import urllib3
import requests
url = "https://menus.sodexomyway.com/BiteMenu/Menu?menuId=14756&locationId=11870001&whereami=http://mnsu.sodexomyway.com/dining-near-me/university-dining-center"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html5lib")
allFood = soup.findAll('a', attrs={'class':'get-nutritioncalculator primary-textcolor'})
allCals = soup.findAll('a', attrs={'class':'get-nutrition primary-textcolor'})
nums = '0123456789'
def printData(charIndex):
for char in allFood[charIndex].contents:
print(char)
for char in allCals[charIndex].contents:
print(char)
def getGoals():
userCalories = int(input("Please input calorie goal for the day (kC): "))
#Display Info (Text/RsbPi)
fullList = {}
def compileFood():
foodCount = 0
for food in allFood:
print(foodCount)
for foodName in allFood[foodCount].contents:
fullList[foodName] = 0
foodCount += 1
print(foodCount)
compileFood()
print(fullList['Sour Cream'])
Any help would be great. Thanks!
Ok first why is this happening:
The reason is because the food on the index 7 is empty. Because it's empty it will never enter your for loop and therefore never increase your foodCount => it will stuck at 7 forever.
So if you would shift your index increase outside of the for loop it would work without a problem.
But you doing something crude here.
You already iterate through the food item and still use an additional variable.
You could solve it smarter this way:
def compileFood():
for food in allFood:
for foodName in food.contents:
fullList[foodName] = 0
With this you don't need to care about an additional variable at all.
Related
I'm trying to iterate through a list of NFL QBs (over 100) and add create a list of links that I will use later.
The links follow a standard format, however if there are multiple players with the same name (such as 'Josh Allen') the link format needs to change.
I've been trying to do this with different nested while/for loops with Try/Except with little to no success. This is what I have so far:
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
name_int = 0
for names in test:
try:
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
q_b = pd.read_html(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
q_b1 = q_b[0]
#filter_status is a function that only works with QB data
df = filter_stats(q_b1)
#triggers the try if the link wasn't a QB
df.head(5)
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
except:
#adds one to the variable to change the link to find the proper QB link
name_int += 1
The result only appends the final correct link. I need to append each correct link to the empty list.
Still a beginner in Python and trying to challenge myself with different projects. Thanks!
As stated, the try/except will work in that it will try the code under the try block. If at any point within that block it fails or raises and exception/error, it goes and executes the block of code under the except.
There are better ways to go about this problem (for example, I'd use BeautifulSoup to simply check the html for the "QB" position), but since you are a beginner, I think trying to learn this process will help you understand the loops.
So what this code does:
1 It formats your player name into the link format.
2 We initialize a while loop that will it will enter
3 It gets the table.
4a) It enters a function that checks if the table contains 'passing'
stats by looking at the column headers.
4b) If it finds 'passing' in the column, it will return a True statement to indicate it is a "QB" type of table (keep in mind sometimes there might be runningbacks or other positions who have passing stats, but we'll ignore that). If it returns True, the while loop will stop and go to the next name in your test list
4c) If it returns False, it'll increment your name_int and check the next one
5 To take care of a case where it never finds a QB table, the while loop will go to False if it tries 10 iterations
Code:
import pandas as pd
def check_stats(q_b1):
for col in q_b1.columns:
if 'passing' in col.lower():
return True
return False
test = ['Josh Allen', 'Lamar Jackson', 'Derek Carr']
empty_list=[]
for names in test:
name_int = 0
q_b_name = names.split()
link1=q_b_name[1][0].capitalize()
qbStatsInTable = False
while qbStatsInTable == False:
link2=q_b_name[1][0:4].capitalize()+q_b_name[0][0:2].capitalize()+f'0{name_int}'
url = f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/'
try:
q_b = pd.read_html(url, header=0)
q_b1 = q_b[0]
except Exception as e:
print(e)
break
#Check if "passing" in the table columns
qbStatsInTable = check_stats(q_b1)
if qbStatsInTable == True:
print(f'{names} - Found QB Stats in {link1}/{link2}/gamelog/')
empty_list.append(f'https://www.pro-football-reference.com/players/{link1}/{link2}/gamelog/')
else:
name_int += 1
if name_int == 10:
print(f'Did not find a link for {names}')
qbStatsInTable = False
Output:
print(empty_list)
['https://www.pro-football-reference.com/players/A/AlleJo02/gamelog/', 'https://www.pro-football-reference.com/players/J/JackLa00/gamelog/', 'https://www.pro-football-reference.com/players/C/CarrDe02/gamelog/']
I would like to improve the readability following code, especially lines 8 to 11
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
question1 = str(soup.find(id='i1'))
question1 = question1.split('>')[1].lstrip().split('.')[1]
question1 = question1[1:]
question1 = question1.replace("_", "")
print(question1)
Thanks in advance :)
You could use the following
question1 = soup.find(id='i1').getText().split(".")[1].replace("_","").strip()
to replace lines 8 to 11.
.getText() takes care of removing the html-tags. Rest is pretty much the same.
In python you can almos always just chain operations. So your code would also be valid a a one-liner:
question1 = str(soup.find(id='i1')).split('>')[1].lstrip().split('.')[1][1:].replace("_", "")
But in most cases it is better to leave the code in a more readable form than to reduce the line-count.
Abhinav, is not very clear what you want to achieve, the script is actually already very simple which is a good thing and follow the Pythonic principle of The Zen of Python:
"Simple is better than complex."
Also is not comprehensive of what you actually mean:
Make it more simple as in Understandable and clear for Human beings?
Make it more simple for the machine to compute it, hence improve performance?
Reduce the line of codes and follow more the programming Guidelines?
I point this out because for next time would be better to make it more explicit in the question, having said that, as I don't know exactly what you mean, I come up with an answer that more or less covers all of 3 points:
ANSWER
import requests
from bs4 import BeautifulSoup
URL = 'https://docs.google.com/forms/d/e/1FAIpQLSd5tU8isVcqd02ymC2n952LC2Nz_FFPd6NT1lD4crDeSsJi2w/viewform?usp=sf_link'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# ========= < FUNCTION TO GET ALL QUESTION DYNAMICALLY > ========= #
def clean_string_by_id(page, id):
content = str(page.find(id=id)) # Get Content of page by different ids
if content != 'None': # Check if there is actual content or not
find_question = content.split('>') # NOTE: Split at tags closing
if len(find_question) >= 2 and find_question[1][0].isdigit(): # NOTE: If len is 1 means that is not the correct element Also we check if the first element is a digit means that is correct
cleaned_question = find_question[1].split('.')[1].strip() # We get the actual Question and strip it already !
result = cleaned_question.replace('_', '')
return result
else:
return
# ========= < Scan the entire page Dynamically + add result to a list> ========= #
all_questions = []
for i in range(1, 50): # NOTE: I went up to 50 but there may be many more, I let you test it
get_question = clean_string_by_id(soup, f'i{i}')
if get_question: # Append result to list only if there is actual content
all_questions.append(get_question)
# ========= < show all results > ========= #
for question in all_questions:
print(question)
NOTE
Here I'm assuming that you want to get all elements from this page, hence you don't want to write 2000 variables, as you can see I left the logic basically the same as yours but I wrapped everything in a Function instead.
In fact the steps you follow were pretty good and yes you may "improve it" or make it "smarter" however comprehensible wins complexity. Also take in mind that I assumed that get all the 'questions' from that Google Forms was your goal.
EDIT
As pointed by #wuerfelfreak and as he explains in his answer further improvement can be achived by using getText() function
Hence here the result of the above function using getText:
def clean_string_by_id(page, id):
content = page.find(id=id)
if content: # NOTE: Check if there is actual content or not, same as if len(content) >= 0
find_question = content.getText() # NOTE: Split at tags closing
if find_question: # NOTE: same as do if len(findÑ_question) >= 1: ... If is 0 means that is a empty line so we skip it
cleaned_question = find_question.split('.')[1].strip() # Same as before
result = cleaned_question.replace('_', '')
return result
Documentations & Guides
Zen of Python
getText
geeksforgeeks.org | isdigit()
I am trying to create a program that takes user input of a suburb and then returns a list of good cafes in that suburb.
The webpage I am scraping has a number of good cafe lists for various suburbs but has not got a list for every suburb where I live.
What I have done so far is create the code to get a list of suburbs, then created the code to scrape the webpage for the "best of" cafes based on using a f-string. My old code meant I would have to manually enter every suburb that the website has a page for as elif statements. Like this:
def cafe_search():
user_suburb = input("What Suburb?")
if user_suburb == "Thornbury":
print(get_cafes("thornbury"))
elif user_suburb == "Northcote":
print(get_cafes("northcote"))
elif user_suburb == "Carlton":
print(get_cafes("carlton"))
But I am trying to find a way to use a "suburb_list" I pull from wikipedia and then match it with the user's input to add to the f-string expression and then check whether that suburb has a cafe listing. I trying to do this with this f-string:
f"https://www.broadsheet.com.au/melbourne/guides/best-cafes-{user_suburb}"
I am trying to use the any() function to do this... not sure how successful that will be? I would be really grateful for any tips. PS, I am pretty new to all of this and this is my first project so my question may be a bit clumsy and my code inefficient, apologies!
#import stuff to open and scrape websites
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
from requests import get
#suburbs
#open suburb listing
url_suburbs = "https://en.wikipedia.org/wiki/List_of_Melbourne_suburbs"
html_suburbs = urlopen(url_suburbs)
soup_suburb_list = BeautifulSoup(html_suburbs, 'html.parser')
type(soup_suburb_list)
#grab suburb names
suburbs_containers = soup_suburb_list.select(".mw-parser-output > ul")
suburbs = []
for container in suburbs_containers:
suburb_list = container.find_all('a')
for suburb in suburb_list:
suburbs.append(suburb.text)
#cafes
def get_cafes(user_suburb):
#open url
url_cafes = f"https://www.broadsheet.com.au/melbourne/guides/best-cafes-{user_suburb}"
html_cafes = urlopen(url_cafes)
#create beautiful soup object for cafes
soup_cafe_list = BeautifulSoup(html_cafes, 'html.parser')
type(soup_cafe_list)
#grab cafe names
cafe_names = soup_cafe_list.find_all("h2", class_= "venue-title")
print (cafe_names)
#function to search cafes
def cafe_search():
user_suburb = input("What Suburb?")
if user_suburb == any(suburbs):
print(get_cafes("user_suburb"))
any(mylist) returns True if anything in mylist is True in the Python way of testing True - it’s like an or on mylist values. Similarly all(Mylist) is like an and on the values in mylist. See docs.python.org/3.8/library/functions.html#any
So for your code to check:
if user_suburb == any(suburbs):
any(suburbs) will return True if suburbs is not zero-length, and user_suburb is not the value True, so the test will always fail.
More on Python's testing for True is here, see below Truth Value Testing - worth reading! - https://docs.python.org/3.8/library/stdtypes.html
So you can't use any() but the in operator works against a list, so change:
if user_suburb == any(suburbs)
to
if user_suburb in suburbs
You’ll probably need to be careful that case is consistent as in is case-sensitive - may be simplest to lowercase everything going into suburbs and to lowercase user_suburb before using in
To lowercase the suburbs list change e.g.
suburbs.append(suburb.text)
to
suburbs.append(suburb.text.lower())
and change the check to:
if user_suburb.lower() in suburbs:
I've written a script in python using openpyxl to get some names and its correspoding values from Sheet1 and use them as parameters meant to be passed in an url to make it a valid url. The problem is when I run my script, it keeps on printing urls even when there are only 5 of them in Sheet1. So far my knowledge goes, the way I defined max row is accurate. How the max row becomes unlimited?
This is the script:
import requests
from openpyxl import load_workbook
wb = load_workbook('ReverseSearch.xlsx')
ws = wb['Sheet1']
def search_name(session,query,query1):
session.headers = {"User-Agent":"Mozilla/5.0"}
res = session.get(url.format(query,query1))
print(res.url)
if __name__ == '__main__':
url = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
for row in range(2, ws.max_row + 1): #I used row 2 cause there are headers in row 1
key = ws.cell(row=row,column=1).value
key1 = ws.cell(row=row,column=2).value
session = requests.Session()
search_name(session,key,key1)
names I used:
café claude
sears fine food
chaat cafe
bean bag coffee house
primo patio cafe
values I used:
3392129
473113343
18528177
12192803
641231
I'm supposed to get only 5 links (fully qualified) but Im getting blank urls when there are no parameters left.
https://www.yellowpages.com/san-francisco-ca/mip/cafe-claude-3392129?lid=3392129
https://www.yellowpages.com/san-francisco-ca/mip/sears-fine-food-473113343?lid=473113343
https://www.yellowpages.com/san-francisco-ca/mip/chaat-cafe-18528177?lid=18528177
https://www.yellowpages.com/san-francisco-ca/mip/bean-bag-coffee-house-12192803?lid=12192803
https://www.yellowpages.com/san-francisco-ca/mip/primo-patio-cafe-641231?lid=641231
https://www.yellowpages.com/los-angeles-ca/mip/None-None
https://www.yellowpages.com/los-angeles-ca/mip/None-None
https://www.yellowpages.com/los-angeles-ca/mip/None-None
I wish my script stops when 5 links are printed.
Btw, this is how the url looks like:
url = "https://www.yellowpages.com/los-angeles-ca/mip/{}-{}"
I would put this as a comment but I don't have enough rep.
My first trouble shooting step would be to check what you get if you do?
print(ws.max_row)
Does it print 7?
If it prints a bigger number it might be counting empty rows in your document, in that case you would need to check the content of your cells a break the loop
(Code below)
I'm scraping a website and the data I'm getting back is in 2 multi-dimensional arrays. I'm wanting everything to be in a JSON format because I want to save this and load it in again later when I add "tags".
So, less vague. I'm writing a program which takes in data like what characters you have and what missions are requiring you to do (you can complete multiple at once if the attributes align), and then checks that against a list of attributes that each character fulfills and returns a sorted list of the best characters for the context.
Right now I'm only scraping character data but I've already "got" the attribute data per character - the problem there was that it wasn't sorted by name so it was just a randomly repeating list that I needed to be able to look up. I still haven't quite figured out how to do that one.
Right now I have 2 arrays, 1 for the headers of the table and one for the rows of the table. The rows contain the "Answers" for the Header's "Questions" / "Titles" ; ie Maximum Level, 50
This is true for everything but the first entry which is the Name, Pronunciation (and I just want to store the name of course).
So:
Iterations = 0
While loop based on RowArray length / 9 (While Iterations <= that)
HeaderArray[0] gives me the name
RowArray[Iterations + 1] gives me data type 2
RowArray[Iterations + 2] gives me data type 3
Repeat until Array[Iterations + 8]
Iterations +=9
So I'm going through and appending these to separate lists - single arrays like CharName[] and CharMaxLevel[] and so on.
But I'm actually not sure if that's going to make this easier or not? Because my end goal here is to send "CharacterName" and get stuff back based on that AND be able to send in "DesiredTraits" and get "CharacterNames who fit that trait" back. Which means I also need to figure out how to store that category data semi-efficiently. There's over 80 possible categories and most only fit into about 10. I don't know how I'm going to store or load that data.
I'm assuming JSON is the best way? And I'm trying to keep it all in one file for performance and code readability reasons - don't want a file for each character.
CODE: (Forgive me, I've never scraped anything before + I'm actually somewhat new to Python - just got it 4? days ago)
https://pastebin.com/yh3Z535h
^ In the event anyone wants to run this and this somehow makes it easier to grab the raw code (:
import time
import requests, bs4, re
from urllib.parse import urljoin
import json
import os
target_dir = r"D:\00Coding\Js\WebScraper" #Yes, I do know that storing this in my Javascript folder is filthy
fullname = os.path.join(target_dir,'TsumData.txt')
StartURL = 'http://disneytsumtsum.wikia.com/wiki/Skill_Upgrade_Chart'
URLPrefix = 'http://disneytsumtsum.wikia.com'
def make_soup(url):
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
return soup
def get_links(url):
soup = make_soup(url)
a_tags = soup.find_all('a', href=re.compile(r"^/wiki/"))
links = [urljoin(URLPrefix, a['href'])for a in a_tags] # convert relative url to absolute url
return links
def get_tds(link):
soup = make_soup(link)
#tds = soup.find_all('li', class_="category normal") #This will give me the attributes / tags of each character
tds = soup.find_all('table', class_="wikia-infobox")
RowArray = []
HeaderArray = []
if tds:
for td in tds:
#print(td.text.strip()) #This is everything
rows = td.findChildren('tr')#[0]
headers = td.findChildren('th')#[0]
for row in rows:
cells = row.findChildren('td')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
RowArray.append(clean_content)
for row in rows:
cells = row.findChildren('th')
for cell in cells:
cell_content = cell.getText()
clean_content = re.sub( '\s+', ' ', cell_content).strip()
if clean_content:
HeaderArray.append(clean_content)
print(HeaderArray)
print(RowArray)
return(RowArray, HeaderArray)
#Output = json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1)
#print(json.dumps([dict(zip(RowArray, row_2)) for row_2 in HeaderArray], indent=1))
#TempFile = open(fullname, 'w') #Read only, Write Only, Append
#TempFile.write("EHLLO")
#TempFile.close()
#print(td.tbody.Series)
#print(td.tbody[Series])
#print(td.tbody["Series"])
#print(td.data-name)
#time.sleep(1)
if __name__ == '__main__':
links = get_links(StartURL)
MainHeaderArray = []
MainRowArray = []
MaxIterations = 60
Iterations = 0
for link in links: #Specifically I'll need to return and append the arrays here because they're being cleared repeatedly.
#print("Getting tds calling")
if Iterations > 38: #There are this many webpages it'll first look at that don't have the data I need
TempRA, TempHA = get_tds(link)
MainHeaderArray.append(TempHA)
MainRowArray.append(TempRA)
MaxIterations -= 1
Iterations += 1
#print(MaxIterations)
if MaxIterations <= 0: #I don't want to scrape the entire website for a prototype
break
#print("This is the end ??")
#time.sleep(3)
#jsonized = map(lambda item: {'Name':item[0], 'Series':item[1]}, zip())
print(MainHeaderArray)
#time.sleep(2.5)
#print(MainRowArray)
#time.sleep(2.5)
#print(zip())
TsumName = []
TsumSeries = []
TsumBoxType = []
TsumSkillDescription = []
TsumFullCharge = []
TsumMinScore = []
TsumScoreIncreasePerLevel = []
TsumMaxScore = []
TsumFullUpgrade = []
Iterations = 0
MaxIterations = len(MainRowArray)
while Iterations <= MaxIterations: #This will fire 1 time per Tsum
print(Iterations)
print(MainHeaderArray[Iterations][0]) #Holy this gives us Mickey ;
print(MainHeaderArray[Iterations+1][0])
print(MainHeaderArray[Iterations+2][0])
print(MainHeaderArray[Iterations+3][0])
TsumName.append(MainHeaderArray[Iterations][0])
print(MainRowArray[Iterations][1])
#At this point it will, of course, crash - that's because I only just realized I needed to append AND I just realized that everything
#Isn't stored in a list as I thought, but rather a multi-dimensional array (as you can see below I didn't know this)
TsumSeries[Iterations] = MainRowArray[Iterations+1]
TsumBoxType[Iterations] = MainRowArray[Iterations+2]
TsumSkillDescription[Iterations] = MainRowArray[Iterations+3]
TsumFullCharge[Iterations] = MainRowArray[Iterations+4]
TsumMinScore[Iterations] = MainRowArray[Iterations+5]
TsumScoreIncreasePerLevel[Iterations] = MainRowArray[Iterations+6]
TsumMaxScore[Iterations] = MainRowArray[Iterations+7]
TsumFullUpgrade[Iterations] = MainRowArray[Iterations+8]
Iterations += 9
print(Iterations)
print("It's Over")
time.sleep(3)
print(TsumName)
print(TsumSkillDescription)
Edit:
tl;dr my goal here is to be like
"For this Mission Card I need a Blue Tsum with high score potential, a Monster's Inc Tsum for a bunch of games, and a Male Tsum for a long chain.. what's the best Tsum given those?" and it'll be like "SULLY!" and automatically select it or at the very least give you a list of Tsums. Like "These ones match all of them, these ones match 2, and these match 1"
Edit 2:
Here's the command Line Output for the code above:
https://pastebin.com/vpRsX8ni
Edit 3: Alright, just got back for a short break. With some minor looking over I see what happened - my append code is saying "Append this list to the array" meaning I've got a list of lists for both the Header and Row arrays that I'm storing. So I can confirm (for myself at least) that these aren't nested lists per se but they are definitely 2 lists, each containing a single list at every entry. Definitely not a dictionary or anything "special case" at least. This should help me quickly find an answer now that I'm not throwing "multi-dimensional list" around my google searches or wondering why the list stuff isn't working (as it's expecting 1 value and gets a list instead).
Edit 4:
I need to simply add another list! But super nested.
It'll just store the categories that the Tsum has as a string.
so Array[10] = ArrayOfCategories[Tsum] (which contains every attribute in string form that the Tsum has)
So that'll be ie TsumArray[10] = ["Black", "White Gloves", "Mickey & Friends"]
And then I can just use the "Switch" that I've already made in order to check them. Possibly. Not feeling too well and haven't gotten that far yet.
Just use the with open file as json_file , write/read (super easy).
Ultimately stored 3 json files. No big deal. Much easier than appending into one big file.