All Links not retrieved from webpage -- python

All Links not retrieved from webpage -- python - python

I would like the user to give search type choices , search text and then display all the links in the resulting webpage . but i am not able to retrieve the resulting links(only home link is retrieved) from the webpage (http://djraag.net/bollywood/searchbycat.php)
from bs4 import BeautifulSoup
import urllib.request
import re
print(" 1 : Album \n 2 : Track \n 3 : Artist \n ")
count = 0
while (count == 0):
temp_str = input('Enter yout search type : ')
temp = int(temp_str)
if (temp >= 1)&(temp <= 3):
search_no = temp
count = 1
else :
print("Invalid Input")
if search_no == 1 :
search_type1 = "album"
search_type = str(search_type1)
elif search_no == 2:
search_type1 = "track"
search_type = str(search_type1)
else:
search_type1 = "artist"
search_type = str(search_type1)
Search = input("Search : ")
url_temp = "http://djraag.net/bollywood/searchbycat.php?search="+Search+"&type="+search_type+"&cat_id=5&submit=Submit"
url = urllib.request.urlopen(url_temp)
content = url.read()
soup = BeautifulSoup(content ,"html.parser")
for a in soup.findAll('a',href=True):
if re.findall('http',a['href']):
print ("URL:", a['href'])

Remove the line
if re.findall('http',a['href']):
from code and try again.

Related

How do I scrape different data-stats that live under the same div using BeautifulSoup?

from bs4 import BeautifulSoup
import requests
first = ()
first_slice = ()
last = ()
def askname():
global first
first = input(str("First Name of Player?"))
global last
last = input(str("Last Name of Player?"))
print("Confirmed, loading up " + first + " " + last)
# asks user for player name
askname()
first_slice_result = (first[:2])
last_slice_result = (last[:5])
print(first_slice_result)
print(last_slice_result)
# slices player's name so it can match the format bref uses
first_slice_resultA = str(first_slice_result)
last_slice_resultA = str(last_slice_result)
first_last_slice = last_slice_resultA + first_slice_resultA
lower = first_last_slice.lower() + "01"
start_letter = (last[:1])
lower_letter = (start_letter.lower())
# grabs the letter bref uses for organization
print(lower)
source = requests.get('https://www.basketball-reference.com/players/' + lower_letter + '/' + lower + '.html').text
soup = BeautifulSoup(source, 'lxml')
tbody = soup.find('tbody')
pergame = tbody.find(class_="full_table")
classrite = tbody.find(class_="right")
tr_body = tbody.find_all('tr')
# lprint(pergame)
for td in tbody:
print(td.get_text)
print("done")
get = str(input("What stat? \nCheck commands.txt for statistic names. \n"))
for trb in tr_body:
print(trb.get('id'))
print("\n")
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
row = {}
for td in trb.find_all('td'):
row[td.get('data-stat')] = td.get_text()
print(row[get])
So I have this program that scrapes divs based on their given a "data-stat" value. (pg_per_mp etc)
However right now I can only get that data-stat value from either assigning it a variable or getting it from an input. I would like to make a list of data-stats and grab all the values from each data-stat in the list.
for example
list = [fga_per_mp, fg3_per_mp, ft_per_mp]
for x in list:
print(x)
In a perfect world, the script would take each value of the list and scrape the website for the assigned stat.
I tried editing line 66 - 79 to:
get = [fga_per_mp, fg3_per_mp]
for trb in tr_body:
print(trb.get('id'))
print("\n")
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
row = {}
for td in trb.find_all('td'):
for x in get():
row[td.get('data-stat')] = td.get_text()
.. but of course that wouldn't work. Any help?

I would avoid hard coding the player id as it may not always follow that same pattern. What I would do is pull in the player names ad Ids (since the site provides it), then using something like fuzzywuzzy to match player name input (in case for typos and what not.
Once you get that, it's just a matter of pulling out the specific <td> tage with the chosen data-stat
from bs4 import BeautifulSoup
import requests
import pandas as pd
#pip install fuzzywuzzy
from fuzzywuzzy import process
#pip install choice
import choice
def askname():
playerNameInput = input(str("Enter the player's name -> "))
return playerNameInput
# Get all player IDs
player_df = pd.read_csv('https://www.basketball-reference.com/short/inc/sup_players_search_list.csv', header=None)
player_df = player_df.rename(columns={0:'id',
1:'playerName',
2:'years'})
playersList = list(player_df['playerName'])
# asks user for player name
playerNameInput = askname()
# Find closest matches
search_match = pd.DataFrame(process.extract(f'{playerNameInput}', playersList))
search_match = search_match.rename(columns={0:'playerName',1:'matchScore'})
matches = pd.merge(search_match, player_df, how='inner', on='playerName').drop_duplicates().reset_index(drop=True)
choices = [': '.join(x) for x in list(zip(matches['playerName'], matches['years']))]
# Choice the match
playerChoice = choice.Menu(choices).ask()
playerName, years = playerChoice.split(': ')
# Get that match players id
match = player_df[(player_df['playerName'] == playerName) & (player_df['years'] == years)]
baseUrl = 'https://www.basketball-reference.com/players'
playerId = match.iloc[0]['id']
url = f'{baseUrl}/{playerId[0]}/{playerId}.html'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')
soup = BeautifulSoup(html, 'html.parser')
statList = ['fga_per_mp', 'fg3_per_mp', 'ft_per_mp', 'random']
for stat in statList:
try:
statTd = soup.find('td', {'data-stat':stat})
print(statTd['data-stat'], statTd.text)
except:
print(f'{stat} stat not found')

Need Wikipedia web scraper to continuously ask for user input

I need the below code to ask for user input again, after executing and showing results. I guess a while loop would be best but not sure how to do it as have BeautifulSoup and requests library in use.
Any help would be greatly appreciated.
from bs4 import BeautifulSoup
user_input = input("Enter article:")
response = requests.get("https://en.wikipedia.org/wiki/" + user_input)
soup = BeautifulSoup(response.text, "html.parser")
list = []
count = 0
IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
"User talk:", "Module:", "Help:", "File:", "Portal:", "#", "About this", ".ogg", "disambiguation", "Edit section"]
for tag in soup.select('div.mw-parser-output a:not(.infobox a)'):
if count <= 10:
title = tag.get("title", "")
if not any(x in title for x in IGNORE) and title != "":
count = count + 1
print(title)
list.append(title)
else:
break

Use function with return statement
Example
import requests
from bs4 import BeautifulSoup
IGNORE = ["Wikipedia:", "Category:", "Template:", "Template talk:", "User:",
"User talk:", "Module:", "Help:", "File:", "Portal:", "#", "About this", ".ogg", "disambiguation",
"Edit section"]
def get_user_input():
user_input = input("Enter article:")
if len(str(user_input)) > 0:
return get_response(user_input)
else:
return get_user_input()
def get_response(user_input):
response = requests.get("https://en.wikipedia.org/wiki/" + user_input)
soup = BeautifulSoup(response.text, "html.parser")
title_list = []
count = 0
for tag in soup.select('div.mw-parser-output a:not(.infobox a)'):
if count <= 10:
title = tag.get("title", "")
if not any(x in title for x in IGNORE) and title != "":
count = count + 1
print(title)
title_list.append(title)
print(title_list)
else:
return get_user_input()
if __name__ == '__main__':
get_user_input()

Extract specific text from a list in Python

I am trying to extract certain information from a long list of text do display it nicely but i cannot seem to figure out how exactly to tackle this problem.
My text is as follows:
"(Craw...Crawley\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\n**Hotstage**\n **248236**\n\n\n\n\n\n\n\n\n\n\n\n\n\nCosta Collect...Costa Coffee (Bedf...Bedford\n\n\n\n\n\n\n08:00\n\n\n\n \n\n\n**Hotstage**\n **247962**\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Sheffield Qu...Sheffield\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\nHotstage\n 247971\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Brentford...BRENTFORD\n\n\n\n\n\n\n08:00\n\n\n\n\n\n\nHotstage\n 248382\n\n\n\n\n\n\n\n\n\n\n\n\n\nKFC - Acrelec Deployment...KFC - Newport"
I would like to extract what is highlighted.
I'm thinking the solution is simple and maybe I am not storing the information properly or not extracting it properly.
This is my code
from bs4 import BeautifulSoup
import requests
import re
import time
def main():
url = "http://antares.platinum-computers.com/schedule.htm"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
response.close()
# Get
tech_count = 0
technicians = [] #List to hold technicians names
xcount = 0
test = 0
name_links = soup.find_all('td', {"class": "resouce_on"}) #Get all table data with class name "resource on".
# iterate through html data and add them to "technicians = []"
for i in name_links:
technicians.append(str(i.text.strip())) # append value to dictionary
tech_count += 1
print("Found: " + str(tech_count) + " technicians + 1 default unallocated.")
for t in technicians:
print(xcount,t)
xcount += 1
test = int(input("choose technician: "))
for link in name_links:
if link.find(text=re.compile(technicians[test])):
jobs = []
numbers = []
unique_cr = []
jobs.append(link.parent.text.strip())
for item in jobs:
for subitem in item.split():
if(subitem.isdigit()):
numbers.append(subitem)
for number in numbers:
if number not in unique_cr:
unique_cr.append(number)
print ("tasks for technician " + str(technicians[test]) + " are as follows")
for cr in unique_cr:
print (jobs)
if __name__ == '__main__':
main()

It's fairly simple:
myStr = "your complicated text"
words = mystr.split("\n")
niceWords = []
for word in words:
If "**"in word:
niceWords.append(word.replace("**", "")
print(niceWords)

I'm trying to make an API in python that receives data in JSON

I'm trying to make an API which depending on which store the user likes and what specific item they like it will retrieve the data in JSON, for example, in Tesco search for Nutella, it works perfectly to find when I do it for just 1 store but when I did it for more than 1 it get errors.
I've tried a for loop and if loop, I've gotten better results
from bs4 import BeautifulSoup libraries to do this.
My code:
import requests
from bs4 import BeautifulSoup
num = int(input('''press 1 for tesco
press 2 for morrisions
press 3 for sainsbury's
'''))
print("please enter the food you want to search for: ")
flag = num%2
txt = input("")
if flag == 1:
tesco_url = requests.get('https://www.tesco.com/groceries/en-GB/search?query=' + txt).text
elif flag == 2:
morrisons_url = requests.get('https://groceries.morrisons.com/search?entry=' + txt).text
elif flag == 3:
sainsburys_url = requests.get('https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=OYaxfyCjgnRApUa%2FS%2BjRmgHslGfDEUtd3xECMndoz2f9gvq5KRuP8TuhW4m1jnUT%2FJU3fBivUiAIozuhmBLJJJQe6gcTedPJTASuwsZfLkt49e%2FYAPxDyWxCjeiyFxNN5WjSEdcW7LMdmfJbn3TmGVBZKVIxqu1zUw7IT8Qo2afgUyuCpJPcxPbmc2gWJMpi#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=' + txt).text
soup = BeautifulSoup(flag, 'lxml')
print(soup.prettify())
The error I'm getting:
Traceback (most recent call last):
File "My API.py", line 23, in <module>
soup = BeautifulSoup(flag, 'lxml')
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/bs4/__init__.py", line 245, in __init__
elif len(markup) <= 256 and (
TypeError: object of type 'int' has no len()

BeautifulSoup requires XML, which you probably forgot to update when adding more stores. Also I'm not sure why you have flag = num%2, as flag will then only be values [0, 1]. If you replace flag with num, it should work as expected.
import requests
from bs4 import BeautifulSoup
num = int(input('''press 1 for tesco
press 2 for morrisions
press 3 for sainsbury's
'''))
print("please enter the food you want to search for: ")
flag = num # <--- notice this
txt = input("")
if flag == 1:
markup = requests.get('https://www.tesco.com/groceries/en-GB/search?query=' + txt).text
elif flag == 2:
markup = requests.get('https://groceries.morrisons.com/search?entry=' + txt).text
elif flag == 3:
markup = requests.get('https://www.sainsburys.co.uk/webapp/wcs/stores/servlet/SearchDisplayView?catalogId=10123&langId=44&storeId=10151&krypto=OYaxfyCjgnRApUa%2FS%2BjRmgHslGfDEUtd3xECMndoz2f9gvq5KRuP8TuhW4m1jnUT%2FJU3fBivUiAIozuhmBLJJJQe6gcTedPJTASuwsZfLkt49e%2FYAPxDyWxCjeiyFxNN5WjSEdcW7LMdmfJbn3TmGVBZKVIxqu1zUw7IT8Qo2afgUyuCpJPcxPbmc2gWJMpi#langId=44&storeId=10151&catalogId=10123&categoryId=&parent_category_rn=&top_category=&pageSize=60&orderBy=RELEVANCE&searchTerm=' + txt).text
soup = BeautifulSoup(markup, 'lxml')
print(soup.prettify())

Using beautifulsoup to get prices from craigslist

I am new to coding in python (maybe a couple of days in) and basically learning of other people's code on stackoverflow. The code I am trying to write uses beautifulsoup to get the pid and the corresponding price for motorcycles on craigslist. I know there are many other ways of doing this but my current code looks like this:
from bs4 import BeautifulSoup
from urllib2 import urlopen
u = ""
count = 0
while (count < 9):
site = "http://sfbay.craigslist.org/mca/" + str(u)
html = urlopen(site)
soup = BeautifulSoup(html)
postings = soup('p',{"class":"row"})
f = open("pid.txt", "a")
for post in postings:
x = post.getText()
y = post['data-pid']
prices = post.findAll("span", {"class":"itempp"})
if prices == "":
w = 0
else:
z = str(prices)
z = z[:-8]
w = z[24:]
filewrite = str(count) + " " + str(y) + " " +str(w) + '\n'
print y
print w
f.write(filewrite)
count = count + 1
index = 100 * count
print "index is" + str(index)
u = "index" + str(index) + ".html"
It works fine and as I keep learning i plan to optimize it. The problem I have right now, is that entries without price are still showing up. Is there something obvious that I am missing.
thanks.

The problem is how you're comparing prices. You say:
prices = post.findAll("span", {"class":"itempp"})
In BS .findAll returns a list of elements. When you're comparing price to an empty string, it will always return false.
>>>[] == ""
False
Change if prices == "": to if prices == [] and everything should be fine.
I hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

All Links not retrieved from webpage -- python - python

Remove the line if re.findall('http',a['href']): from code and try again.

Related

How do I scrape different data-stats that live under the same div using BeautifulSoup?

Need Wikipedia web scraper to continuously ask for user input

Extract specific text from a list in Python

I'm trying to make an API in python that receives data in JSON

Using beautifulsoup to get prices from craigslist

Categories

Resources