I am trying to build a simple 'stock-checker' for a T-shirt I want to buy. Here is the link: https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069
As you can see, I am present with 'Coming Soon' text, whereas usually if an item is in stock, it will show 'Add To Cart'.
I thought the simplest way would be to use requests and beautifulsoup to isolate this <button> tag, and read the value of text. If it eventually says 'Add To Cart', then I will write the code to email/message myself it's back in stock.
However, here's the code I have so far, and you'll see that the response says the text contains 'Add To Cart', which is not what the website actually shows?
import requests
import bs4
URL = 'https://yesfriends.co/products/mens-t-shirt-black?variant=40840532689069'
def check_stock(url):
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, "html.parser")
buttons = soup.find_all('button', {'name': 'add'})
return buttons
if __name__ == '__main__':
buttons = check_stock(URL)
print(buttons[0].text)
All data available in <script> tag in JSON. So we need to get this, and extract the information we need. Let's use a simple slice by indexes to get clean JSON
import requests
import json
url = 'https://yesfriends.co/products/mens-t-shirt-black'
response = requests.get(url)
index_start = response.text.index('product:', 0) + len('product:')
index_finish = response.text.index(', }', index_start)
json_obj = json.loads(response.text[index_start:index_finish])
for variant in json_obj['variants']:
available = 'IN STOCK' if variant['available'] else 'OUT OF STOCK'
print(variant['id'], variant['option1'], available)
OUTPUT:
40840532623533 XXS OUT OF STOCK
40840532656301 XS OUT OF STOCK
40840532689069 S OUT OF STOCK
40840532721837 M OUT OF STOCK
40840532754605 L OUT OF STOCK
40840532787373 XL OUT OF STOCK
40840532820141 XXL OUT OF STOCK
40840532852909 3XL IN STOCK
40840532885677 4XL OUT OF STOCK
I want to Retrieve a financial dataset from a Website which has a log in. I've managed to log in using requests and access the HTML
from bs4 import BeautifulSoup
import pandas as pd
s = requests.session()
login_data = dict(email='my login', password='password')
s.post('*portal webiste with/login*', data=login_data)
r = s.get(' *website with finacial page* ')
print (r.content)
## work on r as its a direct link
url = r # stock url
page = url
soup = BeautifulSoup(page.text) # returns the htm of the finance page.
The above code allows me to log in and get the html from the correct page.
headers = []
# finds all the headers.
for i in table.find_all('th'):
title = i.text.strip()
headers.append(title)
df = pd.DataFrame(columns = headers)
print(df)
this block finds the table and gets the column headers.
which are printed as:
Columns: [Date, Type, Type, Credit, Debit, Outstanding, Case File, ]
The next part is the problem. when I attempt to retrieve the financials using the following code:
for row in table.find_all('tr')[1:]:
data = row.find_all('td')
row_data = [td.text.strip()for td in data]
print(row_data)
it returns this
['"Loading Please Wait..."']
HTML of the site looks like this
html of the site i want to scrape
http://comp20008-jh.eng.unimelb.edu.au:9889/main/
Hi, so I'm trying to get the list of all the urls and main headings from articles given on a html page. So the link above is the main page and clicking the 'next article' link leads to the next article with a link like this:
http://comp20008-jh.eng.unimelb.edu.au:9889/main/Hodg001.html
"Hodg001.html" href which continues until 147th article. This page has a 'next article' link that leads to the next article and so on.
I'm trying to extract the url and the heading from each article and create a dataframe with to save into a csv file. I'm totally clueless and don't know how to proceed now
base_url = 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/'
req = requests.get(base_url)
print(req)
soup = BeautifulSoup(req.text, 'html.parser')
print(soup.prettify())
print(soup.h1)
links = soup.findAll("a")
print(links)
headings = soup.findAll("h1")
print(headings)
for link in links:
print(link.get("href")) ##only gets 1
for i in headings:
print(i) #doesn't work
Can anyone please explain how I can proceed? I can provide more information if needed.
Do you mean something like this?
In the code below a few things are happening:
set the a base_url as the links are relative links. Not absolute
keep track of the next_url for the while loop
the next_link_class is just a placeholder to find which -tag is needed
data will contain the links and headings
csv_path is the path to your export file
Next we tell the script to keep on fetching links and extract information as long as the next_link is populated.
After it's done, it will write the data to a the provided path for the csv file.
I must admit I didn't let it run through to the end - so you may need to catch it differently if no next_link available on the html-page. But this should get you well on your way.
import csv
import bs4
import requests
url_base = 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/'
next_url = 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Hodg001.html'
next_link_class = 'nextLink'
csv_path = 'export.csv'
data = []
while next_url:
link_list.append(next_url)
soup = bs4.BeautifulSoup(requests.get(next_url).content)
try:
next_url = url_base + soup.find('p', {'class': next_link_class}).a['href']
except AttributeError:
break # This exception should be thrown when the last page is reached.
# The loop should break at that point and dump the data to the csv-file
d = {
'url': next_url,
'heading': soup.find('h1').string
}
data.append(d)
print(d)
with open(csv_path, 'w') as f:
c = csv.DictWriter(f, fieldnames=data[0].keys())
c.writeheader()
[c.writerow(i) for i in data]
This code above will provide you with a dictionary of links and headings. And save it all to the given csv_path
In [20]: data
Out[20]:
[{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Vick002.html',
'heading': 'Hodgson shoulders England blame'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Yach003.html',
'heading': 'Vickery out of Six Nations'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Lapo004.html',
'heading': 'Yachvili savours France comeback'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Lews005.html',
'heading': 'Laporte tinkers with team'},
{'url': 'http://comp20008-jh.eng.unimelb.edu.au:9889/main/Fumi006.html',
'heading': 'Lewsey puzzle over disallowed try'}]
I'm currently writing a Python script and part of it gets winshares from the first 4 seasons of every player's career in the NBA draft between 2005 and 2015. I've been messing around with this for almost 2 hours (getting increasingly frustrated), but I've been unable to get the Win Shares for the individual players. I'm trying to use the "Advanced" table at the following link as a test case: https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none
When getting the player's names from the draft pages I had no problems, but I've tried so many iterations of the following code and have had no success in accessing the td element the stat is in.
playerSoup = BeautifulSoup(playerHtml)
playertr = playerSoup.find_all("table", id = "advanced").find("tbody").findAll("tr")
playerws = playertr.findAll("td")[21].getText()
This page use JavaScript to add tables but it doesn't read data from server. All tables are in HTML but as comments <!-- ... ->
Using BeautifulSoup you can find all comments and then check which one has text "Advanced". And then you can use this comment as normal HTML in BeautifulSoup
import requests
from bs4 import BeautifulSoup
from bs4 import Comment
url = 'https://www.basketball-reference.com/players/b/bogutan01.html#advanced::none'
r = requests.get(url)
soup = BeautifulSoup(r.content)
all_comments = soup.find_all(string=lambda text: isinstance(text, Comment))
for item in all_comments:
if "Advanced" in item:
adv = BeautifulSoup(item)
playertr = adv.find("table", id="advanced")
if not playertr:
#print('skip')
continue # skip comment without table - go back to `for`
playertr = playertr.find("tbody").findAll("tr")
playerws = adv.find_all("td")[21].getText()
print('playertr:', playertr)
print('playerws:', playerws)
for row in playertr:
if row:
print(row.find_all('th')[0].text)
all_td = row.find_all('td')
print([x.text for x in all_td])
print('--')
I am trying to scrape this page ("http://www.arohan.in/branch-locator.php") in which when I select the state and city, an address will be displayed and I have to write the state,city and address in csv/excel file. I am able to reach this till step, now I am stuck.
Here is my code:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
chrome_path= r"C:\Users\IBM_ADMIN\Downloads\chromedriver_win32\chromedriver.exe"
driver =webdriver.Chrome(chrome_path)
driver.get("http://www.arohan.in/branch-locator.php")
select = Select(driver.find_element_by_name('state'))
select.select_by_visible_text('Bihar')
drop = Select(driver.find_element_by_name('branch'))
city_option = WebDriverWait(driver, 5).until(lambda x: x.find_element_by_xpath("//select[#id='city1']/option[text()='Gaya']"))
city_option.click()
Is selenium necessary? looks like you can use URLs to arrive at what you want: http://www.arohan.in/branch-locator.php?state=Assam&branch=Mirza.
Get a list of the state / branch combinations then use the beautiful soup tutorial to get the info from each page.
In a slightly organized manner:
import requests
from bs4 import BeautifulSoup
link = "http://www.arohan.in/branch-locator.php?"
def get_links(session,url,payload):
session.headers["User-Agent"] = "Mozilla/5.0"
res = session.get(url,params=payload)
soup = BeautifulSoup(res.text,"lxml")
item = [item.text for item in soup.select(".address_area p")]
print(item)
if __name__ == '__main__':
for st,br in zip(['Bihar','West Bengal'],['Gaya','Kolkata']):
payload = {
'state':st ,
'branch':br
}
with requests.Session() as session:
get_links(session,link,payload)
Output:
['Branch', 'House no -10/12, Ward-18, Holding No-12, Swarajpuri Road, Near Bank of Baroda, Gaya Pin 823001(Bihar)', 'N/A', 'N/A']
['Head Office', 'PTI Building, 4th Floor, DP Block, DP-9, Salt Lake City Calcutta, 700091', '+91 33 40156000', 'contact#arohan.in']
A better approach would be to avoid using selenium. That is useful if you require the javascript processing required to render the HTML. In your case, this is not needed. The required information is already contained within the HTML.
What is needed is to first make a request to get a page containing all of the states. Then for each state, request the list of branch. Then for each state/branch combination, a URL request can be made to get the HTML containing the address. This happens to be contained in the second <li> entry following a <ul class='address_area'> entry:
from bs4 import BeautifulSoup
import requests
import csv
import time
# Get a list of available states
r = requests.get('http://www.arohan.in/branch-locator.php')
soup = BeautifulSoup(r.text, 'html.parser')
state_select = soup.find('select', id='state1')
states = [option.text for option in state_select.find_all('option')[1:]]
# Open an output CSV file
with open('branch addresses.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['State', 'Branch', 'Address'])
# For each state determine the available branches
for state in states:
r_branches = requests.post('http://www.arohan.in/Ajax/ajax_branch.php', data={'ajax_state':state})
soup = BeautifulSoup(r_branches.text, 'html.parser')
# For each branch, request a page contain the address
for option in soup.find_all('option')[1:]:
time.sleep(0.5) # Reduce server loading
branch = option.text
print("{}, {}".format(state, branch))
r_branch = requests.get('http://www.arohan.in/branch-locator.php', params={'state':state, 'branch':branch})
soup_branch = BeautifulSoup(r_branch.text, 'html.parser')
ul = soup_branch.find('ul', class_='address_area')
if ul:
address = ul.find_all('li')[1].get_text(strip=True)
row = [state, branch, address]
csv_output.writerow(row)
else:
print(soup_branch.title)
Giving you an output CSV file starting:
State,Branch,Address
West Bengal,Kolkata,"PTI Building, 4th Floor,DP Block, DP-9, Salt Lake CityCalcutta, 700091"
West Bengal,Maheshtala,"Narmada Park, Par Bangla,Baddir Bandh Bus Stop,Opp Lane Kismat Nungi Road,Maheshtala,Kolkata- 700140. (W.B)"
West Bengal,ShyamBazar,"First Floor, 6 F.b.T. Road,Ward No.-6,Kolkata-700002"
You should slow the script down using a time.sleep(0.5) to avoid too much loading on the server.
Note: [1:] is used as the first item in the drop down lists is not a branch or state, but a Select Branch entry.