Scraping all tables from a webpage using python ps4 - python

I want to use beautifulsoup to get all the tables at this link https://www.investing.com/indices/indices-futures, following that I want to get the titles in the Index Column, and the Links of those titles.
I want what's in the first column only.
So for example.
title href
Dow Jones /indices/us-30-futures
S&P 500 /indices/us-spx-500-futures
...
Mini DAX /indices/mini-dax-futures
...
VSTOXX Mini /indices/vstoxx-mini
I use the following code
url = "https://www.investing.com/indices/indices-futures"
req = requests.get(url, headers=urlheader)
soup = BeautifulSoup(req.content, "lxml")
table = soup.find('div', id="cross_rates_container")
for a in table.find_all('a', href=True):
print (a['title'], a['href'])
I can see the table variable, but I can't seem the access the title (which contains the index name) and href (which contains the links)
What's wrong with it how can I get all the tables' entries at once?

You can iterate over <td> elements and get the <a> link under them.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://www.investing.com/indices/indices-futures'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print('{:<30} {}'.format('Title', 'URL'))
for a in soup.select('td.plusIconTd > a'):
print('{:<30} {}'.format(a.text, 'https://www.investing.com' + a['href']))
Prints:
Title URL
Dow Jones https://www.investing.com/indices/us-30-futures
S&P 500 https://www.investing.com/indices/us-spx-500-futures
Nasdaq https://www.investing.com/indices/nq-100-futures
SmallCap 2000 https://www.investing.com/indices/smallcap-2000-futures
S&P 500 VIX https://www.investing.com/indices/us-spx-vix-futures
DAX https://www.investing.com/indices/germany-30-futures
CAC 40 https://www.investing.com/indices/france-40-futures
FTSE 100 https://www.investing.com/indices/uk-100-futures
Euro Stoxx 50 https://www.investing.com/indices/eu-stocks-50-futures
FTSE MIB https://www.investing.com/indices/italy-40-futures
SMI https://www.investing.com/indices/switzerland-20-futures
IBEX 35 https://www.investing.com/indices/spain-35-futures
ATX https://www.investing.com/indices/austria-20-futures
WIG20 https://www.investing.com/indices/poland-20-futures
AEX https://www.investing.com/indices/netherlands-25-futures
BUX https://www.investing.com/indices/hungary-14-futures
RTS https://www.investing.com/indices/rts-cash-settled-futures
... and so on.
EDIT: Screenshot with <td> elements:

Related

How can I get my python code to scrape the correct part of a website?

I am trying to get python to scrape a page on Mississippi's state legislature website. My goal is scrape a page and add what I've scraped into a new csv. My command prompt doesn't give me errors, but I am only scraping a " symbol and that is it. Here is what I have so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['http://www.legislature.ms.gov/legislation/all-measures/']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict = [item.text for item in soup.select('tbody')]
df = pd.DataFrame.from_dict(temp_dict, orient='index').transpose()
df.to_csv('3-New Bills.csv')
I believe the problem is with line 13:
temp_dict = [item.text for item in soup.select('tbody')]
What should I replace 'tbody' with in this code to see all of the bills? Thank you so much for your help.
EDIT: Please see Sergey K' comment below, for a more elegant solution.
That table is being loaded in an xframe, so you would have to scrape that xframe's source for data. The following code will return a dataframe with 3 columns (measure, shorttitle, author):
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
list_for_df = []
r = requests.get('http://billstatus.ls.state.ms.us/2022/pdf/all_measures/allmsrs.xml', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
for x in soup.select('msrgroup'):
list_for_df.append((x.measure.text.strip(), x.shorttitle.text.strip(), x.author.text.strip()))
df = pd.DataFrame(list_for_df, columns = ['measure', 'short_title', 'author'])
df
Result:
measure short_title author
0 HB 1 Use of technology portals by those on probatio... Bell (65th)
1 HB 2 Youth court records; authorize judge to releas... Bell (65th)
2 HB 3 Sales tax; exempt retail sales of severe weath... Bell (65th)
3 HB 4 DPS; require to establish training component r... Bell (65th)
4 HB 5 Bonds; authorize issuance to assist City of Ja... Bell (65th)
... ... ... ...
You can add more data to that table, like measurelink, authorlink, action, etc - whatever is available in the xml document tags.
Try get_text instead
https://beautiful-soup-4.readthedocs.io/en/latest/#get-text
temp_dict = [item.get_text() for item in soup.select('tbody')]
IIRC The .text only shows the direct child text, not including the text of descendant tags. See XPath - Difference between node() and text() (which I think applies here for .text as well - it is the child text node, not other child nodes)

Price comparison - python

Hi guys i am trying to create a program in python that compares prices from websites but i cant get the prices. I have managed to ge the title of the product and the quantity using the code bellow.
page = requests.get(urls[7],headers=Headers)
soup = BeautifulSoup(page.text, 'html.parser')
title = soup.find("h1",{"class" : "Titlestyles__TitleStyles-sc-6rxg4t-0 fDKOTS"}).get_text().strip()
quantity = soup.find("li", class_="quantity").get_text().strip()
total_price = soup.find('div', class_='Pricestyles__ProductPriceStyles-sc-118x8ec-0 fzwZWj price')
print(title)
print(quantity)
print(total_price)
Iam trying to get the price from this website (Iam creating a program do look for diper prices lol) https://www.drogasil.com.br/fralda-huggies-tripla-protecao-tamanho-m.html .
the price is not coming even if i get the text it always says that its nonetype.
Some of the information is built up via javascript from data stored in <script> sections in the HTML. You can access this directly by searching for it and using Python's JSON library to decode it into a Python structure. For example:
from bs4 import BeautifulSoup
import requests
import json
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}
url = 'https://www.drogasil.com.br/fralda-huggies-tripla-protecao-tamanho-m.html'
req = requests.get(url, headers=headers)
soup = BeautifulSoup(req.content, 'html.parser')
script = soup.find('script', type='application/ld+json')
data = json.loads(script.text)
title = data['name']
total_price = data['offers']['price']
quantity = soup.find("li", class_="quantity").get_text().strip()
print(title)
print(quantity)
print(total_price)
Giving you:
HUGGIES FRALDAS DESCARTAVEL INFANTIL TRIPLA PROTECAO TAMANHO M COM 42 UNIDADES
42 Tiras
38.79
I recommend you add print(data) to see what other information is available.

Scraping website for image path(not downloading the image just getting clickable link) but image url is parsed in scraped text

I'm trying to scrape this website for the image URLs but when scraped this is the output while the image url which was previously visible in chrome's inspect element is no longer available as seen in the block of html text below
<div class="productImage" data-qa-id="productImagePLP_Running Low Top Sneaker Black/Rose Gold "><div class="sc-1xjgu8-0 jRkpWF"><div class="sc-1xjgu8-1 gCPKVp"><svg fill="none" height="22" viewbox="0 0 22 22" width="22" xmlns="http://www.w3.org/2000/svg"><path d="M14.2113 0.741972C13.3401 0.393483 12.3994 0.219238 11.4583 0.219238C10.901 0.219238 10.3433 0.289037 9.78569 0.393483L7.53809 4.26151C8.46153 3.75635 9.48942 3.51231 10.5525 3.52989C11.197 3.52989 11.8244 3.617 12.4343 3.79125L14.2113 0.741972Z" fill="#B2B8CA"></path><path d="M0.708008 11.1439C0.708008 16.7197 5.44726 21.0582 10.9706 21.0582C16.7556 21.0582 21.425 16.3885 21.425 10.7085C21.425 7.38056 19.8222 4.4533 17.435 2.50171L15.6925 5.51608C17.2258 6.82292 18.1146 8.73961 18.1146 10.7607C18.1146 14.6288 14.9084 17.7998 10.9706 17.7998C7.03278 17.7998 3.84441 14.6115 3.84441 10.6736C3.84441 10.6736 3.84441 10.6736 3.84441 10.6561C3.84441 10.1858 3.87906 9.71528
3.96618 9.26209L0.708008 11.1439Z" fill="#B2B8CA"></path></svg>
chrome's inspect element
<img width="100%" height="100%" src="https://z.nooncdn.com/products/tr:n-t_240/v1603717104/N41330370V_2.jpg" alt="Running Low Top Sneaker Black/Rose Gold ">
I'm trying to scrape the src attribute.
Is there a way to get around this? I've tried to form the URL myself using other attributes but that did not work. ill add relevant code and website link below
code:
page = requests.get(URL, headers=header)
soup = BeautifulSoup(page.content, 'html.parser')
divs = soup.find_all('div', class_="productContainer")
print(divs[0])
website link: https://www.noon.com/egypt-en/search?q=shoes
The page is loaded dynamically, therefore request doesn't support it. However, the data is available in JSON format on the page, which you can extract using the built-in json module.
import json
import requests
from bs4 import BeautifulSoup
URL = "https://www.noon.com/egypt-en/search?q=shoes"
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
soup = BeautifulSoup(requests.get(URL, headers=headers).content, "html.parser")
json_data = json.loads(soup.find("script", {"id": "__NEXT_DATA__"}).string)
for data in json_data["props"]["pageProps"]["catalog"]["hits"]:
price = data["sale_price"] or data["price"]
print(data["name"])
print(price)
print("-" * 80)
Output:
Running Low Top Sneaker Black/Rose Gold
1247
--------------------------------------------------------------------------------
Asweemove Running Shoes Black/White
1076
--------------------------------------------------------------------------------
Leather Half Boots Dark Blue
250
--------------------------------------------------------------------------------
...
...

BeatifulSoup struggling to scrape the Listings Detail Page

I'm still a rookie on the Python world. I'm trying to build a scraper that will be useful on my daily work routine. But I'm stuck at a particular point:
My goal is to scrape a real estate website. I'm using BeatifulSoup, and I manage to get the parameters on the lists pages without problems. But when I enter on the listing details page, I'm not managing to scrape any data.
My code:
from bs4 import BeautifulSoup
import requests
url = "https://timetochoose.co.ao/?search-listings=true"
headers = {'User-Agent': 'whatever'}
response = requests.get(url, headers=headers)
print(response)
data = response.text
print(data)
soup = BeautifulSoup(data, 'html.parser')
anuncios = soup.find_all("div", {"class": "grid-listing-info"})
for anuncios in anuncios:
titles = anuncios.find("a",{"class": "listing-link"}).text
location = anuncios.find("p",{"class": "location muted marB0"}).text
link = anuncios.find("a",{"class": "listing-link"}).get("href")
anuncios_response = requests.get(link)
anuncios_data = anuncios_response.text
anuncios_soup = BeautifulSoup(anuncios_data, 'html.parser')
conteudo = anuncios_soup.find("div", {"id":"listing-content"}).text
print("Título", titles, "\nLocalização", location, "\nLink", link, "\nConteudo", conteudo)
Example: I'm not getting anything under "conteudo" variable. I've tried to get different data from the Details Page, like the Price or number of Rooms but It always fail and I just get "None".
I'm searching an answer since yesterday afternoon, but I'm not getting where I'm failing. I manage to get the parameters on the upper pages without problems, but when I reach the listings detail page level, it just fails.
If someone could just point me what I'm doing wrong, I will be grateful. Thanks in advance for the time that you take to read my question.
To get correct page you need to set User-Agent http header.
For example:
import requests
from bs4 import BeautifulSoup
main_url = 'https://timetochoose.co.ao/?search-listings=true'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
def print_info(url):
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
print(soup.select_one('#listing-content').get_text(strip=True, separator='\n'))
soup = BeautifulSoup(requests.get(main_url, headers=headers).content, 'html.parser')
for a in soup.select('a.listing-featured-image'):
print(a['href'])
print_info(a['href'])
print('-' * 80)
Prints:
https://timetochoose.co.ao/listings/loja-rua-rei-katiavala-luanda/
Avenida brasil , Rua katiavala
Maculusso
Loja com 90 metros quadrados
2 andares
1 wc
Frente a estrada
Arrendamento  mensal 500.000 kz Negociável
--------------------------------------------------------------------------------
https://timetochoose.co.ao/listings/apertamento-t3-rua-cabral-montcada-maianga/
Apartamento T3 maianga
1  suíte com varanda
2 quartos com varanda
1 wc
1 sala comum grande
1 cozinha
Tanque de  agua
Predio limpo
Arrendamento 350.000  akz Negociável
--------------------------------------------------------------------------------
...and so on.

Can't find and process text taken out of an HTML

I'm trying to search in a webpage the "spanish" content but can't get it at all.
This is the code I have so far:
from bs4 import BeautifulSoup
import requests
import re
url = 'http://www.autotaskstatus.net/'
r = requests.get(url)
estado = r.status_code
r = r.content
soup = BeautifulSoup(r, "html.parser")
data = soup.find_all('span', attrs={'class':'name'})[1]
pais = 'Spanish'
data.get_text()
print(data.text)
I have there the "pais" var so it will be replaced by an input so the user can search the country they want.
The only data I get with a 1 there is "Limited Release" but if I go with a 0 I can't filter the results at all
I have been searching all over Internet and couldn't find anyone with this same problem so I can't find a solution.
I am using Python 3.6
Edit: since people seemed to find this unclear I'll explain it now
What I have on the page is: - just a part
<div data-component-id="fp5s6cp13l47"
class="component-inner-container status-green "
data-component-status="operational"
data-js-hook="">
<span class="name">
Concord
</span>
<span class="tooltip-base tool" title="https://concord.centrastage.net">?</span>
<span class="component-status">
Operational
</span>
So spanish is like "Concord" and what I want to take out is the "Spanish" (and later on the "operational") which will be in a var so it can later be changed for any country there
You can get the Spanish server status using this approach:
from bs4 import BeautifulSoup
import requests
URL = 'http://www.autotaskstatus.net/'
with requests.session() as s:
s.headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0'}
r = s.get(URL)
soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all('div', attrs={'class':'component-inner-container'})
pais = 'Spanish'
print([d.find('span', {'class': 'name'}).text.strip() + ' - ' + d.find('span', {'class': 'component-status'}).text.strip() for d in data if pais in d.text])

Categories

Resources