Take only the second element that has the same name in BeautifulSoup - python

I'm scraping a website for college work, and I am having trouble getting only the second text in a span.I have seen that you can use below to get the text:
gross = container.find_all('span', attrs = {'name':'nv'})
print(gross)
I have as a result this:
[<span data-value="845875" name="nv">845.875</span>, <span data-value="335.451.311" name="nv">$335.45M</span>]
how do I get only the values contained with in the second data-value, in a way that can replicate for others span's ?

Try this.
gross = container.find_all('span', attrs = {'name':'nv', 'data-value':'335.451.311'})
print(gross)
If this data val keeps changing then you don't have any other choice but to use gross[1].

Related

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Trying to extract coin names, price, and market cap from coinmarketcap.com. I first tried using soup.find_all to search for certain tags with a specific class but it always picked up information I didnt need or want. So instead I used find_all to search for 'td' and then planned on using a for loop to look for specific class names and to append those to a new list and then print that list but it returns a data type for some reason.
coin_table = soup.find_all('td')
class_value = 'sc-1eb5slv-0 iJjGCS'
for i in coin_table:
if class_value in coin_table:
list.append(i)
print(list)
But this returns:
<class 'list'>
to the console even though im not asking to see the data type. Very new to beautifulsoup and coding in general so sorry if this is a very basic question. Still trying to get my head around all of this stuff.
As #RJAdriaansen mentioned, you don't need to scrape website when they provide api. Here is how you do it with requests library:
import requests
url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,tags,platform,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
response = requests.get(url)
data = response.json()
This will give you json data. Now you can grab all you need by accessing correct keys:
final_list = []
temp = []
for each_crypto in data['data']['cryptoCurrencyList']:
temp.append(each_crypto['name'])
# each_crypto['quotes'] gives you list of price and market gap of each crypto
for quote in each_crypto['quotes']:
# assuming you want to get USD price of each crypto
if quote['name'] == "USD":
temp.append(quote['price'])
temp.append(quote['marketCap'])
final_list.append(temp)
temp = []
Final result would look like this:
[
['Bitcoin', 34497.01819639692, 646704595579.0485],
['Ethereum', 2195.11816422801, 255815488972.87268],
['Tether', 1.0003936138399, 62398426501.02027],
['Binance Coin', 294.2550537711805, 45148405357.003],
...
]

How to get the positions of get_text in Beautiful Soup

I am trying to store the result of my get_text in variables.
I am filtering my html in order to find the information I need. If I want to extract for example a number of rooted can present several, this is how the information I get is displayed:
<span cetxt\"="" class='\"rSpnValor' vidc0='\"74922\"'>74922</span>
<span cetxt\"="" class='\"rSpnValor' vidc0='\"75005\"'>75005</span>
With get_text it would look like this:
74922
75005
I share a bit of my code:
def getValBySpanName(name):
dataArray = soup.find_all('div', {'class': '\\\"rDivDatosAseg'})
for data in dataArray:
data_container = data
spans_data = data_container.find_all("span")
info = []
if spans_data[0].get_text() == name:
container_values = spans_data[1].get_text()
return container_values
file_number= getValBySpanName('NĂºmero de radicado')
print(file_number)
The problem is that I get the first position "74922" as a result. I need to find a way to store each value in the variable (Then I will insert this data in sql) so I need to save it one by one
I tried to go through them with a for but it goes through the positions of the first value, something like '7,4,9,2,2'
If I understand you correctly, you are probably looking for something like this:
dataArray = soup.select('div.\\"rSpnValor span')
container_values = []
for data in dataArray:
container_values.append(data.text)
print(container_values)
output
['74922', '75005']

Is there any function to extract (get) a value of an attribute using bs4

I need to extract attribute value.I was surfing the net and could not find any solutions. The only I found was to use CSS selector ('select-one'). But the problem is that I need to get ALL the values from the attribute. So here is it:
<span data-name="BLABLABLA" data-id="40423" data-volume="18.643.727" class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-tooltip="BLABLABLA"></span>
I need to get data-id value(it is 40423). But also there are 3 spans more. How do I get all the values, if they have these(span and data-id) in common.
I tried something like this:
DataNames = soup.findAll('span',attrs = {'data-id':True} )
for value in DataNames:
data_names.append(value.span['data-id'])
Try this:
DataNames = soup.findAll('span',attrs = {'data-id':True} )
for element in DataNames:
data_names.append(element['data-id'])
I didn't test it, if you post the link you're trying to get this from I think I could help you more
For someone who is still looking for a solution:
input="""<span data-name="BLABLABLA" data-id="40423" data-volume="18.643.727" class="alertBellGrayPlus js-plus-icon genToolTip oneliner" data-tooltip="BLABLABLA"></span>"""
soup=BeautifulSoup(input,'html')
results = soup.find_all("span")
for result in results:
print(result.get('data-id'))
The output will be:
40423

How can I assign scraped data two different variables with same class?

I am scraping a website with the following HTML:
I have the following code:
import requests
from bs4 import BeautifulSoup
URL = 'https://texastech.com/sports/baseball/stats/2019/oregon/boxscore/14317#play-by-play'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='inning-all')
innings = results.find_all('table', class_='play-by-play')
for innings in innings:
situation = innings.find('caption')
away_team = innings.find('th', class_='text-center')
home_team = innings.find('th', class_='text-center')
print(away_team)
print(home_team)
The issue I am running into is that I want to assign the first 'text-center' with the content 'ORE' to the away_team variable while assigning the 'text-center' with the content 'TTU' to the home_team variable.
When I run my code, it assigns 'ORE' to both variables which logically makes sense. I just can't seem to wrap my head around how to select the 'second' 'text-center' and assign it to home_team.
Any suggestions as to how I can accomplish this while neither table heading has a distinguishing class?
Thank you for your time and if there is anything I can add to clarify my question, don't hesitate.
You have such problem because find function returns only first match which in your case is ORE, use inning.find_all to get list and indexes to get first and second match.
Also you have a mistake in your for loop you override innings variable
for innings in innings: <-

Unable to get the output in an organized manner

I've written a script in python to scrape some item names along with review texts and reviewers connected to each item name from a webpage using their api. The thing is my below script can do those things partially. I need to do those in an organized manner.
For example, in each item name there are multiple review texts and reviewer names connected to it. I wish to get them along the columns like:
Name review text reviewer review text reviewer -----
Basically, I can't get the idea how to make use of the already defined for loop in the right way within my script. Lastly, there are few item names which do not have any reviews or reviewers, so the code breaks when it doesn't find any reviews and so.
This s my approach so far:
import requests
url = "https://eatstreet.com/api/v2/restaurants/{}?yelp_site="
res = requests.get("https://eatstreet.com/api/v2/locales/madison-wi/restaurants")
for item in res.json():
itemid = item['id']
req = requests.get(url.format(itemid))
name = req.json()['name']
for texualreviews in req.json()['yelpReviews']:
reviews = texualreviews['message']
reviewer = texualreviews['reviewerName']
print(f'{name}\n{reviews}\n{reviewer}\n')
If I use print statement outside the for loop, It only gives me a single review and reviewer.
Any help to fix that will be highly appreciated.
You need to append the review and a reviewer name to an array to display as you wish.
Try the following code.
review_data = dict()
review_data['name'] = req.json()['name']
review_data['reviews'] = []
for texualreviews in req.json()['yelpReviews']:
review_sub_data = {'review': texualreviews['message'], 'reviewer': texualreviews['reviewerName']}
review_data['reviews'].append(review_sub_data)
#O/P {'name': 'xxx', 'reviews':[{'review':'xxx', 'reviewer': 'xxx'}, {'review':'xxx', 'reviewer': 'xxx'}]}
Hope this helps!

Categories

Resources