How can I assign scraped data two different variables with same class? - python

I am scraping a website with the following HTML:
I have the following code:
import requests
from bs4 import BeautifulSoup
URL = 'https://texastech.com/sports/baseball/stats/2019/oregon/boxscore/14317#play-by-play'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='inning-all')
innings = results.find_all('table', class_='play-by-play')
for innings in innings:
situation = innings.find('caption')
away_team = innings.find('th', class_='text-center')
home_team = innings.find('th', class_='text-center')
print(away_team)
print(home_team)
The issue I am running into is that I want to assign the first 'text-center' with the content 'ORE' to the away_team variable while assigning the 'text-center' with the content 'TTU' to the home_team variable.
When I run my code, it assigns 'ORE' to both variables which logically makes sense. I just can't seem to wrap my head around how to select the 'second' 'text-center' and assign it to home_team.
Any suggestions as to how I can accomplish this while neither table heading has a distinguishing class?
Thank you for your time and if there is anything I can add to clarify my question, don't hesitate.

You have such problem because find function returns only first match which in your case is ORE, use inning.find_all to get list and indexes to get first and second match.
Also you have a mistake in your for loop you override innings variable
for innings in innings: <-

Related

BeautifulSoup4: Extracting tables, now how do I exclude certain tags and bits of information I do not want

Trying to extract coin names, price, and market cap from coinmarketcap.com. I first tried using soup.find_all to search for certain tags with a specific class but it always picked up information I didnt need or want. So instead I used find_all to search for 'td' and then planned on using a for loop to look for specific class names and to append those to a new list and then print that list but it returns a data type for some reason.
coin_table = soup.find_all('td')
class_value = 'sc-1eb5slv-0 iJjGCS'
for i in coin_table:
if class_value in coin_table:
list.append(i)
print(list)
But this returns:
<class 'list'>
to the console even though im not asking to see the data type. Very new to beautifulsoup and coding in general so sorry if this is a very basic question. Still trying to get my head around all of this stuff.
As #RJAdriaansen mentioned, you don't need to scrape website when they provide api. Here is how you do it with requests library:
import requests
url = 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,tags,platform,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
response = requests.get(url)
data = response.json()
This will give you json data. Now you can grab all you need by accessing correct keys:
final_list = []
temp = []
for each_crypto in data['data']['cryptoCurrencyList']:
temp.append(each_crypto['name'])
# each_crypto['quotes'] gives you list of price and market gap of each crypto
for quote in each_crypto['quotes']:
# assuming you want to get USD price of each crypto
if quote['name'] == "USD":
temp.append(quote['price'])
temp.append(quote['marketCap'])
final_list.append(temp)
temp = []
Final result would look like this:
[
['Bitcoin', 34497.01819639692, 646704595579.0485],
['Ethereum', 2195.11816422801, 255815488972.87268],
['Tether', 1.0003936138399, 62398426501.02027],
['Binance Coin', 294.2550537711805, 45148405357.003],
...
]

Why can't I use a variable in a results.find_all?

I'm trying to do a pretty basic web scrape but I'd like to be able to use variables so I don't have to keep repeating code for multiple pages.
example line of code:
elems = results.find_all("span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P")
I would like it to read as:
elems = results.find_all(<variable>)
and then have my variable be:
'"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
However, when I do this, I get no results. I've included the rest of the function below. Does anyone know why this will not work?
EDIT:
I've also tried splitting it up like below example but still get the same issue:
elems = results.find_all(variable1 , class_=variable2)
variable1 = '"span"'
variable2 = '"st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"'
code:
def get_name(sym,result,elem,name):
url = URL + sym
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id=result)
elems = results.find_all(elem)
for elem in elems:
name_elem = elem.find(name)
print(name_elem.text)
get_name('top',"app",'"span",class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"','"span","st_3lrv4Jo"')
The find_all method takes more then one parameter
you are just using a string in the first argument of the method which would struggle to find anything
you will need to split the variable into multiple so your variable '"span", class_="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"' will need to be split into to variables
elem = "span" and class="st_2XVIMK7 st_8u0ePN3 st_2oUi2Vb st_3kXJm4P"
and in your code it will look like
elems = results.find_all(elem, class)
Just a few more things:
according to the documentation https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all and what i can find online the class parameter takes a Dict with a string array for multiple class values so your function will look more like
findAll(elem, {'class':['st_2XVIMK7', 'st_8u0ePN3', 'st_2oUi2Vb', 'st_3kXJm4P']})

Take only the second element that has the same name in BeautifulSoup

I'm scraping a website for college work, and I am having trouble getting only the second text in a span.I have seen that you can use below to get the text:
gross = container.find_all('span', attrs = {'name':'nv'})
print(gross)
I have as a result this:
[<span data-value="845875" name="nv">845.875</span>, <span data-value="335.451.311" name="nv">$335.45M</span>]
how do I get only the values contained with in the second data-value, in a way that can replicate for others span's ?
Try this.
gross = container.find_all('span', attrs = {'name':'nv', 'data-value':'335.451.311'})
print(gross)
If this data val keeps changing then you don't have any other choice but to use gross[1].

multiple findAll in one for loop

I'm using BeatufulSoap to read some data from web page.
This code works fine, but I would like to improve it.
How do I make the for loop to extract more than one piece of data per iteration? Here I have 3 for loops to get values from:
for elem in bsObj.findAll('div', class_="grad"): ...
for elem in bsObj.findAll('div', class_="ulica"): ...
for elem in bsObj.findAll('div', class_="kada"): ...
How to change this to work in one for loop? Of course I'd like a simple solution.
Output can be list
My code so far
from bs4 import BeautifulSoup
# get data from a web page into the ``html`` varaible here
bsObj = BeautifulSoup(html.read(),'lxml')
mj=[]
adr=[]
vri=[]
for mjesto in bsObj.findAll('div', class_="grad"):
print (mjesto.get_text())
mj.append(mjesto.get_text())
for adresa in bsObj.findAll('div', class_="ulica"):
print (adresa.get_text())
adr.append(adresa.get_text())
for vrijeme in bsObj.findAll('div', class_="kada"):
print (vrijeme.get_text())
vri.append(vrijeme.get_text())
You can use BeautifulSoup's select method to target your various desired elements, and do whatever you want with them. In this case we are going to simplify the CSS selector pattern by using the :is() pseudo-class, but basically we are searching for any div that has class grad, ulica, or kada. As each element is returned that matches the pattern, we just sort them by which class they correspond to:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
lokacija="http://www.hep.hr/ods/bez-struje/19?dp=koprivnica&el=124"
datum="12.02.2019"
lokacija=lokacija+"&datum="+datum
print(lokacija)
r = requests.get(lokacija)
print(type(str(r)))
print(r.status_code)
html = urlopen(lokacija)
bsObj = BeautifulSoup(html.read(),'lxml')
print("Datum radova:",datum)
print("HEP područje:",bsObj.h3.get_text())
mj=[]
adr=[]
vri=[]
hep_podrucje=bsObj.h3.get_text()
for el in bsObj.select('div:is(.grad, .ulica, .kada)'):
if 'grad' in el.get('class'):
print (el.get_text())
mj.append(el.get_text())
elif 'ulica' in el.get('class'):
print(el.get_text())
adr.append(el.get_text())
elif 'kada' in el.get('class'):
print (el.get_text())
vri.append(el.get_text())
Note: basic explanation ahead. If you know this, skip directly to the listing of possibilities
To change the code into a loop, you have to look at the part that stays the same and the part that varies. In your case, you find a div, get the text and append it to a list.
The class attribute of the div objects varies each time, so does the list you append to. A for loop works by having one variable that is assigned different values each iteration, then executig the code within.
We get a basic structure:
for div_class in <div classes>:
<stuff to do>
Now, in <stuff to do>, we have a different list each time. We need some way of getting a different list into the loop. For this, there are multiple possibilities:
Put the list into a dict and use item lookup
zip the lists with <div classes> and iterate over them
The first two will involve using nested loops, the result looking similar to this:
list_1 = []
list_2 = []
list_3 = []
for div_class, the_list in zip(['div_cls1', 'div_cls2', 'div_cls3'], [list_1, list_2, list_3]):
for elem in bsObj.find_all('div', class_=div_class):
the_list.append(elem.get_text())
or
lists = {'div_cls1': [], 'div_cls2': [], 'div_cls3': []}
for div_class in lists: # note: keys MUST match the class of div elements
for elem in bsObj.find_all('div', class_=div_class):
lists[div_class].append(elem.get_text)
Of course, the inner loop could be replaced by list comprehension (works for the dict approach): lists[div_class] = [elem.get_text() for elem in bsObj.find_all('div', class_=div_class)]

How can I avoid that an array is resetting after a callback?

I want to scrape review data from a website using scrapy. The code is given below.
The problem is that each time the program goes to the next page, it starts at the beginning (due to the callback) and it resets records[]. So the array will be empty again and every review that is saved in records[] is lost. This results in that when I open my csv file, I only get the reviews of the last page.
What I want is that all the data is stored in my csv file, so that records[] does not keep resetting each time a next page is requested. I can't put the line: records = [] before the parse method, because than the array is not defined.
Here is my code:
def parse(self, response):
records = []
for r in response.xpath('//div[contains(#class, "a-section review")]'):
rtext = r.xpath('.//div[contains(#class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(#class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(#class, "review-votes")]/text())').extract_first()
if not votes:
votes = "none"
records.append((rating, votes, rtext))
print(records)
nextPage = response.xpath('//li[contains(#class, "a-last")]/a/#href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)
import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')
Moving the record declaration to the method call will use a common gotcha in python outlined here in the python docs. However in this instance the weird behavior of instantiating lists in a method declaration will work in your favor.
Python’s default arguments are evaluated once when the function is defined, not each time the function is called (like it is in say, Ruby). This means that if you use a mutable default argument and mutate it, you will and have mutated that object for all future calls to the function as well.
def parse(self, response, records=[]):
for r in response.xpath('//div[contains(#class, "a-section review")]'):
rtext = r.xpath('.//div[contains(#class, "a-row review-data")]').extract_first()
rating = r.xpath('.//span[contains(#class, "a-icon-alt")]/text()').extract_first()
votes = r.xpath('normalize-space(.//span[contains(#class, "review-votes")]/text())').extract_first()
if not votes:
votes = "none"
records.append((rating, votes, rtext))
print(records)
nextPage = response.xpath('//li[contains(#class, "a-last")]/a/#href').extract_first()
if nextPage:
nextPage = response.urljoin(nextPage)
yield scrapy.Request(url = nextPage)
import pandas as pd
df = pd.DataFrame(records, columns=['rating' , 'votes', 'rtext'])
df.to_csv('ama.csv', sep = '|', index =False, encoding='utf-8')
The above method is a little weird. A more general solution would be to simply use a global variable. Here is a post going over how to use globals.
Here parse is a callback which is called every time again. Try to define records globally or call an appender function and call it to append values.
Also scrappy is capable to generate CSV itself. Here’s my little experiment with scraping - https://gist.github.com/lisitsky/c4aac52edcb7abfd5975be067face1bb
So you can load data to csv then pandas will read it.

Categories

Resources