LiveScore BeautifulSoup Python - python

I am using BeautifulSoup to parse this site:
http://www.livescore.com/soccer/champions-league/
I'm looking to get the links for the rows with numbers:
FT Zenit St. Petersburg 3 - 0 Standard Liege"
The 3 - 0 is a link a link; what I want to do is find every link with numbers (so not results like
15:45 APOEL Nicosia ? - ? Paris Saint Germain
), so I can go load these links and parse out the minute data (<td class="min">)
Hi!!! Needs edited. Now I'm able to get the links. Like this:
import urllib2, re, bs4
sitioweb = urllib2.urlopen('http://www.livescore.com/soccer/champions-league/').read()
soup = bs4.BeautifulSoup(sitioweb)
href_tags = soup.find_all('a', {'class':"scorelink"})
links = []
for x in xrange(1, len(href_tags)):
insert = href_tags[x].get("href");links.append(insert)
print links
Now my problem is the following: I want to write all this into a DB (like sqlite) with the number of minute in which a goal was made (this information I can get from the link I get) but this is possible only in the case that the goal count is not ? - ?, as there isn't any goal made.
I hope you can understand me...
Best regards and thanks a lot for your help,
Marco

The following search matches only your links:
import re
links = soup.find_all('a', class_='scorelink', href=True,
text=re.compile('\d+ - \d+'))
The search is limited to:
<a> tags
with the class scorelink
a non-empty href attribute
and the link text containing two digits separated by a dash.
Extracting just the links is then trivial:
score_urls = [link['href'] for link in soup.find_all(
'a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> from pprint import pprint
>>> soup = BeautifulSoup(requests.get('http://www.livescore.com/soccer/champions-league/').content)
>>> [link['href'] for link in soup.find_all('a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/', '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/', '/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/', '/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/', '/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/', '/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/', '/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/', '/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/', '/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/', '/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/', '/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
>>> pprint(_)
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
'/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/',
'/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/',
'/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/',
'/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/',
'/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/',
'/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/',
'/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/',
'/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/',
'/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']

Fairly easy to do outside of BeautifulSoup. Just find all the links first, then, filter out the ones that return a ? - ? text, then get the href attribute from each item in the sanitized list. See below.
In [1]: from bs4 import BeautifulSoup as bsoup
In [2]: import requests as rq
In [3]: url = "http://www.livescore.com/soccer/champions-league/"
In [4]: r = rq.get(url)
In [5]: bs = bsoup(r.text)
In [6]: links = bs.find_all("a", class_="scorelink")
In [7]: links
Out[7]:
[<a class="scorelink" href="/soccer/champions-league/group-a/atletico-madrid-vs-malmo-ff/1-1821150/" onclick="return false;">? - ?</a>,
<a class="scorelink" href="/soccer/champions-league/group-a/olympiakos-vs-juventus/1-1821151/" onclick="return false;">? - ?</a>,
...
In [8]: links_clean = [link for link in links if link.get_text() != "? - ?"]
In [9]: links_clean
Out[9]:
[<a class="scorelink" href="/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/" onclick="return false;">0 - 1</a>,
<a class="scorelink" href="/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/" onclick="return false;">3 - 0</a>,
...
In [10]: links_final = [link["href"] for link in links_clean]
In [11]: links_final
Out[11]:
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
...
Extracting the minutes from each link is, of course, up to you.

Related

is there any convenient way to get a index of sub section in a page?

it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.

How to scrape on only certain tags that are all in the same class?

Im creating this program which allows me to scrape all the names and abilities of characters from this website. The tags (li) which contain the information I need are mixed in with other li tags that are not needed.
I have tried selecting different classes but that wont work.
Here is my code:
import bs4, requests, lxml, re, time, os
from bs4 import BeautifulSoup as soup
def webscrape():
res = requests.get('https://www.usgamer.net/articles/15-11-2017-skyrim-guide-for-xbox-one-and-ps4-which-races-and-character-builds-are-the-best')
soup = bs4.BeautifulSoup(res.text, 'lxml')
races_list = soup.find_all("li < strong")
races_list_text = [f.text.strip() for f in races_list]
print(races_list_text)
time.sleep(1)
webscrape()
It is expected to print out all the races and their corresponding information.
You can use the following
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.usgamer.net/articles/15-11-2017-skyrim-guide-for-xbox-one-and-ps4-which-races-and-character-builds-are-the-best')
soup = bs(r.content, 'lxml')
#one list of tuples
race_info = [ (item.text, item.next_sibling) for item in soup.select('h2 ~ ul strong')]
# separate lists
races, abilities = zip(*[ (item.text, item.next_sibling) for item in soup.select('h2 ~ ul strong')])
A dictionary might be nicer in which case you can do
race_info = [ (item.text, item.next_sibling) for item in soup.select('h2 ~ ul strong')]
race_info = dict(race_info)
The ~ is a general sibling combinator:
The ~ combinator selects siblings. This means that the second element
follows the first (though not necessarily immediately), and both share
the same parent.

Hashtags python html

I want to extract all the hashtags from a given website:
For example, "I love #stack overflow because #people are very #helpful!"
This should pull the 3 hashtags into a table.
In the website I am targeting there is a table with a #tag description
So we can find #love this hashtag speaks about love
This is my work:
#import the library used to query a website
import urllib2
#specify the url
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
#Query the website and return the html to the variable 'page'
page = urllib2.urlopen(wiki)
#import the Beautiful soup functions to parse the data returned from the
website
from bs4 import BeautifulSoup
#Parse the html in the 'page' variable, and store it in Beautiful Soup
format
soup = BeautifulSoup(page, "lxml")
print soup.prettify()
s = soup.get_text()
import re
re.findall("#(\w+)", s)
I have an issues in the output :
The first one is that the output look like this :
[u'eeeeee',
u'333333',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'222222',
u'AASTGrandRoundsacute'
The output concatenate the Hashtag with the first word in the description. If I compare to the example I evoked before the output is 'lovethis'.
How can I do to extract only the one word after the hashtag.
Thank you
I think there's no need to use regex to parse the text you get from the page, you can use BeautifulSoup itself for that. I'm using Python3.6 in the code below, just to show the entire code, but the important line is hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'}). Notice all hashtags in the table have td tag and id attribute = tweetchatlist_hashtag, so calling .findAll is the way to go here:
import requests
import re
from bs4 import BeautifulSoup
wiki = "https://www.symplur.com/healthcare-hashtags/tweet-chats/all"
page = requests.get(wiki).text
soup = BeautifulSoup(page, "lxml")
hashtags = soup.findAll('td', {'id':'tweetchatlist_hashtag'})
Now let's have a look at the first item of our list:
>>> hashtags[0]
<td id="tweetchatlist_hashtag" itemprop="location">#AASTGrandRounds</td>
So we see that what we really want is the value of title attribute of a:
>>> hashtags[0].a['title']
'#AASTGrandRounds'
To proceed to get a list of all hashtags using list comprehension:
>>> lst = [hashtag.a['title'] for hashtag in hashtags]
If you are not used with list comprehension syntax, the line above is similar to this:
>>> lst = []
>>> for hashtag in hashtags:
lst.append(hashtag.a['title'])
lst then is the desired output, see the first 20 items of the list:
>>> lst[:20]
['#AASTGrandRounds', '#abcDrBchat', '#addictionchat', '#advocacychat', '#AetnaMyHealthy', '#AlzChat', '#AnatQ', '#anzOTalk', '#AskAvaility', '#ASPChat', '#ATtalk', '#autchat', '#AXSChat', '#ayacsm', '#bcceu', '#bccww', '#BCSM', '#benurse', '#BeTheDifference', '#bioethx']

Find and extract curr_id number from Investing

I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.
You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.
You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.

Using Regular Expressions With Python to Get Value Buried in HTML5

I'm trying to use BeautifulSoup and RE to get a specific value from Yahoo Finance. I can't figure out exactly how to get it. I'll paste some code I have along with the HTML and unique selector I got.
I just want this number in here, "7.58," but the problem is that the class of this column is the same as many other ones in the same element.
<tr><td class="yfnc_tablehead1" width="74%">Diluted EPS (ttm):</td><td class="yfnc_tabledata1">7.58</td>"
Here is the selector Google gave me...
yfncsumtab > tbody > tr:nth-child(2) > td.yfnc_modtitlew1 > table:nth-child(10) > tbody > tr > td > table > tbody > tr:nth-child(8) > td.yfnc_tabledata1
Here is some template code I'm using to test different things, but I'm very new to regular expressions and can't find a way to extract that number after "Diluted EPS (ttm):###
from bs4 import BeautifulSoup
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
soup = BeautifulSoup(res.text, 'html.parser')
body = soup.findAll('td')
print (body)
Thanks!
You could find by text Diluted EPS (ttm): first:
soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})
If using regex, please try:
>>> import re
>>> text = '<tr><td class="yfnc_tablehead1" width="74%">Diluted EPS (ttm):</td><
td class="yfnc_tabledata1">7.58</td>"'
>>> re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', text)
['7.58']
UPDATE Here is the sample code using requests and re:
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
print re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', res.text)
Output:
[u'7.58']
Thanks for answering my question. I was able to use two ways to get the desired value. The first way is this.
from bs4 import BeautifulSoup
import requests
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
soup = BeautifulSoup(res.text, 'html.parser')
eps = soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})
for i in eps:
print (i)
Here is the second way...
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
print (re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', res.text.strip()))
I don't quite understand it all yet, but this is a great start with two different ways to understand it and move forward incorporating this aspect of the project. Really appreciate your assistance!

Categories

Resources