I need to know the curr_id to submit using python to investing.com and extract historic data for a number of currencies/commodities. To do this I need the curr_id number. As in the example bellow. I'm able to extract all scripts. But then I cannot figure out how to find the correct script index that contains curr_id and extract the digits '2103'. Example: I need the code to find 2103.
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
#URL
url='http://www.investing.com/currencies/usd-brl-historical-data'
#OPEN URL
r = requests.get(url)
#DETERMINE FORMAT
soup=BeautifulSoup(r.content,'html.parser')
#FIND TABLE WITH VALUES IN soup
curr_data = soup.find_all('script', {'type':'text/javascript'})'
UPDATE
I did it like this:
g_data_string=str(g_data)
if 'curr_id' in g_data_string:
print('success')
start = g_data_string.find('curr_id') + 9
end = g_data_string.find('curr_id')+13
print(g_data_string[start:end])
But I`m sure there is a better way to do it.
You can use a regular expression pattern as a text argument to find a specific script element. Then, search inside the text of the script using the same regular expression:
import re
import requests
from bs4 import BeautifulSoup
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
pattern = re.compile(r"curr_id: (\d+)")
script = soup.find('script', text=pattern)
match = pattern.search(script.text)
if match:
print(match.group(1))
Prints 2103.
Here (\d+) is a capturing group that would match one or more digits.
You don't actually need a regex, you can get the id from by extracting the value attribute from the input tag with the name=item_ID
In [6]: from bs4 import BeautifulSoup
In [7]: import requests
In [8]: r = requests.get("http://www.investing.com/currencies/usd-brl-historical-data").content
In [9]: soup = BeautifulSoup(r, "html.parser")
In [10]: soup.select_one("input[name=item_ID]")["value"]
Out[10]: u'2103'
You could also look for the id starting with item_id:
In [11]: soup.select_one("input[id^=item_id]")["value"]
Out[11]: u'2103'
Or look for the first div with the pair_id attribute:
In [12]: soup.select_one("div[pair_id]")["pair_id"]
Out[12]: u'2103'
There are actually numerous ways to get it.
Related
I am trying to remove the quotes from my re.findall output using Python 3. I tried suggestions from various forums but it didn't work as expected finally thought of asking out here myself.
My code:
import requests
from bs4 import BeautifulSoup
import re
import time
price = [];
while True:
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.prettify()
for p in data:
match = re.findall('\d*\.?\d+',data)
print("ETH/USDT",match)
price.append(match)
break
Output of match gives:
['143.19000000']. I would like it to be like: [143.1900000] but I cannot figure out how to do this.
Another problem I am encountering is that the list price appends every object like a single list. So the output of price would be for example [[a], [b], [c]]. I would like it to be like [a, b, c] I am having a bit of trouble to solve these two problems.
Thanks :)
Parse the response from requests.get() as JSON, rather than using BeautifulSoup:
import requests
url = "https://api.binance.com/api/v3/ticker/price?symbol=ETHUSDT"
response = requests.get(url)
response.raise_for_status()
data = response.json()
print(data["price"])
To get floats instead of strings:
float_match = [float(el) for el in match]
To get a list instead of a list of lists:
for el in float_match:
price.append(el)
it is convenient to use "index-x" to quick locate a sub section in a page.
for instance
https://docs.python.org/3/library/re.html#index-2
gives 3rd sub-section in this page.
when i want to share the location of a sub-section to others, how to get the index in a convenient way?
for instance, how to get the index of {m,n} sub-section without counting from index-0?
With bs4 4.7.1 you can use :has and :contains to target a specific text string and return the index (note that using select_one will return first match. Use a list comprehension and select if want to return all matches
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
index = soup.select_one('dl:has(.pre:contains("{m,n}"))')['id']
print(index)
Any version: if you want a dictionary that maps special characters to indices. Thanks to #zoe for spotting the error in my dictionary comprehension.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://docs.python.org/3/library/re.html')
soup = bs(r.content, 'lxml')
mappings = dict([(item['id'], [i.text for i in item.select('dt .pre')]) for item in soup.select('[id^="index-"]')])
indices = {i: k for (k, v) in mappings.items() for i in v}
You're looking for index-7.
You can download the HTML of the page and get all the possible values of index-something with the following code:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get('https://docs.python.org/3/library/re.html')
soup = BeautifulSoup(r.content.decode())
result = [t['id'] for t in soup.find_all(id=re.compile('index-\d+'))]
print(result)
Output:
['index-0', 'index-1', 'index-2', 'index-3', 'index-4', 'index-5', 'index-6', 'index-7', 'index-8', 'index-9', 'index-10', 'index-11', 'index-12', 'index-13', 'index-14', 'index-15', 'index-16', 'index-17', 'index-18', 'index-19', 'index-20', 'index-21', 'index-22', 'index-23', 'index-24', 'index-25', 'index-26', 'index-27', 'index-28', 'index-29', 'index-30', 'index-31', 'index-32', 'index-33', 'index-34', 'index-35', 'index-36', 'index-37', 'index-38']
The t objects in the list comprehension contain the HTML of the tags whose id matches the regex.
I want to scrape the data of websitses using Beautiful Soup and requests, and I've come so far that I've got the data I want but now I want to filter it:
from bs4 import BeautifulSoup
import requests
url = "website.com"
keyword = "22222"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'lxml')
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
print(article.get("data-variant-code"))
Let's say this prints the following:
11111
22222
33333
How can I filter this so it only returns me the "22222"?
assuming that article.get("data-variant-code") prints 11111, 22222, 33333,
you can simply use an if statement:
for article in soup.find_all('a'):
for a in article:
if article.has_attr('data-variant-code'):
x = article.get("data-variant-code")
if x == '22222':
print(x)
if you want to print the 2nd group of chars in a string delimited by space, then you can split the string using space as delimiter. This will give you a list of strings then access the 2nd item of the list.
For example:
print(article.get("data-variant-code").split(" ")[1])
result: 22222
I'm trying to use BeautifulSoup and RE to get a specific value from Yahoo Finance. I can't figure out exactly how to get it. I'll paste some code I have along with the HTML and unique selector I got.
I just want this number in here, "7.58," but the problem is that the class of this column is the same as many other ones in the same element.
<tr><td class="yfnc_tablehead1" width="74%">Diluted EPS (ttm):</td><td class="yfnc_tabledata1">7.58</td>"
Here is the selector Google gave me...
yfncsumtab > tbody > tr:nth-child(2) > td.yfnc_modtitlew1 > table:nth-child(10) > tbody > tr > td > table > tbody > tr:nth-child(8) > td.yfnc_tabledata1
Here is some template code I'm using to test different things, but I'm very new to regular expressions and can't find a way to extract that number after "Diluted EPS (ttm):###
from bs4 import BeautifulSoup
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
soup = BeautifulSoup(res.text, 'html.parser')
body = soup.findAll('td')
print (body)
Thanks!
You could find by text Diluted EPS (ttm): first:
soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})
If using regex, please try:
>>> import re
>>> text = '<tr><td class="yfnc_tablehead1" width="74%">Diluted EPS (ttm):</td><
td class="yfnc_tabledata1">7.58</td>"'
>>> re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', text)
['7.58']
UPDATE Here is the sample code using requests and re:
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
print re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', res.text)
Output:
[u'7.58']
Thanks for answering my question. I was able to use two ways to get the desired value. The first way is this.
from bs4 import BeautifulSoup
import requests
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
soup = BeautifulSoup(res.text, 'html.parser')
eps = soup.find('td', text='Diluted EPS (ttm):').parent.find('td', attrs={'class': 'yfnc_tabledata1'})
for i in eps:
print (i)
Here is the second way...
import requests
import re
sess = requests.Session()
res = sess.get('http://finance.yahoo.com/q/ks?s=MMM+Key+Statistics')
print (re.findall('Diluted\s+EPS\s+\(ttm\).*?>([\d.]+)<', res.text.strip()))
I don't quite understand it all yet, but this is a great start with two different ways to understand it and move forward incorporating this aspect of the project. Really appreciate your assistance!
I am using BeautifulSoup to parse this site:
http://www.livescore.com/soccer/champions-league/
I'm looking to get the links for the rows with numbers:
FT Zenit St. Petersburg 3 - 0 Standard Liege"
The 3 - 0 is a link a link; what I want to do is find every link with numbers (so not results like
15:45 APOEL Nicosia ? - ? Paris Saint Germain
), so I can go load these links and parse out the minute data (<td class="min">)
Hi!!! Needs edited. Now I'm able to get the links. Like this:
import urllib2, re, bs4
sitioweb = urllib2.urlopen('http://www.livescore.com/soccer/champions-league/').read()
soup = bs4.BeautifulSoup(sitioweb)
href_tags = soup.find_all('a', {'class':"scorelink"})
links = []
for x in xrange(1, len(href_tags)):
insert = href_tags[x].get("href");links.append(insert)
print links
Now my problem is the following: I want to write all this into a DB (like sqlite) with the number of minute in which a goal was made (this information I can get from the link I get) but this is possible only in the case that the goal count is not ? - ?, as there isn't any goal made.
I hope you can understand me...
Best regards and thanks a lot for your help,
Marco
The following search matches only your links:
import re
links = soup.find_all('a', class_='scorelink', href=True,
text=re.compile('\d+ - \d+'))
The search is limited to:
<a> tags
with the class scorelink
a non-empty href attribute
and the link text containing two digits separated by a dash.
Extracting just the links is then trivial:
score_urls = [link['href'] for link in soup.find_all(
'a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
Demo:
>>> from bs4 import BeautifulSoup
>>> import requests
>>> from pprint import pprint
>>> soup = BeautifulSoup(requests.get('http://www.livescore.com/soccer/champions-league/').content)
>>> [link['href'] for link in soup.find_all('a', class_='scorelink', href=True, text=re.compile('\d+ - \d+'))]
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/', '/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/', '/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/', '/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/', '/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/', '/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/', '/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/', '/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/', '/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/', '/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/', '/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
>>> pprint(_)
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
'/soccer/champions-league/qualifying-round/apoel-nicosia-vs-aab/1-1801432/',
'/soccer/champions-league/qualifying-round/bate-borisov-vs-slovan-bratislava/1-1801436/',
'/soccer/champions-league/qualifying-round/celtic-vs-maribor/1-1801428/',
'/soccer/champions-league/qualifying-round/fc-porto-vs-lille/1-1801444/',
'/soccer/champions-league/qualifying-round/arsenal-vs-besiktas/1-1801438/',
'/soccer/champions-league/qualifying-round/athletic-bilbao-vs-ssc-napoli/1-1801446/',
'/soccer/champions-league/qualifying-round/bayer-leverkusen-vs-fc-koebenhavn/1-1801442/',
'/soccer/champions-league/qualifying-round/malmo-ff-vs-salzburg/1-1801430/',
'/soccer/champions-league/qualifying-round/pfc-ludogorets-razgrad-vs-steaua-bucuresti/1-1801434/']
Fairly easy to do outside of BeautifulSoup. Just find all the links first, then, filter out the ones that return a ? - ? text, then get the href attribute from each item in the sanitized list. See below.
In [1]: from bs4 import BeautifulSoup as bsoup
In [2]: import requests as rq
In [3]: url = "http://www.livescore.com/soccer/champions-league/"
In [4]: r = rq.get(url)
In [5]: bs = bsoup(r.text)
In [6]: links = bs.find_all("a", class_="scorelink")
In [7]: links
Out[7]:
[<a class="scorelink" href="/soccer/champions-league/group-a/atletico-madrid-vs-malmo-ff/1-1821150/" onclick="return false;">? - ?</a>,
<a class="scorelink" href="/soccer/champions-league/group-a/olympiakos-vs-juventus/1-1821151/" onclick="return false;">? - ?</a>,
...
In [8]: links_clean = [link for link in links if link.get_text() != "? - ?"]
In [9]: links_clean
Out[9]:
[<a class="scorelink" href="/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/" onclick="return false;">0 - 1</a>,
<a class="scorelink" href="/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/" onclick="return false;">3 - 0</a>,
...
In [10]: links_final = [link["href"] for link in links_clean]
In [11]: links_final
Out[11]:
['/soccer/champions-league/group-e/cska-moscow-vs-manchester-city/1-1821202/',
'/soccer/champions-league/qualifying-round/zenit-st-petersburg-vs-standard-liege/1-1801440/',
...
Extracting the minutes from each link is, of course, up to you.