scrape data from interactive chart using selenium, bs4 or requests

scrape data from interactive chart using selenium, bs4 or requests - python

I want to scrape data from charts on this page: http://188.166.44.172/match/live-stats/100941310
I tried requests and bs4, but failed to get any data, I also tried with selenium and no data as well.
Here's the code using requests:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:76.0) Gecko/20100101 Firefox/76.0'}
session = requests.Session()
r = session.get(u, timeout=30, headers=headers)
soup = BeautifulSoup(r.content, 'html.parser')
for i in soup.find_all('rect'):
if i.has_attr("onmouseover"):
text = i.get('onmouseover')
print(text)
And the code using selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
u = "http://188.166.44.172/match/live-stats/100941310"
driver = webdriver.Chrome(executable_path=r"C:/chromedriver.exe", options=options)
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
for i in soup.find_all('rect'): #I also tried soup.select('*')
if i.has_attr("onmouseover"):
text = i.get('onmouseover')
print(text)
Is there's any way to scrape data from those charts using python ?

The reason you're not getting anything back is because all the charts are generated dynamically by JavaScript and with, for example, bs4 you won't read any of this.
However, the data for the charts is embedded in the HTML. You could parse that and plot.
Here's how:
import ast
import re
import requests
import matplotlib.pyplot as plt
target_url = "http://188.166.44.172/match/live-stats/100941310"
page_source = requests.get(target_url).text
raw_attack_data = ast.literal_eval(
re.search(r"var all_attack = (\[.*\])", page_source).group(1),
)
all_attack = [i[1] for i in raw_attack_data if isinstance(i, list)]
plt.plot(all_attack, label="attack")
plt.legend(loc="lower right")
plt.show()
This should give you a plot like this:
As I've said, eveyrthing you need is in the source code, so you'd have to play around with the values.
The source looks like this:
Where the first value of the inner list is the game time and the second value is the stat that's plotted on the charts.
Note that some arrays have values in {}. These' are those icon markers on the charts. You can filter them out with isistance(i, list) since these can be easily parsed as dicts, as I've shown above.
EDIT:
Yes, it's possible to get division and team info, as everything is in the HTML. I've reworked the initial answer a bit and came up with this:
import ast
import re
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
target_url = "http://188.166.44.172/match/live-stats/100941310"
page_source = requests.get(target_url).text
def get_match_info(html_source: str) -> list:
return [
i.getText(strip=True) for i
in BeautifulSoup(html_source, "lxml").select("h1 a")
]
def get_stats(html_source: str, search_str: str) -> tuple:
raw_data = ast.literal_eval(
re.search(fr"var {search_str} = (\[.*\])", html_source).group(1),
)
filtered = [i[1] for i in raw_data if isinstance(i, list)]
game_time = [i[0] for i in raw_data if isinstance(i, list)]
return game_time, filtered
division, home, away = get_match_info(page_source)
time_, attack_home = get_stats(page_source, "dangerous_home")
_, attack_away = get_stats(page_source, "dangerous_away")
plt.suptitle(f"{division} - {home} v {away}")
plt.ylabel("Attack")
plt.xlabel("Game time")
plt.plot(time_, attack_home, color="blue", label=home)
plt.plot(time_, attack_away, color="black", label=away)
plt.legend(loc="lower right")
plt.show()
This produces a plot:

Related

Python REGEX remove string containing substring

I am writing a script that will scrape a newsletter for URLs. There are some URLs in the newsletter that are irrelevant (e.g. links to articles, mailto links, social links, etc.). I added some logic to remove those links, but for some reason not all of them are being removed. Here is my code:
from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
termSheetLinks = []
for companyURL in htmlParser.select("table#templateBody p > a"):
termSheetLinks.append(companyURL.get('href'))
for link in termSheetLinks:
if "fortune.com" in link in termSheetLinks:
termSheetLinks.remove(link)
if "forbes.com" in link in termSheetLinks:
termSheetLinks.remove(link)
if "twitter.com" in link in termSheetLinks:
termSheetLinks.remove(link)
print(termSheetLinks)
When I ran it most recently, this was my output, despite trying to remove all links containing "fortune.com":
['https://fortune.com/company/blackstone-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://fortune.com/company/tpg?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://casproviders.org/asd-guidelines/', 'https://fortune.com/company/carlyle-group?utm_source=email&utm_medium=newsletter&utm_campaign=term-sheet&utm_content=2022080907am', 'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5', 'mailto:termsheet#fortune.com', 'https://www.afresh.com/', 'https://www.geopagos.com/', 'https://montana-renewables.com/', 'https://descarteslabs.com/', 'https://www.dealer-pay.com/', 'https://www.sequeldm.com/', 'https://pueblo-mechanical.com/', 'https://dealcloud.com/future-proof-your-firm/', 'https://apartmentdata.com/', 'https://www.irobot.com/', 'https://www.martin-bencher.com/', 'https://cell-matters.com/', 'https://www.lever.co/', 'https://www.sigulerguff.com/']
Any help would be greatly appreciated!

It do not need a regex in my opinion - Instead of removing the urls, append only those to a list that do not contain your substrings, eg with a list comprehension:
[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a") if not any(x in companyURL.get('href') for x in ["fortune.com","forbes.com","twitter.com"])]
Example
from bs4 import BeautifulSoup
import requests
termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
myList = ["fortune.com","forbes.com","twitter.com"]
[companyURL.get('href') for companyURL in htmlParser.select("table#templateBody p > a")
if not any(x in companyURL.get('href') for x in myList)]
Output
['https://casproviders.org/asd-guidelines/',
'https://ir.carlyle.com/static-files/433abb19-8207-4632-b173-9606698642e5',
'https://www.afresh.com/',
'https://www.geopagos.com/',
'https://montana-renewables.com/',
'https://descarteslabs.com/',
'https://www.dealer-pay.com/',
'https://www.sequeldm.com/',
'https://pueblo-mechanical.com/',
'https://dealcloud.com/future-proof-your-firm/',
'https://apartmentdata.com/',
'https://www.irobot.com/',
'https://www.martin-bencher.com/',
'https://cell-matters.com/',
'https://www.lever.co/',
'https://www.sigulerguff.com/']

Removing the links after the for iterator will not skip any entry.
from os import remove
from turtle import clear
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
termSheet = "https://fortune.com/newsletter/termsheet"
html = requests.get(termSheet)
htmlParser = BeautifulSoup(html.text, "html.parser")
termSheetLinks = []
for companyURL in htmlParser.select("table#templateBody p > a"):
termSheetLinks.append(companyURL.get('href'))
lRemove = []
for link in termSheetLinks:
if "fortune.com" in link:
lRemove.append(link)
if "forbes.com" in link:
lRemove.append(link)
if "twitter.com" in link:
lRemove.append(link)
for l in lRemove:
termSheetLinks.remove(l)
print(termSheetLinks)

how to create range for pages (from 1 page to x)

from bs4 import BeautifulSoup
import requests
url="https://bararanonline.com/letter/%D5%A1?page=1"
response=requests.get(url)
soup=BeautifulSoup(response.content, "lxml")
words=soup.find_all('a',"word-href")
for word in words:
print(word.text)
So, I got the first page. Now, I want to scrape information from all pages and I have to put URL page number in {} (page={}), but I can't figure out how to do it.
Thanks in advance.

Simply define a for loop and set your range() parameters:
from bs4 import BeautifulSoup
import requests
url="https://bararanonline.com/letter/%D5%A1?page="
words = []
for i in range(1, 3):
response=requests.get(f'{url}{i}')
##or as olvin roght mentioned by setting params
##response=requests.get("https://bararanonline.com/letter/ա", params={"page": i})
soup=BeautifulSoup(response.content, "lxml")
words.extend([word.text.strip() for word in soup.find_all('a',"word-href")])
words
Alternativ is to go by while - Example starts with 207 just to show, that it stops, if there is no next page, but you can change it if you like:
from bs4 import BeautifulSoup
import requests
url="https://bararanonline.com/letter/%D5%A1?page=207"
words = []
while True:
response=requests.get(url)
soup=BeautifulSoup(response.content, "lxml")
words.extend([word.text.strip() for word in soup.find_all('a',"word-href")])
if a := soup.select_one('a[rel="next"]'):
url = a['href']
else:
break
words

How i can get random object from list in Python

I have build a list which contains href from website and i wanna randomly select one of this link, how can i do that?
from bs4 import BeautifulSoup
import urllib
import requests
import re
import random
url = "https://www.formula1.com/en/latest.html"
articles = []
respone = urllib.request.urlopen(url)
soup = BeautifulSoup(respone,'lxml')
def getItems():
for a in soup.findAll('a',attrs={'href': re.compile("/en/latest/article.")}):
articles = a['href']
x = random.choice(articles)
print(x)
That code work, but selecting only random index from all of the objects

You're very close to the answer. You just need to do this:
from bs4 import BeautifulSoup
import urllib
import requests
import re
import random
url = "https://www.formula1.com/en/latest.html"
articles = []
respone = urllib.request.urlopen(url)
soup = BeautifulSoup(respone,'lxml')
def getItems():
for a in soup.findAll('a',attrs={'href': re.compile("/en/latest/article.")}):
articles.append(a['href'])
x = random.choice(articles)
print(x)
getItems()
The changes are:
We add each article to the articles array.
The random choice is now done after the loop, rather than inside the loop.

if string contains from list

I want to check if any of the excluded sites show up. I can get it to work with just one site, but as soon as I make it a list, it errors at if donts in thingy:
TypeError: 'in ' requires string as left operand, not tuple"
This is my code:
import requests
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
url = ("http://stackoverflow.com")
donts = ('stackoverflow.com', 'stackexchange.com')
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
thingy = (link.get('href'))
thingy = str(thingy)
if donts in thingy:
pass
else:
print (thingy)

import requests
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
url = ("http://stackoverflow.com")
donts = ('stackoverflow.com', 'stackexchange.com')
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
thingy = (link.get('href'))
thingy = str(thingy)
if thingy in donts :
print (thingy)
else:
pass
Judge: string in tuple

The crux of your problem is how you're searching your excluded list:
excluded = ("a", "b", "c")
links = ["a", "d", "e"]
for site in links:
if site not in excluded: # We want to know if the site is in the excluded list
print(f"Site not excluded: {site}")
Reverse the order of your elements and this should work fine. I've inverted your logic here so you can skip the unnecessary pass.
As a side note, this is one reason clear variable names can help - they will help you reason about what the logic should be doing. Especially in Python where ergonomics like in exist, this is very useful.

import requests
from bs4 import BeautifulSoup
from lxml import html, etree
import sys
import re
url = ("http://stackoverflow.com")
donts = ('stackoverflow.com', 'stackexchange.com')
r = requests.get(url, timeout=6, verify=True)
soup = BeautifulSoup(r.content, 'html.parser')
for link in soup.select('a[href*="http"]'):
thingy = (link.get('href'))
thingy = str(thingy)
if any(d in thingy for d in donts):
pass
else:
print (thingy)

scrape blocks of data from webpage API

I try to collect block data which forms a small table from a webpage. Pls see my codes below.
`
import requests
import re
import json
import sys
import os
import time
from lxml import html,etree
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.investing.com/instruments/OptionsDataAjax'
params = {'pair_id': 525, ## SPX
'date': 1536555600, ## 2018-9-4
'strike': 'all', ## all prices
'callspots': 'calls',#'call_andputs',
'type':'analysis', # webpage viewer
'bringData':'true',
}
headers = {'User-Agent': Chrome/39.0.2171.95 Safari/537.36'}
def R(text, end='\n'): print('\033[0;31m{}\033[0m'.format(text), end=end)
def G(text, end='\n'): print('\033[0;32m{}\033[0m'.format(text), end=end)
page = requests.get(url, params=params,headers = headers)
if page.status_code != 200:
R('ERROR CODE:{}'.format(page.status_code))
sys.exit
G('Problem in connection!')
else:
G('OK')
soup = BeautifulSoup(page.content,'lxml')
spdata = json.loads(soup.text)
print(spdata['data'])`
This result--spdata['data'] gives me a str, I just want to get following blocks in this str. There are many such data blocks in this str with the same format.
SymbolSPY180910C00250000
Delta0.9656
Imp Vol0.2431
Bid33.26
Gamma0.0039
Theoretical33.06
Ask33.41
Theta-0.0381
Intrinsic Value33.13
Volume0
Vega0.0617
Time Value-33.13
Open Interest0
Rho0.1969
Delta / Theta-25.3172
I use json and BeautifulSoup here, maybe regular expression will help but I don't know much about re. To get the result, any approach is appreciated. Thanks.

Add this after your code:
regex = r"((SymbolSPY[1-9]*):?\s*)(.*?)\n[^\S\n]*\n[^\S\n]*"
for match in re.finditer(regex, spdata['data'], re.MULTILINE | re.DOTALL):
for line in match.group().splitlines():
print (line.strip())
Outputs
OK
SymbolSPY180910C00245000
Delta0.9682
Imp Vol0.2779
Bid38.26
Gamma0.0032
Theoretical38.05
Ask38.42
Theta-0.0397
Intrinsic Value38.13
Volume0
Vega0.0579
Time Value-38.13
Open Interest0
Rho0.1934
Delta / Theta-24.3966
SymbolSPY180910P00245000
Delta-0.0262
Imp Vol0.2652
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

scrape data from interactive chart using selenium, bs4 or requests - python

Related

Python REGEX remove string containing substring

how to create range for pages (from 1 page to x)

How i can get random object from list in Python

if string contains from list

scrape blocks of data from webpage API

Categories

Resources