I need to parse
http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0
If you view source of the above url you will find
Expected Output:
fvRequests= css
fvRequests=7
import re
import urllib2
if __name__ == "__main__":
url = 'http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0'
# http request
response = urllib2.urlopen(url)
html = response.read()
response.close()
# finding values in html
results = re.findall(r'fvRequests\.setValue\(\d+, \d+, \'?(.*?)\'?\);', html)
keys = results[::2]
values = results[1::2]
# creating a dictionary
output = dict(zip(keys, values))
print output
The idea is to locate the script with BeautifulSoup and use regular expression pattern to find the the fvRequests.setValue() calls and extract the value of the third argument:
import re
from bs4 import BeautifulSoup
import requests
pattern = re.compile(r"fvRequests\.setValue\(\d+, \d+, '?(\w+)'?\);")
response = requests.get("http://www.webpagetest.org/breakdown.php?test=150325_34_0f581da87c16d5aac4ecb7cd07cda921&run=2&cached=0")
soup = BeautifulSoup(response.content)
script = soup.find("script", text=lambda x: x and "fvRequests.setValue" in x).text
print(re.findall(pattern, script))
Prints:
[u'css', u'7', u'flash', u'0', u'font', u'0', u'html', u'14', u'image', u'80', u'js', u'35', u'other', u'14']
You can go further and pack the list into a dict (solution taken from here):
dict(zip(*([iter(data)] * 2)))
which would produce:
{
'image': '80',
'flash': '0',
'js': '35',
'html': '14',
'font': '0',
'other': '14',
'css': '7'
}
Related
I'm trying to make a testing project that scraps info of a specific site but with no success.
I followed some tutorials i have found and even an post on stackoverflow. After all this I'm stuck!
help me stepbrothers, I'm a hot new programmer with python and I can't stop my projects.
more info: this is a lottery website that I was trying to scrap and make some analisys to get a lucky number.
I have followed this tutorials:
https://towardsdatascience.com/how-to-collect-data-from-any-website-cb8fad9e9ec5
https://beautiful-soup-4.readthedocs.io/en/latest/
Using BeautifulSoup in order to find all "ul" and "li" elements
All of you have my gratitute!
from bs4 import BeautifulSoup as bs
import requests
import html5lib
#import urllib3 # another attemp to make another req in the url ------failed
url = '''https://loterias.caixa.gov.br/Paginas/Mega-Sena.aspx'''
#another try to take results in the <ul> but I have no qualified results == None
def parse_ul(elem):#https://stackoverflow.com/questions/50338108/using-beautifulsoup-in-order-to-find-all-ul-and-li-elements
result = {}
for sub in elem.find_all('li', recursive=False):
if sub.li is None:
continue
data = {k: v for k, v in sub.attrs.items()}
if sub.ul is not None:
# recurse down
data['children'] = parse_ul(sub.ul)
result[sub.li.get_text(strip=True)] = data
return result
page = requests.get(url)#taking info from website
print(page.encoding)# == UTF-8
soup = bs(page.content,features="lxml")#takes all info from the url and organizes it ==Beaultiful soup
numbers = soup.find(id='ulDezenas')#searcher in the content of this specific id// another try: soup.find('ul', {'class': ''})
result = parse_ul(soup)#try to parse info, but none is found EVEN WITH THE ORIGINAL ONE
print(numbers)#The result is below:
'''<ul class="numbers megasena" id="ulDezenas">
<li ng-repeat="dezena in resultado.listaDezenas ">{{dezena.length > 2 ? dezena.slice(1) : dezena}}</li>
</ul>'''
print(result)# == "{}" nothing found
#with open('''D:\Documents\python\_abretesesame.txt''', 'wb') as fd:
# for chunk in page.iter_content(chunk_size=128):
# fd.write(chunk)
# =======printing document(HTML) in file still no success in getting the numbers
Main issue is that the content is provided dynamically by JavaScript but you can get the information via another url:
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
will give you folowing JSON:
{'tipoJogo': 'MEGA_SENA', 'numero': 2468, 'nomeMunicipioUFSorteio': 'SÃO PAULO, SP', 'dataApuracao': '02/04/2022', 'valorArrecadado': 158184963.0, 'valorEstimadoProximoConcurso': 3000000.0, 'valorAcumuladoProximoConcurso': 0.0, 'valorAcumuladoConcursoEspecial': 36771176.89, 'valorAcumuladoConcurso_0_5': 33463457.98, 'acumulado': False, 'indicadorConcursoEspecial': 1, 'dezenasSorteadasOrdemSorteio': ['022', '041', '053', '042', '035', '057'], 'listaResultadoEquipeEsportiva': None, 'numeroJogo': 2, 'nomeTimeCoracaoMesSorte': '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'tipoPublicacao': 3, 'observacao': '', 'localSorteio': 'ESPAÇO DA SORTE', 'dataProximoConcurso': '06/04/2022', 'numeroConcursoAnterior': 2467, 'numeroConcursoProximo': 2469, 'valorTotalPremioFaixaUm': 0.0, 'numeroConcursoFinal_0_5': 2470, 'listaDezenas': ['022', '035', '041', '042', '053', '057'], 'listaDezenasSegundoSorteio': None, 'listaMunicipioUFGanhadores': [{'posicao': 1, 'ganhadores': 1, 'municipio': 'SANTOS', 'uf': 'SP', 'nomeFatansiaUL': '', 'serie': ''}], 'listaRateioPremio': [{'faixa': 1, 'numeroDeGanhadores': 1, 'valorPremio': 122627171.8, 'descricaoFaixa': '6 acertos'}, {'faixa': 2, 'numeroDeGanhadores': 267, 'valorPremio': 34158.18, 'descricaoFaixa': '5 acertos'}, {'faixa': 3, 'numeroDeGanhadores': 20734, 'valorPremio': 628.38, 'descricaoFaixa': '4 acertos'}], 'id': None, 'ultimoConcurso': True, 'exibirDetalhamentoPorCidade': True, 'premiacaoContingencia': None}
Simply extract dezenasSorteadasOrdemSorteio and prozess in list comprehension:
[n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']]
Result will be:
['22', '35', '41', '42', '53', '57']
Example
import requests
jsonData = requests.get('https://servicebus2.caixa.gov.br/portaldeloterias/api/megasena/').json()
print([n if len(n) < 2 else n[1:] for n in jsonData['listaDezenas']])
I send POST data using the requests library in python.
I can't get the result of the form. I wonder it's because it's too fast.
The action URL of the form is the same page. And when I manually fill the form and submit it, then the result of the form appears at a div of the page. But when I use requests in python, the result div is empty even though the response status code 200. What should I do to obtain the result?
my code is below:
import requests
import time
from time import sleep
url = "https://******"
data = {
'year': '1973',
'month': '03',
'name': 'chae'
}
res = requests.post(url, data)
print(res) #status code 200
print(res.text)
Any advice would be appreciated.
This code give me the information:
import requests
import time
from time import sleep
url = "https://efine.go.kr/licen/truth/licenTruth.do"
# data = {'checkPage': '1', 'flag': '', 'regYear': '1973', 'regMonth': '03', 'regDate': '01', 'name': '채승완',
# 'licenNo0': '11', 'licenNo1': '91', 'licenNo2': '822161', 'licenNo3': '12'}
data = {
"checkPage": "2",
"flag": "searchPage",
"regYear": "1973",
"regMonth": "03",
"regDate": "01",
"name": "채승완",
"licenNo0": "11",
"licenNo1": "91",
"licenNo2": "822161",
"licenNo3": "12",
"ghostNo": "2161",
}
res = requests.post(url, data=data)
print(res.text)
# result contains "전산 자료와 일치 합니다.식별번호가 일치하지 않습니다."
try using requests.Session() instead of requests, it worked for me.
I've always did it this way, please check out this https://requests.readthedocs.io/en/master/user/advanced/ and let me know if it was helpful.
I've created a script in python using requests module to get the titles of different items populated upon initiating a search in duckduckgo.com. My search keyword is cricket. My script is parsing the titles from the first page flawlessly.
Website address
I'm facing trouble parsing the titles from next pages as the two fields of params are increasing weirdly, as in 's': '0' and 'dc': '-27'. However, the rest of the fields are static.
To parse titles from the first page, I tried like below (working):
import requests
from bs4 import BeautifulSoup
URL = "https://duckduckgo.com/html/"
params = {
'q': 'python',
's': '0',
'nextParams': '',
'v': 'l',
'o': 'json',
'dc': '-27',
'api': 'd.js',
'kl': 'us-en'
}
resp = requests.post(URL,data=params,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(resp.text,"lxml")
for title in soup.select(".result__body .result__a"):
print(title.text)
That two fields of the params are increasing like below:
1st page:
's': '0'
'dc': '-27'
2nd page:
's': '30'
'dc': '27'
Third page:
's': '80'
'dc': '76'
Fourth page:
's': '130'
'dc': '126'
How can I scrape titles from next pages as well?
The params for the next page are held in the POST response each time
import requests
from bs4 import BeautifulSoup
URL = "https://duckduckgo.com/html/"
params = {
'q': 'python',
's': '0',
'nextParams': '',
'v': 'l',
'o': 'json',
'dc': '0',
'api': 'd.js',
'kl': 'us-en'
}
with requests.Session() as s:
while True:
resp = s.post(URL,data=params,headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(resp.text,"lxml")
for title in soup.select(".result__body .result__a"):
print(title.text)
for i in soup.select('form:not(.header__form) [type=hidden]'): #updated params based on response
params[i['name']] = i['value']
if not soup.select_one('[value=Next]'):
break
JSON RESPONSE FROM WEBSITE I am new to python scrapy and json . I am trying to scrape json response from 78751 . But it is showing error . The code i used is
import scrapy
import json
class BlackSpider(scrapy.Spider):
name = 'black'
start_urls = ['https://appworld.blackberry.com/cas/content/2360/reviews/2.17.2?page=1&pagesize=100&sortby=newest&callback=_content_2360_reviews_2_17_2&_=1499161778751']
def parse(self, response):
data = re.findall('(\{.+\})\);', response.body_as_unicode())
a=json.loads(data[0])
item = MyItem()
item["Reviews"] = a["reviews"][4]["review"]
return item
The error it is showing is
ValueError("No JSON object could be decoded")ERROR
The response you are getting is javascript function with some json in it:
_content_2360_reviews_2_17_2(\r\n{"some":"json"}]});\r\n
To extract the data from this you can use simple regex solution:
import re
import json
data = re.findall('(\{.+\})\);', response.body_as_unicode())
json.loads(data[0])
It trasnslates to: select everything between {} that ends with );
edit: results I'm getting with this:
{'platform': None,
'reviews': [{'createdDate': '2017-07-04',
'model': 'London',
'nickname': 'aravind14-92362',
'rating': 6,
'review': 'Very bad ',
'title': 'My WhatsApp no update '}],
'totalReviews': 569909,
'version': '2.17.2'}
I am having difficult parsing the xml _file below using Ixml:
>>_file= "qv.xml"
file content:
<document reference="suspicious-document00500.txt">
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="128" this_length="2503" source_reference="source-document00500.txt" source_offset="138339" source_length="2503"/>
<feature name="plagiarism" type="artificial" obfuscation="none" this_offset="8593" this_length="1582" source_reference="source-document00500.txt" source_offset="49473" source_length="1582"/>
</document>
Here is my attempt:
>>from lxml.etree import XMLParser, parse
>>parsefile = parse(_file)
>>print parsefile
Output: <lxml.etree._ElementTree object at 0x000000000642E788>
The output is the location of the ixml object, while I am after the actual file content ie
Desired output={'document reference'="suspicious-document00500.txt", 'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
Any ideas on how to get the desired output? thanks.
Here's one way of getting the desired outputs:
from lxml import etree
def main():
doc = etree.parse('qv.xml')
root = doc.getroot()
print root.attrib
for item in root:
print item.attrib
if __name__ == "__main__":
main()
Output:
{'reference': 'suspicious-document00500.txt'}
{'this_offset': '128', 'obfuscation': 'none', 'source_length': '2503', 'name': 'plagiarism', 'this_length': '2503', 'source_reference': 'source-document00500.txt', 'source_offset': '138339', 'type': 'artificial'}
{'this_offset': '8593', 'obfuscation': 'none', 'source_length': '1582', 'name': 'plagiarism', 'this_length': '1582', 'source_reference': 'source-document00500.txt', 'source_offset': '49473', 'type': 'artificial'}
It works fine with the contents you gave.
You might want to read thisto see how etree represents xml objects.