I have a problem about python 3.5 Turkish character.
You can see issue in pictures. How can I fix this ?
My Codes is below. You can see last row that print(blink1.text)give charcter problem but print("çÇğĞıİuÜoÖşŞ")is not problem despite that's all same
from bs4 import BeautifulSoup
import requests
r = requests.get("http://www.ensonhaber.com/son-dakika")
soup = BeautifulSoup(r.text)
for tag in soup.find_all("ul",attrs={"class":"ui-list"}):
for link1 in tag.find_all('li'):
for link2 in link1.find_all('a',href=True):
print("www.ensonhaber.com" + link2['href'])
print("\n")
print(link2['title'])
for link3 in link1.find_all('span',attrs={"class":"spot"}):
# özet kısmı print(link3.text)
print("\n")
rbodysite = "http://www.ensonhaber.com"+link2['href']
rbody = requests.get(rbodysite)
soupbody = BeautifulSoup(rbody.text)
for btag in soupbody.find_all("article",attrs={"class":""}):
for blink1 in btag.find_all("p"):
print(blink1.text)
print("çÇğĞıİuÜoÖşŞ")
My output :
Hangi Åehirde çekildiÄi bilinmeyen videoda bir çocuk, ailesiyle yolculuk yaparken gördüÄü trafik polisinin üÅüdüÄünü düÅünerek gözyaÅlarına boÄuldu. Trafik polisi, yanına gelen çocuÄu "Ben üÅümüyorum" diyerek teselli etti.
çÇğĞıİuÜoÖşŞ
The problem is most certainly wrong code page. Python is codepage agnostic and neither print nor beautifulsoup is going to fix it for you.
The site seems to serve all pages in UTF-8 so I think your terminal is something else. I don't know what character set has ı but the locations of the corrupted characters and their values suggest Windows-1254. You need to call iconv, but you first need to read the meta tag <meta charset= because it won't always be UTF-8. On the other side, you also need to know your terminal's encoding, but that's harder to get.
Related
I'm trying to download a csv file from here: Link after clicking on "Acesse todos os negócios realizados até o momento", which is in blue next to an image of a cloud with an arrow.
I do know how to solve the problem with selenium, but it's such a heavy library that I'd like to learn another solutions (specially faster ones). My main idea was to use requests, since I think it's the fastest approach.
My code:
import requests
url="https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
r=requests.get(url,allow_redirects=True)
r.text
r.text is a string of 459432 characters gives the following output (just put some part of it here):
RGF0YS9Ib3JhIGRhIHVsdGltYSBhdHVhbGl6YWNhbzogMTEvMTEvMjAyMiAxNTo1MDoxNApJbnN0cnVtZW50byBGaW5hbmNlaXJvO0VtaXNzb3I7Q29kaWdvIElGO1F1YW50aWRhZGUgTmVnb2NpYWRhO1ByZWNvIE5lZ29jaW87Vm9sdW1lIEZpbmFuY2Vpcm8gUiQ7VGF4YSBOZWdvY2lvO09yaWdlbSBOZWdvY2lvO0hvcmFyaW8gTmVnb2NpbztEYXRhIE5lZ29jaW87Q29kLiBJZGVudGlmaWNhZG9yIGRvIE5lZ29jaW87Q29kLiBJc2luO0RhdGEgTGlxdWlkYWNhbwpDUkE7RUNPQUdST1NFQztDUkEwMTcwMDJCRTsxMDc7MTMxMyw3MDUxMjAwMDsxNDA1NjYsNDU7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6NDk7MTEvMTEvMjAyMjsjOTc0MzQ0ODY7QlJFQ09BQ1JBMVozOzExLzExLzIwMjIKQ1JJO09QRUFTRUNVUklUSVpBRE9SQTsxOEcwNjg3NTIxOzIzNTsxMjQ4LDY5MTcwNDAwOzI5MzQ0Miw1NTssMDAwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NDtCUlJCUkFDUkk0WTU7MTEvMTEvMjAyMgpERUI7UEVUUk9CUkFTO1BFVFIxNjs3NjsxMTU1LDkzNTgwODAwOzg3ODUxLDEyOzcsMzUwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Nzo0OTsxMS8xMS8yMDIyOyM5NzQzNDQ4NTtCUlBFVFJEQlMwMDE7MTEvMTEvMjAyMgpERUI7VkFMRTtDVlJEQTY7Mjk7MzQsMDAwMDAwMDA7OTg2LDAwOywwMDAwO1ByZS1yZWdpc3RybyAtIFZvaWNlOzE1OjQ3OjA0OzExLzExLzIwMjI7Izk3NDMzOTM2O0JSVkFMRURCUzAyODsxMS8xMS8yMDIyCkRFQjtWQUxFO0NWUkRBNjsyOTszMywzMDAwMDAwMDs5NjUsNzA7LDAwMDA7UHJlLXJlZ2lzdHJvIC0gVm9pY2U7MTU6NDc6MDQ7MTEvMTEvMjAyMjsjOTc0MzM5Mzc7QlJWQUxFREJTMDI4OzExLzExLzIwMjIKREVCO0VRVUFUT1JJQUxUUkFOU01JO0VRVUExMTs1OTsxMDA3LDg0NDQ1MzAwOzU5NDYyLDgyOzcsMDMwMDtQcmUtcmVnaXN0cm8gLSBWb2ljZTsxNTo0Njo0MDsxMS8xMS8yMDIyOyM5NzQzMzYxNDtCUkVR...
I don't know what this string is or what to do with it. Is it encoded? Should I call another function with it? Am I calling the wrong link? Should I just try another approach? Is selenium not that bad?
Any help is appreciated. Thank you!
Extra info:
From devtools I found it's calling these javascript functions:
onclick is just that: <a href="#" onclick="carregarDownloadArquivo('')">
carregarDownloadArquivo (I don't know java, tried to extract only this specific function inside the js):
function carregarDownloadArquivo(n){var t;
t=n!=null&&n!=""?n:getParameterByName("data")==null?"":getParameterByName("data");
$("#divloadArquivo").show();
$.ajax({url:"/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="+t}).then(function(t,i,r){var u;
if(t!="null"&&t!=""){var f=convertBase64(t),e=window.navigator.userAgent,o=e.indexOf("MSIE ");
(n==null||n=="")&&(n=retornaDataHoje());
u=n+"_NEGOCIOSBALCAO.CSV";
o>0||!!navigator.userAgent.match(/Trident.*rv\:11\./)?DownloadArquivoIE(u,f,r.getAllResponseHeaders()):DownloadArquivo(u,f,r.getAllResponseHeaders())}$("#divloadArquivo").hide()})}
Extra functions to understand carregardownload:
function DownloadArquivoIE(n,t,i){var r=new Blob(t,{type:i});
navigator.msSaveBlob(r,n)}function DownloadArquivo(n,t,i){var r=new Blob(t,{type:i});
saveAs(r,n)}function getParameterByName(n,t){t||(t=window.location.href);
n=n.replace(/[\[\]]/g,"\\$&");
var r=new RegExp("[?&]"+n+"(=([^&#]*)|&|#|$)"),i=r.exec(t);
return i?i[2]?decodeURIComponent(i[2].replace(/\+/g," ")):"":null}
Not so sure about ajax and send calls. Also I don't know how their code is downloading the csv file
Try to decode the base64-encoded response from the server:
import base64
import requests
url = "https://bvmf.bmfbovespa.com.br/NegociosRealizados/Registro/DownloadArquivoDiretorio?data="
t = base64.b64decode(requests.get(url).text)
print(t.decode("utf-8"))
Prints:
Data/Hora da ultima atualizacao: 11/11/2022 16:38:20
Instrumento Financeiro;Emissor;Codigo IF;Quantidade Negociada;Preco Negocio;Volume Financeiro R$;Taxa Negocio;Origem Negocio;Horario Negocio;Data Negocio;Cod. Identificador do Negocio;Cod. Isin;Data Liquidacao
DEB;STAGENEBRA;MSGT23;2200;980,96524400;2158123,54;7,8701;Pre-registro - Voice;16:34:56;11/11/2022;#97441063;BRMSGTDBS050;12/11/2022
CRA;VERTSEC;CRA022006N5;306;972,49346900;297583,00;,0000;Pre-registro - Voice;16:34:49;11/11/2022;#97441055;BRVERTCRA2S9;11/11/2022
CRI;VIRGOSEC;21L0823062;96;1012,81900600;97230,62;,0000;Pre-registro - Voice;16:33:21;11/11/2022;#97441034;BRIMWLCRIAF2;11/11/2022
CRI;VIRGOSEC;21L0823062;356;1012,81900600;360563,57;,0000;Pre-registro - Voice;16:32:55;11/11/2022;#97441028;BRIMWLCRIAF2;11/11/2022
COE;;IT5322K6C2H;10;1000,00000000;10000,00;;Registro;16:31:52;11/11/2022;#2022111113338428;;11/11/2022
...and so on.
I am trying to crawl wordreference, but I am not succeding.
The first problem I have encountered is, that a big part is loaded via JavaScript, but that shouldn't be much problem because I can see what I need in the source code.
So, for example, I want to extract for a given word, the first two meanings, so in this url: http://www.wordreference.com/es/translation.asp?tranword=crane I need to extract grulla and grúa.
This is my code:
import lxml.html as lh
import urllib2
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
doc = lh.parse((urllib2.urlopen(url)))
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print i
The result is that I get an empty list.
I have tried to crawl it with scrapy too, no success. I am not sure what is going on, the only way I have been able to crawl it is using curl, but that is sloopy, I want to do it in an elegant way, with Python.
Thank you very much
It looks like you need a User-Agent header to be sent, see Changing user agent on urllib2.urlopen.
Also, just switching to requests would do the trick (it automatically sends the python-requests/version User Agent by default):
import lxml.html as lh
import requests
url = 'http://www.wordreference.com/es/translation.asp?tranword=crane'
response = requests.get("http://www.wordreference.com/es/translation.asp?tranword=crane")
doc = lh.fromstring(response.content)
trans = doc.xpath('//td[#class="ToWrd"]/text()')
for i in trans:
print(i)
Prints:
grulla
grúa
plataforma
...
grulla blanca
grulla trompetera
I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n in the raw_html variable.
Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n in the page.
If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
Seems like they are literal \n characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')
I'd like to switch CJK characters in Python 3.3. That is, I need to get 價(Korean) from 价(Chinese), and 価(Japanese) from 價. Is there a external module like that?
Unihan information
The Unihan page about 價 provide a simplified variant (vs. traditionnal), but doesn't seems to give Japanese/Korean one. So...
CJKlib
I would recommend to have a look at CJKlib, which has a feature section called Variants stating:
Z-variant forms, which only differ in typeface
[update] Z-variant
Your sample character 價 (U+50F9) doesn't have a z-variant. However 価 (U+4FA1) has a kZVariant to U+50F9 價. This seems weird.
Further reading
Package documentation is available on Python.org/pypi/cjklib ;
Z-variant form definition.
Here is a relatively complete conversion table. You can dump it to json for later use:
import requests
from bs4 import BeautifulSoup as BS
import json
def gen(soup):
for tr in soup.select('tr'):
tds = tr.select('td.tdR4')
if len(tds) == 6:
yield tds[2].string, tds[3].string
uri = 'http://www.kishugiken.co.jp/cn/code10d.html'
soup = BS(requests.get(uri).content, 'html5lib')
d = {}
for hanzi, kanji in gen(soup):
a = d.get(hanzi, [])
a.append(kanji)
d[hanzi] = a
print(json.dumps(d, indent=4))
The code and it's output are in this gist.
I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:
import sys,os,glob,re,datetime,optparse
import urllib2
from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup
#from utility import *
sTargetEncoding = "utf-8"
page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding
ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content
document = BSXPathEvaluator(ucontent)
print "ORIGINAL ENCODING: " + document.originalEncoding
I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?
Thanks
Well, there is no guarantee that the encoding presented by the HTTP headers is the same the some specified inside the HTML itself. This can happen either due to misconfiguration on the server side or the charset definition inside the HTML can be just wrong. There is really no automatic way to detect the encoding or to detect the right encoding. I suggest to check HTML manually for the right encoding (e.g. iso-8859-1 vs. utf-8 can be easily detected) and then hardcode the encoding somehow manually inside your app.