I have spent the whole day looking for a way to display the Arabic letters with scrapy and nothing worked for me! I am scraping an Arabic website but i am not getting the right format of the arabic language.
here is what i am actually getting when i am saving the results in a csv file:
"بطل ليÙربول القديم" يرد على أنصار "الريدز"
here is my function:
def parse_details(self, response):
vars = ArticlesItem()
vars["title"] = response.css("h1.sna_content_heading::text").extract_first().strip()
vars["article_summary"] = response.css("span.article-summary").extract_first().strip()
vars["article_content"] = [i.strip() for i in response.css("div.article-body p::text").extract()]
vars["tags"] = [i.strip() for i in response.css("div.article-tags h2.tags::text").extract()]
yield vars
i tried to add encode("utf-8") but i am still not getting the right format
vars["title"] = ...extract_first().strip().encode("utf-8")
i am getting something like this:
b'\xd8\xa8\xd8\xb1\xd9\x82\xd9\x85 "\xd9\x85\xd8\xb0\xd9\x87'
b'\xd9\x84".. \xd8\xa8\xd9\x86\xd8\xb2\xd9\x8a\xd9\x85\xd8\xa9 \xd9'
b'\x8a\xd8\xaa\xd9\x81\xd9\x88\xd9\x82 \xd8\xb9\xd9\x84\xd9\x89'
b' \xd9\x85\xd9\x8a\xd8\xb3\xd9\x8a \xd9\x88\xd8\xb1\xd9\x88'
b'\xd9\x86\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88 \xd9\x88\xd8\xb5\xd9'
b'\x84\xd8\xa7\xd8\xad'
If I take the data you report in your question, and assign it to a variable, like this:
>>> a = (b'\xd8\xa8\xd8\xb1\xd9\x82\xd9\x85 "\xd9\x85\xd8\xb0\xd9\x87'
b'\xd9\x84".. \xd8\xa8\xd9\x86\xd8\xb2\xd9\x8a\xd9\x85\xd8\xa9 \xd9'
b'\x8a\xd8\xaa\xd9\x81\xd9\x88\xd9\x82 \xd8\xb9\xd9\x84\xd9\x89'
b' \xd9\x85\xd9\x8a\xd8\xb3\xd9\x8a \xd9\x88\xd8\xb1\xd9\x88'
b'\xd9\x86\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88 \xd9\x88\xd8\xb5\xd9'
b'\x84\xd8\xa7\xd8\xad')
and I then decode these bytes on the (reasonable) assumption that they are UTF-8:
>>> a.decode()
'برقم "مذهل".. بنزيمة يتفوق على ميسي ورونالدو وصلاح'
it seems to me that you are getting back what you might be expecting, just not quite in the way you expect it.
since #gallaecio wanted me to write an answer to my question
here is what i did:
1- open an empty excel sheet
2- go to data
3- choose From text/csv
4- under the file origin i had to change it from 1252 Western European(Windows) TO 65001 Unicode ( UTF-8 ), now i can read the arabic text !
5- Load !
Related
Problem
I'm running dataflow job where I have steps - reading txt file from cloud storage using dataflow/beam - apache_beam.io.textio.ReadFromText() which has StrUtf8Coder (utf-8) by default and after that loading it into postgres using StringIteratorIO with copy_from.
data coming from pcollection element by element, there are some elements which will look like this:
line = "some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
After that, I need to download it to postgres (the delimiter here is "|"), but the problem is these kinds of elements because postgres try to encode it(and I'm getting: 'invalid byte sequence for encoding "UTF8"'):
from F\226 we are getting this -> F\x96
This slash is not visible so I can not just replace it like this:
line.replace("\\", "\\\\")
Using python 3.8.
Have tried repr() or encode("unicode_escape").decode()
Also in every line we have different elements so let's say in the next one can be r\456
I'm able to catch and change it with regex only if I will use a raw string, but not sure how to represent a regular string as a raw if we already have it in a variable.
import re
line = r"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
updated = re.sub("([a-zA-Z])\\\\(\\d*)", "\\1\\\\\\\\\\2",string)
print(updated)
$ some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|
Goal
Have an extra backslash if after backslash we have some element, so the line need to look like this:
line = "some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|"
Thank's for any help!
If you're able to read the file in binary or select the encoding, you could get a better starting point. This is how to do it in binary:
>>> line = b"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
>>> line.decode('cp1252')
'some information|more information S\\\\H, F–|DIST|local|app\\\\\\lock\\|'
This is how to decode the whole file:
f = open('file.txt', encoding='cp1252')
f.read()
The encoding CP-1252 is the legacy Microsoft latin-1 encoding.
In the following code, I am converting an image file into a string depending on the choices of radio buttons:
def convert_now(self):
self.img_data = ""
self.img_data_encoded = ""
file1 = open(self.filedict,'rb')
self.img_data = file1.read()
#RADIO_BUTTONS CHOICES, Convert to: 0-ascii, 1-base64, 2-Hex
v = self.rvar.get()
if v==0:
self.img_data_encoded=self.img_data
elif v==1:
self.img_data_encoded=base64.b64encode(self.img_data) (!)
elif v==2:
self.img_data_encoded=base64.b16encode(self.img_data) (!!)
I tried getting the base64 string from an image file using this line (!) and saved it to a "st" named string.
Then I tried getting the hex string using this one(!!)
The problem is when I compared the results I got from the code above to those I got from this website "https://www.branah.com/ascii-converter" when I used the "st"(the base64 string from the code)
they don't match at all.
Did I code something wrong ?
No, it doesn't seem like you did something wrong. Your code looks fine. Also, for me, both b64encode and b16encode produce the same output as the webpage you linked to. Here is an example.
>>> import base64
>>> base64.b64encode(b"test")
b'dGVzdA=='
>>> base64.b16encode(b"test")
b'74657374'
You can compare this to the results on the webpage. They match. So, there must be something else wrong. b64encode and b16encode work fine.
I know there are a lot of encoding/decoding topics here and I tried it for hours, but I'm still not able to solve it. Hence I want to raise a question:
I have a string with hexadecimal values, which in the end is a text that I want to write to a text file with the correct encoding.
hexvalues = "476572E2A47465646574656B746F72"
In the end, the (german) result should be "Gerätedetektor"
Currently I am using binascii.unhexlify() for the decoding, but it still doesn't show me the "ä" like its supposed to be, and instead, I get:
>> result = binascii.unhexlify(hexvalues)
Gerâ¤tedetektor
I tried to do result.decode("utf-8") and a lot of other things but either the script crashes or it also doesn't return what I want to see.
In the end, I want to write the word in the correct way to a file.
Any help would be highly appreciated!
Edit:
As I wrote before, I tried many things so it is kind of hard to give the ONE code that I'm using but here is an excerpt from the current version:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import binascii
resultfile = "text_GER.txt"
fpx = open(resultfile, 'wb')
hexvalues = "476572E2A47465646574656B746F72"
result= binascii.unhexlify(hexvalues )
result= result.decode("utf-8")
print(result)
fpx.write(result)
This one makes the script crash, but no further indicators why it does.
If i skip
result= result.decode("utf-8")
then my result print of result looks like this:
b'Ger\xe2\xa4tedetektor'
I'm using this program, and all the tweets that I'm getting are like this"because it is in Arabic Language":
"text": "\\u0637\\u0627\\u0644\\u0628\\u0629 \\u062c\\u0633\\u0645\\u0647\\u0627 \\u062c\\u0628\\u0627\\u0631 \\u062a\\u062a\\u062e\\u062f \\u0645\\u0646 \\u0627\\u0644\\u0634\\u0627\\u0631\\u0639 \\u0648 \\u062a\\u062a\\u0646\\u0627\\u0643..\\n\\n\\u0633\\u0643\\u0633_\\u0627\\u062c\\u0646\\u0628\\u064a\\n\\u0645
I had a question about it and got the answer here
the question is : Where I can use ensure_ascii=False in the program so it can read the Arabic tweet correctly? I don't know in which place I need to copy it.
You need to modify twitter_search.py
Replace all
json.dump(<something>,fd)
For
json.dump(<something>,fd,ensure_ascii=False)
You'll also need to replace all <file_descriptor> for utf-8 ones
import codecs
...
...
fd = codecs.open("/tmp/lol", "w", "utf-8")
If you're processing the results with python another approach would be to unescape the ascii string.
s='\\u0637\\u0627\\u0644\\u0628\\u0629...'
print s.encode("utf-8").decode('unicode_escape')
I am trying to load email messages that I copied into rtf files (as my training data)
I load the directory containing the files using sklearn module and command:
sklearn.datasets.load_files
corpus = sklearn.datasets.load_files(<path>,shuffle = False)
When I attempt to print corpus.data, the first 6000 characters or so are \x00\x00\x00\x01Bud1\x00\x00\x10\x00\x00\x00\x08. Then the actual message text is displayed but intertwined are characters such as: \cf0 \expnd0\expndtw0\kerning0\nHey,\\ in the middle of the text.
I do want to mention that some of the text has German characters as well as English.
What could be the problem here?
Best
Ok
In the documentation for this function it says
If you leave encoding equal to None, then the content will be made of bytes instead of Unicode, and you will not be able to use most functions in sklearn.feature_extraction.text.
Without knowing the encoding of your files you might want to try
sklearn.databases.load_files(<path>,shuffle = False, encoding='utf-8')