I'm using this program, and all the tweets that I'm getting are like this"because it is in Arabic Language":
"text": "\\u0637\\u0627\\u0644\\u0628\\u0629 \\u062c\\u0633\\u0645\\u0647\\u0627 \\u062c\\u0628\\u0627\\u0631 \\u062a\\u062a\\u062e\\u062f \\u0645\\u0646 \\u0627\\u0644\\u0634\\u0627\\u0631\\u0639 \\u0648 \\u062a\\u062a\\u0646\\u0627\\u0643..\\n\\n\\u0633\\u0643\\u0633_\\u0627\\u062c\\u0646\\u0628\\u064a\\n\\u0645
I had a question about it and got the answer here
the question is : Where I can use ensure_ascii=False in the program so it can read the Arabic tweet correctly? I don't know in which place I need to copy it.
You need to modify twitter_search.py
Replace all
json.dump(<something>,fd)
For
json.dump(<something>,fd,ensure_ascii=False)
You'll also need to replace all <file_descriptor> for utf-8 ones
import codecs
...
...
fd = codecs.open("/tmp/lol", "w", "utf-8")
If you're processing the results with python another approach would be to unescape the ascii string.
s='\\u0637\\u0627\\u0644\\u0628\\u0629...'
print s.encode("utf-8").decode('unicode_escape')
Related
Problem
I'm running dataflow job where I have steps - reading txt file from cloud storage using dataflow/beam - apache_beam.io.textio.ReadFromText() which has StrUtf8Coder (utf-8) by default and after that loading it into postgres using StringIteratorIO with copy_from.
data coming from pcollection element by element, there are some elements which will look like this:
line = "some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
After that, I need to download it to postgres (the delimiter here is "|"), but the problem is these kinds of elements because postgres try to encode it(and I'm getting: 'invalid byte sequence for encoding "UTF8"'):
from F\226 we are getting this -> F\x96
This slash is not visible so I can not just replace it like this:
line.replace("\\", "\\\\")
Using python 3.8.
Have tried repr() or encode("unicode_escape").decode()
Also in every line we have different elements so let's say in the next one can be r\456
I'm able to catch and change it with regex only if I will use a raw string, but not sure how to represent a regular string as a raw if we already have it in a variable.
import re
line = r"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
updated = re.sub("([a-zA-Z])\\\\(\\d*)", "\\1\\\\\\\\\\2",string)
print(updated)
$ some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|
Goal
Have an extra backslash if after backslash we have some element, so the line need to look like this:
line = "some information|more information S\\\\\H, F\\226|DIST|local|app\\\\\\lock\\|"
Thank's for any help!
If you're able to read the file in binary or select the encoding, you could get a better starting point. This is how to do it in binary:
>>> line = b"some information|more information S\\\\H, F\226|DIST|local|app\\\\\lock\|"
>>> line.decode('cp1252')
'some information|more information S\\\\H, F–|DIST|local|app\\\\\\lock\\|'
This is how to decode the whole file:
f = open('file.txt', encoding='cp1252')
f.read()
The encoding CP-1252 is the legacy Microsoft latin-1 encoding.
I have spent the whole day looking for a way to display the Arabic letters with scrapy and nothing worked for me! I am scraping an Arabic website but i am not getting the right format of the arabic language.
here is what i am actually getting when i am saving the results in a csv file:
"بطل ليÙربول القديم" يرد على أنصار "الريدز"
here is my function:
def parse_details(self, response):
vars = ArticlesItem()
vars["title"] = response.css("h1.sna_content_heading::text").extract_first().strip()
vars["article_summary"] = response.css("span.article-summary").extract_first().strip()
vars["article_content"] = [i.strip() for i in response.css("div.article-body p::text").extract()]
vars["tags"] = [i.strip() for i in response.css("div.article-tags h2.tags::text").extract()]
yield vars
i tried to add encode("utf-8") but i am still not getting the right format
vars["title"] = ...extract_first().strip().encode("utf-8")
i am getting something like this:
b'\xd8\xa8\xd8\xb1\xd9\x82\xd9\x85 "\xd9\x85\xd8\xb0\xd9\x87'
b'\xd9\x84".. \xd8\xa8\xd9\x86\xd8\xb2\xd9\x8a\xd9\x85\xd8\xa9 \xd9'
b'\x8a\xd8\xaa\xd9\x81\xd9\x88\xd9\x82 \xd8\xb9\xd9\x84\xd9\x89'
b' \xd9\x85\xd9\x8a\xd8\xb3\xd9\x8a \xd9\x88\xd8\xb1\xd9\x88'
b'\xd9\x86\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88 \xd9\x88\xd8\xb5\xd9'
b'\x84\xd8\xa7\xd8\xad'
If I take the data you report in your question, and assign it to a variable, like this:
>>> a = (b'\xd8\xa8\xd8\xb1\xd9\x82\xd9\x85 "\xd9\x85\xd8\xb0\xd9\x87'
b'\xd9\x84".. \xd8\xa8\xd9\x86\xd8\xb2\xd9\x8a\xd9\x85\xd8\xa9 \xd9'
b'\x8a\xd8\xaa\xd9\x81\xd9\x88\xd9\x82 \xd8\xb9\xd9\x84\xd9\x89'
b' \xd9\x85\xd9\x8a\xd8\xb3\xd9\x8a \xd9\x88\xd8\xb1\xd9\x88'
b'\xd9\x86\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88 \xd9\x88\xd8\xb5\xd9'
b'\x84\xd8\xa7\xd8\xad')
and I then decode these bytes on the (reasonable) assumption that they are UTF-8:
>>> a.decode()
'برقم "مذهل".. بنزيمة يتفوق على ميسي ورونالدو وصلاح'
it seems to me that you are getting back what you might be expecting, just not quite in the way you expect it.
since #gallaecio wanted me to write an answer to my question
here is what i did:
1- open an empty excel sheet
2- go to data
3- choose From text/csv
4- under the file origin i had to change it from 1252 Western European(Windows) TO 65001 Unicode ( UTF-8 ), now i can read the arabic text !
5- Load !
I know there are a lot of encoding/decoding topics here and I tried it for hours, but I'm still not able to solve it. Hence I want to raise a question:
I have a string with hexadecimal values, which in the end is a text that I want to write to a text file with the correct encoding.
hexvalues = "476572E2A47465646574656B746F72"
In the end, the (german) result should be "Gerätedetektor"
Currently I am using binascii.unhexlify() for the decoding, but it still doesn't show me the "ä" like its supposed to be, and instead, I get:
>> result = binascii.unhexlify(hexvalues)
Gerâ¤tedetektor
I tried to do result.decode("utf-8") and a lot of other things but either the script crashes or it also doesn't return what I want to see.
In the end, I want to write the word in the correct way to a file.
Any help would be highly appreciated!
Edit:
As I wrote before, I tried many things so it is kind of hard to give the ONE code that I'm using but here is an excerpt from the current version:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import binascii
resultfile = "text_GER.txt"
fpx = open(resultfile, 'wb')
hexvalues = "476572E2A47465646574656B746F72"
result= binascii.unhexlify(hexvalues )
result= result.decode("utf-8")
print(result)
fpx.write(result)
This one makes the script crash, but no further indicators why it does.
If i skip
result= result.decode("utf-8")
then my result print of result looks like this:
b'Ger\xe2\xa4tedetektor'
I have a python script as a subversion pre-commit hook, and I encounter some problems with UTF-8 encoded text in the submit messages. For example, if the input character is "å" the output is "?\195?\165". What would be the easiest way to replace those character parts with the corresponding byte values? Regexp doesn't work as I need to do processing on each element and merge them back together.
code sample:
infoCmd = ["/usr/bin/svnlook", "info", sys.argv[1], "-t", sys.argv[2]]
info = subprocess.Popen(infoCmd, stdout=subprocess.PIPE).communicate()[0]
info = info.replace("?\\195?\\166", "æ")
I do the same things in my code and you should be able to use:
...
u_changed_path = unicode(changed_path, 'utf-8')
...
When using the approach above, I've only run into issues with characters like line feeds and such. If you post some code, it could help.
Ok, so I am trying to write a Python script for XCHAT that will allow me to type "/hookcommand filename" and then will print that file line by line into my irc buffer.
EDIT: Here is what I have now
__module_name__ = "scroll.py"
__module_version__ = "1.0"
__module_description__ = "script to scroll contents of txt file on irc"
import xchat, random, os, glob, string
def gg(ascii):
ascii = glob.glob("F:\irc\as\*.txt")
for textfile in ascii:
f = open(textfile, 'r')
def gg_cb(word, word_eol, userdata):
ascii = gg(word[0])
xchat.command("msg %s %s"%(xchat.get_info('channel'), ascii))
return xchat.EAT_ALL
xchat.hook_command("gg", gg_cb, help="/gg filename to use")
Well, your first problem is that you're referring to a variable ascii before you define it:
ascii = gg(ascii)
Try making that:
ascii = gg(word[0])
Next, you're opening each file returned by glob... only to do absolutely nothing with them. I'm not going to give you the code for this: please try to work out what it's doing or not doing for yourself. One tip: the xchat interface is an extra complication. Try to get it working in plain Python first, then connect it to xchat.
There may well be other problems - I don't know the xchat api.
When you say "not working", try to specify exactly how it's not working. Is there an error message? Does it do the wrong thing? What have you tried?