I'm getting some facebook posts that have a mixture of English and and a non-English language (Khmer to be exact).
Here's how the non-English is displayed when I print the data to screen or save it to file: \u178a\u17c2\u179b\u1787\u17b6\u17a2\u17d2. I would rather have it display as ឈឹម បញ្ចពណ៌ (Note: this is not a translation of the previous unicode.)
Try this if you want to save the info in a file:
import codecs
string = 'ឈឹម បញ្ចពណ៌'
with codecs.open('yourfile', 'w', encoding='utf-8') as f:
f.write(string)
This should be it:
print(u'\u1787\u17b6\u17a2\u17d2') #python3
print u'\u1787\u17b6\u17a2\u17d2' #python2.7
Output: ជាអ្
In pycharm I added:
(at top) # -- coding: utf-8 --
import sys
reload(sys)
sys.setdefaultencoding('utf8')
s = json.dumps(posts['data'],ensure_ascii=False)
json_file.write(s.decode('utf-8'))
Related
My requirement is to read some data from mysql database, and then write it in a JSON format to a file. However, while writing into file, the unicode data is garbled.
Actual Unicode Name: ぎぎぎは.exe
Name written to file: ã<81>Žã<81>Žã<81>Žã<81>¯.exe
My database has charset set as utf8. I am opening connection like below:
MySQLdb.connect (host = "XXXXX", user = "XXXXX", passwd = "XXXX", cursorclass=cursors.SSCursor,charset='utf8',use_unicode=True)
And the outfile is opened as below:
for r in data:
with open("XX.json",'w') as out:
d={}
d['name']=r[0]
d['type']='Work'
out.write('%s\n' % json.dumps(d, indent=0, ensure_ascii=False).replace('\n', ''))
This is working, but as mentioned above unicode data is getting garbled.
If I do type(r[0]), it's coming as 'str'.
If your solution includes to use codes.open function, with encoding as 'utf-8', then please help me add decode/encode whereever required. This method needs all data to be unicode.
I am lost in plethora of solution available online, but none of them are working perfectly fine for me :(
OS Details: CentOS 6
>>> import sys
>>> sys.getfilesystemencoding()
'UTF-8'
>>>
Try io.open:
# -*- coding: utf8 -*-
import json
import io
with io.open('b.txt', 'at', encoding='utf8') as json_file:
json_file.write(json.dumps({u'ぎぎぎは': 0}, ensure_ascii=False, encoding='utf8'))
It's my first time writing here, so I hope I'm doing everything all right.
I'm using python 3.5 on Win10, and I'm trying to "sync" music from Itunes to my Android device. Basically, I'm reading the Itunes Library XML file and getting all the files location ( so I can copy/paste them into my phone ) but I have problems with songs containing foreign characters.
import getpass
import re
import os
from urllib.parse import unquote
user = getpass.getuser()
ITUNES_LIB_PATH = "C:\\Users\\%s\\Music\\Itunes\\iTunes Music Library.xml" % user
ITUNES_SONGS_FILE = "ya.txt"
def write(file, what, newline=True):
with open(file, 'a', encoding="utf8") as f:
if not os.path.isfile(what):
print("Issue locating file %s\n" % what)
if newline:
what+"\n"
f.write(what)
def get_songs(file=ITUNES_LIB_PATH):
with open(file, 'r', encoding="utf8") as f:
f = f.read()
songs_location = re.findall("<key>Location</key><string>file://localhost/(.*?)</string>", f)
for song in songs_location:
song = unquote(song.replace("/", '\\'))
write(ITUNES_SONGS_FILE, song)
get_songs()
Output:
Issue locating file C:\Users\Dymy\Desktop\Media\Norin & Rad - Bird Is The Word.mp3
How should I handle that "&" in the file name?
There are a couple of related issues in your code e.g., unescaped xml character references, hardcoded character encodings cause by using regular expressions to parse xml. To fix them, use xml parser such as xml.etree.ElementTree or use a more specific pyitunes library (I haven't tried it).
I have html file called test.html it has one word בדיקה.
I open the test.html and print it's content using this block of code:
file = open("test.html", "r")
print file.read()
but it prints ??????, why this happened and how could I fix it?
BTW. when I open text file it works good.
Edit: I'd tried this:
>>> import codecs
>>> f = codecs.open("test.html",'r')
>>> print f.read()
?????
import codecs
f=codecs.open("test.html", 'r')
print f.read()
Try something like this.
I encountered this problem today as well. I am using Windows and the system language by default is Chinese. Hence, someone may encounter this Unicode error similarly. Simply add encoding = 'utf-8':
with open("test.html", "r", encoding='utf-8') as f:
text= f.read()
you can make use of the following code:
from __future__ import division, unicode_literals
import codecs
from bs4 import BeautifulSoup
f=codecs.open("test.html", 'r', 'utf-8')
document= BeautifulSoup(f.read()).get_text()
print(document)
If you want to delete all the blank lines in between and get all the words as a string (also avoid special characters, numbers) then also include:
import nltk
from nltk.tokenize import word_tokenize
docwords=word_tokenize(document)
for line in docwords:
line = (line.rstrip())
if line:
if re.match("^[A-Za-z]*$",line):
if (line not in stop and len(line)>1):
st=st+" "+line
print st
*define st as a string initially, like st=""
You can read HTML page using 'urllib'.
#python 2.x
import urllib
page = urllib.urlopen("your path ").read()
print page
Use codecs.open with the encoding parameter.
import codecs
f = codecs.open("test.html", 'r', 'utf-8')
CODE:
import codecs
path="D:\\Users\\html\\abc.html"
file=codecs.open(path,"rb")
file1=file.read()
file1=str(file1)
You can simply use this
import requests
requests.get(url)
you can use 'urllib' in python3 same as
https://stackoverflow.com/a/27243244/4815313 with few changes.
#python3
import urllib
page = urllib.request.urlopen("/path/").read()
print(page)
About saving in UTF-8
I have a python script which consumes Twitter API and saves some tweets in JSON structures.
When it comes to text containing characters like á, é, í, ó or ú I won't get that text saved properly but replaced by a pattern like \u00f1:
This is my while script:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import twitter
import json
from api.search import *
from api.tpt import *
from tweet.tweet import *
from time import *
WORLD_WOE_ID = 23424900
trendingTopics = startTrendsDictionary()
twitterAPI = createAPIConsumer()
print "Twitter API started"
while True:
mexican_trends = twitterAPI.trends.place(_id=WORLD_WOE_ID)
print "Current trends fetched"
with open ('../files/trends/trends '+strftime("%c"), 'w') as trendsFile:
trendsFile.write(json.dumps(mexican_trends,indent=1))
print "Current trends at +- {t} saved".format(t=strftime("%c"))
statuses = harvestTrendingTopicTweets(twitterAPI,trendingTopics,100,10)
print "Harvest done"
with open('../files/tweets/unclassified '+strftime("%c"), 'w') as statusesFile:
statusesFile.write(json.dumps(statuses,indent=1).encode('utf-8'))
print "File saved"
print "We're going to wait in order not to fed up Twitter API"
sleep(2400)
print "OK, it was enough waiting, here we go again"
I thought that both:
# -*- coding: utf-8 -*-
.encode('utf-8'))
Will solve it but they don't.
About reading in UTF-8
When it comes to read, I'm trying:
import json
with open('file', 'r', buffering=1) as f:
tweetsJSON = json.load(f)
for category in tweetsJSON:
for trend in tweetsJSON[category]:
for t in tweetsJSON[category][trend]:
print t['text']
print
In this case, printed to console, *I can see all those letters displayed properly.*
So, why when I open the saved files with a text editor (Sublime Text, in my case) they won't look ok?
I have this code:
# -*- coding: utf-8 -*-
import codecs
prefix = u"а"
rus_file = "rus_names.txt"
output = "rus_surnames.txt"
with codecs.open(rus_file, 'r', 'utf-8') as infile:
with codecs.open(output, 'a', 'utf-8') as outfile:
for line in infile.readlines():
outfile.write(line+prefix)
And it gives me smth kinda chineese text in an output file. Even when I try to outfile.write(line) it gives me the same crap in an output. I just don't get it.
The purpose: I have a huge file with male surnames. I need to get the same file with female surnames. In russian it looks like this: Ivanov - Ivanova | Иванов - Иванова
Try
lastname = str(line+prefix, 'utf-8')
outfile.write(lastname)
So #AndreyAtapin was partially right. I've tried to add lines in a file which contains my previous mistakes with chineese characters. even flushing the file didn't help. But when I delete it and script creates it once again, it works! thanks.