Some issue with Unicode encoding - python

I am trying to open and parse a Json file using python script and write its content into another Json file after formatting it as I want. Now my source Json file has character /"
which I want to replace with a blank. I don't have any issue in parsing or creating news file only the issue is that character is not getting replaced by blank. How do I do it. Earlier I have achieved the same task but then there was no such character in the document that time.
Here is my code
doubleQuote = "\""
try:
destination = open("TodaysHtScrapedItemsOutput.json","w") # open JSON file for output
except IOError:
pass
with open('TodaysHtScrapedItems.json') as f: #load json file
data = json.load(f)
print "file successfully loaded"
for dataobj in data:
for news in data[cnt]["body"]:
news = news.encode("utf-8")
if(news.find(doubleQuote) != -1): # if doublequotes found in first body tag
# print "found double quote"
news.replace(doubleQuote,"")
if(news !=""):
my_news = my_news +" "+ news
destination.write("{\"body\":"+ "\""+my_news+"\"}"+"\n")
my_news = ""
cnt= cnt + 1

Some things to try:
You should write and read the json files as binaries, so "w" becomes "wb" and you need to add "rb".
You can define your search string as unicode, with:
doubleQuote = u'"'
You can lookup the integer value of the character with this command.
ord(u'"')
I get 34 as a response. The reverse function is chr(34). Are the double quotes you are looking for the same double quotes as the json contains? See here for details.
You don't need the if loop to check if news contains the '"'. Doing a replace on 'news' is enough.
Try these steps and let me know if it still doesn't work.

str.replace doesn't change the original string.So you need to assign the string back to news.
if(news.find(doubleQuote) != -1): # if doublequotes found in first body tag
# print "found double quote"
news = news.replace(doubleQuote,"")

Related

How to retrieve the last characters from a remote text file using Python?

I am trying to get the digits from the second last column in this txt file
url: http://services.swpc.noaa.gov/text/wing-kp.txt
I only need the last value in the second last column at the very end of the file.
I have tried a few different sample-codes in Python 3(.4?)
This code only get me a specific amount of characters starting from the beginning of the file:
# coding: utf-8
import urllib.request
req = urllib.request.Request('http://services.swpc.noaa.gov/text/wing-kp.txt')
with urllib.request.urlopen(req) as response:
the_page = response.read(100)
print (the_page)
I have tried the .seek function but it returned a value I could not recognize.
In the following code I first tried to use the .seek directly from the webpage, but it didn't work so then I tried to save the file first and then read from the file with no/limited success.
# coding: utf-8
import urllib.request
req = urllib.request.Request('http://services.swpc.noaa.gov/text/wing-kp.txt')
with urllib.request.urlopen(req) as response:
open('data.txt', 'wb').write(urllib.request.urlopen(req).read())
file = open('data.txt' , 'rb+')
data = file.seek(-5, 2)
file.close()
print (data)
If you only need the second last value, you could do it like this:
file = open('data.txt' , 'rb+')
data = file.readlines()
file.close()
data = [i for i in str(data[-1]).strip().split(" ") if i != ''][-2]
With file.readlines() we get a list of all the lines, where we can take the last by indexing with [-1]. Then, we can simply split by whitespaces and construct a new list with all non-empty strings, where we now have the second last column as the second last element of the list. This assumes that there are no whitespaces in the values for the last two columns and does not work for parsing all columns, since other data like the dates is also separated by whitespaces.
Using requests rather than urllib. Assumes that you don't need the file on disk:
import requests
url = "http://services.swpc.noaa.gov/text/wing-kp.txt"
data = [x for x in requests.get(url).content.rstrip().split("\n")[-1].split(" ") if x][-2]
Command line version because why not? :)
$ python -c 'import requests; print [x for x in requests.get("http://services.swpc.noaa.gov/text/wing-kp.txt").content.rstrip().split("\n")[-1].split(" ") if x][-2]'
2.33

writing persian text into a text file in the way which could be read in python

I have developed a simple program which sends a request to a persian web server and gets the source code of the main page. Then I convert it to string , use file.open (new_file , 'w') and paste the string in it.
When i use print the string in python idle I can see the right words in persian but the text file which i made in directory is written with strings like \xd9\x8a\xd8\xb9\n.
Here is the code:
import urllib.request as ul
import sys
url = 'http://www.uut.ac.ir/'
resp = ul.urlopen(url).read()
string = str(resp)
create_file(filename , string) # this function creates a text file in desktop
I also used:
file.open(new_file , 'w' , encoding = 'utf-8')
string = resp.encode('utf-8')
But nothing changed. Any help would be appreciated.
So look at your code:
>>> resp = ul.urlopen(url).read()
>>> type(resp)
<class 'bytes'>
resp has the type bytes. In the next you have used:
string = str(resp)
But you have forgot to set the encoding. The right command is:
string = str(resp, encoding="utf-8")
Now you get the right string and can write it directly to your file.
Your solution 2 is false. You must use decode instead of encode.
string = resp.decode('utf-8')
decode the web site content before write into file
import urllib.request as ul
import sys
url = 'http://www.uut.ac.ir/'
resp = ul.urlopen(url).read()
string = str(resp.decode())
f=open("a.txt",'w')
f.write(string)

Writing on text file, accents and special characters not displaying correctly

Here's what I'm doing, I'm web crawling for my personal use on a website to copy the text and put the chapters of a book on text format and then transform it with another program to pdf automatically to put it in my cloud. Everything is fine until this happens: special characters are not copying correctly, for example the accent is showed as: \xe2\x80\x99 on the text file and the - is showed as \xe2\x80\x93. I used this (Python 3):
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
Because since I had a bug when reading those characters and it just stopped my program, I encoded everything to utf-8 and retransform everything to string with python's method str()
I will post the whole code if anyone has a better solution to my problem, here's the part that crawl the website from page 1 to max_pages, you can modify it on line 21 to get more or less chapters of the book:
import requests
from bs4 import BeautifulSoup
def crawl_ATG(max_pages):
page = 1
while page <= max_pages:
x= page
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter.decode('utf-8', 'ignore'), 'html.parser')
f = open('atg_chapter' + str(x) + '.txt', 'w+')
for text in soup.find_all('p'):
texta = text.text
f.write(str(str(texta).encode("utf-8")))
f.write('\n')
f.close
page +=1
crawl_ATG(10)
I will do the clean up of the first useless lines that are copied later when I get a solution to this problem. Thank you
The easiest way to fix this problem that I found is adding encoding= "utf-8" in the open function:
with open('file.txt','w',encoding='utf-8') as file :
file.write('ñoño')
For some reason, you (wrongly) have utf8 encoded data in a Python3 string. The real cause of that is probably that requests.content is already a unicode string, so you should not decode it, but use it directly:
url = 'http://www.wuxiaworld.com/atg-index/atg-chapter-' + str(x) + "/"
source = requests.get(url)
chapter = source.content
soup = BeautifulSoup(chapter, 'html.parser')
If it is not enough, that means if you still have ’ and – (Unicode u'\u2019' and u'\u2013') display as \xe2\x80\x99 and \xe2\x80\x93', that could be caused by the html page not correctly declaring its encoding. In that case you should first encode to a byte string with latin1 encoding, and then decode it as utf8:
chapter = source.content.encode('latin1', 'ignore').decode('utf8', 'ignore')
soup = BeautifulSoup(chapter, 'html.parser')
Demonstration:
t = u'\xe2\x80\x99 \xe2\x80\x93'
t = t.encode('latin1').decode('utf8')
Displays : u'\u2019 \u2013'
print(t)
Displays : ’ –
The only error I can spot is,
str(texta).encode("utf-8")
In it, you are forcing a conversion to str and encoding it. It should be replaced with,
texta.encode("utf-8")
EDIT:
The error stems in the server not giving the correct encoding for the page. So requests assumes a 'ISO-8859-1'. As noted in this bug, it is a deliberate decision.
Luckily, chardet library correctly detects the 'utf-8' encoding, so you can do:
source.encoding = source.apparent_encoding
chapter = source.text
And there won't be any need to manually decode the text in chapter, since requests uses it to decode the content for you.

Unable to decode HTML page with urllib.request

I've wrote the following piece of code which searches URL and saves the HTML to a text file. However, I have two issues
Most importantly, it does not save € and £ in the HTML as this. This is likely a decoding issue which I've tried to fix, but so far without success
The following code does not replace the "\n" in the HTML with "". This isn't as important to me, but I am curious as to why it is not working
Any ideas?
import urllib.request
while True: # this is an infinite loop
with urllib.request.urlopen('WEBSITE_URL') as f:
fDecoded = f.read().decode('utf-8')
data = str(fDecoded .read()).replace('\n', '') # does not seem to work?
myfile = open("TestFile.txt", "r+")
myfile.write(data)
print ('----------------')
When you do this -
fDecoded = f.read().decode('utf-8')
fDecoded is already of type str , you are reading the byte string from the request and decoding it into str using utf-8 encoding.
Then after this you cannot call -
str(fDecoded .read()).replace('\n', '')
str has no method read() and you do not actually need to convert it to str again. Just do -
data = fDecoded.replace('\n', '')

Getting "newline inside string" while reading the csv file in Python?

I have this utils.py file in Django Architecture:
def range_data(ip):
r = []
f = open(os.path.join(settings.PROJECT_ROOT, 'static', 'csv ',
'GeoIPCountryWhois.csv'))
for num,row in enumerate(csv.reader(f)):
if row[0] <= ip <= row[1]:
r.append([r[4]])
return r
else:
continue
return r
Here the ip parameter is just the IPv4 Address, I am using open source MAXMIND GeoIPCountrywhois.csv file.
Some starting content of GeopIOCountrywhois.csv:
"1.0.0.0","1.0.0.255","16777216","16777471","AU","Australia"
"1.0.1.0","1.0.3.255","16777472","16778239","CN","China"
"1.0.4.0","1.0.7.255","16778240","16779263","AU","Australia"
"1.0.8.0","1.0.15.255","16779264","16781311","CN","China"
"1.0.16.0","1.0.31.255","16781312","16785407","JP","Japan"
"1.0.32.0","1.0.63.255","16785408","16793599","CN","China"
"1.0.64.0","1.0.127.255","16793600","16809983","JP","Japan"
"1.0.128.0","1.0.255.255","16809984","16842751","TH","Thailand"
I have also read about the issue, But didn't found so much understandable. Would you please help me to solve that error?
According to my method in utils, I am checking country name of paasing parameter IP address to the method.
had similar problem earlier today, there was an end quote missing from a line and the solution is by instructing reader to perform no special processing of quote characters (quoting=csv.QUOTE_NONE).
You can preprocess the csv by removing the newline like below.
import csv
content = open("GeoIPCountryWhois.csv", "r").read().replace('\r\n','\n')
with open("GeoIPCountryWhois2.csv", "w") as g:
g.write(content)
Then Use GeoIPCountryWhois2 for csv reader.
A wild Guess using a lineterminator may solve your problem
for num,row in enumerate(csv.reader(f,lineterminator='\n'))
See also: http://docs.python.org/lib/csv-fmt-params.html
You must open your files as binary:
def range_data(ip):
r = []
f = open(os.path.join(settings.PROJECT_ROOT, 'static', 'csv ',
'GeoIPCountryWhois.csv'), 'rb')
for num,row in enumerate(csv.reader(f)):
# Your things.
Note the 'rb' mode there; otherwise the file could be opened with native line endings, and the CSV reader doesn't handle the various forms very well. Certainly the copy of GeoIPCountryWhois.csv that I downloaded has clean \n line endings.
This is documented for the .reader() method:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
If, however, your csv file is so corrupted as to still contain unexpected newline characters in unexpected places, use this file subclass instead as a stop-gap measure:
class CleanlinesFile(file):
def next(self):
line = super(CleanlinesFile, self).next()
return line.replace('\r', '').replace('\n', '') + '\n'
This class guarantees there will be no newlines anywhere in the returned results except as the very last character (just the way the csv module wants it). Use it instead of the open call; the 'rb' mode modifier becomes optional in this case:
def range_data(ip):
r = []
f = CleanlinesFile(os.path.join(settings.PROJECT_ROOT, 'static', 'csv ',
'GeoIPCountryWhois.csv'))
for num,row in enumerate(csv.reader(f)):
# Your things.

Categories

Resources