Python how to "ignore" ascii text? - python

I'm trying to scrape some stuff off a page using selenium. But this some of the text has ascii text in it.. so I get this.
f.write(database_text.text)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 1462: ordinal not in range(128)
I was wondering, is there anyway to just simpley ascii?
Thanks!
print("â")
I'm not looking to write it in my text file, but ignore it.
note: It's not just "â" it has other chars like that also.
window_before = driver.window_handles[0]
nmber_one = 1
f = open(str(unique_filename) + ".txt", 'w')
for i in range(5, 37):
time.sleep(3)
driver.find_element_by_xpath("""/html/body/center/table[2]/tbody/tr[2]/td/table/tbody/tr""" + "[" + str(i) + "]" + """/td[2]/a""").click()
time.sleep(3)
driver.switch_to.window(driver.window_handles[nmber_one])
nmber_one = nmber_one + 1
database_text = driver.find_element_by_xpath("/html/body/pre")
f = open(str(unique_filename) + ".txt", 'w',)
f.write(database_text.text)
driver.switch_to.window(window_before)
import uuid
import io
unique_filename = uuid.uuid4()
which generates a new filename, well it should anyway, it worked before.

The problem is that some of the text is not ascii. database_text.text is likely unicode text (you can do print type(database_text.text) to verify) and contains non-english text. If you are on windows it may be "codepage" text which depends on how your user account is configured.
Often, one wants to store text like this as utf-8 so open your output file accordingly
import io
text = u"â"
with io.open('somefile.txt', 'w', encoding='utf-8') as f:
f.write(text)
If you really do want to just drop the non-ascii characters from the file completely you can setup a error policy
text = u"ignore funky â character"
with io.open('somefile.txt', 'w', encoding='ascii', errors='ignore') as f:
f.write(text)
In the end, you need to choose what representation you want to use for non-ascii (roughly speaking, non-English) text.

A Try Except block would work:
try:
f.write(database_text.text)
except UnicodeEncodeError:
pass

Related

Transposing files from columns to rows for multiple files

I have approximately 200 files (plus more in the future) that I need to transpose data from columns into rows. I'm a microbiologist, so coding isn't my forte (have worked with Linux and R in the past). One of my computer science friends was trying to help me write code in Python, but I have never used it before today.
The files are in .lvm format, and I'm working on a Mac. Items with 2 stars on either side are paths that I've hidden to protect my privacy.
The for loop is where I've been getting the error, but I'm not sure if that's where my problem lies or if it's something else.
This is the Python code I've been working on:
import os
lvm_directory = "/Users/**path**"
output_file = "/Users/**path**/Transposed.lvm"
newFile = True
output_delim = "\t"
for filename in os.listdir(lvm_directory):
header = []
data = []
f = open(lvm_directory + "/" + filename)
for l in f:
sl = l.split()
if (newFile):
header += [sl[1]]
f. close()
This is the error message I've been getting and I can't figure out how to work through it:
File "<pyshell#97>", line 5, in <module>
for l in f:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 345: invalid continuation byte
The rest of the code after this error is as follows, but I haven't worked through it yet due to the above error:
f = open(output_file, 'w')
f.write(output_delim.join(header))
newFile = False
else:
f = open(output_file, 'a')
f.write("\n"+output_delim.join(data))
f.close()
Looks like your files have a different encoding than the default utf-8 format. Probably ASCII. You'd use something like:
with open(lvm_directory + "/" + filename, encoding="ascii") as f:
for l in f:
# rest of your code here
^ It's generally more "pythonic" to use a with statement to handle resource management (i.e. opening and closing a file), hence the with approach demonstrated above. If your files aren't ASCII, see if any other encoding work. There are command-line tools like chardet that can help you identify the file's encoding.

Continuing for loop after exception in Python

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem.
I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point.
As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement.
This is the code:
import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
try:
clean = unidecode(line)
search_keyword = clean
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
#print("https://www.youtube.com/watch?v=" + video_ids[0])
except:
print("Error encounted with Line: " + line)
This is the full error message, to see that the for loop itself is causing the problem.
Traceback (most recent call last):
File "ytbysearchtolinks.py", line 6, in
for line in infh:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
If you need an example of input I'm working with: https://pastebin.com/LEkwdU06
The try-except-block looks correct and should allow you to catch all occurring exceptions.
The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.
One solution is to use urllib's quote() function. As per documentation:
Replace special characters in string using the %xx escape.
This is what works for me with the input you've provided:
import urllib.request
from urllib.parse import quote
import re
with open('out.txt', 'r', encoding='utf-8') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
search_keyword = quote(line)
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
print("https://www.youtube.com/watch?v=" + video_ids[0])
EDIT:
After thinking about it, I believe you are running into the following problem:
You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:
$ file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators
This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).
Make sure that you are using encoding='utf-8' when opening the file.
i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

How to extract String from a Unicoded JSONObject in Python?

I'm getting the below error when I try to parse a String with Unicodes like ' symbol and Emojis, etc :
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f33b' in position 19: ordinal not in range(128)
Sample Object:
{"user":{"name":"\u0e2a\u0e31\u0e48\u0e07\u0e14\u0e48\u0e27\u0e19 \u0e2b\u0e21\u0e14\u0e44\u0e27 \u0e40\u0e14\u0e23\u0e2a\u0e41\u0e1f\u0e0a\u0e31\u0e48\u0e19\u0e21\u0e32\u0e43\u0e2b\u0e21\u0e48 \u0e23\u0e32\u0e04\u0e32\u0e40\u0e1a\u0e32\u0e46 \u0e2a\u0e48\u0e07\u0e17\u0e31\u0e48\u0e27\u0e44\u0e17\u0e22 \u0e44\u0e14\u0e49\u0e02\u0e2d\u0e07\u0e0a\u0e31\u0e27\u0e23\u0e4c\u0e08\u0e49\u0e32 \u0e2a\u0e19\u0e43\u0e08\u0e15\u0e34\u0e14\u0e15\u0e48\u0e2d\u0e2a\u0e2d\u0e1a\u0e16\u0e32\u0e21 Is it","tag":"XYZ"}}
I'm able to extract tag value, but I'm unable to extract name value.
Here is my code:
dict = json.loads(json_data)
print('Tag - 'dict['user']['tag'])
print('Name - 'dict['user']['name'])
You can save the data in CSV file format which could also be opened using Excel. When you open a file in this way: open(filename, "w") then you can only store ASCII characters, but if you try to store Unicode data this way, you would get UnicodeEncodeError. In order for you to store Unicode data, you need to open the file with UTF-8 encoding.
mydict = json.loads(json_data) # or whatever dictionary it is...
# Open the file with UTF-8 encoding, most important step
f = open("userdata.csv", "w", encoding='utf-8')
f.write(mydict['user']['name'] + ", " + mydict['user']['tag'] + "\n")
f.close()
Feel free to change the code based on the data you have.
That's it...

Python: UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0: ordinal not in range(128)

I'm currently have an issue with my python 3 code.
replace_line('Products.txt', line, tenminus_str)
Is the line I'm trying to turn into utf-8, however when I try to do this like I would with others, I get errors such as no attribute ones and when I try to add, for example...
.decode("utf8")
...to the end of it, I still get errors that it is using ascii. I also tried other methods that worked with other lines such as adding io. infront and adding a comma with
encoding = 'utf8'
The function that I am using for replace_line is:
def replace_line(file_name, line_num, text):
lines = open(file_name, 'r').readlines()
lines[line_num] = text
out = open(file_name, 'w')
out.writelines(lines)
out.close()
How would I fix this issue? Please note that I'm very new to Python and not advanced enough to do debugging well.
EDIT: Different fix to this question than 'duplicate'
EDIT 2:I have another error with the function now.
File "FILELOCATION", line 45, in refill replace_line('Products.txt', str(line), tenminus_str)
File "FILELOCATION", line 6, in replace_line lines[line_num] = text
TypeError: list indices must be integers, not str
What does this mean and how do I fix it?
Change your function to:
def replace_line(file_name, line_num, text):
with open(file_name, 'r', encoding='utf8') as f:
lines = f.readlines()
lines[line_num] = text
with open(file_name, 'w', encoding='utf8') as out:
out.writelines(lines)
encoding='utf8' will decode your UTF-8 file correctly.
with automatically closes the file when its block is exited.
Since your file started with \xef it likely has a UTF-8-encoding byte order mark (BOM) character at the beginning. The above code will maintain that on output, but if you don't want it use utf-8-sig for the input encoding. Then it will be automatically removed.
codecs module is just what you need. detail here
import codecs
def replace_line(file_name, line_num, text):
f = codecs.open(file_name, 'r', encoding='utf-8')
lines = f.readlines()
lines[line_num] = text
f.close()
w = codecs.open(file_name, 'w', encoding='utf-8')
w.writelines(lines)
w.close()
Handling coding problems You can try adding the following settings to your head
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
Type = sys.getfilesystemencoding()
Try adding encoding='utf8' if you are reading a file
with open("../file_path", encoding='utf8'):
# your code

Cannot execute auto-generated Python script encoded in UTF8-sig

I am using a Python script to take some text from the internet and put it as comments to another Python script which the first one generates.
Originally I was simply using open() to open create the new Python script and write() to print to it.
outputFile = open(fileName, 'w')
outputFile.write('#!/usr/bin/python\n')
outputFile.write('\n')
outputFile.write('# ' + lineFromTheInternet + '\n')
outputFile.write('print \'Hello, World!\'\n')
This works most of the time, the new script is generated and I can run it. However, sometimes the text that I am taking from the internet has Unicode characters and gives me problems (UnicodeEncodeError: 'ascii' codec can't encode character u'\xd7' in position 55: ordinal not in range(128)). I replaced the code then to:
outputFile = codecs.open(fileName, 'w', 'utf-8-sig)
outputFile.write('#!/usr/bin/python\n')
outputFile.write('\n')
outputFile.write('# ' + lineFromTheInternet + '\n')
outputFile.write('print \'Hello, World!\'\n')
And this would generate the file correctly, but when I try to execute it I get ./autogenerated.py: line 1: #!/usr/bin/python: No such file or directory
This has to be the encoding, since it's the only thing changing, but I do not know how to solve it.
Linux or Windows? This works on Windows. Make sure to write Unicode strings to the file opened with codecs.open:
#!/usr/bin/python2
import codecs
with codecs.open('y.py', 'w', 'utf-8-sig') as outputFile:
outputFile.write(u'#!/usr/bin/python2\n')
outputFile.write(u'\n')
outputFile.write(u'# ' + u'Syst\xe9m' + u'\n')
outputFile.write(u'print \'Hello, World!\'\n')
AFAIK, Linux may not like the UTF-8 BOM. Try removing it and declaring the encoding instead, e.g. #coding:utf8 at the top of the file:
#!/usr/bin/python2
import codecs
with codecs.open('y.py', 'w', 'utf8') as outputFile:
outputFile.write(u'#!/usr/bin/python2\n')
outputFile.write(u'#coding:utf8\n')
outputFile.write(u'\n')
outputFile.write(u'# ' + u'Syst\xe9m' + u'\n')
outputFile.write(u'print \'Hello, World!\'\n')

Categories

Resources