How do i read and print a whole .txt file using python? - python

I am totally new to python, and I am supposed to write a program that can read a whole .txt file and print it. The file is an article in my first language(Norwegian), and long. I have three versions that should do the same thing, but all get error. I have tried in bot PyCharm and eclipse with PyDev installed, and i get the same errors on both...
from sys import argv
import pip._vendor.distlib.compat
script, dev = argv
txt = open(dev)
print("Here's your file %r:" % dev)
print(txt.read())
print("Type the filename again:")h
file_again = pip._vendor.distlib.compat.raw_input("> ")
txt_again = open(file_again)
print(txt_again.read())
But this gets the errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/1A.py", line 5, in <module>
script, dev = argv
ValueError: not enough values to unpack (expected 2, got 1)
Again, i am new to python, and i searched around, but didn't find a solution.
My next attempt was this:
# -*- coding: utf-8 -*-
import sys, traceback
fr = open('dev.txt', 'r')
text = fr.read()
print(text)
But this gets these errors:
Traceback (most recent call last):
File "/Users/vebjornbergaplass/Documents/Python eclipse/oblig1/src/1A/v2.py", line 6, in <module>
text = fr.read()
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128)
I do not understand why i doesn't work.
My third attempt looks like this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("dev.txt", help="dev.txt")
args = parser.parse_args()
if args.filename:
with open('dev.txt') as f:
for line in f:
name, _ = line.strip().split('\t')
print(name)
And this gets the errors:
usage: v3.py [-h] dev.txt
v3.py: error: the following arguments are required: dev.txt
Any help to why these doesnt work is welcome.
Thank you in advance :D

For the 2nd approach is the simplest, I'll stick to it.
You stated the contents of dev.txt to be Norwegian, that means, it will include non-ascii characters like Æ,Ø,Å etc. The python interpreter is trying to tell you this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 26: ordinal not in range(128) It cannot interpret the byte 0xC3 = 195 (decimal) as an ascii character, which is limited to a range of 128 different characters.
I'll assume you're using UTF-8 as encoding but if not, change the parameter in line 2.
# -*- coding: utf-8 -*-
fr = open('dev.txt', 'r', encoding='utf-8')
text = fr.read()
print(text)
If you do not know your encoding, you can find it out via your editor or use python to guess it.
Your terminal could also cause the error when it's not configured to print Unicode Characters or map them correctly. You might want to take a look at this question and its answers.
After operating a file, it is recommended to close it. You can either do that manually via fr.close() or make python do it automatically:
with open('dev.txt', 'r', encoding='utf-8') as fr:
# automatically closes fr when leaving this code-block

file = open("File.txt", "r")
a = str(file.read())
print(a)
Is this what you were looking for?

For example:
open ("fileA.txt", "r") as fileA:
for line in fileA:
print(line);

This is a possible solution:
f = open("textfile.txt", "r")
lines = f.readlines()
for line in lines:
print(line)
f.close()
Save it as for example myscript.py and execute it:
python /path/to/myscript.py

Related

Continuing for loop after exception in Python

So first of all I saw similar questions, but nothing worked/wasn't applicable to my problem.
I'm writing a program that is taking in a Text file with a lot of search queries to be searched on Youtube. The program is iterating through the text file line by line. But these have special UTF-8 characters that cannot be decoded. So at a certain point the program stops with a
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
As I cannot check every line of my entries, I want it to except the error, print the line it was working on and continue at that point.
As the error is not happening in my for loop, but rather the for loop itself, I don't know how to write an try...except statement.
This is the code:
import urllib.request
import re
from unidecode import unidecode
with open('out.txt', 'r') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
try:
clean = unidecode(line)
search_keyword = clean
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
#print("https://www.youtube.com/watch?v=" + video_ids[0])
except:
print("Error encounted with Line: " + line)
This is the full error message, to see that the for loop itself is causing the problem.
Traceback (most recent call last):
File "ytbysearchtolinks.py", line 6, in
for line in infh:
File "C:\Users\nfeyd\AppData\Local\Programs\Python\Python36\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 1826: character maps to
If you need an example of input I'm working with: https://pastebin.com/LEkwdU06
The try-except-block looks correct and should allow you to catch all occurring exceptions.
The usage of unidecode probably won't help you because non-ASCII characters must be encoded in a specific way in URLs, see, e.g., here.
One solution is to use urllib's quote() function. As per documentation:
Replace special characters in string using the %xx escape.
This is what works for me with the input you've provided:
import urllib.request
from urllib.parse import quote
import re
with open('out.txt', 'r', encoding='utf-8') as infh,\
open("links.txt", "w") as outfh:
for line in infh:
search_keyword = quote(line)
html = urllib.request.urlopen("https://www.youtube.com/results?search_query=" + search_keyword)
video_ids = re.findall(r"watch\?v=(\S{11})", html.read().decode())
outfh.write("https://www.youtube.com/watch?v=" + video_ids[0] + "\n")
print("https://www.youtube.com/watch?v=" + video_ids[0])
EDIT:
After thinking about it, I believe you are running into the following problem:
You are running the code on Windows, and apparently, Python will try to open the file with cp1252 encoding when on Windows, while the file that you shared is in UTF-8 encoding:
$ file out.txt
out.txt: UTF-8 Unicode text, with CRLF line terminators
This would explain the exception you are getting and why it's not being caught by your try-except-block (it's occurring when trying to open the file).
Make sure that you are using encoding='utf-8' when opening the file.
i ran your code, but i didnt have some problems. Do you have create virtual environment with virtualenv and install all the packages you use ?

python script not encoding to utf-8

I have this Python 3 script to read a json file and save as csv. It works fine except for the special characters like \u00e9. So Montr\u00e9al should be encoded like Montréal, but it is giving me Montréal instead.
import json
ifilename = 'business.json'
ofilename = 'business.csv'
json_lines = [json.loads( l.strip() ) for l in open(ifilename).readlines() ]
OUT_FILE = open(ofilename, "w", newline='', encoding='utf-8')
root = csv.writer(OUT_FILE)
root.writerow(["business_id","name","neighborhood","address","city","state"])
json_no = 0
for l in json_lines:
root.writerow([l["business_id"],l["name"],l["neighborhood"],l["address"],l["city"],l["state"]])
json_no += 1
print('Finished {0} lines'.format(json_no))
OUT_FILE.close()
It turns out the csv file was displaying correctly when opening it with Notepad++ but not with Excel. So I had to import the csv file with Excel and specify 65001: Unicode (UTF-8).
Thanks for the help.
Try using this at the top of the file
# -*- coding: utf-8 -*-
Consider this example:
# -*- coding: utf-8 -*-
import sys
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
string_demo="Montréal"
print(string_demo)
reload(sys) # just in python2.x
sys.setdefaultencoding('UTF8') # just in python2.x
print("my default encoding is : {0}".format(sys.getdefaultencoding()))
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
In my case, the output is like this if i run in python2.x:
my default encoding is : ascii
Montréal
my default encoding is : UTF8
('Montr\xc3\xa9al', <type 'str'>)
but when i comment out the reload and setdefaultencoding lines, my output is like this:
my default encoding is : ascii
Montréal
my default encoding is : ascii
Traceback (most recent call last):
File "test.py", line 12, in <module>
print(str(string_demo.encode('utf8')), type(string_demo.encode('utf8')))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 5: ordinal not in range(128)
It's most a problem with the editor, Python when it's a encode error raise a Exception.

import utf-8 csv in python3.x - special german characters

I have been trying to import a csv file containing special characters (ä ö ü)
in python 2.x all special characters where automatically encoded without need of specifying econding attribute in the open command.
I can´t figure out how to get this to work in python 3.x
import csv
f = open('sample_1.csv', 'rU', encoding='utf-8')
csv_f = csv.reader(f, delimiter=';')
bla = list(csv_f)
print(type(bla))
print(bla[0])
print(bla[1])
print(bla[2])
print()
print(bla[3])
Console output (Sublime Build python3)
<class 'list'>
['\ufeffCat1', 'SEO Meta Text']
['Damen', 'Damen----']
['Damen', 'Damen-Accessoires-Beauty-Geschenk-Sets-']
Traceback (most recent call last):
File "/Users/xxx/importer_tree.py", line 13, in <module>
print(bla[3])
UnicodeEncodeError: 'ascii' codec can't encode character '\xf6' in position 37: ordinal not in range(128)
input sample_1.csv (excel file saved as utf-8 csv)
Cat1;SEO Meta Text
Damen;Damen----
Damen;Damen-Accessoires-Beauty-Geschenk-Sets-
Damen;Damen-Accessoires-Beauty-Körperpflege-
Männer;Männer-Sport-Sportschuhe-Trekkingsandalen-
Männer;Männer-Sport-Sportschuhe-Wanderschuhe-
Männer;Männer-Sport-Sportschuhe--
is this only an output format issue or am I also importing the data
wrongly?
how can I print out "Männer"?
thank you for your help/guidance!
thank you to juanpa-arrivillaga and to this answer: https://stackoverflow.com/a/44088439/9059135
Issue is due to my Sublime settings:
sys.stdout.encoding returns US-ASCII
in the terminal same command returns UTF-8
Setting up the Build system in Sublime properly will solve the issue

Program (twitter bot) works on Windows machine, but not on Linux machine [duplicate]

I was trying to read a file in python2.7, and it was readen perfectly. The problem that I have is when I execute the same program in Python3.4 and then appear the error:
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
Also, when I run the program in Windows (with python3.4), the error doesn't appear. The first line of the document is:
Codi;Codi_lloc_anonim;Nom
and the code of my program is:
def lectdict(filename,colkey,colvalue):
f = open(filename,'r')
D = dict()
for line in f:
if line == '\n': continue
D[line.split(';')[colkey]] = D.get(line.split(';')[colkey],[]) + [line.split(';')[colvalue]]
f.close
return D
Traduccio = lectdict('Noms_departaments_centres.txt',1,2)
In Python2,
f = open(filename,'r')
for line in f:
reads lines from the file as bytes.
In Python3, the same code reads lines from the file as strings. Python3
strings are what Python2 call unicode objects. These are bytes decoded
according to some encoding. The default encoding in Python3 is utf-8.
The error message
'utf-8' codec can't decode byte 0xf2 in position 424: invalid continuation byte'
shows Python3 is trying to decode the bytes as utf-8. Since there is an error, the file apparently does not contain utf-8 encoded bytes.
To fix the problem you need to specify the correct encoding of the file:
with open(filename, encoding=enc) as f:
for line in f:
If you do not know the correct encoding, you could run this program to simply
try all the encodings known to Python. If you are lucky there will be an
encoding which turns the bytes into recognizable characters. Sometimes more
than one encoding may appear to work, in which case you'll need to check and
compare the results carefully.
# Python3
import pkgutil
import os
import encodings
def all_encodings():
modnames = set(
[modname for importer, modname, ispkg in pkgutil.walk_packages(
path=[os.path.dirname(encodings.__file__)], prefix='')])
aliases = set(encodings.aliases.aliases.values())
return modnames.union(aliases)
filename = '/tmp/test'
encodings = all_encodings()
for enc in encodings:
try:
with open(filename, encoding=enc) as f:
# print the encoding and the first 500 characters
print(enc, f.read(500))
except Exception:
pass
Ok, I did the same as #unutbu tell me. The result was a lot of encodings one of these are cp1250, for that reason I change :
f = open(filename,'r')
to
f = open(filename,'r', encoding='cp1250')
like #triplee suggest me. And now I can read my files.
In my case I can't change encoding because my file is really UTF-8 encoded. But some rows are corrupted and causes the same error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd0 in position 7092: invalid continuation byte
My decision is to open file in binary mode:
open(filename, 'rb')

python-write to file (ignore non-ascii chars)

I am on Linux and a want to write string (in utf-8) to txt file. I tried many ways, but I always got an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position in position 36: ordinal not in range(128)
Is there any way, how to write to file only ascii characters? And ignore non-ascii characters.
My code:
# -*- coding: UTF-8-*-
import os
import sys
def __init__(self, dirname, speaker, file, exportFile):
text_file = open(exportFile, "a")
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
text_file.close()
Thank you.
You can use the codecs module:
import codecs
text_file = codecs.open(exportFile,mode='a',encoding='utf-8')
text_file.write(...)
Try using the codecs module.
# -*- coding: UTF-8-*-
import codecs
def __init__(self, dirname, speaker, file, exportFile):
with codecs.open(exportFile, "a", 'utf-8') as text_file:
text_file.write(speaker.encode("utf-8"))
text_file.write(file.encode("utf-8"))
Also, beware that your file variable has a name which collides with the builtin file function.
Finally, I would suggest you have a look at http://www.joelonsoftware.com/articles/Unicode.html to better understand what is unicode, and one of these pages (depending on your python version) to understand how to use it in Python:
http://docs.python.org/2/howto/unicode
http://docs.python.org/3/howto/unicode.html
You could decode your input string before writing it;
text = speaker.decode("utf8")
with open(exportFile, "a") as text_file:
text_file.write(text.encode("utf-8"))
text_file.write(file.encode("utf-8"))

Categories

Resources