I used Selenium to scrape some sentences, and then print the result to a .txt-file, but it cannot display ' but instead some weird character:
Original sentence:
I don't think so.
in .txt file:
I don? think so.
I have specified the .txt encoding to "utf-8" already, what should I do?
You need to open the file for writing in "utf-8" for that:
with open("file.txt", "w", encoding="utf-8") as file
file.write("your_text")
Hope it helps you!
Related
I'm working with the requests module to scrape text from a website and store it into a txt file using a method like below:
r = requests.get(url)
with open("file.txt","w") as filename:
filename.write(r.text)
With this method, say if "送分200000" was the only string that requests got from url, it would've been decoded and stored in file.txt like below.
\u9001\u5206200000
When I grab the string from file.txt later on, the string doesn't convert back to "送分200000" and instead remains at "\u9001\u5206200000" when I try to print it out. For example:
with open("file.txt", "r") as filename:
mystring = filename.readline()
print(mystring)
Output:
"\u9001\u5206200000"
Is there a way for me to convert this string and others like it back to their original strings with unicode characters?
convert this string and others like it back to their original strings with unicode characters?
Yes, let file.txt content be
\u9001\u5206200000
then
with open("file.txt","rb") as f:
content = f.read()
text = content.decode("unicode_escape")
print(text)
output
送分200000
If you want to know more read Text Encodings in codecs built-in module docs
It's better to use the io module for that. Try and adapt the following code for your problem.
import io
with io.open(filename,'r',encoding='utf8') as f:
text = f.read()
# process Unicode text
with io.open(filename,'w',encoding='utf8') as f:
f.write(text)
Taken from https://www.tutorialspoint.com/How-to-read-and-write-unicode-UTF-8-files-in-Python
I am guessing you are using Windows. When you open a file, you get its default encoding, which is Windows-1252, unless you specify otherwise. Specify the encoding when you open the file:
with open("file.txt","w", encoding="UTF-8") as filename:
filename.write(r.text)
with open("file.txt", "r", encoding="UTF-8") as filename:
mystring = filename.readline()
print(mystring)
That works as you expect regardless of platform.
I'm attempting to extract article information using the python newspaper3k package and then write to a CSV file. While the info is downloaded correctly, I'm having issues with the output to CSV. I don't think I fully understand unicode, despite my efforts to read about it.
from newspaper import Article, Source
import csv
first_article = Article(url="http://www.bloomberg.com/news/articles/2016-09-07/asian-stock-futures-deviate-as-s-p-500-ends-flat-crude-tops-46")
first_article.download()
if first_article.is_downloaded:
first_article.parse()
first_article.nlp
article_array = []
collate = {}
collate['title'] = first_article.title
collate['content'] = first_article.text
collate['keywords'] = first_article.keywords
collate['url'] = first_article.url
collate['summary'] = first_article.summary
print(collate['content'])
article_array.append(collate)
keys = article_array[0].keys()
with open('bloombergtest.csv', 'w') as output_file:
csv_writer = csv.DictWriter(output_file, keys)
csv_writer.writeheader()
csv_writer.writerows(article_array)
output_file.close()
When I print collate['content'], which is first_article.text, the console outputs the article's content just fine. Everything shows up correctly, apostrophes and all. When I write to the CVS, the content cell text has odd characters in it. For example:
“At the end of the day, Europe’s economy isn’t in great shape, inflation doesn’t look exciting and there are a bunch of political risks to reckon with.
So far I have tried:
with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file:
to no avail. I also tried utf-16 instead of 8, but that just resulted in the cells writing in an odd order. It didn't create the cells correctly in the CSV, although the output looked correct. I've also tried .encode('utf-8') are various variable but nothing has worked.
What's going on? Why would the console print the text correctly, while the CSV file has odd characters? How can I fix this?
Add encoding='utf-8-sig' to open(). Excel requires the UTF-8-encoded BOM code point (Byte Order Mark, U+FEFF) signature to interpret a file as UTF-8; otherwise, it assumes the default localized encoding.
Changing with open('bloombergtest.csv', 'w', encoding='utf-8') as output_file: to with open('bloombergtest.csv', 'w', encoding='utf-8-sig') as output_file:, worked, as recommended by Leon and Mark Tolonen.
That's most probably a problem with the software that you use to open or print the CSV file - it doesn't "understand" that CSV is encoded in UTF-8 and assumes ASCII, latin-1, ISO-8859-1 or a similar encoding for it.
You can aid that software in recognizing the CSV file's encoding by placing a BOM sequence in the beginning of your file (which, in general, is not recommended for UTF-8).
I'm having a problem understanding why my python program does what it does when reading (first) lines from files and adding the lines into a list. For some reason the first line needs to be empty or it'll not read the first line correctly. If the first line is empty, it's not empty (at least not according to python).
The thing is, I have two types of files:
First file is in the form:
text:more text
another text:and more
and the second file in the form:
text_file.txt
anothertext_file.txt
Both files are UTF-8 encoded text files. The first line of both files that gets added to a list in my program, is "text" and "text_file.txt" but any code that for example tries to say
if something == "text":
...
will not get executed even if the "something" is the same as the "text".
So I'm assuming that my problem is that somewhere in the machine code (or something), my computer writes some invisible code in the beginning of the text file and that makes the first line not what it is. Maybe? I have actually found a solution for the problem simply by adding an empty line and an if clause when reading the file line by line:
if not "." in line:
...
and in the other filetype:
if not ":" in line:
...
Those if clauses work and my program does what it's supposed to (as long as I always add an empty line to the beginning of the file), but I haven't been able to find a real reason for why my program is behaving as it is. Also, I would like to not have to do this kind of a workaround if there's an easier solution that doesn't involve me editing all my files and adding an if clauses to my code.
Would appreciate any help understanding what's happening here!
Edit: as you people have been asking for my code, here it is:
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
for line in f:
filelist.append(line.rstrip("\n"))
This does not work properly. Also I tried it like mxds said,
filelist = []
with open("filename.txt", "r", encoding="UTF-8") as f:
lines = f.readlines()
for line in lines:
filelist.append(line.rstrip("\n"))
and this does not work either. It is only a problem in the files in the first character of the first line.
Edit2:
It seems the problem is having a Byte order mark in the beginning of my text files. After a quick googling I didn't find a solution as to how I could remove it. I'm creating my files with just windows notepad.
Final edit:
Apparently notepad is not a real text editor. I guess I'll just swap over from notepad to notepad++ to avoid this problem. However, just in case I'll have to handle my files in notepad: If I open a textfile in notepad and add some text in it, will it add a BOM or should it do that only in the creating of the file?
Looks like you've already done the legwork on this, but according to How to make Notepad to save text in UTF-8 without BOM?, the best answer is not to use Notepad (but Notepad++ is ok). :)
Alternatively, you can strip the BOM in Python with:
line = line.decode("utf-8-sig").encode("utf-8")
See https://docs.python.org/3/library/codecs.html:
To increase the reliability with which a UTF-8 encoding can be
detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls
"utf-8-sig") for its Notepad program: Before any of the Unicode
characters is written to the file, a UTF-8 encoded BOM (which looks
like this as a byte sequence: 0xef, 0xbb, 0xbf) is written.
...
On decoding utf-8-sig will skip those three bytes if they appear as the first three bytes in the file. In UTF-8, the use of the BOM is discouraged and should generally be avoided.
A classic approach to reading text files in Python is:
with open(fname, 'r') as f:
lines = f.readlines()
After which you can process the lines like this:
for line in lines:
# do something with line...
As other comments have hinted, you may want to make sure this works first. It would help if you post your current code for review.
I just had similar issue: python readlines() reports invalid chars heading the first line, something like . I have tried all suggestions i can google, with no luck.
I came up with a simple trick: skip the line with
add a blank line as the first line in the text file
if len(line[i]) > len(line[0]):
do things
else:
skipping
in my case, the len(line[0] = 4, all other lines are longer than 4
I'll preface this by indicating I'm using Python 2.7.3 (x64) on Windows 7, with lxml 2.3.6.
I have a little, odd, problem I'm hoping somebody can help with. I haven't find a solution online, perhaps I'm not searching for the right thing.
Anyway, I have a problem where I'm programmatically building some XML with lxml, then outputting this to a text file, the problem is lxml is converting carriage returns to the text
, almost like urlencoding - but I'm not using HTML I'm using XML.
For example, I have a simple text file created in Notepad, like this:
This
is
my
text
I then build some xml and add this text into the xml:
from lxml import etree
textstr = ""
fh = open("mytext.txt", "rb")
for line in fh:
textstr += line
root = etree.Element("root")
a = etree.SubElement(root, "some_element")
a.text = textstr
print etree.tostring(root)
The problem here is the output of the print looks like this:
<root><some_element>This
is
my
text</some_element></root>
For my purposes the line breaks are fine, but the
elements are not.
What I have been able to figure out is that this is happening because I'm opening the text file in binary mode "rb" (which I actually need to do as my app is indexing a large text file). If I don't open the file in binary mode "r", then the output does not contain
(but of course, then my indexing doesn't work).
I've also tried changing the etree.tostring to:
print etree.tostring(root, method="xml")
However there is no difference in the output.
Now, I CAN dump the xml text to a string then do a replace of the $#13; artifacts, however, I was hoping for a more elegant solution - because the text files I parse are not under my control and I'm worried that other elements of the text file might be converted to url style encoding without my knowledge.
Does anyone know a way of preventing this encoding from happening?
Windows uses \r\n to represent a line ending, Unix uses \n.
This will remove the \r at the end of the line, if there is one there (so the code will work with unix text files too.) It will remove at most one \r, so if there is an \r somewhere else in the line it will be preserved.
import re
textstr = ""
with open("mytext.txt", "rb") as fh:
for line in fh:
textstr += re.sub(r'\r$', '', line)
print(repr(textstr))
Whenever I try to open a .csv file with the python command
fread = open('input.csv', 'r')
it always opens the file with spaces between every single character. I'm guessing it's something wrong with the text file because I can open other text files with the same command and they are loaded correctly. Does anyone know why a text file would load like this in python?
Thanks.
Update
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
The post by recursive is probably right... the contents of the file are likely encoded with a multi-byte charset. If this is, in fact, the case you can likely read the file in python itself without having to convert it first outside of python.
Try something like:
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
The 'b' flag ensures the file is read as binary data. You'll need to know (or guess) the original encoding... in this example, I've used utf-16, but YMMV. This will convert the file to unicode. If you truly have a file with multi-byte chars, I don't recommend converting it to ascii as you may end up losing a lot of the characters in the process.
EDIT: Thanks for uploading the file. There are two bytes at the front of the file which indicates that it does, indeed, use a wide charset. If you're curious, open the file in a hex editor as some have suggested... you'll see something in the text version like 'I.D.|.' (etc). The dot is the extra byte for each char.
The code snippet above seems to work on my machine with that file.
The file is encoded in some unicode encoding, but you are reading it as ascii. Try to convert the file to ascii before using it in python.
Isn't csv a simple txt file with values separated with comma.
Just try to open it with a text editor to see if the file is correctly formed.
To read an encoded file, you can simply replace open with codecs.open.
fread = codecs.open('input.csv', 'r', 'utf-16')
It did never ocurred to me, but as truppo said, it must be something wrong with the file.
Try to open the file in Excel/BrOffice Calc and Save As the file as Csv again.
If the problem persists, try a subset of the data: fist 10/last 10/intermediate 10 lines of the file.
Ok, I got it with the help of Jarret Hardie's post
this is the code that I used to convert the file to ascii
fread = open('input.csv', 'rb').read()
mytext = fread.decode('utf-16')
mytext = mytext.encode('ascii', 'ignore')
fwrite = open('input-ascii.csv', 'wb')
fwrite.write(mytext)
Thanks!
Open the file in binary mode, 'rb'. Check it in a HEX Editor and check for null padding '00'. Open the file in something like Scintilla Text Editor to check the characters present in the file.
Here's the quick and easy way, esp if python won't parse the input correctly
sed 's/ \(.\)/\1/g'