How to write Chinese characters to file by python - python

I'm walking through a directory and want to write all files names into a file. Here's the piece of code
with open("c:/Users/me/filename.txt", "a") as d:
for dir, subdirs, files in os.walk("c:/temp"):
for f in files:
fname = os.path.join(dir, f)
print fname
d.write(fname + "\n")
d.close()
The problem I have is, there are some files that are named in Chinese characters. By using print, I can see the file name correctly in console, but in the target file, it's just a mess... I've tried to open the file like open(u"c:/Users/me/filename.txt", "a"), but it did not work. I also tried to write fname.decode("utf-16"), still does not work...

In Python 2, it's a good idea to use codecs.open() if you're dealing with encodings other than ASCII. That way, you don't need to manually encode everything you write. Also, os.walk() should be passed a Unicode string if you're expecting non-ASCII characters in the filenames:
import codecs
with codecs.open("c:/Users/me/filename.txt", "a", encoding="utf-8") as d:
for dir, subdirs, files in os.walk(u"c:/temp"):
for f in files:
fname = os.path.join(dir, f)
print fname
d.write(fname + "\n")
No need to call d.close(), the with block already takes care of that.

Use str.encode() to encode fname before you write it to the file:
d.write(fname.encode('utf8') + '\n')

The key is to tell python to prepare the file for being used in "utf-8" format. I wonder why python doesn't assume utf-8 by default. Anyway, try the following:
with open("c:/Users/me/filename.txt", "a", encoding='utf-8') as d:
for dir, subdirs, files in os.walk("c:/temp"):
...
I am using python3.5. So, please be aware that the "encoding" option may be not available in python 2.7. But the idea is to tell python in advance about the encoding, rather than fighting with encoding of each string later.

To succesfully write chinese characters in python 2 you have to do the following.
Open the file using the codecs library which allows you to provide
the encoding parameter and set it to unicode.
Write the string in
unicode encoding.
The corrected code would be the following:
import codecs
with codecs.open("c:/Users/me/filename.txt", "a", encoding='utf-8') as d:
for dir, subdirs, files in os.walk("c:/temp"):
for f in files:
fname = os.path.join(dir, f)
print fname
d.write(fname.decode('utf-8') + "\n")
Note
The same problem does not exist in python 3 so you should also consider making your script python 3 compatible.

with open("xyz.xml', "w", encoding='utf-8-sig') as f:
worked for me.

Related

How to write a file if the filename contains /?

I need to loop some stock tickers, and then output the files using the stock ticker:
without changing the filename
without the computer thinking that it's the wrong path because of the /
what's the solution? I tried:
for ticker in (['MSFT','BF/B','AAPL']):
file_name = "{}_prices.csv".format(ticker)
with open("/Users/my_mac/download/{}".format(file_name), "w") as f:
f.write('whatever')
the above code doesn't work because python thinks BF is the directory and B_prices.csv is the filename. I use chatgpt for suggestion, but it's not smart enough to give me a solution. It told me to:
use the os.path.join function to join the components of the file path
encode the file name using urllib.parse.quote() function, which will encode the special characters in the file name, including the slash /
replace the '/' character with an underscore ('_') or any other character that is a valid filename character.
and other stupid methods that doesn't work
I can't change the filename because it's the stock ticker, it has to be the same to run other code for something else after.
the robot suggested me to try the below also, but didn't work.
file_name = "{}_prices.csv".format(ticker)
file_path = "/Users/my_mac/download/{}".format(file_name)
directory = os.path.dirname(file_path)
if not os.path.exists(directory):
os.makedirs(directory)
with open(file_path, "w") as f:
f.write(response.text)
What's the solution?

I can't write to a file in python when using the absolute path of the file

I have created a script to write to a file in python:
a_file = open("file:///C:/Users/xdo/OneDrive/Desktop/Javascript/read%20and%20write/testfileTryToOVERWRITEME.txt", "a+")
a_file.write("hello")
The absolute path of the file is: file:///C:/Users/xdo/OneDrive/Desktop/Javascript/read%20and%20write/testfileTryToOVERWRITEME.txt
However, the program does not write(append) to the file. I can run the program, but nothing happens to the file. The strange thing is that it works if I put the file in the same directory as the script and run the script using the location "testfileTryToOVERWRITEME.txt". That is:
a_file= open("testfileTryToOVERWRITEME.txt", "a+")
a_file.write("hello")
This works 100% and appends to the file. But when I use the absolute path of the file, it never works. What is wrong?
Edit
I tried everything and it still doesn't work
My code:
a_file= open("C://Users//xdo//OneDrive//Desktop//Javascript//read%20and%20write//testfileTryToOVERWRITEME.txt", "a+")
a_file.write("hello")
a_file.close()
This did not work. I also tried:
a_file= open("C:/Users/xdo/OneDrive/Desktop/Javascript/read%20and%20write/testfileTryToOVERWRITEME.txt", "a+")
a_file.write("hello")
a_file.close()
This did not work
Edit (finally works)
It finally works. I replaced the "%20" with a regular space " " and used the pathlib module like this:
from pathlib import Path
filename = Path("C:/Users/qqWha/OneDrive/Desktop/Javascript/read and write/testfileTryToOVERWRITEME.txt")
f = open(filename, 'a+')
f.write("Hello")
And now it writes to the file.
It also works using "with". Like this:
with open("c:/users/xdo/OneDrive/Desktop/Javascript/read and write/testfileTryToOVERWRITEME.txt", "a+") as file:
file.write("hello")
Try doing "with". Also, replace the %20 with a space. Python does not automatically decode this, but you shouldn't have an issue using spaces in the instance below.
with open("c:/users/xdo/OneDrive/Desktop/Javascript/read and write/testfile.txt", "a+") as file:
file.write("hello")
In this case, if the file doesn't exist, it will create it. The only thing that would stop this is if there are permissions issues.
This will work. when we open a file in python using the open function we have to use two forward slashes.
f = open('C://Users//xdo//OneDrive//Desktop//Javascript//read%20and%20write//testfileTryToOVERWRITEME.txt', 'a+')
f.write("writing some text")
f.close()
or you can use another way in which you have to use from pathlib import Path package.
from pathlib import Path
filename = Path("C:/Users/xdo/OneDrive/Desktop/Javascript/read%20and%20write/testfileTryToOVERWRITEME.txt")
f = open(filename, 'a+')
f.write("Hello")
f.close()
If still, your problem exists, then try another absolute path like "C:/Users/xdo/OneDrive/Desktop/testfileTryToOVERWRITEME.txt"

Read all the text files in a folder and change a character in a string if it presents

I have a folder with csv formated documents with a .arw extension. Files are named as 1.arw, 2.arw, 3.arw ... etc.
I would like to write a code that reads all the files, checks and replaces the forwardslash / with a dash -. And finally creates new files with the replaced character.
The code I wrote as follows:
for i in range(1,6):
my_file=open("/path/"+str(i)+".arw", "r+")
str=my_file.read()
if "/" not in str:
print("There is no forwardslash")
else:
str_new = str.replace("/","-")
print(str_new)
f = open("/path/new"+str(i)+".arw", "w")
f.write(str_new)
my_file.close()
But I get an error saying:
'str' object is not callable.
How can I make it work for all the files in a folder? Apparently my for loop does not work.
The actual error is that you are replacing the built-in str with your own variable with the same name, then try to use the built-in str() after that.
Simply renaming the variable fixes the immediate problem, but you really want to refactor the code to avoid reading the entire file into memory.
import logging
import os
for i in range(1,6):
seen_slash = False
input_filename = "/path/"+str(i)+".arw"
output_filename = "/path/new"+str(i)+".arw"
with open(input_filename, "r+") as input, open(output_filename, "w") as output:
for line in input:
if not seen_slash and "/" in line:
seen_slash = True
line_new = line.replace("/","-")
print(line_new.rstrip('\n')) # don't duplicate newline
output.write(line_new)
if not seen_slash:
logging.warn("{0}: No slash found".format(input_filename))
os.unlink(output_filename)
Using logging instead of print for error messages helps because you keep standard output (the print output) separate from the diagnostics (the logging output). Notice also how the diagnostic message includes the name of the file we found the problem in.
Going back and deleting the output filename when you have examined the entire input file and not found any slashes is a mild wart, but should typically be more efficient.
This is how I would do it:
for i in range(1,6):
with open((str(i)+'.arw'), 'r') as f:
data = f.readlines()
for element in data:
element.replace('/', '-')
f.close()
with open((str(i)+'.arw'), 'w') as f:
for element in data:
f.write(element)
f.close()
this is assuming from your post that you know that you have 6 files
if you don't know how many files you have you can use the OS module to find the files in the directory.

Python directory crawler to scan all kinds of files and search for keyword

I am trying to create a directory crawler to search for specific keywords in all files inside a folder and all its subfolders. This is what I have so far (in this case I am looking for keyword 'olofx'):
import os
rootDir = os.getcwd()
def scan_file(filename, dirname):
print(os.path.join(dirname,filename))
contains = False
if("olofx" in filename):
contains = True
else:
with open(os.path.join(dirname,filename)) as f:
lines = f.readlines()
for l in lines:
#print(l)
if("olofx" in l):
contains = True
break
if contains:
print("yes")
for dirName, subdirList, fileList in os.walk(rootDir):
for fname in fileList:
scan_file(fname, dirName)
Problem is when I reach one of my sample excel files, the characters seem to be unreadable.
here is some of the output for the excel file:
;���+͋�۳�L���P!�/��KdocProps/core.xml �(���_K�0���C�{�v�9Cہʞ
n(���v
6H�ݾ�i���|Lι��sI���:��VJ' �#1ͅ�h�^�s9O��VP�8�(//r���6`��r���7c�v ���
I have worked with openpyxl and I know I can use that to read excel files, but I want one script that reads all kinds of files: word, excel, pdf etc. Anyway to represent files' contents regardless of file types?
Thank you
Your code assumes that the content of your files is available as plain text.
Unfortunately, for many file types this is not the case. Office documents (.docx, .xslx) are basically XML documents inside a ZIP archive. That means that the text content is saved in a compressed way, so when you parse the file bytes as plain text, the content is not recognisable.
You will need the necessary tools to interpret each of your file types correctly. There are libraries for this. One that I found is https://textract.readthedocs.io/en/stable/ but I have no experience with it.
It seems, that Your script is saved with different encoding, as Your files, which are likely UTF-8 encoded.
Try to add at the very beginning of Your file the following line:
#!/usr/bin/env python
#-*- coding: utf-8 -*-
You may also check the following answer: Character Encoding, XML, Excel, python

Cannot open filename that has umlaute in python 2.7 & django 1.9

I am trying doing a thing that goes through every file in a directory, but it crashes every time it meets a file that has an umlaute in the name. Like ä.txt
the shortened code:
import codecs
import os
for filename in os.listdir(WATCH_DIRECTORY):
with codecs.open(filename, 'rb', 'utf-8') as rawdata:
data = rawdata.readline()
# ...
And then I get this:
IOError: [Errno 2] No such file or directory: '\xc3\xa4.txt'
I've tried to encode/decode the filename variable with .encode('utf-8'), .decode('utf-8') and both combined. This usually leads to "ascii cannot decode blah blah"
I also tried unicode(filename) with and without encode/decode.
Soooo, kinda stuck here :)
You are opening a relative directory, you need to make them absolute.
This has nothing really to do with encodings; both Unicode strings and byte strings will work, especially when soured from os.listdir().
However, os.listdir() produces just the base filename, not a path, so add that back in:
for filename in os.listdir(WATCH_DIRECTORY):
fullpath = os.path.join(WATCH_DIRECTORY, filename)
with codecs.open(fullpath, 'rb', 'utf-8') as rawdata:
By the way, I recommend you use the io.open() function rather than codecs.open(). The io module is the new Python 3 I/O framework, and is a lot more robust than the older codecs module.

Categories

Resources