Open multiple files from a file - python

I need to open a file that have multiple absolute file directories.
EX:
Layer 1 = C:\User\Files\Menu\Menu.snt
Layer 2 = C:\User\Files\N0 - Vertical.snt
The problem is that when I try to open C:\User\Files\Menu\Menu.snt python doesn't like \U or \N
I could open using r"C:\User\Files\Menu\Menu.snt" but I can't automate this process.
file = open(config.txt, "r").read()
list = []
for line in file.split("\n"):
list.append(open(line.split("=",1)[1]).read())
It prints out:
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 33-34: malformed \N character escape

The backslash character \ is used as an escape character by the Python interpreter in order to provide special characters.
For example, \n is a "New Line" character, like you would get from pressing the Return key on your keyboard.
So if you are trying to read something like newFolder1\newFolder2, the interpreter reads it as:
newFolder1
ewFolder2
where the New Line character has been inserted between the two lines of text.
You already mentioned one workaround: using raw strings like r'my\folder\structure' and I'm a little curious why this can't be automated.
If you can automate it, you could try replacing all instances of a single backslash (\) with a double backslash (\\) in your file paths and that should work.
Alternatively, you can try looking in the os module and dynamically building your paths using os.path.join(), along with the os.sep operator.
One final point: You can save yourself some effort by replacing:
list.append(open(line.split("=",1)[1]).read())
by
list = open(line.split("=",1)[1]).readlines()

here is my solution:
file = open("config.txt", "r").readlines()
list = [open(x.split("=")[1].strip(), 'r').read() for x in file]
readlines creates a list that contains all lines in file, there is no need to split the whole string.

Related

Trouble with opening docx file, seems to be Unicode issue

I am a novice to python & this is my first small project
I am having trouble inputting a file directory to open a Word document. I tried this by copying & pasting the directory from my command prompt, but this Error appears after plugging it in. How do I convert the command prompt to UTF-8 or find the directory in Unicode?
#After importing necessary modules for the project, I access the file
from docx import Document
import pandas as pd
import docx
doc = Document('C:\Users\trisy\OneDrive\Desktop\classes\SP_22_courses\CS1110\pye_files\kw_txt.docx')
#Error message
doc = Document('C:\Users\xxx\OneDrive\Desktop\classes\SP_22_courses\xxx\pye_files\kw_txt.docx')
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
The problem is caused by the backslashes in that pathname, combined with certain other characters.
In Python, putting \x in a string can have special behavior depending on what x is.
For example, \n does not mean "backslash n"; it means a newline character.
\U is one of these special cases.
To get around this, you have two options:
Use "raw strings". Put an r before the string. r'C:\Users\...' The r tells Python that backslashes should have no special meaning.
Use forward slashes in the file path. 'C:/Users/...' These will work even on Windows.

chdir modifying the path in Python

I've got a program that reads strings with special characters (used in spanish) from a file. I then use chdir to change to a directory which its name is the string.
For example in a file called "names.txt" I got the following
Tableta
Música
.
.
etc
That file is encoded in utf-8 so that I read it from python as follows
f=open("names.txt","r",encoding="utf-8")
names=f.readlines()
f.close()
And it does read all successfully
print(names)
output:
['Tableta\n','Música\n', ...etc]
Problem arises when I want to change to the first directory (the first name 'Tableta',without the newline character)
chdir(names[0][:-1])
I get the following error
FileNotFoundError: [WinError 2] The system cannot find the file specified: "\ufeffTableta"
And it does only happen with the first name, which was very odd to me. With other names it is able to change directories whether they have special characters or not
I supposed it had to do something with the encoding, because of that '\ufeff' extra added character. So I changed the "names.txt" file to ANSI coding and deleted all the special characters so that I could read it with python, and it worked. But thing is that I need to have that file in utf-8 coding so that I can read special characters. Is there any way I can fix this? Why is python adding that '\ufeff' character to the string and only with the first name?
Your file "names.txt" has a Byte-Order Mask (BOM). To remove it, open the file with the following decoder:
f = open("names.txt", encoding="utf-8-sig")
As a side note, it is safer to strip a file name: names[0].strip() instead of names[0][:-1].
The beginning of your file has the unicode BOM. Skip first character when reading your file or open it with utf-8-sig encoding.

python csv files read and upload

I want to upload all the csv files that meet certain condition in a directory to a database. But I encounter an error at the beginning of my code.
mypath = "D:\user\01367564\Project Coordinator\Database Trying\all data csv"
csv_name_reg = r'^[0-9]{11}_HKG_[0-9]{14}_v2-0.csv$'
The error is below
File "D:\user\01367564\Project Coordinator\Database Trying\Upload_CA_Manifest.py", line 9
mypath = "D:\user\01367564\Project Coordinator\Database Trying\all data csv"
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \uXXXX escape
Can you help me? Thank you.
Currently your path looks like it's meant to contain a Unicode character with the \u.... Please note that on Windows you have three options for paths:
Raw strings
mypath = r"D:\user\01367564\Project Coordinator\Database Trying\all data csv"
Escaped backslashes
mypath = "D:\\user\\01367564\\Project Coordinator\\Database Trying\\all data csv"
Forward slashes
mypath = "D:/user/01367564/Project Coordinator/Database Trying/all data csv"
In Python, there are some cool backslash escapes. A "\" inside a string plus a character(s).
Some notable ones are "\n" and "\t" which are newline and tab. A non-builtin backslash escape will be turned into the actual character in the final string. "\\" will turn into one "\" during, say, a print statement.
The escape Python thinks your using is the unicode escape. "\uXXXX". To fix this all you need is to replace each backslash with a double backslash. "\\". So this string will work: "D:\\user\\01367564\\Project Coordinator\\Database Trying\\all data csv"
For a full list of Python Backslash Escapes look at the Python Docs.

Python - Must add r when opening a file

I have several .py files and I can open my file everywhere, except in my test.py file (I test scripts and functions there) instead of this:
file = open("C:\Users\User\Desktop\key_values.txt", "r")
I need to use this (with r) to avoid error:
file = open(r"C:\Users\User\Desktop\key_values.txt", "r")
I get this error: (when I try to open a file without r in my test.py script)
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Any idea why is this happening ?
Backslash is an escape character, so you can include characters like "\n" (new line) and "\t" (tab). The r before the string means means "my backslashes are not escape characters".
Interestingly, it looks like your string "C:\Users\User\Desktop\key_values.txt" works ok in python 2 because none of the backslashes are part of anything looking like a known escape sequence. But in python 3, "\Uxxxx" indicates a unicode character. So maybe that is why some of your python files can cope and some can't.
The other answers are OK.. but this a time saving trick:
Try using slashes instead of backslashes:
file = open("C:/Users/User/Desktop/key_values.txt", "r")
It works in Windows. Tried with Python 2.7
Hope this helps

Removing unknown characters from a text file

I have a large number of files containing data I am trying to process using a Python script.
The files are in an unknown encoding, and if I open them in Notepad++ they contain numerical data separated by a load of 'null' characters (represented as NULL in white on black background in Notepad++).
In order to handle this, I separate the file by the null character \x00 and retrieve only numerical values using the following script:
stripped_data=[]
for root,dirs,files in os.walk(PATH):
for rawfile in files:
(dirName, fileName)= os.path.split(rawfile)
(fileBaseName, fileExtension)=os.path.splitext(fileName)
h=open(os.path.join(root, rawfile),'r')
line=h.read()
for raw_value in line.split('\x00'):
try:
test=float(raw_value)
stripped_data.append(raw_value.strip())
except ValueError:
pass
However, there are sometimes other unrecognised characters in the file (as far as I have found, always at the very beginning) - these show up in Notepad++ as 'EOT', 'SUB' and 'ETX'. They seem to interfere with the processing of the file in Python - the file appears to end at those characters, even though there is clearly more data visible in Notepad++.
How can I remove all non-ASCII characters from these files prior to processing?
You are opening the file in text mode. That means that the first Ctrl-Z character is considered as an end-of-file character. Specify 'rb' instead of 'r' in open().
I don't know if this will work for sure, but you could try using the IO methods in the codec module:
import codec
inFile = codec.open(<SAME ARGS AS 'OPEN'>, 'utf-8')
for line in inFile.readline():
do_stuff()
You can treat the inFile just like a normal FILE object.
This may or may not help you, but it probably will.
[EDIT]
Basically you'll replace: h=open(os.path.join(root, rawfile),'r') with h=open(os.path.join(root, rawfile),'r', 'utf-8')
The file.read() function will read until EOF.
As you said it stops too early you want to continue reading the file even when hitting an EOF.
Make sure to stop when you have read the entire file. You can do this by checking the position in the file via file.tell() when hitting an EOF and stopping when you hit the file-size (read file-size prior to reading).
As this is rather complex you may want to use file.next and iterate over bytes.
To remove non-ascii characters you can either use a white-list for specific characters or check the read Byte against a range your define.
E.g. is the Byte between x30 and x39 (a number) -> keep it / save it somewhere / add it to a string.
See an ASCII table.

Categories

Resources