Problem reading text files without extensions in python

Problem reading text files without extensions in python - python

I have written a piece of a code which is supposed to read the texts inside several files which are located in a directory. These files are basically text files but they do not have any extensions.But my code is not able to read them:
corpus_path = 'Reviews/'
for infile in glob.glob(os.path.join(corpus_path,'*.*')):
review_file = open(infile,'r').read()
print review_file
To test if this code works, I put a dummy text file, dummy.txt. which worked because it has extension. But i don't know what should be done so files without the extensions could be read.
can someone help me? Thanks

Glob patterns don't work the same way as wildcards on the Windows platform. Just use * instead of *.*. i.e. os.path.join(corpus_path,'*'). Note that * will match every file in the directory - if that's not what you want then you can revise the pattern accordingly.
See the glob module documentation for more details.

Just use * instead of *.*.
The latter requires an extension to be present (more precisely, there needs to be a dot in the filename), the former doesn't.

You could search for * instead of *.*, but this will match every file in your directory.
Fundamentally, this means that you will have to handle cases where the file you are opening is not a text file.

it seems that you need
from os import listdir
from filename in ( fn for fn in listdir(corpus_path) if '.' not in fn):
# do something
you could write
from os import listdir
for fn in listdir(corpus_path):
if '.' not in fn:
# do something
but the former with a generator spares one indentation level

Related

prevent getfiles from seeing .DS and other hidden files

I am currently working on a python project on my macintosh, and from time to time I get unexpected errors, because .DS or other files, which are not visible to the "not-root-user" are found in folders. I am using the following command
filenames = getfiles.getfiles(file_directory)
to retreive information about the amount and name of the files in a folder. So I was wondering, if there is a possibility to prevent the getfiles command to see these types of files, by for example limiting its right or the extensions which it can see (all files are of .txt format)
Many thanks in advance!

In your case, I would recommend you switch to the Python standard library glob.
In your case, if all files are of .txt format, and they are located in the directory of /sample/directory/, you can use the following script to get the list of files you wanted.
from glob import glob
filenames = glob.glob("/sample/directory/*.txt")
You can easily use regular expressions to match files and filter out files you do not need. More details can be found from Here.
Keep in mind that with regular expression, you could do much more complicated pattern matching than the above example to handle your future needs.
Another good example of using glob to glob multiple extensions can be found from Here.
If you only want to get the basenames of those files, you can always use standard library os to extract basenames from the full paths.
import os
file_basenames = [os.path.basename(full_path) for full_path in filenames]

There isn't an option to filter within getfiles, but you could filter the list after.
Most-likely you will want to skip all "dot files" ("system files", those with a leading .), which you can accomplish with code like the following.
filenames = [f for f in ['./.a', './b'] if not os.path.basename(f).startswith('.')]

Welcome to Stackoverflow.
You might find the glob module useful. The glob.glob function takes a path including wildcards and returns a list of the filenames that match.
This would allow you to either select the files you want, like
filenames = glob.glob(os.path.join(file_directory, "*.txt")
Alternatively, select the files you don't want, and ignore them:
exclude_files = glob.glob(os.path.join(file_directory, ".*"))
for filename in getfiles.getfiles(file_directory):
if filename in exclude_files:
continue
# process the file

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.

If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.

Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)

you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

Need to read every file of directory for particular word around 172 directories,we have done for single directory

Here is the below code we have developed for single directory of files
from os import listdir
with open("/user/results.txt", "w") as f:
for filename in listdir("/user/stream"):
with open('/user/stream/' + filename) as currentFile:
text = currentFile.read()
if 'checksum' in text:
f.write('current word in ' + filename[:-4] + '\n')
else:
f.write('NOT ' + filename[:-4] + '\n')
I want loop for all directories
Thanks in advance

If you're using UNIX you can use grep:
grep "checksum" -R /user/stream
The -R flag allows for a recursive search inside the directory, following the symbolic links if there are any.

My suggestion is to use glob.
The glob module allows you to work with files. In the Unix universe, a directory is / should be a file so it should be able to help you with your task.
More over, you don't have to install anything, glob comes with python.
Note: For the following code, you will need python3.5 or greater
This should help you out.
import os
import glob
for path in glob.glob('/ai2/data/prod/admin/inf/**', recursive=True):
# At some point, `path` will be `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error`
if not os.path.isdir(path):
# Check the `id` of the file
# Do things with the file
# If there are files inside `/ai2/data/prod/admin/inf/inf_<$APP>_pvt/error` you will be able to access them here
What glob.glob does is, it Return a possibly-empty list of path names that match pathname. In this case, it will match every file (including directories) in /user/stream/. If these files are not directories, you can do whatever you want with them.
I hope this will help you!
Clarification
Regarding your 3 point comment attempting to clarify the question, especially this part we need to put appi dynamically in that path then we need to read all files inside that directory
No, you do not need to do this. Please read my answer carefully and please read glob documentation.
In this case, it will match every file (including directories) in /user/stream/
If you replace /user/stream/ with /ai2/data/prod/admin/inf/, you will have access to every file in /ai2/data/prod/admin/inf/. Assuming your app ids are 1, 2, 3, this means, you will have access to the following files.
/ai2/data/prod/admin/inf/inf_1_pvt/error
/ai2/data/prod/admin/inf/inf_2_pvt/error
/ai2/data/prod/admin/inf/inf_3_pvt/error
You do not have to specify the id, because you will be iterating over all files. If you do need the id, you can just extract it from the path.
If everything looks like this, /ai2/data/prod/admin/inf/inf_<$APP>_pvt/error, you can get the id by removing /ai2/data/prod/admin/inf/ and taking everything until you encounter _.

Python operating on files in a folder - 'for file in folder'

I know a folder's path, and for every file in the folder I would like to do some operations. So essentially what I'm looking for is a for file in folder type of code that gives me access to the files in variables.
What is the Python way of doing this?
Thanks
EDIT - example: my folder will contain a bunch of XML files, and I have a python routine already to parse them into variables I need.

This will allow you to access and print all the file names in your current directory:
import os
for filename in os.listdir('.'):
print filename
The os module contains much more information about the various functions available. The os.listdir() function can also take any other paths you want to specify.

Does the glob library look helpful?
It will perform some pattern matching, and accepts both absolute and relative addresses.
>>> import glob
>>> for file in glob.glob("*.xml"): # only loops over XML documents
print file

For people coming at this from a python version 3.5 or later, we now have the superior os.scandir() which has tremendous performance improvements over os.listdir()
For more information about the improvements/benefits, check out https://benhoyt.com/writings/scandir/

How can I find path to given file?

I have a file, for example "something.exe" and I want to find path to this file
How can I do this in python?

Perhaps os.path.abspath() would do it:
import os
print os.path.abspath("something.exe")
If your something.exe is not in the current directory, you can pass any relative path and abspath() will resolve it.

use os.path.abspath to get a normalized absolutized version of the pathname
use os.walk to get it's location
import os
exe = 'something.exe'
#if the exe just in current dir
print os.path.abspath(exe)
# output
# D:\python\note\something.exe
#if we need find it first
for root, dirs, files in os.walk(r'D:\python'):
for name in files:
if name == exe:
print os.path.abspath(os.path.join(root, name))
# output
# D:\python\note\something.exe

if you absolutely do not know where it is, the only way is to find it starting from root c:\
import os
for r,d,f in os.walk("c:\\"):
for files in f:
if files == "something.exe":
print os.path.join(r,files)
else, if you know that there are only few places you store you exe, like your system32, then start finding it from there. you can also make use of os.environ["PATH"] if you always put your .exe in one of those directories in your PATH variable.
for p in os.environ["PATH"].split(";"):
for r,d,f in os.walk(p):
for files in f:
if files == "something.exe":
print os.path.join(r,files)

Just to mention, another option to achieve this task could be the subprocess module, to help us execute a command in terminal, like this:
import subprocess
command = "find"
directory = "/Possible/path/"
flag = "-iname"
file = "something.foo"
args = [command, directory, flag, file]
process = subprocess.run(args, stdout=subprocess.PIPE)
path = process.stdout.decode().strip("\n")
print(path)
With this we emulate passing the following command to the Terminal:
find /Posible/path -iname "something.foo".
After that, given that the attribute stdout is binary string, we need to decode it, and remove the trailing "\n" character.
I tested it with the %timeit magic in spyder, and the performance is 0.3 seconds slower than the os.walk() option.
I noted that you are in Windows, so you may search for a command that behaves similar to find in Unix.
Finally, if you have several files with the same name in different directories, the resulting string will contain all of them. In consequence, you need to deal with that appropriately, maybe using regular expressions.

This is really old thread, but might be useful to someone who stumbles across this. In python 3, there is a module called "glob" which takes "egrep" style search strings and returns system appropriate pathing (i.e. Unix\Linux and Windows).
https://docs.python.org/3/library/glob.html
Example usage would be:
results = glob.glob('./**/FILE_NAME')
Then you get a list of matches in the result variable.

Uh... This question is a bit unclear.
What do you mean "have"? Do you have the name of the file? Have you opened it? Is it a file object? Is it a file descriptor? What???
If it's a name, what do you mean with "find"? Do you want to search for the file in a bunch of directories? Or do you know which directory it's in?
If it is a file object, then you must have opened it, reasonably, and then you know the path already, although you can get the filename from fileob.name too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.