Python code, extracting extensions [duplicate]

Python code, extracting extensions [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In python, how can I check if a filename ends in '.html' or '_files'?
import os
path = '/Users/Marjan/Documents/Nothing/Costco'
print path
names = os.listdir(path)
print len(names)
for name in names:
print name
Here is the code I've been using, it lists all the names in this category in terminal. There are a few filenames in this file (Costco) that don't have .html and _files. I need to pick them out, the only issue is that it has over 2,500 filenames. Need help on a code that will search through this path and pick out all the filenames that don't end with .html or _files. Thanks guys

for name in names:
if filename.endswith('.html') or filename.endswith('_files'):
continue
#do stuff
Usually os.path.splitext() would be more appropriate if you needed the extension of a file, but in this case endswith() is perfectly fine.

A little shorter than ThiefMaster's suggestion:
for name in [x for x in names if not x.endswith(('.html', '_files'))]:
# do stuff

Related

How to exclude some files by name using glob.glob("")? [duplicate]

This question already has answers here:
glob exclude pattern
(12 answers)
Closed 3 years ago.
I'm using python glob.glob("*.json"). The script returns a file of json files, but after applying some operations it creates a new json file. If I run the same script again it adds this new file in list...
glob.glob("*.json")
Output:
['men_pro_desc_zalora.json',
'man_pro_desc_Zalando.json',
'man_pro_desc_nordstrom.json']
End of code:
with open("merged_file.json", "w") as outfile:
json.dump(result, outfile)
After running addition of file merged_file.json
if I run again glob.glob("*.json") it returns:
['men_pro_desc_zalora.json',
'man_pro_desc_Zalando.json',
'man_pro_desc_nordstrom.json',
merged_file.json]

You can make the pattern less exclusive as some comments mention by doing something like glob.glob('*_*_*_*.json'). More details can be found here https://docs.python.org/3.5/library/glob.html#glob.glob.
This isn't ever clean and since glob isn't regular regex it isn't very expressive. Since ordering doesn't seem very important you could do something like
excludedFiles = ['merged_file.json']
includedFiles = glob.glob('*.json')
# other code here
print list(set(includedFiles) - set(excludedFile))
That answers your question however I think a better solution to your problem is separate your raw data and generated files into different directories. I think that's generally a good practice when you're doing adhoc work with data.

If you want to remove only the latest file added, then you can try this code.
import os
import glob
jsonFiles = []
jsonPattern = os.path.join('*.json')
fileList = glob.glob(jsonPattern)
for file in fileList:
jsonFiles.append(file)
print jsonFiles
latestFile = max(jsonFiles, key=os.path.getctime)
print latestFile
jsonFiles.remove(latestFile)
print jsonFiles
Output:
['man_pro_desc_nordstrom.json', 'man_pro_desc_Zalando.json', 'men_pro_desc_zalora.json', 'merged_file.json']
merged_file.json
['man_pro_desc_nordstrom.json', 'man_pro_desc_Zalando.json', 'men_pro_desc_zalora.json']

function that browses every folder in a folder [duplicate]

This question already has answers here:
How to list only top level directories in Python?
(21 answers)
Closed 2 years ago.
How can I bring python to only output directories via os.listdir, while specifying which directory to list via raw_input?
What I have:
file_to_search = raw_input("which file to search?\n>")
dirlist=[]
for filename in os.listdir(file_to_search):
if os.path.isdir(filename) == True:
dirlist.append(filename)
print dirlist
Now this actually works if I input (via raw_input) the current working directory. However, if I put in anything else, the list returns empty. I tried to divide and conquer this problem but individually every code piece works as intended.

that's expected, since os.listdir only returns the names of the files/dirs, so objects are not found, unless you're running it in the current directory.
You have to join to scanned directory to compute the full path for it to work:
for filename in os.listdir(file_to_search):
if os.path.isdir(os.path.join(file_to_search,filename)):
dirlist.append(filename)
note the list comprehension version:
dirlist = [filename for filename in os.listdir(file_to_search) if os.path.isdir(os.path.join(file_to_search,filename))]

split filename on name and extenstion in python [duplicate]

This question already has answers here:
Extracting extension from filename in Python
(33 answers)
Getting file extension using pattern matching in python
(6 answers)
Closed 5 years ago.
I have this pattern:
dir1/dir2/.log.gz
dir1/dir2/a.log.gz
dir1/dir2/a.py
dir1/dir2/*.gzip.tar
I want to get filename or path and extension. e.g:
(name,extension)=(dir1/dir2/,.log.gz)
(name,extension)=(dir1/dir2/a,.log.gz)
(name,extension)=(dir1/dir2/a,.py)
(name,extension)=(dir1/dir2/,.gzip.tar)
I try:
re.findall(r'(.*).*\.?(.*)',path)
but it doesn't work perfect

If you just want the file's name and extension:
import os
# path = C:/Users/Me/some_file.tar.gz
temp = os.path.splitext(path)
var = (os.path.basename(temp[0]), temp[1])
print (var)
# (some_file.tar, .gz)
Its worth noting that files with "dual" extensions will need to be recursed if you want. For example, the .tar.gz is a gzip file that happens to be an archive file as well. But the current state of it is .gz.
There is more on this topic here on SO.

General strategy: find the first '.' everything before it is the path, everything after it is the extension.
def get_path_and_extension(filename):
index = filename.find('.')
return filename[:index], filename[index + 1:]

Python - Check file name for extension [duplicate]

This question already has answers here:
How can I check the extension of a file?
(14 answers)
Closed 5 years ago.
Is there a way in Python to check a file name to see if its extension is included in the name? My current workaround is to simply check if the name contains a . in it, and add an extension if it doesn't...this obviously won't catch files with . but no extension in the name (ie. 12.10.13_file). Anyone have any ideas?

'12.10.13_file' as a filename, does have '13_file' as it's file extension. At least regarding the file system.
But, instead of finding the last . yourself, use os.path.splitext:
import os
fileName, fileExtension = os.path.splitext('/path/yourfile.ext')
# Results in:
# fileName = '/path/yourfile'
# fileExtension = '.ext'
If you want to exclude certain extensions, you could blacklist those after you've used the above.

You can use libmagic via https://pypi.python.org/pypi/python-magic to determine "file types." It's not 100% perfect, but a whole lot of files can be accurately classified this way, and then you can decide your own rules, such as .txt for text files, .pdf for PDFs, etc.
Don't think in terms of finding files with or without extensions--think of it in terms of classifying your files based on their content, ignoring their current names.

Read files sequentially in order [duplicate]

This question already has answers here:
Is there a built in function for string natural sort?
(23 answers)
Closed 9 years ago.
I have a number of files in a folder with names following the convention:
0.1.txt, 0.15.txt, 0.2.txt, 0.25.txt, 0.3.txt, ...
I need to read them one by one and manipulate the data inside them. Currently I open each file with the command:
import os
# This is the path where all the files are stored.
folder path = '/home/user/some_folder/'
# Open one of the files,
for data_file in os.listdir(folder_path):
...
Unfortunately this reads the files in no particular order (not sure how it picks them) and I need to read them starting with the one having the minimum number as a filename, then the one with the immediate larger number and so on until the last one.

A simple example using sorted() that returns a new sorted list.
import os
# This is the path where all the files are stored.
folder_path = 'c:\\'
# Open one of the files,
for data_file in sorted(os.listdir(folder_path)):
print data_file
You can read more here at the Docs
Edit for natural sorting:
If you are looking for natural sorting you can see this great post by #unutbu

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python code, extracting extensions [duplicate] - python

for name in names: if filename.endswith('.html') or filename.endswith('_files'): continue #do stuff Usually os.path.splitext() would be more appropriate if you needed the extension of a file, but in this case endswith() is perfectly fine.

A little shorter than ThiefMaster's suggestion: for name in [x for x in names if not x.endswith(('.html', '_files'))]: # do stuff

Related

How to exclude some files by name using glob.glob("")? [duplicate]

function that browses every folder in a folder [duplicate]

split filename on name and extenstion in python [duplicate]

Python - Check file name for extension [duplicate]

Read files sequentially in order [duplicate]

Categories

Resources