How to exclude some files by name using glob.glob("")? [duplicate] - python

This question already has answers here:
glob exclude pattern
(12 answers)
Closed 3 years ago.
I'm using python glob.glob("*.json"). The script returns a file of json files, but after applying some operations it creates a new json file. If I run the same script again it adds this new file in list...
glob.glob("*.json")
Output:
['men_pro_desc_zalora.json',
'man_pro_desc_Zalando.json',
'man_pro_desc_nordstrom.json']
End of code:
with open("merged_file.json", "w") as outfile:
json.dump(result, outfile)
After running addition of file merged_file.json
if I run again glob.glob("*.json") it returns:
['men_pro_desc_zalora.json',
'man_pro_desc_Zalando.json',
'man_pro_desc_nordstrom.json',
merged_file.json]

You can make the pattern less exclusive as some comments mention by doing something like glob.glob('*_*_*_*.json'). More details can be found here https://docs.python.org/3.5/library/glob.html#glob.glob.
This isn't ever clean and since glob isn't regular regex it isn't very expressive. Since ordering doesn't seem very important you could do something like
excludedFiles = ['merged_file.json']
includedFiles = glob.glob('*.json')
# other code here
print list(set(includedFiles) - set(excludedFile))
That answers your question however I think a better solution to your problem is separate your raw data and generated files into different directories. I think that's generally a good practice when you're doing adhoc work with data.

If you want to remove only the latest file added, then you can try this code.
import os
import glob
jsonFiles = []
jsonPattern = os.path.join('*.json')
fileList = glob.glob(jsonPattern)
for file in fileList:
jsonFiles.append(file)
print jsonFiles
latestFile = max(jsonFiles, key=os.path.getctime)
print latestFile
jsonFiles.remove(latestFile)
print jsonFiles
Output:
['man_pro_desc_nordstrom.json', 'man_pro_desc_Zalando.json', 'men_pro_desc_zalora.json', 'merged_file.json']
merged_file.json
['man_pro_desc_nordstrom.json', 'man_pro_desc_Zalando.json', 'men_pro_desc_zalora.json']

Related

How to search files with multiple extensions in a specific path using python [duplicate]

This question already has answers here:
Python glob multiple filetypes
(40 answers)
Closed 5 years ago.
I have written a small python code to read specific files containing .txt from the specific path. I would like to do the same thing to search for files with multiple extensions more or less with the same code with little modifications. Ofcourse, I am not looking for / which will search all the extensions. Any help would be appreciated.
varlogpath = "C:/Users/vveldand/Office/Fetcher/Projects/LOG-PARSER/var/log/*.txt"
outputfile = open(wrfilename, "a")
files=glob.glob(varlogpath)
you could do something like this:
files=None
# put file extensions into a list
fileext=[".txt",".log",".csv"]
for ext in fileext:
varlogpath = "C:/Users/vveldand/Office/Fetcher/Projects/LOG-
PARSER/var/log/*"+ext
outputfile = open(wrfilename, "a")
files=glob.glob(varlogpath)
wrfilename = os.path.join(wrscriptpath, 'TiC_Timeline_Report.txt')
varlogpath = "C:/Users/vveldand/Office/Fetcher/Projects/LOG-PARSER/var/log/*.txt"
outputfile = open(wrfilename, "a")
files=[glob.glob(varlogpath.replace(".txt",ext)) for ext in list_of_extensions_you_want_to_search]

Read files sequentially in order [duplicate]

This question already has answers here:
Is there a built in function for string natural sort?
(23 answers)
Closed 9 years ago.
I have a number of files in a folder with names following the convention:
0.1.txt, 0.15.txt, 0.2.txt, 0.25.txt, 0.3.txt, ...
I need to read them one by one and manipulate the data inside them. Currently I open each file with the command:
import os
# This is the path where all the files are stored.
folder path = '/home/user/some_folder/'
# Open one of the files,
for data_file in os.listdir(folder_path):
...
Unfortunately this reads the files in no particular order (not sure how it picks them) and I need to read them starting with the one having the minimum number as a filename, then the one with the immediate larger number and so on until the last one.
A simple example using sorted() that returns a new sorted list.
import os
# This is the path where all the files are stored.
folder_path = 'c:\\'
# Open one of the files,
for data_file in sorted(os.listdir(folder_path)):
print data_file
You can read more here at the Docs
Edit for natural sorting:
If you are looking for natural sorting you can see this great post by #unutbu

How to run a script on all *.txt files in current directory? [duplicate]

This question already has answers here:
Find all files in a directory with extension .txt in Python
(25 answers)
Closed 9 years ago.
I am trying to run below script on all *.txt files in current directory. Currently it will process only test.txt file and print block of text based on regular expression. What would be the quickest way of scanning current directory for *.txt files and running below script on all found *.txt files? Also how I could include lines containing 'word1' and 'word3' as currently script is printing only content between those two lines? I would like to print whole block.
#!/usr/bin/env python
import os, re
file = 'test.txt'
with open(file) as fp:
for result in re.findall('word1(.*?)word3', fp.read(), re.S):
print result
I would appreciate any advice or suggestions on how to improve above code e.g. speed when running on large set of text files.
Thank you.
Use glob.glob:
import os, re
import glob
pattern = re.compile('word1(.*?)word3', flags=re.S)
for file in glob.glob('*.txt'):
with open(file) as fp:
for result in pattern.findall(fp.read()):
print result
Inspired by the answer of falsetru, I rewrote my code, making it more generic.
Now the files to explore :
can be described either by a string as second argument that will be used by glob(),
or by a function specifically written for this goal in case the set of desired files can't be described with a globish pattern
and may be in the current directory if no third argument is passed,
or in a specified directory if its path is passed as a second argument
.
import re,glob
from itertools import ifilter
from os import getcwd,listdir,path
from inspect import isfunction
regx = re.compile('^[^\n]*word1.*?word3.*?$',re.S|re.M)
G = '\n\n'\
'MWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMWMW\n'\
'MWMWMW %s\n'\
'MWMWMW %s\n'\
'%s%s'
def search(REGX, how_to_find_files, dirpath='',
G=G,sepm = '\n======================\n'):
if dirpath=='':
dirpath = getcwd()
if isfunction(how_to_find_files):
gen = ifilter(how_to_find_files,
ifilter(path.isfile,listdir(dirpath)))
elif isinstance(how_to_find_files,str):
gen = glob.glob(path.join(dirpath,
how_to_find_files))
for fn in gen:
with open(fn) as fp:
found = REGX.findall(fp.read())
if found:
yield G % (dirpath,path.basename(fn),
sepm,sepm.join(found))
# Example of searching in .txt files
#============ one use ===================
def select(fn):
return fn[-4:]=='.txt'
print ''.join(search(regx, select))
#============= another use ==============
print ''.join(search(regx,'*.txt'))
The advantage of chaining the treatments of sevral files through succession of generators is that the final joining with ''.join() creates a unique string that is instantly written,
while, if not so processed, the printing of several individual strings one after the other is longer because of the interrupts of displaying (am I understandable ?)

How can I stop Python's gzip from making sub directories? [duplicate]

This question already has answers here:
Python gzip folder structure when zipping single file
(4 answers)
Closed 9 years ago.
this is probably a simple mistake on my part, but I can't quite figure out how to compress a file without making lots of sub-directories.
Here is how I am doing it:
f_in = open(r'C:\cygwin\home\User\Stuff\MoreStuff\file.csv', 'r')
gzip_file_name = r'C:\cygwin\home\User\Stuff\MoreStuff\file.csv.gz'
f_out = gzip.open(gzip_file_name, 'w')
f_out.writelines(f_in)
f_out.close()
The problem is, when I decompress that .gz file, I don't get just the csv file, but rather a long chain of directories that finally end with the csv file.
e.g. cygwin\home\User\Stuff\MoreStuff\file.csv
My workaround looks a bit like this:
current_dir = os.getcwd()
os.chdir(r'C:\cygwin\home\User\Stuff\MoreStuff')
f_in = open('file.csv', 'r')
gzip_file_name = 'file.csv.gz'
f_out = gzip.open(gzip_file_name, 'w')
f_out.writelines(f_in)
f_out.close()
os.chdir(current_dir)
I don't know if it is a good idea to keep changing the current directory (especially since I might have multiple files to compress).
So, is there a way to not make those sub directories? (I couldn't find anything that discussed this in the offical docs ).
Note: I am using Windows, but I do need this to be portable. I am also using Python 2.4.
Thanks for your time.
edit: I see the sub directories when I open the compressed file in WinRar or even in 7zip. If I do it with chdir, then I no longer see those sub directories.
The link to the previous question that crayzeewulf provided worked just fine.
This is likely only a problem in older distributions of Python. According to that diff (also provided by crazyzeewulf), this was changed in newer versions, so you likely won't be able to reproduce this problem in Python 2.7.
Thanks for your help everyone.

Python code, extracting extensions [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
In python, how can I check if a filename ends in '.html' or '_files'?
import os
path = '/Users/Marjan/Documents/Nothing/Costco'
print path
names = os.listdir(path)
print len(names)
for name in names:
print name
Here is the code I've been using, it lists all the names in this category in terminal. There are a few filenames in this file (Costco) that don't have .html and _files. I need to pick them out, the only issue is that it has over 2,500 filenames. Need help on a code that will search through this path and pick out all the filenames that don't end with .html or _files. Thanks guys
for name in names:
if filename.endswith('.html') or filename.endswith('_files'):
continue
#do stuff
Usually os.path.splitext() would be more appropriate if you needed the extension of a file, but in this case endswith() is perfectly fine.
A little shorter than ThiefMaster's suggestion:
for name in [x for x in names if not x.endswith(('.html', '_files'))]:
# do stuff

Categories

Resources