I wrote a dataframe to a csv in Pyspark. And I got the output files in the directory as:
._SUCCESS.crc
.part-00000-6cbfdcfd-afff-4ded-802c-6ccd67f3804a-c000.csv.crc
part-00000-6cbfdcfd-afff-4ded-802c-6ccd67f3804a-c000.csv
How do I keep only the CSV file in the directory and delete rest of the files, using Python?
import os
directory = "/path/to/directory/with/files"
files_in_directory = os.listdir(directory)
filtered_files = [file for file in files_in_directory if not file.endswith(".csv")]
for file in filtered_files:
path_to_file = os.path.join(directory, file)
os.remove(path_to_file)
first, you list all files in directory. Then, you only keep in list those, which don't end with .csv. And then, you remove all files that are left.
Try iterating over the files in the directory, and then os.remove only those files that do not end with .csv.
import os
dir_path = "path/to/the/directory/containing/files"
dir_list = os.listdir(dir_path)
for item in dir_list:
if not item.endswith(".csv"):
os.remove(os.path.join(dir_path, item))
You can also have fun with list comprehension for doing this:
import os
dir_path = 'output/'
[os.remove(os.path.join(dir_path, item)) for item in os.listdir(dir_path) if not item.endswith('.csv')]
I would recommended to use pathlib (Python >= 3.4) and the in-build type set() to substract all csv filenames from the list of all files. I would argument this is easy to read, fast to process and a good pythonic solution.
>>> from pathlib import Path
>>> p = Path('/path/to/directory/with/files')
>>> # Get all file names
>>> # https://stackoverflow.com/a/65025567/4865723
>>> set_all_files = set(filter(Path.is_file, p.glob('**/*')))
>>> # Get all csv filenames (BUT ONLY with lower case suffix!)
>>> set_csv_files = set(filter(Path.is_file, p.glob('**/*.csv')))
>>> # Create a file list without csv files
>>> set_files_to_delete = set_all_files - set_csv_files
>>> # Iteratore on that list and delete the file
>>> for file_name in set_files_to_delete:
... Path(file_name).unlink()
for (root,dirs,files) in os.walk('Test', topdown=true):
for name in files:
fp = os.path.join(root, name)
if name.endswith(".csv"):
pass
else:
os.remove(fp)
What the advandtage of os.walk?, it reads all the subdirectory in particular directory mentioned.
Related
I want get a list of files name of all pdf files in folder I have my python script.
Now I have this code:
files = [f for f in os.listdir('.') if os.path.isfile(f)]
for f in files:
e = (len(files) - 1)
The problem are this code found all files in folder(include .py) so I "fix" if my script is the last file on the folder (zzzz.py) and later I subtract the last file of the list that are my script.py.
I try many codes for only find .pdf but this the more near I am.
Use the glob module:
>>> import glob
>>> glob.glob("*.pdf")
>>> ['308301003.pdf', 'Databricks-how-to-data-import.pdf', 'emr-dg.pdf', 'gfs-sosp2003.pdf']
Use glob on the directory directly to find all your pdf files:
from os import path
from glob import glob
def find_ext(dr, ext):
return glob(path.join(dr,"*.{}".format(ext)))
Demo:
In [2]: find_ext(".","py")
Out[2]:
['./server.py',
'./new.py',
'./ffmpeg_split.py',
'./clean_download.py',
'./bad_script.py',
'./test.py',
'./settings.py']
If you want the option of ignoring case:
from os import path
from glob import glob
def find_ext(dr, ext, ig_case=False):
if ig_case:
ext = "".join(["[{}]".format(
ch + ch.swapcase())) for ch in ext])
return glob(path.join(dr, "*." + ext))
Demo:
In [4]: find_ext(".","py",True)
Out[4]:
['./server.py',
'./new.py',
'./ffmpeg_split.py',
'./clean_download.py',
'./bad_script.py',
'./test.py',
'./settings.py',
'./test.PY']
You can use endswith:
files = [f for f in os.listdir('.') if os.path.isfile(f) and f.endswith('.pdf')]
You simply need to filter the names of files, looking for the ones that end with ".pdf", right?
files = [f for f in os.listdir('.') if os.path.isfile(f)]
files = filter(lambda f: f.endswith(('.pdf','.PDF')), files)
Now, your files contains only the names of files ending with .pdf or .PDF :)
To get all PDF files recursively:
import os
all_files = []
for dirpath, dirnames, filenames in os.walk("."):
for filename in [f for f in filenames if f.endswith(".pdf")]:
all_files.append(os.path.join(dirpath, filename)
You may also use the following,
files = filter(
lambda f: os.path.isfile(f) and f.lower().endswith(".pdf"),
os.listdir(".")
)
file_list = list(files)
Or, in one line:
list(filter(lambda f: os.path.isfile(f) and f.lower().endswith(".md"), os.listdir(".")))
You may, or not, convert the filtered object to list using list() function.
I want to put files in the multiple zip files that have common substring into a single zipfile
I have a folder "temp" containing some .zip files and some other files
filename1_160645.zip
filename1_165056.zip
filename1_195326.zip
filename2_120528.zip
filename2_125518.zip
filename3_171518.zip
test.xlsx
filename19_161518.zip
I have following dataframe df_filenames containing the prefixes of filename
filename_prefix
filename1
filename2
filename3
if there are multiple .zip files in the temp folder with the same prefix that exists in the dataframe df_filenames,i want to merge the contents of those files
for example filename1_160645.zip contains following contents
1a.csv
1b.csv
and filename1_165056.zip contains following contents
1d.csv
and filename1_195326.zip contains following contents
1f.csv
after merging the contents of above 2 files into the filename1_160645.zip
the contents of filename1_160645.zip will be
1a.csv
1b.csv
1d.csv
1f.csv
At the end only following files will remain the temp folder
filename1_160645.zip
filename2_120528.zip
filename3_171518.zip
test.xlsx
filename19_161518.zip
I have written the following code but it's not working
import os
import zipfile as zf
import pandas as pd
df_filenames=pd.read_excel('filename_prefix.xlsx')
#Get the list of all the filenames in the temp folder
lst_fnames=os.listdir(r'C:\Users\XYZ\Downloads\temp')
#take only .zip files
lst_fnames=[fname for fname in lst_fnames if fname.endswith('.zip')]
#take distinct prefixes in the dataframe
df_prefixes=df_filenames['filename_prefix'].unique()
for prefix in df_prefixes:
#this list will contain zip files with the same prefixes
lst=[]
#total count of files in the lst
count=0
for fname in lst_fnames:
if prefix in fname:
#print(prefix)
lst.append(fname)
#print(lst)
#if the list has more than 1 zip files,merge them
if len(lst)>1:
print(lst)
with zf.ZipFile(lst[0], 'a') as f1:
print(f1.filename)
for f in lst[1:]:
with zf.ZipFile(path+'\\'+f, 'r') as f:
print(f.filename) #getting entire path of the file here,not just filename
[f1.writestr(t[0], t[1].read()) for t in ((n, f.open(n)) for n in f.namelist())]
print(f1.namelist())
after merging the contents of the files with the filename containing filename1 into the filename1_160645.zip,
the contents of ``filename1_160645.zip``` should be
1a.csv
1b.csv
1d.csv
1f.csv
but nothing has changed when I double click filename1_160645.zip
Basically, 1a.csv,1b.csv,1d.csv,1f.csv are not part of filename1_160645.zip
I would use shutil for a higher level view for dealing with archive files. Additionally using pathlib gives nice methods/attributes for a given filepath. Combined with a groupby, we can easily extract target files that are related to each other.
import itertools
import shutil
from pathlib import Path
import pandas as pd
filenames = pd.read_excel('filename_prefix.xlsx')
prefixes = filenames['filename_prefix'].unique()
path = Path.cwd() # or change to Path('path/to/desired/dir/')
zip_files = (file for file in path.iterdir() if file.suffix == '.zip')
target_files = sorted(file for file in zip_files
if any(file.stem.startswith(pre) for pre in prefixes))
file_groups = itertools.groupby(target_files, key=lambda x: x.stem.split('_')[0])
for _, group in file_groups:
first, *rest = group
if not rest:
continue
temp_dir = path / first.stem
temp_dir.mkdir()
shutil.unpack_archive(first, extract_dir=temp_dir)
for item in rest:
shutil.unpack_archive(item, extract_dir=temp_dir)
item.unlink()
shutil.make_archive(temp_dir, 'zip', temp_dir)
shutil.rmtree(temp_dir)
I am trying to import multiple csv files and when I run the below code it does work.
allfiles = glob.glob('*.csv')
allfiles
However, this results in:
['file_0.csv',
'file_1.csv',
'file_10.csv',
'file_100.csv',
'file_101.csv,
...
]
As you can see, the imported files are not sorted numbers. What I want is to have my numbers in my file names to be in ascending order:
['file_0.csv',
'file_1.csv',
'file_2.csv',
'file_3.csv',
...
]
How do I solve the problem?
This is also a way to do that. This algorithm will sort with length of file name string.
import glob
all_files = glob.glob('*.csv')
def sort_with_length(file_name):
return len(file_name)
new_files = sorted(all_files, key = sort_with_length )
print("Old files:")
print(all_files)
print("New files:")
print(new_files)
Sample output:
Old files:
['file1.csv', 'file101.csv', 'file102.csv', 'file2.csv', 'file201.csv', 'file3.csv']
New files:
['file1.csv', 'file2.csv', 'file3.csv', 'file101.csv', 'file102.csv', 'file201.csv']
allfiles = glob.glob('*.csv')
allfiles.sort(key= lambda x: int(x.split('_')[1].split('.')[0]))
You can't do that with glob, you need to sort the resultant files yourself by the integer each file contains:
allfiles = glob.iglob('*.csv')
allfiles_sorted = sorted(allfiles, key=lambda x: int(re.search(r'\d+', x).group()))
Also note that, i've used glob.iglob instead of glob.glob as there is no need to make an intermediate list where an iterator would do the job fine.
Check with natsort
from natsort import natsorted
allfiles=natsorted(allfiles)
os.listdir() will give the list of files in that folder and sorted will sort it
import os
sortedlist = sorted(os.listdir())
EDIT: just specify key = len to count the length of an element
sorted(os.listdir(),key = len)
-Root
--A
---1,2
--B
---3
I am trying to get a list of lists of paths based on subdirs:
[['Root/A/1','Root/A/2'],['Root/B/3']]
I tried using os.walk but I couldn't get it to work. I can get a list of all files in one giant list but I can't split those based on subdirs
fullList = []
for root, dirs, files in os.walk(dir):
for name in files:
fullList.append(os.path.join(root, name))
You want to have a list of lists, but you create a list of strings. You'll need to create each of the interior lists and put them all together into one master list.
This program might do what you want:
import os
from pprint import pprint
def return_list_of_paths(dir='.'):
return [[os.path.join(root, file) for file in files]
for root, dirs, files in os.walk(dir)
if files]
pprint(return_list_of_paths("ROOT"))
Or, if you don't care for list comprehensions:
import os
from pprint import pprint
def return_list_of_paths(dir='.'):
fullList = []
for root, dirs, files in os.walk(dir):
if files:
oneList = []
for file in files:
oneList.append(os.path.join(root, file))
fullList.append(oneList)
return fullList
pprint(return_list_of_paths("ROOT"))
I need to iterate through a folder and find every instance where the filenames are identical (except for extension) and then zip (preferably using tarfile) each of these into one file.
So I have 5 files named: "example1" each with different file extensions. I need to zip them up together and output them as "example1.tar" or something similar.
This would be easy enough with a simple for loop such as:
tar = tarfile.open('example1.tar',"w")
for output in glob ('example1*'):
tar.add(output)
tar.close()
however, there are 300 "example" files and I need to iterate through each one and their associated 5 files in order to make this work. This is way over my head. Any advice greatly appreciated.
The pattern you're describing generalizes to MapReduce. I found a simple implementation of MapReduce online, from which an even-simpler version is:
def map_reduce(data, mapper, reducer):
d = {}
for elem in data:
key, value = mapper(elem)
d.setdefault(key, []).append(value)
for key, grp in d.items():
d[key] = reducer(key, grp)
return d
You want to group all files by their name without the extension, which you can get from os.path.splitext(fname)[0]. Then, you want to make a tarball out of each group by using the tarfile module. In code, that is:
import os
import tarfile
def make_tar(basename, files):
tar = tarfile.open(basename + '.tar', 'w')
for f in files:
tar.add(f)
tar.close()
map_reduce(os.listdir('.'),
lambda x: (os.path.splitext(x)[0], x),
make_tar)
Edit: If you want to group files in different ways, you just need to modify the second argument to map_reduce. The code above groups files that have the same value for the expression os.path.splitext(x)[0]. So to group by the base file name with all the extensions stripped off, you could replace that expression with strip_all_ext(x) and add:
def strip_all_ext(path):
head, tail = os.path.split(path)
basename = tail.split(os.extsep)[0]
return os.path.join(head, basename)
You could do this:
list all files in the directory
create a dictionary where the basename is the key and all the extensions are values
then tar all the files by dictionary key
Something like this:
import os
import tarfile
from collections import defaultdict
myfiles = os.listdir(".") # List of all files
totar = defaultdict(list)
# now fill the defaultdict with entries; basename as keys, extensions as values
for name in myfiles:
base, ext = os.path.splitext(name)
totar[base].append(ext)
# iterate through all the basenames
for base in totar:
files = [base+ext for ext in totar[base]]
# now tar all the files in the list "files"
tar = tarfile.open(base+".tar", "w")
for item in files:
tar.add(item)
tar.close()
You have to problems. Solve the separately.
Finding matching names. Use a collections.defaultict
Creating tar files after you find the matching names. You've got that pretty well covered.
So. Solve problem 1 first.
Use glob to get all the names. Use os.path.basename to split the path and basename. Use os.path.splitext to split the name and extension.
A dictionary of lists can be used to save all files that have the same name.
Is that what you're doing in part 1?
Part 2 is putting the files into tar archives. For that, you've got most of the code you need.
Try using the glob module: http://docs.python.org/library/glob.html
#! /usr/bin/env python
import os
import tarfile
tarfiles = {}
for f in os.listdir ('files'):
prefix = f [:f.rfind ('.') ]
if prefix in tarfiles: tarfiles [prefix] += [f]
else: tarfiles [prefix] = [f]
for k, v in tarfiles.items ():
tf = tarfile.open ('%s.tar.gz' % k, 'w:gz')
for f in v: tf.addfile (tarfile.TarInfo (f), file ('files/%s' % f) )
tf.close ()
import os
import tarfile
allfiles = {}
for filename in os.listdir("."):
basename = '.'.join (filename.split(".")[:-1] )
if not basename in all_files:
allfiles[basename] = [filename]
else:
allfiles[basename].append(filename)
for basename, filenames in allfiles.items():
if len(filenames) < 2:
continue
tardata = tarfile.open(basename+".tar", "w")
for filename in filenames:
tardata.add(filename)
tardata.close()