My goal is to append 9 excel files together that exist in different directories. I have a directory tree with the following structure:
Big Folder
|
├── folder_1/
| ├── file1.xls
| ├── file2.xls
| └── file3.xls
|
├── folder_2/
| ├── file4.xls
| ├── file5.xls
| └── file6.xls
|
├── folder_3/
| ├── file7.xls
| ├── file8.xls
| └── file9.xls
I successfully wrote a loop that appends file1, file2, and file3 together within folder_1. My idea is to nest this loop into another loop that flows through each folder as a list. I'm currently tring to us os.walk to accomplish this but am running into the following error in folder_1
[Errno 2 No such file or directory]
Do community members have recommendations on how to extend this loop to execute in each directory? Thanks!
It is hard for me to know how you have implemented the program without given some sort of code to work with, however I believe you have misused the os.walk() method, please read about it here.
I would use the os.walk() method the following way for getting the path to various files in a current directory and subdirectories.
import os
all_files = [(path, files) for path, dirs, files in os.walk(".")]
and then get all the files which ends with "*.xls" like so
all_xls_files = [
os.path.join(path, xls_file)
for (path, xls_files_list) in all_files
for xls_file in xls_files_list
if xls_file.endswith(".xls")
]
this is equivalent to
all_xls_files = []
for (path, xls_files_list) in all_files:
for xls_file in xls_files_list:
if xls_file.endswith(".xls"):
files.append(os.path.join(path, xls_file))
Once you obtain a list of excel files with their path
you can open them by
with open("my_output_file", "w") as output_file:
for file in all_xls_files:
with open(file) as f:
# Do your append here
Related
I have a file directory that looks something like this. I have a larger directory, but showing this one just for explanation purposes:
.
├── a.txt
├── b.txt
├── foo
│ └── w.txt
│ └── a.txt
└── moo
└── cool.csv
└── bad.csv
└── more
└── wow.csv
I want to write a recursive function to get year counts for files within each subdirectory within this directory.
I want the code to basically check if it's a directory or file. If it's a directory then I want to call the function again and get counts until there's no more subdirectories.
I have the following code (which keeps breaking my kernel when I test it). There's probably some logic error as well I would think..
import os
import pandas as pd
dir_path = 'S:\\Test'
def getFiles(dir_path):
contents = os.listdir(dir_path)
# check if content is directory or not
for file in contents:
if os.path.isdir(os.path.join(dir_path, file)):
# get everything inside subdirectory
getFiles(dir_path = os.path.join(dir_path, file))
# it's a file
else:
# do something to get the year of the file and put it in a list or something
# at the end create pandas data frame and return
Expected output would be a pandas dataframe that looks something like this..
Subdir 2020 2021 2022
foo 0 1 1
moo 0 2 0
more 1 0 0
How can I do this in Python?
EDIT:
Just realized os.walk() is probably extremely useful for my case here.
Trying to figure out a solution with os.walk() instead of doing it the long way..
In order to order some files into folders, I have to get the number (as if it was some sort of ID) of both folders (named as p.X, p. fixed, being X a number that can range from 1 to 200150) and files (being PX_N.gmspr, where P is fixed, X is the ID number of the folder and N an identifier of the file, which can be 2,3,6,8,9,A and H).
An example would be p.24 and P24_2.gmspr, P24_3.gmspr, P24_6.gmspr, P24_8.gmspr, P24_9.gmspr, P24_A.gmspr and P24_H.gmspr, in order to move all P24_N.gmspr to p.24
The PX_N.gmspr files are in a different folder than the target folders p.X . A little of os.chdir and os.rename and the files can be moved easily so I believe that is not a problem.
What I want is to obtain the X number of the filename to compare with the folder number, forgetting about both the P and the _N.gmspr string.
Whereas I can obtain the folder number via
foldername.split(".",1)[1] I don't really know how to do it for the file number.
To sum up, I want to move some files called PX_N.gmspr to another folder identified almost the same p.X
Any idea? Thank you!!!
EDIT:
Regarding the answer given, I have to clarify myself about what I am trying to do, specially with the file and folder format:
Mother folder
├── Unclassified
│ └── All PX_N.gmspr being PX certain files that gotta be moved to another folders, X a number that ranges from 1 to 200150 (but not exactly 200150, is just a number ID) and N can be only 2, 3, 6, 9, A or H, nothing more. In total 15435 elements with each of the X having one of the 6 possibles N gmspr.
├──First Folder
│ └── p.X folders (X from 1 to 151), the aim is to select all the PX_N.gmspr files that agree with the X number that matches the p.X of the folder and move it to each folder.
├──Second Folder
│ └── p.X folders (X from 152 to 251, plus p.602 to p.628, p.823, p.824,
│ p.825, p.881 and p.882)
└──Third Folder
└── p.X folders (X from 252 to 386, plus p.585, p.586 and p. 587)
There are some other folders in order to order some more of the 15435 files.
I am currently searching about regex; unluckily for me, it is the first time I actually have to use them.
EDIT CAUSE SOLVED: SO THE POINT WAS TO PLAY WITH REGEX AND GETTING ONLY THE NUMBERS, BUT THEN AS NESTED LISTS APPEARED, ONLY THE FIRST NUMBER WAS USEFUL
This is the perfect job for regexes.
First, let's create a temporary dir and fill it with some files to demonstrate.
from pathlib import Path
from random import choices, randint
from string import ascii_letters
from tempfile import TemporaryDirectory
tmpdir = TemporaryDirectory()
for i in range(4):
n = randint(1, 999)
for i in range(randint(1, 5)):
Path(
tmpdir.name, f"P{n}.{''.join(choices(ascii_letters, k=10))}"
).touch()
Now we have 4 types of file (PN.), with between 1 and 5 files in this type.
Then, we just need to iterate through those file, extract the N from the file name with the regex P(\d+)\..+, and finally create destination dir and move the file.
from pathlib import Path
import re
dir_re = re.compile(r"P(\d+)\..+")
for filepath in Path(tmpdir.name).iterdir():
m = dir_re.match(filepath.name)
dirpath = filepath.parent / f"p.{m.group(1)}"
if not dirpath.is_dir():
dirpath.mkdir()
filepath.rename(dirpath / filepath.name)
For instance, from a flat temp directory, we have now the following sorted.
/var/folders/lf/z7ftpkws0vn7svq8n212czm40000gn/T/tmppve5_m1u/
├── p.413
│ └── P413.yJvxPtuzfz
├── p.705
│ ├── P705.DbwPyiFxum
│ ├── P705.FVwMuSqFms
│ ├── P705.PZyGIQEqSG
│ ├── P705.baRrkcNaZR
│ └── P705.tZKFTKwDah
├── p.794
│ ├── P794.CQTBgXOckQ
│ ├── P794.JNoKsUtgRU
│ └── P794.iSdrdohKYq
└── p.894
└── P894.XbzFxnqYOY
And finally, cleanup the temporary directory.
tmpdir.cleanup()
As per the title i am having trouble accessing certain excel files in another folder based upon their filenames.
I have a folder containing a bunch of excel files which share a common name,but all have an individual datestamp appended on as a suffix in the format /files/filename_%d%m%Y.xlsx giving me a directory like this:
├── files
│ ├── filename_10102021.xlsx
│ ├── filename_07102021.xlsx
│ ├── filename_11102021.xlsx
│ └── filename_14102021.xlsx
├── notebooks
│ └── my_notebook.ipynb
From the my_notebook.ipynb file, I would like to navigate to the files directory, and get the 2 most recent excel files according to the suffixed date and open them in the notebook as pandas dfs so I can compare the columns for any differences. In the directory I provided above, the 2 files I would get are filename_14102021.xlsx and filename_11102021.xlsx but would like this solution to work dynamically as the files folder gets new files with new dates as time goes on. (so hardcoding those two names would not work)
My first thought is to do something like:
import os
import sys
import pandas as pd
sys.path.append('../files')
files = sorted(os.listdir(), reverse=True)
most_recent_df = pd.read_excel(files[0], engine='openpyxl', index_col=0)
second_most_recent_df = pd.read_excel(files[1], engine='openpyxl', index_col=0)
and then do my comparison between the dataframes.
However this code fails to do what I want as even with using sys.path.append, the os.listdir function returns a list of the notebooks directory which tells me the problem lies in this 2 lines:
sys.path.append('../files')
files = sorted(os.listdir(), reverse=True)
How do I fix my code to move into the files directory so that a list of all the excel files is returned?
thank you!
It should work directly using
files = sorted(os.listdir(r'path\to\folder'), reverse=True)
IMO you don't need to use sys.path.append
If folder X is empty, I would like to delete X.
If folder Y only contains folder X and other folders that are empty, I would like to delete Y.
If folder Z only contains folders like X and/or Y, I would like to delete Z.
How do I do this recursively for anything under a specific dir, with Python?
I tried something like the following, but it is only able to identify folders like X, not folders like Y or Z.
from pathlib import Path
folder = '/home/abc'
for path in Path(folder).glob("**/*"):
if path.is_dir() and len(list(path.iterdir())) == 0:
logger.info(f"remove {path}")
May be a bit verbose, but this seems to do the job.
def dir_empty(path):
empty = True
for item in path.glob('*'):
if item.is_file():
empty = False
if item.is_dir() and not dir_empty(item):
empty = False
if empty:
path.rmdir() # Remove if you just want to have the result
return empty
from pathlib import Path
dir_empty(Path('Z'))
os.rmdir() will fail on any directory with contents. So one method here is to just rmdir every directory from bottom to top while suppressing the OSError exception which is thrown when attempting to remove a non-empty directory. All empty ones will be removed, all with contents will remain. Technically, checking if a directory is empty before attempting the removal is a race condition (though typically, a harmless one).
Let's take this filesystem with 2 files in it.
testtree/
├── a
│ └── aa
│ └── filea
├── b
│ ├── bb
│ └── fileb
└── c
└── cc
And run this:
import os
from pathlib import Path
from contextlib import suppress
for root,dirs,_ in os.walk("testtree", topdown=False):
for d in dirs:
with suppress(OSError):
os.rmdir(Path(root,d))
Then the tree is transformed to
testtree/
├── a
│ └── aa
│ └── filea
└── b
└── fileb
bb, cc, and c were all removed.
The problem here is that glob("**/*") appears to do a preorder traversal. In other words, it returns a parent folder before returning any children.
You need to do a postorder traversal instead. In your example, if you delete a folder like X, then Y just becomes the same as X.
You could do this with manual recursion. But if you want to use glob(), you need to reverse the items returned. You can do this with reversed(Path(folder.glob("**/*")).
In my Python3 program I need to delete files and folders that are older than X days. I know there a many similar questions here, but in my case I don't need to check the modification times of these folders and files. Instead I have the following folder structure:
/root_folder/<year>/<month>/<day>/<files>
So for example something like this:
.
└── 2020
├── 04
│ └── 30
│ ├── file.1
│ └── file.2
└── 05
├── 14
│ ├── file.1
│ └── file.2
├── 19
├── 21
│ └── file.1
└── 22
├── file.1
├── file.2
└── file.3
What I want now is to delete all the folders and their files that represent the date older than X days. I have created a solution, but coming from Java it seems to me that is not very Pythonic, or it might be easier to solve in Pyhton. Can you Python experts guide me a bit here, of course taking into account "jumps" over months and years?
Not a python expert here either, but here's something simple:
Find the date oldest date that you want to keep. Anything older than this will be deleted. Let's say it is the 28/04/2020
From that date, you can build a string "/root_folder/2020/04/28"
List all the files, if their path (as string) is less than the string from the previous step, you can delete them all
Example:
files = []
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
for file in f:
if '.txt' in file:
files.append(os.path.join(r, file))
Source of that snippet: https://mkyong.com/python/python-how-to-list-all-files-in-a-directory/
Now, you can do:
for f in files:
if f < date_limit:
os.remove(f)
Note: This is not optimal
It deletes file by file, but the moment you enter the if you could just delete the whole folder where this file is (but then the list of files points to files that have been deleted).
You actually don't care about the files. You could apply the logic to folders alone and remove them recursively.
Update: doing both steps as we browse the folders:
for r, d, f in os.walk(path):
if( r < date_limit ):
print(f"Deleting {r}")
shutil.rmtree(r)
Glob your paths to get your filepaths in an array then run it something like this (below), good luck!
def is_file_access_older_than(file_path, seconds, from_time=None):
"""based on st_atime --> https://docs.python.org/3/library/os.html#os.stat_result.st_atime"""
if not from_time:
from_time = time.time()
if (from_time - os.stat(file_path).st_atime) > seconds:
return file_path
return False