Python accessing excel files from jupyter notebook in separate folder - python

As per the title i am having trouble accessing certain excel files in another folder based upon their filenames.
I have a folder containing a bunch of excel files which share a common name,but all have an individual datestamp appended on as a suffix in the format /files/filename_%d%m%Y.xlsx giving me a directory like this:
├── files
│ ├── filename_10102021.xlsx
│ ├── filename_07102021.xlsx
│ ├── filename_11102021.xlsx
│ └── filename_14102021.xlsx
├── notebooks
│ └── my_notebook.ipynb
From the my_notebook.ipynb file, I would like to navigate to the files directory, and get the 2 most recent excel files according to the suffixed date and open them in the notebook as pandas dfs so I can compare the columns for any differences. In the directory I provided above, the 2 files I would get are filename_14102021.xlsx and filename_11102021.xlsx but would like this solution to work dynamically as the files folder gets new files with new dates as time goes on. (so hardcoding those two names would not work)
My first thought is to do something like:
import os
import sys
import pandas as pd
sys.path.append('../files')
files = sorted(os.listdir(), reverse=True)
most_recent_df = pd.read_excel(files[0], engine='openpyxl', index_col=0)
second_most_recent_df = pd.read_excel(files[1], engine='openpyxl', index_col=0)
and then do my comparison between the dataframes.
However this code fails to do what I want as even with using sys.path.append, the os.listdir function returns a list of the notebooks directory which tells me the problem lies in this 2 lines:
sys.path.append('../files')
files = sorted(os.listdir(), reverse=True)
How do I fix my code to move into the files directory so that a list of all the excel files is returned?
thank you!

It should work directly using
files = sorted(os.listdir(r'path\to\folder'), reverse=True)
IMO you don't need to use sys.path.append

Related

How to get the output file by running Python file in Visual Studio Code?

I am beginner of Python user and select Visual Studio Code as editor. Recently I write down one Python file to identify all the files/directory name at the same level with and then output txt files to list down all the files/directory name that match my rule.
I remember in last month, when I run this Python file with Visual Studio Code, the output files will be seen at the parent folder(upper/previous level). But today, there is no output files after running this Python file with Visual Studio Code. Due to this reason, I double click the Python file directly to run it without Visual Studio Code and see the output files at the same level along with my Python file.
So my problems are:
How to ensure we can get the output files by running Python file with Visual Studio Code?
How to generate the output files at the same level along with Python file that would be run?
Code:
import os
CurrentScriptDir = os.path.dirname(os.path.realpath(__file__))
All_DirName = []
for root, dirs, files in os.walk(CurrentScriptDir):
for each_dir in dirs:
All_DirName.append(each_dir)
for Each_DirName in All_DirName:
Each_DirName_Split = Each_DirName.split('_')
if Each_DirName_Split[3] == 'twc':
unitname = "_".join(Each_DirName_Split[0:-1])
with open(unitname + ".txt", "a") as file:
file.write(Each_DirName + "_K3" + "\n")
file.close()
else:
next
tl;dr;
Below is my how I would write your program while adhering to the original code's code flow. Explanation follows this, and I can update this answer if you provide more details.
To avoid confusion with paths, I would suggest simply requiring the user to provide it when running the script. The path provided by the user is the path that gets scanned, and is also the location of all text files the script creates; cwd and location of script file is then irrelevant.
import os
import sys
# Usage:
# python Program.py <path>
def find_twc_folders(path):
for root, dirs, files in os.walk(path):
for dir in dirs:
parts = dir.split('_')
if len(parts) == 4 and parts[3] == 'twc': # 'a_twc', 'a_b_c_twc_d', etc are skipped
with open(os.path.join(path, dir[:-4] + '.txt'), 'a') as file: # substring with '_twc' removed
file.write(dir + '_K3\n')
if __name__ == '__main__':
if len(sys.argv) > 1:
find_twc_folders(sys.argv[1])
else:
find_twc_folders(os.path.dirname(os.path.realpath(__file__)))
(EDIT: Changed to use the script's directory if program is called with no args).
Folder setup:
Given the following directory setup, with your current working directory (cwd) in the VSCode terminal being one level above root:
PS C:\Users\there\source\repos\SO\75241788> tree /f
C:.
├───.vscode
└───root
│ Program.py
│
├───0_duplicate_path_twc
├───1a_one_two_three
│ ├───0_duplicate_path_twc
│ ├───2a_one_two_three
│ │ ├───0_duplicate_path_twc
│ │ ├───3a_one_two_three
│ │ └───3b_one_two_twc
│ └───2b_one_two_twc
├───1b_one_two_twc
│ ├───2a_one_two_three
│ ├───2b_one_two_three
│ ├───2c_one_two_twc
│ └───2d_one_two_twc
└───1c_one_two_twc
A dry run gives us the following, after replacing the actual file operations with print():
PS C:\Users\there\source\repos\SO\75241788> python root/Program.py
CurrentScriptDir: C:\Users\there\source\repos\SO\75241788\root
in "0_duplicate_path_twc" # <- in top level directory
in "1a_one_two_three"
in "1b_one_two_twc"
open 1b_one_two.txt
print: 1b_one_two_twc_K3\n
in "1c_one_two_twc"
open 1c_one_two.txt
print: 1c_one_two_twc_K3\n
in "0_duplicate_path_twc" # <- in sub level directory
in "2a_one_two_three"
# ...
In the current implementation, you are only pushing the directory name into your array, not the full path. A relative path that is unqualified will be considered rooted under the cwd by the OS, so your script will create all files at the location you see in your terminal to the left of the >.
Operating on folder names alone in this manner also means identical-named folders at different levels will result in multiple (duplicate) entries being added to the same file.
Code fixes
The final else in your program is unnecessary, as your for loop does that anyways. As mentioned by #rioV8, next is being used incorrectly here also. Also pointed out by him, there is no need to close the file in this case, since with does that for you.
As it stands, removing the unneeded All_DirName array, removing the last 3 lines previously mentioned, moving your join operation inline, and prepending your filepaths with CurrentScriptDir, result in:
import os
CurrentScriptDir = os.path.dirname(os.path.realpath(__file__))
for root, dirs, files in os.walk(CurrentScriptDir):
for each_dir in dirs:
Each_DirName_Split = each_dir.split('_')
# todo: check length > 3 first (or) compare last index instead
if Each_DirName_Split[3] == 'twc':
unitname = "_".join(Each_DirName_Split[0:-1])
with open(os.path.join(CurrentScriptDir, unitname) + '.txt'), 'a') as file:
file.write(each_dir + '_K3\n')
...And running it in the before-mentioned setup will walk all folders found in the folder the script is located in, saving all files to that same folder also.
EDIT: Added os.path.join(CurrentScriptDir, ...) in the previous code example to ensure the files are written next to the source program, regardless of current working directory.

Recursive Function to get all files from main folder and subdirectories inside it in Python

I have a file directory that looks something like this. I have a larger directory, but showing this one just for explanation purposes:
.
├── a.txt
├── b.txt
├── foo
│ └── w.txt
│ └── a.txt
└── moo
└── cool.csv
└── bad.csv
└── more
└── wow.csv
I want to write a recursive function to get year counts for files within each subdirectory within this directory.
I want the code to basically check if it's a directory or file. If it's a directory then I want to call the function again and get counts until there's no more subdirectories.
I have the following code (which keeps breaking my kernel when I test it). There's probably some logic error as well I would think..
import os
import pandas as pd
dir_path = 'S:\\Test'
def getFiles(dir_path):
contents = os.listdir(dir_path)
# check if content is directory or not
for file in contents:
if os.path.isdir(os.path.join(dir_path, file)):
# get everything inside subdirectory
getFiles(dir_path = os.path.join(dir_path, file))
# it's a file
else:
# do something to get the year of the file and put it in a list or something
# at the end create pandas data frame and return
Expected output would be a pandas dataframe that looks something like this..
Subdir 2020 2021 2022
foo 0 1 1
moo 0 2 0
more 1 0 0
How can I do this in Python?
EDIT:
Just realized os.walk() is probably extremely useful for my case here.
Trying to figure out a solution with os.walk() instead of doing it the long way..

How to make a config.ini for python project?

I have create a python project that update the data excel for my work, the problem is i need to create a config file so that others people can use my python project without changing the code.
My code:
import pandas as pd
import datetime
timestr = datetime.date.today().strftime('%d%m%Y')
Buyerpath = 'https://asd.cvs'
Sellerpath = 'https://dsa.csv'
Onlinepath = 'https://sda.csv'
Totalpath = pd.DataFrame({'BuyValue': Buyerpath.Buytotal,
'SellerValue': Sellerpath.Sellertotal,
'OnlineValue':Onlinepath.onlinetotal})
Totalpath.to_excel(index=False, excel_writer=r'C:\Users\Mike\Desktop\Result\ResultTotal'+timestr+'.xlsx')
I need the config file to allow others people use my python code and save the excel update in the output folder.
I think what you're asking is: how do I hide strings/variables that are specific to my local machine? correct?
The easiest way to do this is with another python file
my-project
├── __init__.py
├── main.py
└── secrets.py
put your the strings into you're secrets file, then import the values into your main.py file
from .secrets import secret_val

Tensorflow load many CSVs to `tf.data.Dataset` and use directory as label

I'm trying to load CSV data into a tensorflow Dataset object, but don't know how to associate the label with the CSV files given my directory structure.
I've got a directory structure like:
gesture_data/
├── train/
│ └── gesture{0001..9999}/ <- each directory name is the label
│ └── {timestamp}.txt <- each file is an observation associated with that label
├── test/
└── valid/
Despite having a .txt extension, all the files
gesture_data/{test,train,valid}/gesture{0001..9999}/*.txt are CSV files, with a format like:
│ File: train/gesture0002/2022-05-24T01:59:08.244689+02:00.txt
───────┼─────────────────────────────────────────────────────────────
1 │ 0,391,478,528,374,495,471,405,471,438,396,510,473,401,475,192,383,516,501,412,496,453,395,496,445,376,479,470,402,488,445
2 │ 19,402,488,514,371,494,471,407,472,441,390,514,475,406,488,185,395,499,496,399,488,451,409,490,463,382,490,467,403,487,467
3 │ 40,404,490,526,372,484,487,408,472,441,395,506,477,406,474,193,398,496,504,414,493,459,405,476,446,393,495,467,399,473,447
4 │ 56,400,491,525,370,479,486,386,457,439,383,511,466,406,473,192,398,505,503,411,476,450,412,494,461,389,491,467,397,483,392
5 │ 82,391,478,524,371,483,486,408,473,437,394,513,456,410,483,186,397,500,494,398,491,442,402,490,468,386,495,452,386,491,409
... about 200 more lines after this
Where the first value on a line is milliseconds since the start of recording, and after that are 30 sensor readings taken at that millisecond offset.
Each file is one observation, and the directory the file is in is the label of that observation. So all the files under gesture0001 should have the label gesture0001,all the files under gesture0002 should have the label gesture0002, and so on.
I can't see how to do that easily without making my own custom mapping, but this seems like a common data format and directory structure so I'd imagine there'd be an easier way to do it?
Currently I read in the files like:
gesture_ds = tf.data.experimental.make_csv_dataset(
file_pattern = "../gesture_data/train/*/*.txt",
header=False,
column_names=['millis'] + fingers, # `fingers` is an array labeling each of the sensor measurements
batch_size=10,
num_epochs=1,
num_parallel_reads=20,
shuffle_buffer_size=10000
)
But I don't know how to label the data from here. I found the label_name parameter to make_csv_dataset but that requires the label name to be one of the columns of the CSV file.
I can restructure the CSV file to include the label name as a column, but I'm expecting a lot of data and don't want to bloat the files if I can possibly help it.
Thanks!

Append Excel Files in Multiple Directories in Python

My goal is to append 9 excel files together that exist in different directories. I have a directory tree with the following structure:
Big Folder
|
├── folder_1/
| ├── file1.xls
| ├── file2.xls
| └── file3.xls
|
├── folder_2/
| ├── file4.xls
| ├── file5.xls
| └── file6.xls
|
├── folder_3/
| ├── file7.xls
| ├── file8.xls
| └── file9.xls
I successfully wrote a loop that appends file1, file2, and file3 together within folder_1. My idea is to nest this loop into another loop that flows through each folder as a list. I'm currently tring to us os.walk to accomplish this but am running into the following error in folder_1
[Errno 2 No such file or directory]
Do community members have recommendations on how to extend this loop to execute in each directory? Thanks!
It is hard for me to know how you have implemented the program without given some sort of code to work with, however I believe you have misused the os.walk() method, please read about it here.
I would use the os.walk() method the following way for getting the path to various files in a current directory and subdirectories.
import os
all_files = [(path, files) for path, dirs, files in os.walk(".")]
and then get all the files which ends with "*.xls" like so
all_xls_files = [
os.path.join(path, xls_file)
for (path, xls_files_list) in all_files
for xls_file in xls_files_list
if xls_file.endswith(".xls")
]
this is equivalent to
all_xls_files = []
for (path, xls_files_list) in all_files:
for xls_file in xls_files_list:
if xls_file.endswith(".xls"):
files.append(os.path.join(path, xls_file))
Once you obtain a list of excel files with their path
you can open them by
with open("my_output_file", "w") as output_file:
for file in all_xls_files:
with open(file) as f:
# Do your append here

Categories

Resources