Python importing csv files within subfolders

Python importing csv files within subfolders - python

Is there a way of importing all the files within folder1? Each csv file is contained within a subfolder. Below is the file structure.
C:/downloads/folder1 > tree /F
C:.
│ tree
│
├───2020-06
│ test1.csv
│
├───2020-07
│ test2.csv
│
├───2020-08
│ test3.csv
│
├───2020-09
│ test4.csv
I'm aware of glob, below, to take all files within a folder. However can this be used for subfolders?
import glob
import pandas as pd
# Get a list of all the csv files
csv_files = glob.____('*.csv')
# List comprehension that loads of all the files
dfs = [pd.read_csv(____) for ____ in ____]
# List comprehension that looks at the shape of all DataFrames
print(____)

Use the recursive keyword argument of the glob.glob() method:
glob.glob('**\\*.csv', recursive=True)

You can use os.walk to find all sub_folder and get the required files
here's a code sample
import os
import pandas as pd
path = '<Insert Path>'
file_extension = '.csv'
csv_file_list = []
for root, dirs, files in os.walk(path):
for name in files:
if name.endswith(file_extension):
file_path = os.path.join(root, name)
csv_file_list.append(file_path)
dfs = [pd.read_csv(f) for f in csv_file_list]

I found this on Kite's website, check it out
path = "./directory/src_folder"
text_files = glob.glob(path + "/**/*.txt", recursive = True)
print(text_files)
OUTPUT
['./directory/src_folder/src_file.txt', './directory/src_folder/subdirectory/subdirectory_file.txt']

Related

Using glob recursion to get sub directories and files containing CSVs

I am trying to concat multiple CSVs that live in subfolders of my parent directory.
/ParentDirectory
│
│
├───SubFolder 1
│ test1.csv
│
├───SubFolder 2
│ test2.csv
│
├───SubFolder 3
│ test3.csv
│ test4.csv
│
├───SubFolder 4
│ test5.csv
When I do
import pandas as pd
import glob
files = glob.glob('/ParentDirectory/*.csv', recursive=True)
df = pd.concat([pd.read_csv(fp) for fp in files], ignore_index=True)
I get ValueError: No objects to concatenate.
But if I select a specific sub folder, it works:
files = glob.glob('/ParentDirectory/SubFolder 3/*.csv', recursive=True)
How come glob isn't able to go down a directory and get the CSVs within each folder of the parent directory?

Try:
files = glob.glob('/ParentDirectory/**/*.csv', recursive=True)

files = glob.glob('/ParentDirectory/*/*.csv')
It doesn't need to be recursive for that pattern, but does need a wildcard for the subdirectory.

Zipping files to the same folder level

This thread here advises to use shutilto zip files:
import shutil
shutil.make_archive(output_filename, 'zip', dir_name)
This zips everything in dir_name and maintains the folder structure in it. Is it possible to use this same library to remove all sub-folders and just zip all files in dir_name into the same level? Or must I introduce a separate code chunk to first consolidate the files? For eg., this is a hypothetical folder structure:
\dir_name
\dir1
\cat1
file1.txt
file2.txt
\cat2
file3.txt
\dir2
\cat3
file4.txt
Output zip should just contain:
file1.txt
file2.txt
file3.txt
file4.txt

shutil.make_archive does not have a way to do what you want without copying files to another directory, which is inefficient. Instead you can use a compression library directly similar to the linked answer you provided. Note this doesn't handle name collisions!
import zipfile
import os
with zipfile.ZipFile('output.zip','w',zipfile.ZIP_DEFLATED,compresslevel=9) as z:
for path,dirs,files in os.walk('dir_name'):
for file in files:
full = os.path.join(path,file)
z.write(full,file) # write the file, but with just the file's name not full path
# print the files in the zipfile
with zipfile.ZipFile('output.zip') as z:
for name in z.namelist():
print(name)
Given:
dir_name
├───dir1
│ ├───cat1
│ │ file1.txt
│ │ file2.txt
│ │
│ └───cat2
│ file3.txt
│
└───dir2
└───cat3
file4.txt
Output:
file1.txt
file2.txt
file3.txt
file4.txt

# The root directory to search for
path = r'dir_name/'
import os
import glob
# List all *.txt files in the root directory
file_paths = [file_path
for root_path, _, _ in os.walk(path)
for file_path in glob.glob(os.path.join(root_path, '*.txt'))]
import tempfile
# Create a temporary directory to copy your files into
with tempfile.TemporaryDirectory() as tmp:
import shutil
for file_path in file_paths:
# Get the basename of the file
basename = os.path.basename(file_path)
# Copy the file to the temporary directory
shutil.copyfile(file_path, os.path.join(tmp, basename))
# Zip the temporary directory to the working directory
shutil.make_archive('output', 'zip', tmp)
This will create a output.zip file in the current working directory. The temporary directory will be deleted when the end of the context manager is reached.

Iterate excel files and output in one folder in Python

I have a folder and subfolders structure as follows:
D:/src
├─ xyz.xlsx
├─ dist
│ ├─ xyz.xlsx
│ ├─ xxx.zip
│ └─ xxy.xlsx
├─ lib
│ ├─ xy.rar
│ └─ xyx.xlsx
├─ test
│ ├─ xyy.xlsx
│ ├─ x.xls
│ └─ xyz.xlsx
I want to extract all excel files (xls or xlsx) from source directory and subdirectories, drop duplicates based on excel file names and put all the unique files in D:/dst directory. How can I the following result in Python? Thanks.
Expected result:
D:/dst
├─ xyz.xlsx
├─ xxy.xlsx
├─ xyx.xlsx
├─ xyy.xlsx
├─ x.xls
Here is what I have tried:
import os
for root, dirs, files in os.walk(src, topdown=False):
for file in files:
if file.endswith('.xlsx') or file.endswith('.xls'):
#print(os.path.join(root, file))
try:
df0 = pd.read_excel(os.path.join(root, file))
#print(df0)
except:
continue
df1 = pd.DataFrame(columns = [columns_selected])
df1 = df1.append(df0, ignore_index = True)
print(df1)
df1.to_excel('test.xlsx', index = False)

I think this will do what you want:
import os
import shutil
src = os.path.abspath(r'.\_src')
dst = os.path.abspath(r'.\_dst')
wanted = {'.xls', '.xlsx'}
copied = set()
for root, dirs, filenames in os.walk(src, topdown=False):
for filename in filenames:
ext = os.path.splitext(filename)[1]
if ext in wanted and filename not in copied:
src_filepath = os.path.join(root, filename)
shutil.copy(src_filepath, dst)
copied.add(filename)

Since you've already got glob.glob, you don't need to also do os.walk, and vice-versa. But since glob only matches one pattern at a time and no way to denote an optional extra 'x' in the extension, you'll either need the glob loop twice - once for each extension; or use glob.glob( 'D:\\src\\*.xls*') which could match '*.xlsm', etc.
For each file matched, use shutil.move:
for file in glob.glob('D:\\src\\*.xls*'):
shutil.move(file, 'D:\\dst\\' + os.path.basename(file))
With os.walk, you can do each extension check with fnmatch.fnmatch in the same loop:
for root, dirs, files in os.walk('D:\\src'):
for file in files:
if fnmatch.fnmatch(file, '*.xls') or fnmatch.fnmatch(file, '*.xlsx'):
shutil.move(f'{root}\\{file}', f'D:\\dst\\{file}')
# shutil.move(root + '\\' + file, 'D:\\dst\\' + file)

How to get filepath directory and use it to read my excel file? (Mac)

I'm creating a basketball data visualization app, and I've already completed the GUI, now just trying to import my database which is an excel file. I'm using pandas, and when I run this code, I get the "No such file or directory" error. I understand I must get the filepath, but how do I do this (Mac OS X) and implement it to direct my code to my file?
I tried directly copying and pasting the filepath with path = r'C:(insert path here)'
#Basketball DataVis (Data Visualization)
#pylint:disable = W0614
#By Robert Smith
#Import
import tkinter
import os
import pandas as pd
from tkinter import *
from PIL import Image, ImageTk
from pandas import *
#Import the excel file to use as a database
data = pd.read_excel("nbadata.xlsx", sheetname= "Sheet1")

Easiest way is to open an instance of the terminal and then drag the file into the terminal screen - this will print the path which you can then use in your script.
Note that mac filepaths don't begin with C:

I will suggest you to use recursive approach to solve your problem if you don't know where is your xlsx file (so you can't provide relative or absolute path) but you know the exact name of it and you also know the root directory under which this file exists.
For this kind of scenario, just pass the root path and filename to the recursive function and it will give a list of absolute paths of all matched file names.
Finally you can choose the 1st one from that list if you are sure there're no more files with the same name or you can print the list on console and retry.
I found this method best in my case and I have presented a simple example for that as follows.
Directory structure:
H:\RishikeshAgrawani\Projects\GenWork\Python3\try\test>tree . /f
Folder PATH listing for volume New Volume
Volume serial number is C867-828E
H:\RISHIKESHAGRAWANI\PROJECTS\GENWORK\PYTHON3\TRY\TEST
│ tree
│
├───c
│ docs.txt
│
├───cpp
│ docs.md
│
├───data
│ nbadata.xlsx
│
├───js
│ docs.js
│
├───matlab
│ docs.txt
│
├───py
│ │ docs.py
│ │
│ └───docs
│ docs.txt
│
└───r
docs.md
Here is the recursive implementation, please have a look and try.
import os
def search_file_and_get_abspaths(path, filename):
"""
Description
===========
- Gives list of absolute path of matched file names by performing recursive search
- [] will be returned in there is no such file under the given path
"""
matched_paths = []
if os.path.isdir(path):
files = os.listdir(path)
for file in files:
fullpath = os.path.join(path, file)
if os.path.isdir(fullpath):
# Recusive search in child directories
matched_paths += search_file_and_get_abspaths(fullpath, filename)
elif os.path.isfile(fullpath):
if fullpath.endswith(filename):
if not path in matched_paths:
matched_paths.append(fullpath)
return matched_paths
if __name__ == "__main__":
# Test case 1 (Multiple files exmample)
matched_paths = search_file_and_get_abspaths(r'H:\RishikeshAgrawani\Projects\GenWork\Python3\try\test', 'docs.txt');
print(matched_paths)
# Test case 2 (Single file example)
matched_paths2 = search_file_and_get_abspaths(r'H:\RishikeshAgrawani\Projects\GenWork\Python3\try\test', 'nbadata.xlsx');
print(matched_paths2)
# ['H:\\RishikeshAgrawani\\Projects\\GenWork\\Python3\\try\\test\\c\\docs.txt', 'H:\\RishikeshAgrawani\\Projects\\GenWork\\Python3\\try\\test\\matlab\\docs.txt', 'H:\\RishikeshAgrawani\\Projects\\GenWork\\Python3\\try\\test\\py\\docs\\docs.txt']
if matched_paths2:
xlsx_path = matched_paths2[0] # If your file name is unique then it will only be 1
print(xlsx_path) # H:\RishikeshAgrawani\Projects\GenWork\Python3\try\test\data\nbadata.xlsx
data = pd.read_excel(xlsx_path, sheetname= "Sheet1")
else:
print("Path does not exist")

zip file and avoid directory structure

I have a Python script that zips a file (new.txt):
tofile = "/root/files/result/"+file
targetzipfile = new.zip # This is how I want my zip to look like
zf = zipfile.ZipFile(targetzipfile, mode='w')
try:
#adding to archive
zf.write(tofile)
finally:
zf.close()
When I do this I get the zip file. But when I try to unzip the file I get the text file inside of a series of directories corresponding to the path of the file i.e I see a folder called root in the result directory and more directories within it, i.e. I have
/root/files/result/new.zip
and when I unzip new.zip I have a directory structure that looks like
/root/files/result/root/files/result/new.txt
Is there a way I can zip such that when I unzip I only get new.txt?
In other words I have /root/files/result/new.zip and when I unzip new.zip, it should look like
/root/files/results/new.txt

The zipfile.write() method takes an optional arcname argument that specifies what the name of the file should be inside the zipfile
I think you need to do a modification for the destination, otherwise it will duplicate the directory. Use :arcname to avoid it. try like this:
import os
import zipfile
def zip(src, dst):
zf = zipfile.ZipFile("%s.zip" % (dst), "w", zipfile.ZIP_DEFLATED)
abs_src = os.path.abspath(src)
for dirname, subdirs, files in os.walk(src):
for filename in files:
absname = os.path.abspath(os.path.join(dirname, filename))
arcname = absname[len(abs_src) + 1:]
print 'zipping %s as %s' % (os.path.join(dirname, filename),
arcname)
zf.write(absname, arcname)
zf.close()
zip("src", "dst")

zf.write(tofile)
to change
zf.write(tofile, zipfile_dir)
for example
zf.write("/root/files/result/root/files/result/new.txt", "/root/files/results/new.txt")

To illustrate most clearly,
directory structure:
/Users
└── /user
. ├── /pixmaps
. │ ├── pixmap_00.raw
. │ ├── pixmap_01.raw
│ ├── /jpeg
│ │ ├── pixmap_00.jpg
│ │ └── pixmap_01.jpg
│ └── /png
│ ├── pixmap_00.png
│ └── pixmap_01.png
├── /docs
├── /programs
├── /misc
.
.
.
Directory of interest: /Users/user/pixmaps
First attemp
import os
import zipfile
TARGET_DIRECTORY = "/Users/user/pixmaps"
ZIPFILE_NAME = "CompressedDir.zip"
def zip_dir(directory, zipname):
"""
Compress a directory (ZIP file).
"""
if os.path.exists(directory):
outZipFile = zipfile.ZipFile(zipname, 'w', zipfile.ZIP_DEFLATED)
for dirpath, dirnames, filenames in os.walk(directory):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
outZipFile.write(filepath)
outZipFile.close()
if __name__ == '__main__':
zip_dir(TARGET_DIRECTORY, ZIPFILE_NAME)
ZIP file structure:
CompressedDir.zip
.
└── /Users
└── /user
└── /pixmaps
├── pixmap_00.raw
├── pixmap_01.raw
├── /jpeg
│ ├── pixmap_00.jpg
│ └── pixmap_01.jpg
└── /png
├── pixmap_00.png
└── pixmap_01.png
Avoiding the full directory path
def zip_dir(directory, zipname):
"""
Compress a directory (ZIP file).
"""
if os.path.exists(directory):
outZipFile = zipfile.ZipFile(zipname, 'w', zipfile.ZIP_DEFLATED)
# The root directory within the ZIP file.
rootdir = os.path.basename(directory)
for dirpath, dirnames, filenames in os.walk(directory):
for filename in filenames:
# Write the file named filename to the archive,
# giving it the archive name 'arcname'.
filepath = os.path.join(dirpath, filename)
parentpath = os.path.relpath(filepath, directory)
arcname = os.path.join(rootdir, parentpath)
outZipFile.write(filepath, arcname)
outZipFile.close()
if __name__ == '__main__':
zip_dir(TARGET_DIRECTORY, ZIPFILE_NAME)
ZIP file structure:
CompressedDir.zip
.
└── /pixmaps
├── pixmap_00.raw
├── pixmap_01.raw
├── /jpeg
│ ├── pixmap_00.jpg
│ └── pixmap_01.jpg
└── /png
├── pixmap_00.png
└── pixmap_01.png

The arcname parameter in the write method specifies what will be the name of the file inside the zipfile:
import os
import zipfile
# 1. Create a zip file which we will write files to
zip_file = "/home/username/test.zip"
zipf = zipfile.ZipFile(zip_file, 'w', zipfile.ZIP_DEFLATED)
# 2. Write files found in "/home/username/files/" to the test.zip
files_to_zip = "/home/username/files/"
for file_to_zip in os.listdir(files_to_zip):
file_to_zip_full_path = os.path.join(files_to_zip, file_to_zip)
# arcname argument specifies what will be the name of the file inside the zipfile
zipf.write(filename=file_to_zip_full_path, arcname=file_to_zip)
zipf.close()

You can isolate just the file name of your sources files using:
name_file_only= name_full_path.split(os.sep)[-1]
For example, if name_full_path is /root/files/results/myfile.txt, then name_file_only will be myfile.txt. To zip myfile.txt to the root of the archive zf, you can then use:
zf.write(name_full_path, name_file_only)

Check out the documentation for Zipfile.write.
ZipFile.write(filename[, arcname[, compress_type]]) Write the file
named filename to the archive, giving it the archive name arcname (by
default, this will be the same as filename, but without a drive letter
and with leading path separators removed)
https://docs.python.org/2/library/zipfile.html#zipfile.ZipFile.write
Try the following:
import zipfile
import os
filename = 'foo.txt'
# Using os.path.join is better than using '/' it is OS agnostic
path = os.path.join(os.path.sep, 'tmp', 'bar', 'baz', filename)
zip_filename = os.path.splitext(filename)[0] + '.zip'
zip_path = os.path.join(os.path.dirname(path), zip_filename)
# If you need exception handling wrap this in a try/except block
with zipfile.ZipFile(zip_path, 'w') as zf:
zf.write(path, zip_filename)
The bottom line is that if you do not supply an archive name then the filename is used as the archive name and it will contain the full path to the file.

It is much simpler than expected, I configured the module using the parameter "arcname" as "file_to_be_zipped.txt", so the folders do not appear in my final zipped file:
mmpk_zip_file = zipfile.ZipFile("c:\\Destination_folder_name\newzippedfilename.zip", mode='w', compression=zipfile.ZIP_DEFLATED)
mmpk_zip_file.write("c:\\Source_folder_name\file_to_be_zipped.txt", "file_to_be_zipped.txt")
mmpk_zip_file.close()

We can use this
import os
# single File
os.system(f"cd {destinationFolder} && zip fname.zip fname")
# directory
os.system(f"cd {destinationFolder} && zip -r folder.zip folder")
For me, This is working.

Specify the arcname input of the write method as following:
tofile = "/root/files/result/"+file
NewRoot = "files/result/"
zf.write(tofile, arcname=tofile.split(NewRoot)[1])
More info:
ZipFile.write(filename, arcname=None, compress_type=None,
compresslevel=None)
https://docs.python.org/3/library/zipfile.html

I face the same problem and i solve it with writestr. You can use it like this:
zipObject.writestr(<filename> , <file data, bytes or string>)

If you want an elegant way to do it with pathlib you can use it this way:
from pathlib import Path
import zipfile
def zip_dir(path_to_zip: Path):
zip_file = Path(path_to_zip).with_suffix('.zip')
z = zipfile.ZipFile(zip_file, 'w', zipfile.ZIP_DEFLATED)
for f in list(path_to_zip.rglob('*.*')):
z.write(f, arcname=f.relative_to(path_to_zip))

To get rid of the absolute path, I came up with this:
def create_zip(root_path, file_name, ignored=[], storage_path=None):
"""Create a ZIP
This function creates a ZIP file of the provided root path.
Args:
root_path (str): Root path to start from when picking files and directories.
file_name (str): File name to save the created ZIP file as.
ignored (list): A list of files and/or directories that you want to ignore. This
selection is applied in root directory only.
storage_path: If provided, ZIP file will be placed in this location. If None, the
ZIP will be created in root_path
"""
if storage_path is not None:
zip_root = os.path.join(storage_path, file_name)
else:
zip_root = os.path.join(root_path, file_name)
zipf = zipfile.ZipFile(zip_root, 'w', zipfile.ZIP_DEFLATED)
def iter_subtree(path, layer=0):
# iter the directory
path = Path(path)
for p in path.iterdir():
if layer == 0 and p.name in ignored:
continue
zipf.write(p, str(p).replace(root_path, '').lstrip('/'))
if p.is_dir():
iter_subtree(p, layer=layer+1)
iter_subtree(root_path)
zipf.close()
Maybe it isn't the most elegant solution, but this works. If we just use p.name when providing the file name to write() method, then it doesn't create the proper directory structure.
Moreover, if it's needed to ignore the selected directories or files from the root path, this ignores those selections too.

This is an example I used. I have one excel file, Treport where I am using python + pandas in my dowork function to create pivot tables, etc. for each of the companies in CompanyNames. I create a zip file of the csv and a non-zip file so I can check as well.
The writer specifies the path where I want my .xlsx to go and for my zip files, I specify that in the zip.write(). I just specify the name of the xlsx file that was recently created, and that is what gets zipped up, not the whole directory. Beforehand I was just specifying 'writer' and would zip up the whole directory. This allows me to zip up just the recently created excel file.
Treport = 'TestReportData.csv'
CompanyNames = ['Company1','Company2','Company3']
for CompName in CompanyNames:
strcomp = str(CompName)
#Writer Creates pathway to output report to. Each company gets unique file.
writer = pd.ExcelWriter(f"C:\\Users\\MyUser\\Documents\\{strcomp}addReview.xlsx", engine='xlsxwriter')
DoWorkFunction(CompName, Treport, writer)
writer.save()
with ZipFile(f"C:\\Users\\MyUser\\Documents\\{strcomp}addR.zip", 'w') as zip:
zip.write(writer, f"{strcomp}addReview.xlsx")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python importing csv files within subfolders - python

Use the recursive keyword argument of the glob.glob() method: glob.glob('**\\*.csv', recursive=True)

I found this on Kite's website, check it out path = "./directory/src_folder" text_files = glob.glob(path + "/**/*.txt", recursive = True) print(text_files) OUTPUT ['./directory/src_folder/src_file.txt', './directory/src_folder/subdirectory/subdirectory_file.txt']

Related

Using glob recursion to get sub directories and files containing CSVs

Zipping files to the same folder level

Iterate excel files and output in one folder in Python

How to get filepath directory and use it to read my excel file? (Mac)

zip file and avoid directory structure

Categories

Resources