Reading .csv and .xlsx in a directory using Data frame - python

How to read all files in a directory if the directory contains .csv and .xlsx files?
I tried:
read_files = Path(path).rglob("*.csv","*.xlsx")
all_files = [pd.read_excel(file) for file in read_files]
But it is not working. How can I achieve it?

The rglob method of pathlib.Path accepts exactly one pattern. You can (somewhat inefficiently) loop twice:
pathobj = Path(path)
read_files = pathobj.rglob("*.csv") + pathobj.rglob("*.xlsx")
or refactor the code to use a different traversal function - os.walk comes to mind, or do an rglob on all files (or all files with a dot in their name) and extract only the ones which match either of your patterns:
read_files = filter(lambda x: x.suffix in ('.xls', '.csv'), Path(path).rglob("*.*"))
Like the documentation already tells you, the pattern syntax supported by rglob is the same as for the fnmatch module.
... Though as it turns out, using read_excel for CSV files is probably not the way to go anyway; try
pathobj = Path(path)
all_files = [pd.read_csv(file) for file in pathobj.rglob("*.csv")]
all_files.extend(pd.read_excel(file) for file in pathobj.rglob("*.xlsx"))

Related

Find efficiently a file with unknown extension

I have a problem that feels easy, but I cannot come up with a satisfying solution.
I have a file structure with a directory containing a very large number of files. The file names are just their index with an unknown extension. For example, the 10th file is "10.pdf" and the 42th file is "42.png". There can be many different extensions.
I need to access the i-th file from python, given index i but not knowing the extension. This will happen a lot, so I should be able to do it efficiently.
Here are the partial solutions I could think about:
I can glob the pattern f"{i}.*"
However, I think glob will check every file in the directory? This will be very slow for a large number of files.
I can save and preload the full name in a dict, in a JSON file like {..., 10: "10.pdf", ...}
This works, but I have to load and keep track of another heavy object. This feels wrong somehow...
If I have a list of all allowed extensions, I can just test all possibilities. This feels weird and unnecessary, but that's my best guess for now.
What do you think ? Is one of those proposal the correct way to do it ?
As I think, you only need the file name instead full filename+ext. So, one way is to remove the extension from the file, for example:
import os
path = r"Enter your folder's path here"
file_dict = {}
for file in os.listdir(path):
if os.path.isfile(file): # because os.listdir return both files and folders
file_name, ext = os.path.splitext(file)
print(file_name, ext)
For example, if your file is '10.pdf' then file_name='10' and ext='.pdf'. Then you can add it to a dictionary for the future:
file_dict[file_name] = os.path.join(path, file)
Another way is using regular expressions or "re"! if you have a patter(even complex pattern) 're' is awesome! You need to type your desired pattern, for example:
import os
import re
path = r"Enter your folder's path here"
file_dict = {}
for file in os.listdir(path):
if os.path.isfile(file):
mo = re.search(r'(.*\)(..*)', file)
file_name, ext = mo.groups()
print(file_name, ext)

Pandas: How to read xlsx files from a folder matching only specific names

I have a folder full of excel files and i have to read only 3 files from that folder and put them into individual dataframes.
File1: Asterix_New file_Jan2020.xlsx
File2: Asterix_Master file_Jan2020.xlsx
File3: Asterix_Mapping file_Jan2020.xlsx
I am aware of the below syntax which finds xlsx file from a folder but not sure how to relate it to specific keywords. In this case starting with "Asterix_"
files_xlsx = [f for f in files if f[-4:] == "xlsx"]
Also i am trying to put each of the excel file in a individual dataframe but not getting successful:
for i in files_xlsx:
df[i] = pd.read_excel(files_xlsx[0])
Any suggestions are appreciated.
I suggest using pathlib. If all the files are in a folder:
from pathlib import Path
from fnmatch import fnmatch
folder = Path('name of folder')
Search for the files using glob. I will also suggest using fnmatch to include the files whose extensions are in capital letters.
iterdir allows you to iterate through the files in the folder
name is a method in pathlib that gives you the name of the file in string format
using the str lower method ensures that extensions such as XLSX, which is uppercase is captured
excel_only_files = [xlsx for xlsx in folder.iterdir()
if fnmatch(xlsx.name.lower(),'asterix_*.xlsx')]
OR
#you'll have to test this, i did not put it though any tests
excel_only_files = list(folder.rglob('Asterix_*.[xlsx|XLSX]')
from there, you can run a list comprehension to read your files:
dataframes = [pd.read_excel(f) for f in excel_only_files]
Use glob.glob to do your pattern matches
import glob
for i in glob.glob('Asterix_*.xlsx'):
...
First generate a list of files you want to read in using glob (based on #cup's answer) and then append them to a list.
import pandas as pd
import glob
my_df_list = [pd.read_excel(f) for f in glob.iglob('Asterix_*.xlsx')]
Depending on what you want to achieve, you can also use a dict to allow for key-value pairs.
At the end of the if statement you need to add another condition for files which also contain 'Asterix_':
files_xlsx = [f for f in files if f[-4:] == "xlsx" and "Asterix_" in f]
The f[-4:] == "xlsx" is to make sure the last 4 characters of the file name are xlsx and "Asterix_" in f makes sure that "Asterix_" exists anywhere in the file name.
To then read these using pandas try:
for file in excel_files:
df = pd.read_excel(file)
print(df)
That should print the result of the DataFrame read from the excel file
If you have read in the file names, you can make sure that it starts with and ends with the desired strings by using this list comprehension:
files = ['filea.txt', 'fileb.xlsx', 'filec.xlsx', 'notme.txt']
files_xlsx = [f for f in files if f.startswith('file') and f.endswith('xlsx')]
files_xlsx # ['fileb.xlsx', 'filec.xlsx']
The list comprehension says, "Give me all the files that start with file AND end with xlsx.

How can I read files with similar names on python, rename them and then work with them?

I've already posted here with the same question but I sadly I couldn't come up with a solution (even though some of you guys gave me awesome answers but most of them weren't what I was looking for), so I'll try again and this time giving more information about what I'm trying to do.
So, I'm using a program called GMAT to get some outputs (.txt files with numerical values). These outputs have different names, but because I'm using them to more than one thing I'm getting something like this:
GMATd_1.txt
GMATd_2.txt
GMATf_1.txt
GMATf_2.txt
Now, what I need to do is to use these outputs as inputs in my code. I need to work with them in other functions of my script, and since I will have a lot of these .txt files I want to rename them as I don't want to use them like './path/etc'.
So what I wanted was to write a loop that could get these files and rename them inside the script so I can use these files with the new name in other functions (outside the loop).
So instead of having to this individually:
GMATds1= './path/GMATd_1.txt'
GMATds2= './path/GMATd_2.txt'
I wanted to write a loop that would do that for me.
I've already tried using a dictionary:
import os
import fnmatch
dict = {}
for filename in os.listdir('.'):
if fnmatch.fnmatch(filename, 'thing*.txt'):
examples[filename[:6]] = filename
This does work but I can't use the dictionary key outside the loop.
If I understand correctly, you try to fetch files with similar names (at least a re-occurring pattern) and rename them. This can be accomplished with the following code:
import glob
import os
all_files = glob.glob('path/to/directory/with/files/GMAT*.txt')
for file in files:
new_path = create_new_path(file) # possibly split the file name, change directory and/or filename
os.rename(file, new_path)
The glob library allows for searching files with * wildcards and makes it hence possible to search for files with a specific pattern. It lists all the files in a certain directory (or multiple directories if you include a * wildcard as a directory). When you iterate over the files, you could either directly work with the input of the files (as you apparently intend to do) or rename them as shown in this snippet. To rename them, you would need to generate a new path - so you would have to write the create_new_path function that takes the old path and creates a new one.
Since python 3.4 you should be using the built-in pathlib package instead of os or glob.
from pathlib import Path
import shutil
for file_src in Path("path/to/files").glob("GMAT*.txt"):
file_dest = str(file_src.resolve()).replace("ds", "d_")
shutil.move(file_src, file_dest)
you can use
import os
path='.....' # path where these files are located
path1='.....' ## path where you want these files to store
i=1
for file in os.listdir(path):
if file.endswith(end='.txt'):
os.rename(path + "/" + file, path1 + "/"+str(i) + ".txt")
i+=1
it will rename all the txt file in the source folder to 1,2,3,....n.txt

How to read all the files of the same type of an arborescence in Python?

I am using the TUH EEG Seizure corpus and I would like to get easily with Python all the files with the same extension. I saw that post but I don't know if it can be applied to an overall hierarchy.
Thanks in advance
glob allow to automatically select a pattern in a single directory. os.walk is the tool to use for browsing a full hierarchy, but it has no provision for filtering file names on a specific pattern, so you have to apply the filtering by hand. You could do:
import os.path
# enter your real data here
top_folder = ...
extension = ...
# and let's browse:
for folder, sub_folders, files in os.walk(top_folder)
for file in files:
if file.endswith(extension):
full_path = os.path.join(folder, file)
# apply your processing to full_path

How can I get the file names and contents of a whole directory?

I have a directory named main that contains two files: a text file named alex.txt that only has 100 as its contents, and another file named mark.txt that has 400.
I want to create a function that will go into the directory, and take every file name and that file's contents and store them (into a dict?). So the end result would look something like this:
({'alex.txt', '100'}, {'mark.txt', '400'})
What would be the best way of doing this for large amounts of files?
This looks like a good job for os.walk
d = {}
for path,dirs,fnames in os.walk(top):
for fname in fnames:
visit = os.path.join(path,fname)
with open(visit) as f:
d[visit] = f.read()
This solution will also recurse into subdirectories if they are present.
Using a dictionary looks like the way to go.
You can use os.listdir to get a list of the files in your directory. Then, iterate on the files, opening each of them, reading its input and storing them in your dictionary.
If your main directory has some subdirectories, you may want to use the os.walk function to process them recursively. Stick to os.listdir otherwise.
Note that an item of os.listdir is relative to main. You may want to add the path to main before opening the file. In that case, use os.path.join(path_to_main, f) where f is an item of os.listdir.
import os
bar = {}
[bar.update({i: open(i, 'r').read()}) for i in os.listdir('.')]
or (via mgilson)
bar = dict( (i,open(i).read()) for i in os.listdir('.') )

Categories

Resources