Pandas: Reading files with regex - python

I am trying to read multiple excel files while using wildcards and putting it in saparate dataframes using pandas.
i have read base path and will be using below to access subdirectories:
>>>inputs_path
'C:/Users/ABC/Downloads/Input'
>>>path1 = os.chdir(inputs_path + "/path1")
>>>fls=glob.glob("*.*")
>>>fls
['Zambia_W4.xlsm',
'Australia_W4.xlsx',
'France_W4.xlsx',
'Japan_W3.xlsm',
'India_W3.xlsx',
'Italy_W3.xlsx',
'MEA_W5.xlsx',
'NE_W5.xlsm',
'Russia_W5.xlsx',
'Spain_W2.xlsx']
>>>path2 = os.chdir(inputs_path + "/path2")
>>>fls=glob.glob("*.*")
>>>fls
['Today.xlsm',
'Yesterday.xlsx',
'Tomorrow.xlsx']
Right now i am reading them as follows:
>>>df_italy = pd.read_excel("Italy_W3.xlsx",sheet_name='Sheet1')
>>>df_russia = pd.read_excel("Russia_W5.xlsx",sheet_name='Sheet3')
>>>df_france_1 = pd.read_excel("France_W4.xlsx",sheet_name='Sheet1', usecols = 'M, Q', skiprows=4)
>>>df_spain = pd.read_excel("Spain_W2.xlsx",sheet_name='Sheet2',usecols = 'T:U', skiprows=30 )
>>>df_ne = pd.read_excel("NE_W5.xlsm",sheet_name='Sheet2',usecols = 'N,P', skiprows=4 )
>>>df_ne_c = pd.read_excel("NE_W5.xlsm",sheet_name='Sheet1',usecols = 'H:J', skiprows=141 )
Since i have filenames in the list fls, is there a way i could use that list and read files without actually having to use the actual filename since the filename will change as per week number.
Also its mandatory to keep the dataframe names as mentioned above while reading the excel files.
i am looking to read the file as
>>>df_italy = pd.read_excel("Italy*.xlsx",sheet_name='Sheet1')
Is there any way to do this?

If your files always have a _ to split on you could create a dictionary with the split value as the key, and the file path as the location.
Lets use Pathlib which was added in Python 3.4+ as it's easier to use with file systems.
Regex Matching FileName.
Assuming your dictionary is created as above with filenames and paths as the values we could do this. You'll need to extend the function to deal with multiple file matches.
import re
from pathlib import path
file_dict = {file.stem : file for file in location.glob('*.xlsx')}
# assume the numbers are paths.
files = {'Zambia_W4.xlsm': 2,
'Australia_W4.xlsx': 5,
'France_W4.xlsx': 0,
'Japan_W3.xlsm': 7,
'India_W3.xlsx': 2,
'Italy_W3.xlsx': 6,
'MEA_W5.xlsx': 7,
'NE_W5.xlsm': 4,
'Russia_W5.xlsx': 3,
'Spain_W2.xlsx': 5}
def file_name_match(file_dict,pattern):
for name, source in file_dict.items():
if re.search(pattern,name,flags=re.IGNORECASE):
return file_dict.get(name)
file_name_match(file_dict,'italy')
output: 6
df = pd.read_excel(file_name_match(file_dict,'italy'),sheetname=...)

It might be feasible to simply populate a dictionary of dataframes like this:
my_dfs = {}
for f in fls:
my_dfs[f.split(“.”)[0]] = pandas.dataframe(f.split(“,”)[0], ...)
You can use a for loop also to just run the job you need to do for each file, which shouldn’t require knowledge of the file name. Also, it’s possible to also just read all the spreadsheets into one df, and ensure there is an additional column that has the corresponding file name for each row.

The code below assumes you have several files for each country, and need to sort them to find the latest week.
import glob
import os
import re
def find_country_file(country_name):
all_country_files = glob.glob(os.path.join(inputs_path, '{0}_W*.*'))
week_numbers = [re.search('W([0-9]+)', x) for x in all_country_files]
week_numbers = [int(x.group(1)) for x in week_numbers if x is not None]
latest_week_number = sorted(week_numbers, reversed=True)[0]
latest_country_file = [x for x in all_country_files if 'W{0}.'.format(latest_week_number) in x]
return os.path.basename(latest_country_file)
df_italy = pd.read_excel(find_country_file('Italy') , sheet_name='Sheet1')
df_russia = pd.read_excel(find_country_file('Russia'), sheet_name='Sheet3')
df_france_1 = pd.read_excel(find_country_file('France'),sheet_name='Sheet1', usecols = 'M, Q', skiprows=4)
df_spain = pd.read_excel(find_country_file('Spain'),sheet_name='Sheet2',usecols = 'T:U', skiprows=30 )
df_ne = pd.read_excel(find_country_file('NE'),sheet_name='Sheet2',usecols = 'N,P', skiprows=4 )
df_ne_c = pd.read_excel(find_country_file('NE'),sheet_name='Sheet1',usecols = 'H:J', skiprows=141)
the method find_country searches for all files with the country name in the path, uses regex to pull out the week number, sorts them to find the highest number, and then returns the file path from the glob of all country files that matches the latest week found.

Related

Python Dataframe find the file type, choose the correct pd.read_ and merge them

I have a list of files to be imported into the data frame
cdoe:
# list contains the dataset name followed by the column name to match all the datasets; this list keeps changing and even the file formats. These dataset file names are provided by the user, and they are unique.
# First: find the file extension format and select appropriate pd.read_ to import
# second: merge the dataframes on the index
# in the below list,
file_list = ['dataset1.csv','datetime','dataset2.xlsx','timestamp']
df = pd.DataFrame()
for i in range(0:2:len(file_list)):
# find the file type first
# presently, I don't know how to find the file type; so
file_type = 'csv'
# second: merge the dataframe into the existing dataframe on the index
tdf = pd.DataFrame()
if file_type == 'csv':
tdf = pd.read_csv('%s'%(file_list[i])))
if file_type == 'xlsx':
tdf = pd.read_excel('%s'%(file_list[i])))
tdf.set_index('%s'%(file_list[i+1]),inplace=True)
# Merge dataframe with the existing dataframe
df = df.merge(tdf,right_index=True,left_index=True)
I reached this far. Is any direct module available to find the file type? I found magic but it has issues while importing it. Also, suggest a better approach to merge the files?
Update: Working solution
Inspired from the #ljdyer answer below, I came with the following and this is working perfectly:
def find_file_type_import(file_name):
# Total file extensions possible for importing data
file_type = {'csv':'pd.read_csv(file_name)',
'xlsx':'pd.read_excel(file_name)',
'txt':'pd.read_csv(file_name)',
'parquet':'pd.read_parquet(file_name)',
'json':'pd.read_json(file_name)'
}
df = [eval(val) for key,val in file_type.items() if file_name
.endswith(key)][0]
return df
df = find_file_type_import(file_list [0])
This is working perfectly. Thank you for your valuable suggestions. ALso, correct me with the use of eval is good one or not?
The file type is just the three or four letters at the end of the file name, so the simplest way to do this would just be:
if file_list[i].endswith('csv'):
etc.
Other commons options would be os.path.splitext or the suffix attribute of a Path object from the built-in os and pathlib libraries respectively.
The way you are merging looks fine, but I'm not sure why you are using percent notation for the parameters to read_, set_index, etc. The elements of your list are just strings anyway, so for example
tdf = pd.read_csv('%s'%(file_list[i])))
could just be:
tdf = pd.read_csv(file_list[i])
(Answer to follow-up question)
Really nice idea to use a dict! It is generally considered good practice to avoid eval wherever possible, so here's an alternative option with the pandas functions themselves as dictionary values. I also suggest a prettier syntax for your list comprehension with exactly one element based on this answer and some clearer variable names:
def find_file_type_import(file_name):
# Total file extensions possible for importing data
read_functions = {'csv': pd.read_csv,
'xlsx': pd.read_excel,
'txt': pd.read_csv,
'parquet': pd.read_parquet,
'json': pd.read_json}
[df] = [read(file_name) for file_ext, read in read_functions.items()
if file_name.endswith(file_ext)]
return df
You can use glob (or even just os) to retrieve the list of files from a part of their name. Since you guarantee the uniqueness of the file irrespective of the extension, it will only be one (otherwise just put a loop that iterates over the retrieved elements).
Once you have the full file name (which clearly has the extension), just do a split() taking the last element obtained that corresponds to the file extension.
Then, you can read the dataframe with the appropriate function.
Here is an example of code:
from glob import glob
file_list = [
'dataset0', # corresponds to dataset0.csv
'dataset1', # corresponds to dataset1.xlsx
'dataset2.a'
]
for file in file_list:
files_with_curr_name = glob(f'*{file}*')
if len(files_with_curr_name) > 0:
full_file_name = files_with_curr_name[0] # take the first element, the uniqueness of the file name being guaranteed
# extract the file extension (string after the dot, so the last element of split)
file_type = full_file_name.split(".")[-1]
if file_type == 'csv':
print(f'Read {full_file_name} as csv')
# df = pd.read_csv(full_file_name)
elif file_type == 'xlsx':
print(f'Read {full_file_name} as xlsx')
else:
print(f"Don't read {full_file_name}")
Output will be:
Read dataset0.csv as csv
Read dataset1.xlsx as xlsx
Don't read dataset2.a
Using pathlib and a switch dict to call functions.
from pathlib import Path
import pandas as pd
def main(files: list) -> None:
caller = {
".csv": read_csv,
".xlsx": read_excel,
".pkl": read_pickle
}
for file in get_path(files):
print(caller.get(file.suffix)(file))
def get_path(files: list) -> list:
file_path = [x for x in Path.home().rglob("*") if x.is_file()]
return [x for x in file_path if x.name in files]
def read_csv(file: Path) -> pd.DataFrame:
return pd.read_csv(file)
def read_excel(file: Path) -> pd.DataFrame:
return pd.read_excel(file)
def read_pickle(file: Path) -> pd.DataFrame:
return pd.read_pickle(file)
if __name__ == "__main__":
files_to_read = ["spam.csv", "ham.pkl", "eggs.xlsx"]
main(files_to_read)

How to import multiple excel files and manipulate them individually

I have to analyze 13 different Excel files and I want to read them al in Jupyter at once, instead of reading them al individually. Also I want to be able to acces the contents individually. So far I have this:
path = r"C:\Users\giova\PycharmProjects\DAEB_prijzen\data"
filenames = glob.glob(path + "\*.xlsx")
df_list = []
for file in filenames:
df = pd.read_excel(file, usecols=['Onderhoudsordernr.', 'Oorspronkelijk aantal', 'Bedrag (LV)'])
print(file)
print(df)
df_list.append(df)
When I'm running the code it seems to be like 1 big list, with some data missing, which I dont want. Can anyone help? :(
It seems a problem that can be solved with a for loop and a dictionary.
Read the path location of your files:
path = 'C:/your path'
paths = os.listdir(path)
Initialize an empty dictionary:
my_files = {}
for i, p in enumerate(paths):
my_files[i] = pd.read_excel(p)
Then you can acces to your files individually simply calling the key in the dictionary:
my_files[i]
Where i = 1, 2 ..., 13
Alternatively, if you want to assign a name to each file, you can either create a list of name or derive it from the filepath through some slice/regex function on the strings.
Assuming the first case:
names = ['excel1', ...]
for name, p in zip(names, paths):
my_files[name] = pd.read_excel(p)

An elegant way of reading multiple pandas DataFrames and assigning dataframes names in Python using Pandas

Excuse my question, I know this is trivial but for some reasons I am not getting it right. Reading dataframes one by one is highly inefficient especially if you have a lot of dataframes you would like to read from. Remember DRY - DO NOT REPEAT YOURSELF
So here is my approach:
files = ["company.csv", "house.csv", "taxfile.csv", "reliablity.csv", "creditloan.csv", "medicalfunds.csv"]
DataFrameName = ["company_df", "house_df", "taxfile_df", "reliablity_df", "creditloan_df", "medicalfunds_df"]
for file in files:
for df in DataFrameName:
df = pd.read_csv(file)
This only gives me df as one of the frames, I am not sure which of them but I guess the last one. How can I read through the csv files and store them with a dataframe names in the DataFrameName
My goal:
To have 6 dataframes loaded in the workspace spaced in the DataFrameName
For example company_df holds the data from "company.csv"
You could set up
DataFrameDic = {"company":[], "house":[], "taxfile":[], "reliablity":[], "creditloan":[], "medicalfunds":[]}
for key in DataFrameDic:
DataFrameDic[key] = pd.read_csv(key+'.csv')
This should return a dictionary containing of dataframes.
Something like this:
files = [
"company.csv",
"house.csv",
"taxfile.csv",
"reliablity.csv",
"creditloan.csv",
"medicalfunds.csv",
]
DataFrameName = [
"company_df",
"house_df",
"taxfile_df",
"reliablity_df",
"creditloan_df",
"medicalfunds_df",
]
dfs = {}
for name, file in zip(DataFrameName, files):
dfs[name] = pd.read_csv(file)
zip lets you iterate two lists at the same time, so you can get both the name and the filename.
You'll end up with a dict of DataFrames
using pathlib we can create a generator expression then create a dictionary with the file name as the name and the value as the dataframe.
with pathlib we can use the .glob module to grab all the csv's in a target path.
replace "\tmp\files" with the path to your files, if your using windows use a raw string or escape the slashes.
from pathlib import Path
trg_files = (f for f in Path("\tmp\files").glob("*.csv"))
dataframe_dict = {f"{file.stem}_df": pd.read_csv(file) for file in trg_files}
print(dataframe_dict.keys())
'company_df'
print(datarame_dict['company_df'])
Dictionary are the way, since you can name their content dynamically.
names = ["company", "house", "taxfile", "reliablity", "creditloan", "medicalfunds"]
dataframes = {}
for name in names:
dataframes[f"{name}_df"] = pd.read_csv(f"{name}.csv")
The fact that you have a nice naming convention allows us to append easily the _df or .csv part to the name when needed.

python 2.7 rename values within list

When import several csvs and I save it within an array, the name for all files that were imported is q=[Dataframe, Dataframe,Dataframe,Dataframe,Dataframe,Dataframe,Dataframe,Dataframe], I would like to change the name using like base the name of the file.
files_array_Q = []
files_array_F = []
files_array_MRG =[]
for files in files_import_Q :
qs_matrix = pd.read_csv(files, delimiter=" ", header=None)
files_array_Q.append(qs_matrix)
for files in files_import_F :
in_fam = pd.read_csv(files, delimiter=" ", header=None)
files_array_F.append(in_fam)
for example read files with names file1.Q file2.Q file3.Q files4.Q
within the array files_array_Q = [files1, file2, files3, file4]
You can use the Pandas string split function. This solution assumes that the file names do not have extraneous periods.
q_list = pd.DataFrame(['file1.Q', 'file2.Q', 'file3.Q'], columns=['files'])
# split into file, type
q_split = pd.DataFrame(q.files.str.split('.',1).tolist())
# now get only the first column:
q_name_only = q_split[q_split.columns[0]]
You can combine these two steps into one line using Pandas iloc to choose the column by its integer location:
q_name_only = pd.DataFrame(q_list.files.str.split('.',1).tolist()).iloc[:,0]

Manipulating the values of each file in a folder using a dictionary and loop

How do I go about manipulating each file of a folder based on values pulled from a dictionary? Basically, say I have x files in a folder. I use pandas to reformat the dataframe, add a column which includes the date of the report, and save the new file under the same name and the date.
import pandas as pd
import pathlib2 as Path
import os
source = Path("Users/Yay/AlotofFiles/April")
items = os.listdir(source)
d_dates = {'0401' : '04/1/2019', '0402 : 4/2/2019', '0403 : 04/03/2019'}
for item in items:
for key, value in d_dates.items():
df = pd.read_excel(item, header=None)
df.set_columns = ['A', 'B','C']
df[df['A'].str.contains("Awesome")]
df['Date'] = value
file_basic = "retrofile"
short_date = key
xlsx = ".xlsx"
file_name = file_basic + short_date + xlsx
df.to_excel(file_name)
I want each file to be unique and categorized by the date. In this case, I would want to have three files, for example "retrofile0401.xlsx" that has a column that contains "04/01/2019" and only has data relevant to the original file.
The actual result is pretty much looping each individual item, creating three different files with those values, moves on to the next file, repeats and replace the first iteration and until I only am left with three files that are copies of the last file. The only thing that is different is that each file has a different date and are named differently. This is what I want but it's duplicating the data from the last file.
If I remove the second loop, it works the way I want it but there's no way of categorizing it based on the value I made in the dictionary.
Try the following. I'm only making input filenames explicit to make clear what's going on. You can continue to use yours from the source.
input_filenames = [
'retrofile0401_raw.xlsx',
'retrofile0402_raw.xlsx',
'retrofile0403_raw.xlsx',]
date_dict = {
'0401': '04/1/2019',
'0402': '4/2/2019',
'0403': '04/03/2019'}
for filename in input_filenames:
date_key = filename[9:13]
df = pd.read_excel(filename, header=None)
df[df['A'].str.contains("Awesome")]
df['Date'] = date_dict[date_key]
df.to_excel('retrofile{date_key}.xlsx'.format(date_key=date_key))
filename[9:13] takes characters #9-12 from the filename. Those are the ones that correspond to your date codes.

Categories

Resources