How to import multiple excel files and manipulate them individually - python

I have to analyze 13 different Excel files and I want to read them al in Jupyter at once, instead of reading them al individually. Also I want to be able to acces the contents individually. So far I have this:
path = r"C:\Users\giova\PycharmProjects\DAEB_prijzen\data"
filenames = glob.glob(path + "\*.xlsx")
df_list = []
for file in filenames:
df = pd.read_excel(file, usecols=['Onderhoudsordernr.', 'Oorspronkelijk aantal', 'Bedrag (LV)'])
print(file)
print(df)
df_list.append(df)
When I'm running the code it seems to be like 1 big list, with some data missing, which I dont want. Can anyone help? :(

It seems a problem that can be solved with a for loop and a dictionary.
Read the path location of your files:
path = 'C:/your path'
paths = os.listdir(path)
Initialize an empty dictionary:
my_files = {}
for i, p in enumerate(paths):
my_files[i] = pd.read_excel(p)
Then you can acces to your files individually simply calling the key in the dictionary:
my_files[i]
Where i = 1, 2 ..., 13
Alternatively, if you want to assign a name to each file, you can either create a list of name or derive it from the filepath through some slice/regex function on the strings.
Assuming the first case:
names = ['excel1', ...]
for name, p in zip(names, paths):
my_files[name] = pd.read_excel(p)

Related

How to read several xlsx-files in a folder into a pandas dataframe

I have a folder. In this folder are 48 xlsx files, but the count of the relevant files are 22. Them name of these 22 files have no structure, the only thing in common is that the filenames start with data. I would love to access this files and read them all into a dataframe. Doing this manually with the code line
df = pd.read_excel(filename, engine='openpyxl')
takes too long
The table structure is similar but not always exactly the same. How can I manage to solve this problem
import os
import pandas as pd
dfs = {}
def get_files(extension, location):
xlsx_list = []
for root, dirs, files in os.walk(location):
for t in files:
if t.endswith(extension):
xlsx_list.append(t)
return xlsx_list
file_list = get_files('.xlsx', '.')
index = 0
for filename in file_list:
index += 1
df = pd.read_excel(filename, engine='openpyxl')
dfs[filename] = df
print(dfs)
each element in dfs like dfs['file_name_here.xlsx'] accesses the data frame output from the read_excel.
EDIT: that you can add additional criteria to filter through the XLSX files at the line if t.endswith(extension): you can check out the beginning of the file like if t.startswith('data'): too. Maybe combine them if t.startswith('data') and t.endswith(extension):

Pandas: Reading files with regex

I am trying to read multiple excel files while using wildcards and putting it in saparate dataframes using pandas.
i have read base path and will be using below to access subdirectories:
>>>inputs_path
'C:/Users/ABC/Downloads/Input'
>>>path1 = os.chdir(inputs_path + "/path1")
>>>fls=glob.glob("*.*")
>>>fls
['Zambia_W4.xlsm',
'Australia_W4.xlsx',
'France_W4.xlsx',
'Japan_W3.xlsm',
'India_W3.xlsx',
'Italy_W3.xlsx',
'MEA_W5.xlsx',
'NE_W5.xlsm',
'Russia_W5.xlsx',
'Spain_W2.xlsx']
>>>path2 = os.chdir(inputs_path + "/path2")
>>>fls=glob.glob("*.*")
>>>fls
['Today.xlsm',
'Yesterday.xlsx',
'Tomorrow.xlsx']
Right now i am reading them as follows:
>>>df_italy = pd.read_excel("Italy_W3.xlsx",sheet_name='Sheet1')
>>>df_russia = pd.read_excel("Russia_W5.xlsx",sheet_name='Sheet3')
>>>df_france_1 = pd.read_excel("France_W4.xlsx",sheet_name='Sheet1', usecols = 'M, Q', skiprows=4)
>>>df_spain = pd.read_excel("Spain_W2.xlsx",sheet_name='Sheet2',usecols = 'T:U', skiprows=30 )
>>>df_ne = pd.read_excel("NE_W5.xlsm",sheet_name='Sheet2',usecols = 'N,P', skiprows=4 )
>>>df_ne_c = pd.read_excel("NE_W5.xlsm",sheet_name='Sheet1',usecols = 'H:J', skiprows=141 )
Since i have filenames in the list fls, is there a way i could use that list and read files without actually having to use the actual filename since the filename will change as per week number.
Also its mandatory to keep the dataframe names as mentioned above while reading the excel files.
i am looking to read the file as
>>>df_italy = pd.read_excel("Italy*.xlsx",sheet_name='Sheet1')
Is there any way to do this?
If your files always have a _ to split on you could create a dictionary with the split value as the key, and the file path as the location.
Lets use Pathlib which was added in Python 3.4+ as it's easier to use with file systems.
Regex Matching FileName.
Assuming your dictionary is created as above with filenames and paths as the values we could do this. You'll need to extend the function to deal with multiple file matches.
import re
from pathlib import path
file_dict = {file.stem : file for file in location.glob('*.xlsx')}
# assume the numbers are paths.
files = {'Zambia_W4.xlsm': 2,
'Australia_W4.xlsx': 5,
'France_W4.xlsx': 0,
'Japan_W3.xlsm': 7,
'India_W3.xlsx': 2,
'Italy_W3.xlsx': 6,
'MEA_W5.xlsx': 7,
'NE_W5.xlsm': 4,
'Russia_W5.xlsx': 3,
'Spain_W2.xlsx': 5}
def file_name_match(file_dict,pattern):
for name, source in file_dict.items():
if re.search(pattern,name,flags=re.IGNORECASE):
return file_dict.get(name)
file_name_match(file_dict,'italy')
output: 6
df = pd.read_excel(file_name_match(file_dict,'italy'),sheetname=...)
It might be feasible to simply populate a dictionary of dataframes like this:
my_dfs = {}
for f in fls:
my_dfs[f.split(“.”)[0]] = pandas.dataframe(f.split(“,”)[0], ...)
You can use a for loop also to just run the job you need to do for each file, which shouldn’t require knowledge of the file name. Also, it’s possible to also just read all the spreadsheets into one df, and ensure there is an additional column that has the corresponding file name for each row.
The code below assumes you have several files for each country, and need to sort them to find the latest week.
import glob
import os
import re
def find_country_file(country_name):
all_country_files = glob.glob(os.path.join(inputs_path, '{0}_W*.*'))
week_numbers = [re.search('W([0-9]+)', x) for x in all_country_files]
week_numbers = [int(x.group(1)) for x in week_numbers if x is not None]
latest_week_number = sorted(week_numbers, reversed=True)[0]
latest_country_file = [x for x in all_country_files if 'W{0}.'.format(latest_week_number) in x]
return os.path.basename(latest_country_file)
df_italy = pd.read_excel(find_country_file('Italy') , sheet_name='Sheet1')
df_russia = pd.read_excel(find_country_file('Russia'), sheet_name='Sheet3')
df_france_1 = pd.read_excel(find_country_file('France'),sheet_name='Sheet1', usecols = 'M, Q', skiprows=4)
df_spain = pd.read_excel(find_country_file('Spain'),sheet_name='Sheet2',usecols = 'T:U', skiprows=30 )
df_ne = pd.read_excel(find_country_file('NE'),sheet_name='Sheet2',usecols = 'N,P', skiprows=4 )
df_ne_c = pd.read_excel(find_country_file('NE'),sheet_name='Sheet1',usecols = 'H:J', skiprows=141)
the method find_country searches for all files with the country name in the path, uses regex to pull out the week number, sorts them to find the highest number, and then returns the file path from the glob of all country files that matches the latest week found.

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file
Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.
You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

How to assign loaded tables to different variables in a loop in Python?

I am trying to import n tables with a loop, giving a different name for each table.
I can write the code to import the tables but I can't give a different name for each table:
the path to find the files (csv)
totalbaseURL=('path')
find all the csv files
folders = os.listdir(totalbaseURL)
i.e., folders becomes ["house","cars","food","pubs"].
folders contains the name of the csv files
for f in folders:
path=totalbaseURL+f
table=pd.read_csv(path,delimiter=';')
How could I put, instead of "table"? I would like to use the names in folders: (house, cars, food, pubs).
Use a dict,
folders=["house","cars","food","pubs"]
di = {}
for f in folders:
path=totalbaseURL+f
di[f]=pd.read_csv(path,delimiter=';')
U can use a dict to achieve it:
dataframe_dict = {}
totalbaseURL=('path')
folders = os.listdir(totalbaseURL)
for f in folders:
path=totalbaseURL+f
dataframe_dict[f]=pd.read_csv(path,delimiter=';')
print dataframe_dict
Output:
{"house": DataFrame1, "cars": DataFrame2, "food": DataFrame3, "pubs": DataFrame4}
import os
for f in folders:
path = os.path.join(totalbaseURL, f)
globals()[f] = pd.read_csv(path, delimiter=';')
This will create global variables house, cars, food, and pubs. If this code is inside a function, use locals() instead of globals().

Loop through files in a directory, add a date column in pandas

All of my files have the following titles and they stretch back for a few years. I want to be able to read each file and then add the date from the file name as a column.
Filetype as of 2015-04-01.csv
path = 'C:\\Users\\'
filelist = os.listdir(path) #All of my .csv files I am working with
file_count = len(filelist) #I thought I could do a for loop and use this as a the range
df = Series(filelist) #I just added this because I couldn't get the date from a list
date_name = df.str[15:-4] #This gives me the date
So what I have tried is:
for file in filelist:
df = pd.read_csv(file)
Now I want to take the date_name from the file name and add a column called date. Every file is exactly the same but I want to track changes over time and the only date is found just on the name of the file.
Then I will append it.
path = 'C:\\Users\\'
filelist = glob.glob(path + "/*.csv")
frame = pd.DataFrame()
list = []
for file in filelist:
df = pd.read_csv(file)
list_.append(df)
frame = pd.concat(list)
How can I add the date_name to the file/dataframe? 1) Read the file, 2) Add the date column based on the file name, 3) Read the next file, 4) Add the date column, 5) Append, 6) Repeat for all files in the path
Edit---
I think I got something to work - is this the best way? Can someone explain what the list = [] is doing and such is doing?
path = 'C:\\Users\\'
filelist = os.listdir(path)
list = []
frame = pd.DataFrame()
for file in filelist:
df2 = pd.read_csv(path+file)
date_name = file[15:-4]
df2['Date'] = date_name
list.append(df2)
frame = pd.concat(list)
This seems like a reasonable way to do it. The pd.concat takes a list of pandas objects and concatenates them. append adds each frame to the list as you loop through the files. I see two things to change though.
You don't need frame = pd.DataFrame(). It is not doing anything as you are appending dataframes to the list.
I'd change the name of the variable list to something else. Maybe frames as it is descriptive of the contents and doesn't already mean something.

Categories

Resources