read_excel into data frame and keep file name as column (Pandas)

read_excel into data frame and keep file name as column (Pandas) - python

I am trying to read multiple excel files into a data frame and but I can't seem to find a way to keep the file name as a column to reference to where it came from. Also, I need to filter the name of the excel file and the date created before I do read_excel. (there are so many files that I do not want to read them if I don't need to) This is what I have:
res = []
for root, dirs, files in os.walk('.../Minutes/', topdown=True):
if len(files) > 0:
res.extend(zip([root]*len(files), files))
df = pd.DataFrame(res, columns=['Path', 'File_Name'])
df['FullDir'] = df.Path+'\\'+df.File_Name
list_ = []
for f in df["FullDir"]:
data = pd.read_excel(f, sheet_name = 1)
list_.append(data)
df2 = pd.concat(list_)
df2
What I would like as an output
A B filename File Date Created
0 a a File1 1-1-2018
1 b b File1 1-1-2018
2 c c FIle2 2-1-2018
3 a a File2 2-1-2018
Any help would be greatly appreciated!!

You can using concat with keys , then reset_index
res = []
for root, dirs, files in os.walk('.../Minutes/', topdown=True):
if len(files) > 0:
res.extend(zip([root]*len(files), files))
df = pd.DataFrame(res, columns=['Path', 'File_Name'])
df['FullDir'] = df.Path+'\\'+df.File_Name
Assuming above code is work as expected
list_ = []
for f in df["FullDir"]:
data = pd.read_excel(f, sheet_name = 1)
list_.append(data)
df2 = pd.concat(list_, keys=df.File_Name.values.tolist()).reset_index(level=0)

Related

how to loop through a folder of csv files and read header of each? then output in a folder

I'm a newbie in python and need help with this piece of code. I did a lot of search to get to this stage but couldn't fix it on my own. Thanks in advance for your help.
What I'm trying to do is that I have to compare 100+ csv files in a folder, and not all have the same number of columns or columns name. So I'm trying to use python to read the headers of each file and put in a csv file to output in a folder.
I got to this point but not sure if I'm on the right path even:
import pandas as pd
import glob
path = r'C:\Users\user1\Downloads\2016GAdata' # use your path
all_files = glob.glob(path + "/*.csv")
list1 = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
list1.append(df)
frame = pd.concat(list1, axis=0, ignore_index=True)
print(frame)
thanks for your help!

You can create a dictionary whose key is filename and value is dataframe columns. Using this dictionary to create dataframe results in filename as index and column names as column value.
d = {}
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
d[filename] = df.columns
frame = pd.DataFrame.from_dict(d, orient='index')
0 1 2 3
file1 Fruit Date Name Number
file2 Fruit Date Name None

How to create variables and read several excel files in a loop with pandas?

L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df

Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best

This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)

Importing multiple excel files into Python, merge and apply filename to a new column

I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks

Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

I have the following architecture of the text files in the folders and subfolders.
I want to read them all and create a df. I am using this code, but it dont work well for me as the text is not what I checked and the files are not equivalent to my counting.
l = [pd.read_csv(filename,header=None, encoding='iso-8859-1') for filename in glob.glob("2018_01_01/*.txt")]
main_df = pd.concat(l, axis=1)
main_df = main_df.T
for i in range(2):
l = [pd.read_csv(filename, header=None, encoding='iso-8859-1',quoting=csv.QUOTE_NONE) for filename in glob.glob(str(foldernames[i+1])+ '/' + '*.txt')]
df = pd.concat(l, axis=1)
df = df.T
main_df = pd.merge(main_df, df)
file

Assuming those directories contain txt files in which information have the same structure on all of them:
import os
import pandas as pd
df = pd.DataFrame(columns=['observation'])
path = '/path/to/directory/of/directories/'
for directory in os.listdir(path):
if os.path.isdir(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename)) as f:
observation = f.read()
current_df = pd.DataFrame({'observation': [observation]})
df = df.append(current_df, ignore_index=True)
Once all your files have been iterated, df should be the DataFrame containing all the information in your different txt files.

You can do that using a for loop. But before that, you need to give a sequenced name to all the files like 'fil_0' within 'fol_0', 'fil_1' within 'fol_1', 'fil_2' within 'fol_2' and so on. That would facilitate the use of a for loop:
dataframes = []
import pandas as pd
for var in range(1000):
name = "fol_" + str(var) + "/fil_" + str(var) + ".txt"
dataframes.append(pd.read_csv(name)) # if you need to use all the files at once
#otherwise
df = pd.read_csv(name) # you can use file one by one
It will automatically create dataframes for each file.

Using date as index in output file

I have several excel files with their filename differentiated by different dates. I have to concatenate all these files with their filename dates being as the index columns. I have written the following code below:
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\"
fileName = glob.glob(os.path.join(path, "*.xlsx"))
df = (pd.read_excel(f, header=None, sheetname = "YTD Summary_4") for f in fileName)
k = (re.search("([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{4})", fileName))
concatenated_df = pd.concat(df, index=k)
concatenated_df.to_csv('tableau7.csv')
What i have done here is first defined a directory then assigned all files containing xlsx files to filename. I defined filename in a datadrame, used regular expression to get date from filename and assign it to variable k. now i concatenate the file to get the output csv file. But the code somehow gives an error: TypeError: expected string or bytes-like object. Can somebody help me what i am doing wrong.

You can use:
#simplify for add *.xlsx to path
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\*.xlsx"
fileName = glob.glob(path)
#create list of DataFrames dfs
dfs = [pd.read_excel(f, header=None, sheetname = "YTD Summary_4") for f in fileName]
#add parameter keys for filenames, remove second level of multiindex
concatenated_df = pd.concat(dfs, keys=fileName).reset_index(level=1, drop=True)
#extract dates and convert to DatetimeIndex
pat = '([0-9]{1,2}\-[0-9]{1,2}\-[0-9]{4})'
concatenated_df.index = pd.to_datetime(concatenated_df.index.str.extract(pat, expand=False))
print (concatenated_df)

A little mod,
path = r"C:\\Users\\atcs\\Desktop\\data science\\files\\1-Danny Jones KPI's\\Source\\*.xlsx"
fileName = glob.glob(path)
l = []
for f in fileName:
df = pd.read_excel(f, header=None, sheetname = "YTD Summary_4")
df['date'] = f
l.append(df)
concatenated_df = pd.concat(l).set_index('date')
concatenated_df.to_csv('tableau7.csv')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

read_excel into data frame and keep file name as column (Pandas) - python

Related

how to loop through a folder of csv files and read header of each? then output in a folder

How to create variables and read several excel files in a loop with pandas?

Importing multiple excel files into Python, merge and apply filename to a new column

Reading text files from subfolders and folders and creating a dataframe in pandas for each file text as one observation

Using date as index in output file

Categories

Resources