Iterate and split excel filenames and save as dataframe in Pandas

Iterate and split excel filenames and save as dataframe in Pandas - python

Say I have a folder folder1 with excel files, their filenames share same structures: city, building name and id, I want save them in dataframe and then excel file. Please note I also need to append other folders' excel filenames in result.
bj-LG center-101012.xlsx
sh-ABC tower-1010686.xlsx
bj-Jinzhou tower-101018.xlsx
gz-Zijin building-101012.xls
...
The first method I have tried:
import os
import pandas as pd
from pandas import DataFrame, ExcelWriter
path = os.getcwd()
file = [".".join(f.split(".")[:-1]) for f in os.listdir() if os.path.isfile(f)] #exclude files' extension
city = file.split('-')[0]
projectName = file.split('-')[1]
projectID = file.split('-')[2]
#print(city)
df = pd.DataFrame(columns = ['city', 'building name', 'id'])
df['city'] = city
df['building name'] = projectName
df['id'] = projectID
writer = pd.ExcelWriter("C:/Users/User/Desktop/test.xlsx", engine='xlsxwriter')
df.to_excel(writer, index = False)
writer.save()
Problem:
Traceback (most recent call last):
File "<ipython-input-203-c09878296e72>", line 9, in <module>
city = file.split('-')[0]
AttributeError: 'list' object has no attribute 'split'
My second method:
for root, directories, files in os.walk(path):
#print(root)
for file in files:
if file.endswith('.xlsx') or file.endswith('.xls'):
#print(file)
city = file.split('-')[0]
projectName = file.split('-')[1]
projectID = file.split('-')[2]
#print(city)
df = pd.DataFrame(columns = ['city', 'building name', 'id'])
df['city'] = city
df['building name'] = projectName
df['id'] = projectID
writer = pd.ExcelWriter("C:/Users/User/Desktop/test.xlsx", engine='xlsxwriter')
df.to_excel(writer, index = False)
writer.save()
I got an empty test.xlsx file, how could I make it works? Thanks.

This splits off the file extension, then unpacks the split into the vairables.
Creates a dictionary then appends the dictionary to the dataframe.
files = [
"bj-LG center-101012.xlsx",
"sh-ABC tower-1010686.xlsx",
"bj-Jinzhou tower-101018.xlsx",
"gz-Zijin building-101012.xls"]
df = pd.DataFrame()
for file in files:
filename = file.split(".")[0]
city, projectName, projectID = filename.split("-")
d = {'city':city,'projectID':projectID,'projectName':projectName}
df = df.append(d,ignore_index=True)
df.to_excel('summary.xlsx')

Method 2 is close.
You need to create the dataframe before the for loops. After your variable assignments, make a dictionary of the variables and append it to the dataframe.
There is also probably a better way to find your file list using glob, but i will just work with what you have already done.
df = pd.DataFrame()
for root, directories, files in os.walk(path):
for file in files:
if file.endswith('.xlsx') or file.endswith('.xls'):
#print(file)
city = file.split('-')[0]
projectName = file.split('-')[1]
projectID = file.split('-')[2]
#append data inside inner loop
d = {'city':city, 'building name':projectname, 'id':projectID}
df.append(d)
writer = pd.ExcelWriter("C:/Users/User/Desktop/test.xlsx", engine='xlsxwriter')
df.to_excel(writer, index = False)
writer.save()

This should works, thanks to the hint of use glob from #Dan Wisner
import os
from glob import glob
fileNames = [os.path.splitext(val)[0] for val in glob('*.xlsx') or glob('*.xls')]
df = pd.DataFrame({'fileNames': fileNames})
df[['city', 'name', 'id']] = df['fileNames'].str.split('-', n=2, expand=True)
del df['fileNames']
writer = pd.ExcelWriter("C:/Users/User/Desktop/test1.xlsx", engine='xlsxwriter')
df.to_excel(writer, index = False)
writer.save()

Related

how do i select last raw in CSV files and import in to Excel?

How i can select last raw in text files with for?
this my first idea code :
import glob
import pandas as pd
path = input("Insert location:")
file_list = glob.glob(path + "/*.txt")
txt_list = []
for file in file_list:
txt_list.append(pd.read_csv(file))
for file in file_list:
txt_list[-7::3]
excl_merged = pd.concat(txt_list, ignore_index=True)
excl_merged.to_excel('Total.xlsx', index=False) ]

Your code is incorrect. Here is a version that should work:
import glob
import pandas as pd
path = input("Insert location:")
file_list = glob.glob(path + "/*.txt")
df_list = []
for file in file_list:
df = pd.read_csv(file)
df_list.append(df.tail(3)) # last 3 rows from each file dataframe
excl_merged = pd.concat(df_list, ignore_index=True)
excl_merged.to_excel('Total.xlsx', index=False)
Explaination: tail() method takes the last several rows (provided as an argument) from a dataframe.

How do I send my output xls files to a specific path in python?

How do i make my df.to_excel function write to an output path? After my script runs, I do not see the files in the output_path directory i have defined.
import pandas as pd
from openpyxl import load_workbook
import os
import datetime
output_path = 'C:/Users/g/Desktop/autotranscribe/python/Processed'
path = 'C:/Users/g/Desktop/autotranscribe/python/Matching'
cols_to_drop = ['PSI ID','PSIvet Region','PSIvet region num','Fax','County']
column_name_update_map = {'Account name': 'Company Name','Billing address':'Address','Billing city':'City','Billing State':'State'}
for file in os.listdir("C:/Users/g/Desktop/autotranscribe/python/Matching"):
if file.startswith("PSI") and "(updated headers)" not in file:
dfs = pd.read_excel(file, sheet_name=None,skiprows=5)
output = dict()
for ws, df in dfs.items():
if ws.startswith("Cancelled Members"): df = df.drop('Active date', axis=1)
if any(ws.startswith(x) for x in ["New Members","PVCC"]):
continue
#if ws in ["New Members 03.22","PVCC"]: #sheetstoavoid
temp = df
dt = pd.to_datetime(os.path.getctime(os.path.join(path,file)),unit="s").replace(nanosecond=0)
output[ws] = temp
writer = pd.ExcelWriter(f'{file.replace(".xlsx","")} (updated headers).xlsx')
for ws, df in output.items():
df.to_excel(writer, index=None, sheet_name=ws)
writer.save()
writer.close()
I tried df.to_excel(writer,output_path, index=None, sheet_name=ws)
But i get an error
File "", line 36, in
df.to_excel(writer,output_path, index=None, sheet_name=ws)
TypeError: to_excel() got multiple values for argument 'sheet_name'.

A few comments:
The function os.listdir() only returns "unqualified" file names, so before using file, we need to prepend path using something like input_file_name = f'{path}/{file}'.
Similarly, pd.ExcelWriter() will need a qualified file name (that is, including the path as well as the "unqualified" file name), which we can get by doing this: output_file_name = f'{output_path}/{file.replace(".xlsx","")} (updated headers).xlsx'.
There are some elements of the code in your question that may not be getting used, but rather than comment on or change those, I provide a working version with minimal changes below.
I created directories named Matching and Processed. I placed a file named PSI 123.xlsx in Matching with a tab named Cancelled Members containing the following:
will skip
will skip
will skip
will skip
will skip
Col1 Col2 Col3 Active date
xx NY 110 789
I then ran the following modification to your code (note the changes to output_path and path for testing purposes in my environment):
import pandas as pd
from openpyxl import load_workbook
import os
import datetime
#output_path = 'C:/Users/g/Desktop/autotranscribe/python/Processed'
#path = 'C:/Users/g/Desktop/autotranscribe/python/Matching'
output_path = './Processed'
path = './Matching'
cols_to_drop = ['PSI ID','PSIvet Region','PSIvet region num','Fax','County']
column_name_update_map = {'Account name': 'Company Name','Billing address':'Address','Billing city':'City','Billing State':'State'}
for file in os.listdir(path):
if file.startswith("PSI") and "(updated headers)" not in file:
input_file_name = f'{path}/{file}'
dfs = pd.read_excel(input_file_name, sheet_name=None,skiprows=5)
output = dict()
for ws, df in dfs.items():
if ws.startswith("Cancelled Members") and 'Active date' in df.columns: df = df.drop('Active date', axis=1)
if any(ws.startswith(x) for x in ["New Members","PVCC"]):
continue
#if ws in ["New Members 03.22","PVCC"]: #sheetstoavoid
temp = df
dt = pd.to_datetime(os.path.getctime(os.path.join(path,file)),unit="s").replace(nanosecond=0)
output[ws] = temp
output_file_name = f'{output_path}/{file.replace(".xlsx","")} (updated headers).xlsx'
writer = pd.ExcelWriter(output_file_name)
for ws, df in output.items():
df.to_excel(writer, index=None, sheet_name=ws)
writer.save()
writer.close()
After running, the code had created a new file in Processed named PSI 123 (updated headers).xlsx with sheets named as in the input. The sheet Cancelled Members contained the following:
Address State Zip Status Status.1 Date Partner
Col1 Col2 Col3
xx NY 110

How to call multiple columns from multiple csv file in python?

I have 100 csv file. I want to print particular columns from all the csv file with the file name. Here in this code I can print all of the csv file.
path = r'F:\11 semister\TPC_MEMBER'
all_files = glob.glob(path + "/*.csv")
dataStorage = {}
for filename in all_files:
name = os.path.basename(filename).split(".csv")[0]
dataStorage[name] = pd.read_csv(filename)
print(name)
dataStorage

May be you want this.
import pandas as pd
import numpy as np
import glob
path = r'folderpath' #provide your folder path where your csv files are stored.
all_csv= glob.glob(path + "/*.csv")
li = []
for filename in all_csv:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
data_frame = pd.concat(li, axis=0, ignore_index=True)
data_frame['columnname'] # enter the name of your dataframe's column.
print(data_frame)

I have multiple dataframes in a list. How to print their names?

I am reading in multiple files and adding them to a list:
import pandas as pd
import glob
import ntpath
path = r'C:\Folder1\Folder2\Folder3\Folder3'
all_files = glob.glob(path + "/*.dat") #.dat files only
mylist = []
for filename in all_files:
name = ntpath.basename(filename) # for renaming the DF
name = name.replace('.dat','') # remove extension
try:
name = pd.read_csv(filename, sep='\t', engine='python')
mylist.append(name)
except:
print(f'File not read:{filename}')
Now I want to just display the DFs in this list.
This is what I've tried:
for thing in mylist:
print(thing.name)
AttributeError: 'DataFrame' object has no attribute 'name'
And
for item in mylist:
print(item)
But that just prints the whole DF content.

name = pd.read_csv(filename, sep='\t', engine='python')
mylist.append(name)
Here, name is a dataframe, not the name of your dataframe.
To add name to your dataframe, use
df = pd.read_csv(filename, sep='\t', engine='python')
df_name="Sample name"
mylist.append({'data':df, 'name':df_name})
>>> print(thing['name'])
Sample name

You can use a dictionary for that.
Writing to dict:
import pandas as pd
import glob
import ntpath
path = r'C:\Folder1\Folder2\Folder3\Folder3'
all_files = glob.glob(path + "/*.dat") #.dat files only
mydict = {}
for filename in all_files:
name = ntpath.basename(filename) # for renaming the DF
name = name.replace('.dat','') # remove extension
try:
mydict[name] = pd.read_csv(filename, sep='\t', engine='python')
except:
print(f'File not read:{filename}')
To read a df (say filename1) again:
df = mydict['filename1']
or to iterate over all df's in mydict:
for df in mydict.values():
# use df...
or:
for key in mydict:
print(key)
df = mydict[key]
# use df...

Automatically create dataframe for each folder

Each folder has a csv for each month of the year(1.csv,2.csv,3.csv etc) and the script creates a dataframe combining the 9th column for all 12 csv's into an xlsx sheet named concentrated.xlsx. It works but only for one directory at a time
files = glob['2014/*.csv']
sorted_files = natsorted(files)
def read_9th(fn):
return pd.read_csv(fn, usecols=[9], names=headers)
big_df = pd.concat([read_9th(fn) for fn in sorted_files], axis=1)
writer = pd.ExcelWriter('concentrated.xlsx', engine='openpyxl')
big_df.to_excel(writer,'2014')
writer.save()
Is it possible to create a dataframe automatically for each directory without having to manually create one for each folder like this:
files14 = glob['2014/*.csv']
files15 = glob['2015/*.csv']
sorted_files14 = natsorted(files14)
sorted_files15 = natsorted(files15)
def read_9th(fn):
return pd.read_csv(fn, usecols=[9], names=headers)
big_df = pd.concat([read_9th(fn) for fn in sorted_files14], axis=1)
big_df1 = pd.concat([read_9th(fn) for fn in sorted_files15], axis=1)
writer = pd.ExcelWriter('concentrated.xlsx', engine='openpyxl')
big_df.to_excel(writer,'2014')
big_df1.to_excel(writer,'2015')
writer.save()

If you get a list of the folders that want to process, e.g.
folders = os.listdir('.')
# or
folders = ['2014', '2015', '2016']
You could do something like:
writer = pd.ExcelWriter('concentrated.xlsx', engine='openpyxl')
for folder in folders:
files = glob('%s/*.csv' % folder)
sorted_files = natsorted(files)
big_df = pd.concat([read_9th(fn) for fn in sorted_files], axis=1)
big_df.to_excel(writer, folder)
writer.save()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterate and split excel filenames and save as dataframe in Pandas - python

Related

how do i select last raw in CSV files and import in to Excel?

How do I send my output xls files to a specific path in python?

How to call multiple columns from multiple csv file in python?

I have multiple dataframes in a list. How to print their names?

Automatically create dataframe for each folder

Categories

Resources