Merged file excels overwritting first column in Python using Pandas - python

I have a lot of files excel, I want to append multiple excel files using the following code:
import pandas as pd
import glob
import os
import openpyxl
df = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
data.index = [os.path.basename(f)] * len(data)
df.append(data)
df = pd.concat(df)
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Excel files have this structure:
the output is the following:
Why does python alter the first column when concatenating excel files?

I think you need:
df = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
name = os.path.basename(f)
#create Multiindex for not overwrite original index
data.index = pd.MultiIndex.from_product([[name], data.index], names=('files','orig'))
df.append(data)
#reset index for columns from MultiIndex
df = pd.concat(df).reset_index()
Another solution is use parameter keys in concat:
files = glob.glob("*.xlsx")
names = [os.path.basename(f) for f in files]
dfs = [pd.read_excel(f, 'Sheet1') for f in files]
df = pd.concat(dfs, keys=names).rename_axis(('files','orig')).reset_index()
What is same as:
df = []
names = []
for f in glob.glob(".xlsx"):
df.append(pd.read_excel(f, 'Sheet1'))
names.append(os.path.basename(f))
df = pd.concat(df, keys=names).rename_axis(('files','orig')).reset_index()
Last write to excel with no index and no columns names:
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1', index=False, header=False)
writer.save()

Related

Loop through multiple files in a directory and save into a single excel file in different sheets

I have a group of csv files saved into a folder that I want to loop through, convert into a pandas dataframe, perform a series of operations, than save into a single Excel file with each dataframe saved into it's own sheet.
This is my code so far.
from pathlib import Path
import pandas as pd
dir_b = r'/Desktop/MyProjects'
writer = pd.ExcelWriter('Compiled File.xlsx')
for csv in Path(dir_b).glob('*.csv'):
df_list = []
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ',')
car_column = df.pop('car')
df.insert(9, 'car', car_column)
df_list.append(df)
for i, df in enumerate(df_list):
df.to_excel(writer, sheet_name = 'Sheet' + str(i+1), index = False)
writer.save()
Everything seems to work expect the saving into the Excel file. There is no error when I run the code, but the final Excel file only shows 1 sheet from only 1 dataframe.
You are emptying your array each iteration. Try this:
from pathlib import Path
import pandas as pd
dir_b = r'/Desktop/MyProjects'
writer = pd.ExcelWriter('Compiled File.xlsx')
df_list = []
for csv in Path(dir_b).glob('*.csv'):
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ',')
car_column = df.pop('car')
df.insert(9, 'car', car_column)
df_list.append(df)
for i, df in enumerate(df_list):
df.to_excel(writer, sheet_name = 'Sheet' + str(i+1), index = False)
writer.save()
The df_list definition should be outside the for loop. Or else, it will get emptied for every iteration -
df_list = []
for csv in Path(dir_b).glob('*.csv'):
df = pd.read_csv(csv, encoding = 'ISO-8859-1', engine='python', delimiter = ',')
car_column = df.pop('car')
df.insert(9, 'car', car_column)
df_list.append(df)
for i, df in enumerate(df_list):
df.to_excel(writer, sheet_name = 'Sheet' + str(i+1), index = False)
writer.save()

Need to pick 'second column' from multiple csv files and save all 'second columns' in one csv file

So I have 366 CSV files and I want to copy their second columns and write them into a new CSV file. Need a code for this job. I tried some codes available here but nothing works. please help.
Assuming all the 2nd columns are the same length, you could simply loop through all the files. Read them, save the 2nd column to memory and construct a new df along the way.
filenames = ['test.csv', ....]
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
filenames = glob.glob(r'D:/CSV_FOLDER' + "/*.csv")
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
This can accomplished with glob and pandas:
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(mylist[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,1]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,1]).columns
df[colName] = pd.DataFrame(t.iloc[:,1])
df.to_csv('output.csv', index=False)
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(csvList[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,0]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,0]).columns
df[colName] = pd.DataFrame(t.iloc[:,0])
df.to_csv('output.csv', index=False)

Iterate through excel files and sheets and concatenate in Python

Say I have a folder which have multiple excel files with extension xlsx or xls, they share same header column a, b, c, d, e except some empty sheet in several files.
I want to iterate all the files and sheets (except for empty sheets) and concatenate them into one sheet of one file output.xlsx.
I have iterated through all excel files and append them to one file, but how could I iterate through all the sheets of each files if they have more than one sheets?
I need to integrate two block of code below into one. Thanks for your help.
import pandas as pd
import numpy as np
import glob
path = os.getcwd()
files = os.listdir(path)
files
df = pd.DataFrame()
# method 1
excel_files = [f for f in files if f[-4:] == 'xlsx' or f[-3:] == 'xls']
excel_files
for f in excel_files:
data = pd.read_excel(f)
df = df.append(data)
# method 2
for f in glob.glob("*.xlsx" or "*.xls"):
data = pd.read_excel(f)
df = df.append(data, ignore_index=True)
# save the data frame
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'sheet1')
writer.save()
For one file to concatenate multiple sheets:
file = pd.ExcelFile('file.xlsx')
names = file.sheet_names # read all sheet names
df = pd.concat([file.parse(name) for name in names])
import pandas as pd
path = os.getcwd()
files = os.listdir(path)
files
excel_files = [file for file in files if '.xls' in file]
excel_files
def create_df_from_excel(file_name):
file = pd.ExcelFile(file_name)
names = file.sheet_names
return pd.concat([file.parse(name) for name in names])
df = pd.concat(
[create_df_from_excel(xl) for xl in excel_files]
)
# save the data frame
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer, 'sheet1')
writer.save()

Importing multiple excel files into Python, merge and apply filename to a new column

I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks
Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)

Get file created date - add to dataframes column on read_csv

I need to pull many (hundreds) CSV's into a pandas dataframe. I need to a add the date the file was created in a column upon read in to the pandas dataframe for each CSV file. I can obtain the date of creation for a CSV file using this call:
time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime('/path/file.csv')))
As an fyi, this is the command I am using to read in the CSVs:
path1 = r'/path/'
all_files_standings = glob.glob(path1 + '/*.csv')
standings = pd.concat((pd.read_csv(f, low_memory=False, usecols=[7, 8, 9]) for f in standings))
I tried running this call (which worked):
dt_gm = [time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime('/path/file.csv')))]
So then I tried expanding it:
dt_gm = [time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(f) for f in all_files_standings))]
and I get this error:
TypeError: an integer is required (got type generator)
How can I resolve this?
if the different files have the same columns and you would like to append different files into rows.
import pandas as pd
import time
import os
# lis of files you want to read
files = ['one.csv', 'two.csv']
column_names = ['c_1', 'c_2', 'c_3']
all_dataframes = []
for file_name in files:
df_temp = pd.read_csv(file_name, delimiter=',', header=None)
df_temp.columns = column_names
df_temp['creation_time'] = time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(file_name)))
df_temp['file_name'] = file_name
all_dataframes.append(df_temp)
df = pd.concat(all_dataframes, axis=0, ignore_index=True)
df
output:
if you want to append the different files by columns:
all_dataframes = []
for idx, file_name in enumerate(files):
df_temp = pd.read_csv(file_name, delimiter=',', header=None)
column_prefix = 'f_' + str(idx) + '_'
df_temp.columns = [column_prefix + c for c in column_names]
df_temp[column_prefix + 'creation_time'] = time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(file_name)))
all_dataframes.append(df_temp)
pd.concat(all_dataframes, axis=1)
output:

Categories

Resources