dropping columns in multiple excel spreedsheets - python

Is there a way in python i can drop columns in multiple excel files? i.e. i have a folder with several xlsx files. each file has about 5 columns (date, value, latitude, longitude, region). I want to drop all columns except date and value in each excel file.

Let's say you have a folder with multiple excel files:
from pathlib import Path
folder = Path('excel_files')
xlsx_only_files = list(folder.rglob('*.xlsx'))
def process_files(xls_file):
#stem is a method in pathlib
#that gets just the filename without the parent or the suffix
filename = xls_file.stem
#sheet = None ensure the data is read in as a dictionary
#this sets the sheetname as the key
#usecols allows you to read in only the relevant columns
df = pd.read_excel(xls_file, usecols = ['date','value'] ,sheet_name = None)
df_cleaned = [data.assign(sheetname=sheetname,
filename = filename)
for sheetname, data in df.items()
]
return df_cleaned
combo = [process_files(xlsx) for xlsx in xlsx_only_files]
final = pd.concat(combo, ignore_index = True)
Let me know how it goes
stem

I suggest you should define columns you want to keep as a list and then select as a new dataframe.
# after open excel file as
df = pd.read_excel(...)
keep_cols = ['date', 'value']
df = df[keep_cols] # keep only selected columns it will return df as dataframe
df.to_excel(...)

Related

Reading multiple excel files in pyspark/pandas

I have a scenario where I need to read excel spreadsheet with multiple sheets inside and process each sheet separately.
The Sheets inside the excel workbook are named something like [sheet1, data1,data2,data3,summary,reference,other_info,old_records]
I need to read only sheets [reference, data1,data2,data3]
I can hardcode the name reference which is static everytime, but the names data1,data2,data3 are not static as there maybe data1 only or data1,data2 or it can be (eg) data1,data2….data(n)
whatever the count of the sheets be, it will remain same across all files (eg - its not allowed to have Data1,Data2 in one file and Data1,Data2,Data3 in the other - Just to clarify the requirement).
I can check the name by using the following code
reallall = [key for key in pd.read_excel(path,sheet_name = None) if ('Data') in key]
for n in range(0,len(readall)):
sheetname = readall[n]
dfname = df_list[n] – trying to create dynamic dataframe so that we can create separate tables at the end
for s in allsheets:
sheetname = s
data_df = readfile(path,s,”’Data1!C5’”) -- function to read excel file into dataframe
df_ref = readreference(path,s,”’Reference!A1’”)
df_ref is same for all sheets in a workbook, and the data_df is joined with the reference. (Just adding this as an info – there are some other processing that needs to be done as well, which I have already done)
the above is a sample code to read a particular sheet.
My Problem is:
I have Multiple excel files (around 100 files) to read.
Matching sheets from all files should be combined together (eg) ‘Data1’ from file1 should be combined with data1 from file2, data1 from file3 … and so on. Similary Data2 from all files should be combined into a separate dataframe (all sheets have same columns)
Separate delta tables should be created for each tab (eg) table for Data1 should be something like Customers_Data1, Table for Data2 should be Customers_Data2 and so on.
Any help on this please ?
Thanks
Resolved my Issue through the following code.
final_dflist = []
sheet_names = [['Data1','CUSTOMER_DATA1'],['Data2','CUSTOMER_DATA2']]
for shname in sheet_names:
final_df = spark.createDataFrame([],final_schema)
print(f'Initializing final df - record count: {final_df.count()}')
sheetname = shname[0]
dfname = shname[1]
print(f'Processing: {sheetname}')
for file in historic_files:
fn = file.rsplit('/',1)[-1]
fpath = '/dbfs' + file
print(f'Reading file:{fn} -->{sheetname}')
indx_df = pd.read_excel(fpath,sheet_name = 'Index', skiprows = None)
for index,row in indx_df.iterrows():
if row[0] == 'Data Item Description':
row_index = 'A' + str(index+2)
df_index = read_index(file,'Index',row_index)
df_index = df_index.selectExpr('`Col 1` as co_1','Col2','Col3','Col4')
df_data = read_data(file,sheetname,'A10')
# Join Data with index here
# Drop null columns from final df
df = combine_df.na.drop("all")
exec(f'{dfname} = final_df.select("*")')
final_dflist.append([dfname,final_df])
print(f'Data Frame created:{dfname}')
print (f'final df - record count: {final_df.count()}')
Any suggestions to improve this ?

Write Dataframe row to excel sheet using Pandas

How do I save returned row from dataframe into excel sheet?
Story: Am working with large txt file (1.7M rows), containing postal codes for Canada. I created a dataframe, and extracted values I need into it. One column of the dataframe is the province id (df['PID']). I created a list of the unique values found in that PID column, and am successfully creating the (13) sheets, each named after the unique PID, in a new excel spread sheet.
Problem: Each sheet only contains the headers, and not the values of the row.
I am having trouble writing the matching row to the sheet. Here is my code:
import pandas as pd
# parse text file into dataframe
path = 'the_file.txt'
df = pd.read_csv(path, sep='\t', header=None, names=['ORIG', 'PID','PCODE'], encoding='iso-8859-1')
# extract characters to fill values
df['ORIG'] = df['ORIG']
df['PID'] = df['ORIG'].str[11:13].astype(int)
df['PCODE'] = df['ORIG'].str[:6]
# create list of unique province ID's
prov_ids = df['PID'].unique().tolist()
prov_ids_string = map(str, prov_ids)
# create new excel file
writer = pd.ExcelWriter('CanData.xlsx', engine='xlsxwriter')
for id in prov_ids_string:
mydf = df.loc[df.PID==id]
# NEED TO WRITE VALUES FROM ROW INTO SHEET HERE*
mydf.to_excel(writer, sheet_name=id)
writer.save()
I know where the writing should happen, but I haven't gotten the correct result. How can I write only the rows which have matching PID's to their respective sheets?
Thank you
The following should work:
import pandas as pd
import xlsxwriter
# parse text file into dataframe
# extract characters to fill values
df['ORIG'] = df['ORIG']
df['PID'] = df['ORIG'].str[11:13].astype(int)
df['PCODE'] = df['ORIG'].str[:6]
# create list of unique province ID's
prov_ids = df['PID'].unique().tolist()
#prov_ids_string = map(str, prov_ids)
# create new excel file
writer = pd.ExcelWriter('./CanData.xlsx', engine='xlsxwriter')
for idx in prov_ids:
mydf = df.loc[df.PID==idx]
# NEED TO WRITE VALUES FROM ROW INTO SHEET HERE*
mydf.to_excel(writer, sheet_name=str(idx))
writer.save()
For example data:
df = pd.DataFrame()
df['ORIG'] = ['aaaaaa111111111111111111111',
'bbbbbb2222222222222222222222']
df['ORIG'] = df['ORIG']
df['PID'] = df['ORIG'].str[11:13].astype(int)
df['PCODE'] = df['ORIG'].str[:6]
print(df)
In my Sheet 11, I have:
Kr.

merge multiple excel files in one dataframe

I have a lots of excel files +200, all of these have the same format.
the directorys are saved in this list
dir_list = ['all','files]
I want to convert all of them into one single df
below is what I want to select from each excel file into the new df
used_col = ['Dimension', 'Length','Customer']
df_x = pd.read_excel(file,sheet_name='Tabelle1',skiprows=3,skipinitialspace=True, usecols=used_col)
how can I do that ?
You are close, you need to use concat to create a single df from all files.
tmp = []
used_col = ['Dimension', 'Length','Customer']
for file in dir_list:
df_x = pd.read_excel(file,sheet_name='Tabelle1',skiprows=3,skipinitialspace=True, usecols=used_col)
tmp.append(df_x)
final_df = pd.concat(tmp)

Importing multiple excel files into Python, merge and apply filename to a new column

I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks
Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)

Convert multiple xlsm files automatically to multiple csv files by using pandas

I have 300 raw datas (.xlsm) and wanne to extract useful datas and turn them to csv files as input for subsequent neural network, now i try to implement them with 10 datas as example, i have sucessfully extracted the informations what i need, but i dont know how to convert them to csv files with the same name, for single data we can use df.to_csv, but how about for all the datas? with for function?
import glob
import pandas as pd
import numpy as np
import csv
import os
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
directory = '/Beispiel'
for files in excel_files:
data = pd.read_excel(files)
# getting the list of rows and columns you need
list_of_dfs = pd.DataFrame(data.values[0:600:,12:26],
columns=data.columns[12:26]).drop(['Sauberkeit', 'Temparatur'], axis=1)
# converting pandas dataframe columns to numeric: string into float
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
list_of_dfs[cols] = list_of_dfs[cols].apply(pd.to_numeric, errors='coerce', axis=1)
# Filling down from a column through missing data
for fec in list_of_dfs[cols]:
list_of_dfs[fec].fillna(method='ffill', inplace=True)
csvfilename = files.split('/')[-1].split('.')[0] + '.csv'
newtempfile = os.path.join(directory,csvfilename)
print(newtempfile)
print(list_of_dfs.head(2))
problem is solved.
folder_name = 'Beispiel'
csvfilename = files.split('/')[-1].split('.')[0] + '.csv' # change into csv files
newtempfile = os.path.join(folder_name, csvfilename)
# Verify if directory exists
if not os.path.exists(folder_name):
os.makedirs(folder_name) # If not, create it
print(newtempfile)
list_of_dfs.to_csv(newtempfile, index=False)
The easiest way of doing this is to get the filename from the excel and then use the os.path.join() method to save it to the directory you want.
directory = "C:/Test"
for files in excel_files:
csvfilename = (os.path.basename(file)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
Since you already have the excel df you want to push into the csv file, just add the above code to the loop and change the output csv file to 'newtempfile' and that should do it.
df.to_csv(newtempfile, 'Beispel/data{0}.csv'.format(idx))
Hope this helps. :)
Updated Code:
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
for file in excel_files:
data = pd.read_excel(file, columns = cols) # import only the columns you need to the dataframe
csvfilename = (os.path.basename(files)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
# converting pandas dataframe columns to numeric: string into float
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
data[cols].fillna(method='ffill', inplace=True)
data.to_csv(newtempfile).format(idx)

Categories

Resources