Extracting an excel file path from another excel file - python

I have a file called 'workbooks_to_process.xlsx' with a column that contains the following excel files' paths:
**files_paths_2_process** (column header)
c:/work/file01.xlsx
c:/work/file02.xlsx
c:/work/file03.xlsx
………………….
c:/work/file0m.xlsx
On the other hand in Python Pandas
df_0 = pd.read_excel('workbooks_to_process.xlsx') # No issue
list_of_paths = df_0[files_paths_2_process].tolist() # No issue
Following is what I want to do (in an iterative process)
itr = list_of_paths[3] # or [0], [1], [n] etc
df_1 = pd.read_excel(itr)
Is there any method to accomplish the above?
Thanks!

for iterating through all files in a folder and all sheets in those files . try this:
import pandas as pd
import os
file_list = [os.path.join(r,file) for r,d,f in os.walk("C:\\Users\\ref_folder\\") for file in f]
for file in list(file_list):
f = pd.ExcelFile(file)
sheet_names = f.sheet_names
for i in list(sheet_names):
dataframe = pd.read_excel(f,i)
this dataframe will give you dataframe for every sheets, works for workbooks having 1 sheet too.
You can match the filename with your excel column filename and if it matches, read the df. I feel this is the most generalized way you can iterate through files in a folder and read as a df.
Hope that helps.

Try this
for itr in range(len(list_of_paths)):
df_1 = pd.read_excel(list_of_paths[itr])
...
...

Related

Merging csv files into one (columnwise) in Python

I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)

create singel dataframe from multiple csv sources and create singel excel file from dataframe

I need help with my Python code.
The goal is:
read in between 100 and 200 CSV files that are in a folder
copy a variable in each CSV file from position (2,2)
create the sum of all values of column 17 in every CSV
to transfer the values in the form of a dataframe
create a new Excel file
transfer the dataframe in the Excel file
My attempt was the following code
# import necessary libraries
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f,sep=';', skiprows=2,usecols=[2,16],header=None)
#ID
ID = (df.loc[2][2])
#summ of col.16
dat_Verbr = df[16].sum()
# data in single dataframe
df4 = pd.DataFrame({'SIM-Karte': ID, 'Datenverbrauch': dat_Verbr}, index=[0,1,2,3,4,5])
# Specify the name of the excel file
file_name = 'Auswertung.xlsx'
# saving the excelsheet
concatenated.to_excel(file_name)
print(' record successfully exported into Excel File')
unfortunately, it doesn't work.
Problem is that only the first id and first sum are imported in the excel file.
How can I work with the index by creating a single dataframe. I don’t know the exact number of csv files, only somewhat between 100 and 200.
I'm a beginner with python.
Can someone help me please?
You can use the updated code. One assumption I made is that there is data in all rows 1 thru 16. If your file has just ;;;;... in the first row, read_csv sometimes makes a mistake. Also, as you are using skiprow = 1, it will not add the value in row 1, column 17 if present. You can need to change the code if that needs to be added. Rest, I have corrected/changed so the code works. Note that in to_excel I have used index=False as I didnt think you need the index to be added. Remove if you want to see the index as well.
# use glob to get all the csv files
# in the folder
import os, glob
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
# data in single dataframe
df4 = pd.DataFrame(columns =['SIM-Karte', 'Datenverbrauch'])
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f,sep=';', skiprows = 1, usecols=[1,16],header=None)
#ID
ID = (df.iloc[0][1])
#summ of col.16
dat_Verbr = df[16].sum()
df4.loc[len(df4.index)] = [ID, dat_Verbr]
# Specify the name of the excel file
file_name = 'Auswertung.xlsx'
# saving the excelsheet
df4.to_excel(file_name, index=False)
print(' record successfully exported into Excel File')
Output excel - I had 3 files in the folder

How to read several xlsx-files in a folder into a pandas dataframe

I have a folder. In this folder are 48 xlsx files, but the count of the relevant files are 22. Them name of these 22 files have no structure, the only thing in common is that the filenames start with data. I would love to access this files and read them all into a dataframe. Doing this manually with the code line
df = pd.read_excel(filename, engine='openpyxl')
takes too long
The table structure is similar but not always exactly the same. How can I manage to solve this problem
import os
import pandas as pd
dfs = {}
def get_files(extension, location):
xlsx_list = []
for root, dirs, files in os.walk(location):
for t in files:
if t.endswith(extension):
xlsx_list.append(t)
return xlsx_list
file_list = get_files('.xlsx', '.')
index = 0
for filename in file_list:
index += 1
df = pd.read_excel(filename, engine='openpyxl')
dfs[filename] = df
print(dfs)
each element in dfs like dfs['file_name_here.xlsx'] accesses the data frame output from the read_excel.
EDIT: that you can add additional criteria to filter through the XLSX files at the line if t.endswith(extension): you can check out the beginning of the file like if t.startswith('data'): too. Maybe combine them if t.startswith('data') and t.endswith(extension):

Use Python to combine excel files from folder and keep the original formatting

I have already searched for the problem on the Internet, but no satisfying solution.
Lots of excel files that have different formatting in one folder.
The requirement is that combining all the excel files into different sheets of 1 excel. And the sheet name should be the individual excel name, meanwhile, it should keep the original formatting of each excel.
I now can use Pandas to combine all the excels, but the formatting is changed after writing to excel.
How to keep the formatting including the font, alignment, background and etc? Any suggestions? Thanks.
import pandas as pd
import os
# 1, List excel files:
path = r'C:\Users\h290602\Desktop\SAP'
files = os.listdir(path)
#2, Pick excel files
files_xls = [f for f in files if f.endswith(".xlsx") or f.endswith(".xls")]
#3, Initialize a empty dataframe
df = pd.DataFrame()
#4, Loop over the list of files to empty dataframe
save_path = '{0}\{1}.xlsx'.format(path,'results')
result = pd.ExcelWriter(save_path)
for f in files_xls:
excel_file_name = f.split('.')[0]
if '~$' in f:
f = f.replace('~$','')
excel_path = '{0}\{1}'.format(path,f)
df = pd.read_excel(excel_path)
df.to_excel(result,excel_file_name,index= False)
result.save()

Printing columns from a CSV file into an excel file with python

I am trying to come up with a script that will allow me to read all csv files with greater than 62 bits and print two columns into a separate excel file and create a list.
The following is one of the csv files:
FileUUID Table RowInJSON JSONVariable Error Notes SQLExecuted
ff3ca629-2e9c-45f7-85f1-a3dfc637dd81 lng02_rpt_b_calvedets 1 Duplicate entry 'ETH0007805440544' for key 'nosameanimalid' INSERT INTO lng02_rpt_b_calvedets(farmermobile,hh_id,rpt_b_calvedets_rowid,damidyesno,damid,calfdam_id,damtagid,calvdatealv,calvtype,calvtypeoth,easecalv,easecalvoth,birthtyp,sex,siretype,aiprov,othaiprov,strawidyesno,strawid) VALUES ('0974502779','1','1','0','ETH0007805440544','ETH0007805470547',NULL,'2017-09-16','1',NULL,'1',NULL,'1','2','1',NULL,NULL,NULL,NULL,NULL,'0',NULL,NULL,NULL,NULL,NULL,NULL,'0',NULL,'Tv',NULL,NULL,'Et','23',NULL,'5',NULL,NULL,NULL,'0','0')
This is my attempt to solving this problem:
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Please help point me to the right direction with any alternative solution
pandas does a lot of what you are trying to achieve:
import pandas as pd
# Read a csv file to a dataframe
df = pd.read_csv("<path-to-csv>")
# Filter two columns
columns = ["FileUUID", "Table"]
df = df[columns]
# Combine multiple dataframes
df_combined = pd.concat([df1, df2, df3, ...])
# Output dataframe to excel file
df_combined.to_excel("<output-path>", index=False)
To loop through all csv files > 62bits, you can use glob.glob() and os.stat()
import os
import glob
dataframes = []
for csvfile in glob.glob("<csv-folder-path>/*.csv"):
if os.stat(csvfile).st_size > 62:
dataframes.append(pd.read_csv(csvfile))
Use the standard csv module. Don't re-invent the wheel.
https://docs.python.org/3/library/csv.html

Categories

Resources