I have many .csv files like this (with one column):
picture
Id like to merge them into one .csv file, so that each of the column will contain one of the csv files data. The headings should be like this (when converted to spreadsheet):
picture (the first number is the number of minutes extracted from the file name, the second is the first word in the file name behind "export_" in the name, and third is the whole name of the file).
Id like to work in Python.
Can you please someone help me with this? I am new in Python.
Thank you very much.
I tried to join only 2 files, but I have no idea how to do it with more files without writing all down manually. Also, i dont know, how to extract headings from the file names:
import pandas as pd
file_list = ['export_Control 37C 4h_Single Cells_Single Cells_Single Cells.csv', 'export_Control 37C 0 min_Single Cells_Single Cells_Single Cells.csv']
df = pd.DataFrame()
for file in file_list:
temp_df = pd.read_csv(file)
df = pd.concat([df, temp_df], axis=1)
print(df)
df.to_csv('output2.csv', index=False)
Assuming that your .csv files they all have a header and the same number of rows, you can use the code below to put all the .csv (single-columned) one besides the other in a single Excel worksheet.
import os
import pandas as pd
csv_path = r'path_to_the_folder_containing_the_csvs'
csv_files = [file for file in os.listdir(csv_path)]
list_of_dfs=[]
for file in csv_files :
temp=pd.read_csv(csv_path + '\\' + file, header=0, names=['Header'])
time_number = pd.DataFrame([[file.split('_')[1].split()[2]]], columns=['Header'])
file_title = pd.DataFrame([[file.split('_')[1].split()[0]]], columns=['Header'])
file_name = pd.DataFrame([[file]], columns=['Header'])
out = pd.concat([time_number, file_title, file_name, temp]).reset_index(drop=True)
list_of_dfs.append(out)
final= pd.concat(list_of_dfs, axis=1, ignore_index=True)
final.columns = ['Column' + str(col+1) for col in final.columns]
final.to_csv(csv_path + '\output.csv', index=False)
final
For example, considering three .csv files, running the code above yields to :
>>> Output (in Jupyter)
>>> Output (in Excel)
Related
I need help with my Python code.
The goal is:
read in between 100 and 200 CSV files that are in a folder
copy a variable in each CSV file from position (2,2)
create the sum of all values of column 17 in every CSV
to transfer the values in the form of a dataframe
create a new Excel file
transfer the dataframe in the Excel file
My attempt was the following code
# import necessary libraries
import pandas as pd
import os
import glob
# use glob to get all the csv files
# in the folder
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f,sep=';', skiprows=2,usecols=[2,16],header=None)
#ID
ID = (df.loc[2][2])
#summ of col.16
dat_Verbr = df[16].sum()
# data in single dataframe
df4 = pd.DataFrame({'SIM-Karte': ID, 'Datenverbrauch': dat_Verbr}, index=[0,1,2,3,4,5])
# Specify the name of the excel file
file_name = 'Auswertung.xlsx'
# saving the excelsheet
concatenated.to_excel(file_name)
print(' record successfully exported into Excel File')
unfortunately, it doesn't work.
Problem is that only the first id and first sum are imported in the excel file.
How can I work with the index by creating a single dataframe. I don’t know the exact number of csv files, only somewhat between 100 and 200.
I'm a beginner with python.
Can someone help me please?
You can use the updated code. One assumption I made is that there is data in all rows 1 thru 16. If your file has just ;;;;... in the first row, read_csv sometimes makes a mistake. Also, as you are using skiprow = 1, it will not add the value in row 1, column 17 if present. You can need to change the code if that needs to be added. Rest, I have corrected/changed so the code works. Note that in to_excel I have used index=False as I didnt think you need the index to be added. Remove if you want to see the index as well.
# use glob to get all the csv files
# in the folder
import os, glob
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv"))
# data in single dataframe
df4 = pd.DataFrame(columns =['SIM-Karte', 'Datenverbrauch'])
# loop over the list of csv files
for f in csv_files:
# read the csv file
df = pd.read_csv(f,sep=';', skiprows = 1, usecols=[1,16],header=None)
#ID
ID = (df.iloc[0][1])
#summ of col.16
dat_Verbr = df[16].sum()
df4.loc[len(df4.index)] = [ID, dat_Verbr]
# Specify the name of the excel file
file_name = 'Auswertung.xlsx'
# saving the excelsheet
df4.to_excel(file_name, index=False)
print(' record successfully exported into Excel File')
Output excel - I had 3 files in the folder
I have a folder. In this folder are 48 xlsx files, but the count of the relevant files are 22. Them name of these 22 files have no structure, the only thing in common is that the filenames start with data. I would love to access this files and read them all into a dataframe. Doing this manually with the code line
df = pd.read_excel(filename, engine='openpyxl')
takes too long
The table structure is similar but not always exactly the same. How can I manage to solve this problem
import os
import pandas as pd
dfs = {}
def get_files(extension, location):
xlsx_list = []
for root, dirs, files in os.walk(location):
for t in files:
if t.endswith(extension):
xlsx_list.append(t)
return xlsx_list
file_list = get_files('.xlsx', '.')
index = 0
for filename in file_list:
index += 1
df = pd.read_excel(filename, engine='openpyxl')
dfs[filename] = df
print(dfs)
each element in dfs like dfs['file_name_here.xlsx'] accesses the data frame output from the read_excel.
EDIT: that you can add additional criteria to filter through the XLSX files at the line if t.endswith(extension): you can check out the beginning of the file like if t.startswith('data'): too. Maybe combine them if t.startswith('data') and t.endswith(extension):
I am trying to come up with a script that will allow me to read all csv files with greater than 62 bits and print two columns into a separate excel file and create a list.
The following is one of the csv files:
FileUUID Table RowInJSON JSONVariable Error Notes SQLExecuted
ff3ca629-2e9c-45f7-85f1-a3dfc637dd81 lng02_rpt_b_calvedets 1 Duplicate entry 'ETH0007805440544' for key 'nosameanimalid' INSERT INTO lng02_rpt_b_calvedets(farmermobile,hh_id,rpt_b_calvedets_rowid,damidyesno,damid,calfdam_id,damtagid,calvdatealv,calvtype,calvtypeoth,easecalv,easecalvoth,birthtyp,sex,siretype,aiprov,othaiprov,strawidyesno,strawid) VALUES ('0974502779','1','1','0','ETH0007805440544','ETH0007805470547',NULL,'2017-09-16','1',NULL,'1',NULL,'1','2','1',NULL,NULL,NULL,NULL,NULL,'0',NULL,NULL,NULL,NULL,NULL,NULL,'0',NULL,'Tv',NULL,NULL,'Et','23',NULL,'5',NULL,NULL,NULL,'0','0')
This is my attempt to solving this problem:
path = 'csvs/'
for infile in glob.glob( os.path.join(path, '*csv') ):
output = infile + '.out'
with open(infile, 'r') as source:
readr = csv.reader(source)
with open(output,"w") as result:
writr = csv.writer(result)
for r in readr:
writr.writerow((r[4], r[2]))
Please help point me to the right direction with any alternative solution
pandas does a lot of what you are trying to achieve:
import pandas as pd
# Read a csv file to a dataframe
df = pd.read_csv("<path-to-csv>")
# Filter two columns
columns = ["FileUUID", "Table"]
df = df[columns]
# Combine multiple dataframes
df_combined = pd.concat([df1, df2, df3, ...])
# Output dataframe to excel file
df_combined.to_excel("<output-path>", index=False)
To loop through all csv files > 62bits, you can use glob.glob() and os.stat()
import os
import glob
dataframes = []
for csvfile in glob.glob("<csv-folder-path>/*.csv"):
if os.stat(csvfile).st_size > 62:
dataframes.append(pd.read_csv(csvfile))
Use the standard csv module. Don't re-invent the wheel.
https://docs.python.org/3/library/csv.html
I have 300 raw datas (.xlsm) and wanne to extract useful datas and turn them to csv files as input for subsequent neural network, now i try to implement them with 10 datas as example, i have sucessfully extracted the informations what i need, but i dont know how to convert them to csv files with the same name, for single data we can use df.to_csv, but how about for all the datas? with for function?
import glob
import pandas as pd
import numpy as np
import csv
import os
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
directory = '/Beispiel'
for files in excel_files:
data = pd.read_excel(files)
# getting the list of rows and columns you need
list_of_dfs = pd.DataFrame(data.values[0:600:,12:26],
columns=data.columns[12:26]).drop(['Sauberkeit', 'Temparatur'], axis=1)
# converting pandas dataframe columns to numeric: string into float
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
list_of_dfs[cols] = list_of_dfs[cols].apply(pd.to_numeric, errors='coerce', axis=1)
# Filling down from a column through missing data
for fec in list_of_dfs[cols]:
list_of_dfs[fec].fillna(method='ffill', inplace=True)
csvfilename = files.split('/')[-1].split('.')[0] + '.csv'
newtempfile = os.path.join(directory,csvfilename)
print(newtempfile)
print(list_of_dfs.head(2))
problem is solved.
folder_name = 'Beispiel'
csvfilename = files.split('/')[-1].split('.')[0] + '.csv' # change into csv files
newtempfile = os.path.join(folder_name, csvfilename)
# Verify if directory exists
if not os.path.exists(folder_name):
os.makedirs(folder_name) # If not, create it
print(newtempfile)
list_of_dfs.to_csv(newtempfile, index=False)
The easiest way of doing this is to get the filename from the excel and then use the os.path.join() method to save it to the directory you want.
directory = "C:/Test"
for files in excel_files:
csvfilename = (os.path.basename(file)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
Since you already have the excel df you want to push into the csv file, just add the above code to the loop and change the output csv file to 'newtempfile' and that should do it.
df.to_csv(newtempfile, 'Beispel/data{0}.csv'.format(idx))
Hope this helps. :)
Updated Code:
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
for file in excel_files:
data = pd.read_excel(file, columns = cols) # import only the columns you need to the dataframe
csvfilename = (os.path.basename(files)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
# converting pandas dataframe columns to numeric: string into float
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
data[cols].fillna(method='ffill', inplace=True)
data.to_csv(newtempfile).format(idx)
I have a file called 'workbooks_to_process.xlsx' with a column that contains the following excel files' paths:
**files_paths_2_process** (column header)
c:/work/file01.xlsx
c:/work/file02.xlsx
c:/work/file03.xlsx
………………….
c:/work/file0m.xlsx
On the other hand in Python Pandas
df_0 = pd.read_excel('workbooks_to_process.xlsx') # No issue
list_of_paths = df_0[files_paths_2_process].tolist() # No issue
Following is what I want to do (in an iterative process)
itr = list_of_paths[3] # or [0], [1], [n] etc
df_1 = pd.read_excel(itr)
Is there any method to accomplish the above?
Thanks!
for iterating through all files in a folder and all sheets in those files . try this:
import pandas as pd
import os
file_list = [os.path.join(r,file) for r,d,f in os.walk("C:\\Users\\ref_folder\\") for file in f]
for file in list(file_list):
f = pd.ExcelFile(file)
sheet_names = f.sheet_names
for i in list(sheet_names):
dataframe = pd.read_excel(f,i)
this dataframe will give you dataframe for every sheets, works for workbooks having 1 sheet too.
You can match the filename with your excel column filename and if it matches, read the df. I feel this is the most generalized way you can iterate through files in a folder and read as a df.
Hope that helps.
Try this
for itr in range(len(list_of_paths)):
df_1 = pd.read_excel(list_of_paths[itr])
...
...