Append filename to column header when reading multiple csv files - python

I want to read multiple .csv files and append the datetime part of their filename to the column header. Each csv file contains data acquired at a specific datetime. Each filename has the following format:
yyyy-mm-dd_hh-mm-ss_someothertext
Each file contains only one column of data.
I successfully import multiple files as a list of dataframes as follows:
import pandas as pd
import glob
path = r'C:\Users\...' #path
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header=0)
li.append(df)
I then concatenate the files into one dataframe such that each column contains the data from one of the files:
frame = pd.concat(li, axis=1, ignore_index=True)
However, this is where I lose the filename information. The column headers are now just a series of numbers. My question is: how can I append the datetime portion of each filename to its respective column header in frame?
The closest I have got is being able to append the whole filename, not just the datetime part, in a roundabout way by transposing frame, adding the whole filename as a new column, transposing back, then setting the filename row as the header row...
import os
frame=pd.DataFrame.transpose(frame)
frame['filename'] = os.path.basename(filename)
frame=pd.DataFrame.transpose(frame)
frame.reset_index(drop=True)
frame.columns = frame.iloc[6628] #row 6628 is where the row with the filenames ends up after transposing
This seems terribly inefficient though and ends up with the whole filename as the header rather than just the datetime part.

This would be my suggested approach, squeezing the DataFrame and using Regex:
import re
import os
import glob
import pandas as pd
path = 'C:\Users\....'
files = glob.glob(f'{path}\*.csv')
li = []
for file in files:
name = os.path.basename(file) # get filename
date = re.search(r'\d{4}-\d{2}-\d{2}', name).group(0) # extract yyyy-mm-dd from filename
# read file, squeeze to Series, rename to date
li.append(pd.read_csv(file, index_col=None, header=0, squeeze=True).rename(date))
frame = pd.concat(li, axis=1, ignore_index=False)

Related

Import multiple sas files in Python and then row bind

I have over 20 SAS (sas7bdat) files all with same columns I want to read in Python.
I need an iterative process to read all the files and rbind into one big df.
This is what I have so far, but it throws an error saying no objects to concatenate.
import pyreadstat
import glob
import os
path = r'C:\\Users\myfolder' # or unix / linux / mac path
all_files = glob.glob(os.path.join(path , "/*.sas7bdat"))
li = []
for filename in all_files:
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, filename, chunksize= 10000, usecols=cols)
for df, meta in reader:
li.append(df)
frame = pd.concat(li, axis=0)
I found this answer to read in csv files helpful: Import multiple CSV files into pandas and concatenate into one DataFrame
So if one has too big sas data files and plans to append all of them into one df then:
#chunksize command avoids the RAM from crashing...
for filename in all_files:
reader = pyreadstat.read_file_in_chunks(pyreadstat.read_sas7bdat, filename, chunksize= 10000, usecols=cols)
for df, meta in reader:
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)

Process .csv files successively and extract rows whenever a specific column has non-empty cells

I am developing a code to process multiple .csv files in a for loop and then extract (into new .csv files) only the rows that match non-empty string cells across a specific column named "20210-2.0". The non-empty string cells are named the same (i.e. 20210-2.0). Here is a screenshot showing part of the csv file:
https://uoe-my.sharepoint.com/:i:/g/personal/gpapanas_ed_ac_uk/EayBblFTHmVJvRfsB6h8Vr4B09IfjQ2L1I5OQKUN2p5wzw?e=2gXW61
import pandas as pd
import glob
import os
path = './'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None)
li.append(df)
df = li[li['20201-2.0'].notnull()]
print('extracting info from cvs...')
print(df)
# You can now export all outcomes in new csv files
file_name = filename + 'new' + '.csv'
save_path = os.path.abspath(
os.path.join(
path, file_name
)
)
print('saving ...')
export_csv = df.to_csv(save_path, index=None)
I get the following error:
df = li[li['20201-2.0'].notnull()]
TypeError: list indices must be integers or slices, not str
Inside your loop, after you read the file, automatically filter it, before storing it in a list.
df = pd.read_csv(filename, index_col=None, header = 0) # You read the file in your directory under the variable filename, but it needs to know that you have a column header. Your '20201-2.0' value is a column name right?
df = df[df['20201-2.0'].notnull()] # You now get a new dataframe from the one you load, but now you only got the rows in which the column named '20201-2.0' has been populated.
li.append(df) # Store that dataframe in a list called `li`
I also noticed that in the saving as new csv file you have, you are adding "new" string and ".csv" string in each filname string variable you have.
Have you ran this code? Does it not save your file as "something.csvnew.csv"?

Grabbing a single Excel worksheet from multiple workbooks into a pandas dataframe and saving this

I need to extract an Excel worksheet from multiple workbooks and saving it to a dataframe and in turn saving that dataframe.
I have a spreadsheet that is generated at the end of each month (e.g.
June 2019.xlsx, May 2019.xlsx, April 2019.xlsx).
I need to grab a worksheet 'Sheet1'from each of these workbooks and convert these to a dataframe (df1).
I would like to have this dataframe saved.
As a nice to have, I would also like some way just to append the next month's data after the initial 'data grab'.
I'm relatively new to this, so I haven't made much progress.
import os
import glob
import pandas as pd
import xlrd
import json
import io
import flatten_json
files = glob.glob('/Users/ngove/Documents/Python Scripts/2019/*.xlsx')
dfs={}
for f in files:
dfs[os.path.splitext(os.path.basename(f))[0]] = pd.read_excel(f)
You can drop all of your files in a directory (e.g. current directory). Then append all of your excel files in a list (e.g. files_xls). Iterate over all your files and use pandas.read_excel to get the respective dataframes (e.g. list_frames).
Below, you can find an example:
import os
import pandas as pd
path = os.getcwd() # get cur dir
files = os.listdir(path) # get all the files in your cur dir
# get only the xls or xlsm (this depends on you)
files_xls = [f for f in files if (f[-3:] == 'xls' or f[-4:] == 'xlsm')]
df = pd.DataFrame()
list_frames = []
for f in files_xls:
print("Processing file: %s" %f)
try:
# the following will give you the dataframe
# the fun params depends on your data format
data = pd.read_excel(f, 'Sheet1', header=0, index_col=None,
sep='delimiter', error_bad_lines=False,
skip_blank_lines=True, comment=',,')
except:
pass
list_frames.append(data)
# at the end you can concat your data if you want and remove any dublicate
df = pd.concat(list_frames, sort=False).fillna(0)
df = df.drop_duplicates()
# at the end you can save it
writer = pd.ExcelWriter("your_title" + ".xlsx", engine='xlsxwriter')
df.to_excel(writer, sheet_name="Sheets1", index=False)
writer.save()
I hope this helps.
I interpreted your statement that you want to save the dataframe as that you want to save it as a combined Excel file. This will combine all files in the folder specified that end in xlsx.
import os
import pandas as pd
from pandas import ExcelWriter
os.chdir("H:/Python/Reports/") #edit this to be your path
path = os.getcwd()
files = os.listdir(path)
files_xlsx = [f for f in files if f[-4:] == 'xlsx']
df = pd.DataFrame()
for f in files_xlsx:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
writer=ExcelWriter('Combined_Data.xlsx')
df.to_excel(writer,'Sheet1',index=False)
writer.save()
You could update the code to grab all 2019 files by changing the one line to this:
files_xlsx = [f for f in files if f[-9:] == '2019.xlsx']
I referenced this question for most of the code and updated for xlsx and added the file save portion of the code

Convert multiple xlsm files automatically to multiple csv files by using pandas

I have 300 raw datas (.xlsm) and wanne to extract useful datas and turn them to csv files as input for subsequent neural network, now i try to implement them with 10 datas as example, i have sucessfully extracted the informations what i need, but i dont know how to convert them to csv files with the same name, for single data we can use df.to_csv, but how about for all the datas? with for function?
import glob
import pandas as pd
import numpy as np
import csv
import os
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
directory = '/Beispiel'
for files in excel_files:
data = pd.read_excel(files)
# getting the list of rows and columns you need
list_of_dfs = pd.DataFrame(data.values[0:600:,12:26],
columns=data.columns[12:26]).drop(['Sauberkeit', 'Temparatur'], axis=1)
# converting pandas dataframe columns to numeric: string into float
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
list_of_dfs[cols] = list_of_dfs[cols].apply(pd.to_numeric, errors='coerce', axis=1)
# Filling down from a column through missing data
for fec in list_of_dfs[cols]:
list_of_dfs[fec].fillna(method='ffill', inplace=True)
csvfilename = files.split('/')[-1].split('.')[0] + '.csv'
newtempfile = os.path.join(directory,csvfilename)
print(newtempfile)
print(list_of_dfs.head(2))
problem is solved.
folder_name = 'Beispiel'
csvfilename = files.split('/')[-1].split('.')[0] + '.csv' # change into csv files
newtempfile = os.path.join(folder_name, csvfilename)
# Verify if directory exists
if not os.path.exists(folder_name):
os.makedirs(folder_name) # If not, create it
print(newtempfile)
list_of_dfs.to_csv(newtempfile, index=False)
The easiest way of doing this is to get the filename from the excel and then use the os.path.join() method to save it to the directory you want.
directory = "C:/Test"
for files in excel_files:
csvfilename = (os.path.basename(file)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
Since you already have the excel df you want to push into the csv file, just add the above code to the loop and change the output csv file to 'newtempfile' and that should do it.
df.to_csv(newtempfile, 'Beispel/data{0}.csv'.format(idx))
Hope this helps. :)
Updated Code:
cols = ['KonzA', 'KonzB', 'KonzC', 'TempA',
'TempB', 'TempC', 'Modul1', 'Modul2',
'Modul3', 'Modul4', 'Modul5', 'Modul6']
excel_files = glob.glob('../../Versuch/Versuche/RohBeispiel/*.xlsm')
for file in excel_files:
data = pd.read_excel(file, columns = cols) # import only the columns you need to the dataframe
csvfilename = (os.path.basename(files)[-1]).replace('.xlsm','.csv')
newtempfile=os.path.join(directory,csvfilename)
# converting pandas dataframe columns to numeric: string into float
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)
data[cols].fillna(method='ffill', inplace=True)
data.to_csv(newtempfile).format(idx)

Concatenate csv files in python by ascending order of filenames

I need to concatenate csv files with same column headers in python. The csv files with the following filenames should concatenate in order as shown below(ascending order of filename):
AB201602.csv
AB201603.csv
AB201604.csv
AB201605.csv
AB201606.csv
AB201607.csv
AB201608.csv
AB201610.csv
AB201612.csv
I would like to keep the column headers only from first file. Any idea?
I tried to use the below code and it combined the csv file by random filenames and truncated half of the column header name. thanks
csvfiles = glob.glob('/home/c/*.csv')
wf = csv.writer(open('/home/c/output.csv','wb'),delimiter = ',')
for files in csvfiles:
rd = csv.reader(open(files,'r'),delimiter = ',')
rd.next()
for row in rd:
print(row)
wf.writerow(row)
Using #Gokul comment and pandas.
import pandas as pd
import glob
csvfiles = sorted(glob.glob('/home/c/*.csv'))
df = pd.DataFrame()
for files in csvfiles:
df = df.append(pd.read_csv(files))
df.to_csv('newfile.csv')

Categories

Resources