Get file created date - add to dataframes column on read_csv - python

I need to pull many (hundreds) CSV's into a pandas dataframe. I need to a add the date the file was created in a column upon read in to the pandas dataframe for each CSV file. I can obtain the date of creation for a CSV file using this call:
time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime('/path/file.csv')))
As an fyi, this is the command I am using to read in the CSVs:
path1 = r'/path/'
all_files_standings = glob.glob(path1 + '/*.csv')
standings = pd.concat((pd.read_csv(f, low_memory=False, usecols=[7, 8, 9]) for f in standings))
I tried running this call (which worked):
dt_gm = [time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime('/path/file.csv')))]
So then I tried expanding it:
dt_gm = [time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(f) for f in all_files_standings))]
and I get this error:
TypeError: an integer is required (got type generator)
How can I resolve this?

if the different files have the same columns and you would like to append different files into rows.
import pandas as pd
import time
import os
# lis of files you want to read
files = ['one.csv', 'two.csv']
column_names = ['c_1', 'c_2', 'c_3']
all_dataframes = []
for file_name in files:
df_temp = pd.read_csv(file_name, delimiter=',', header=None)
df_temp.columns = column_names
df_temp['creation_time'] = time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(file_name)))
df_temp['file_name'] = file_name
all_dataframes.append(df_temp)
df = pd.concat(all_dataframes, axis=0, ignore_index=True)
df
output:
if you want to append the different files by columns:
all_dataframes = []
for idx, file_name in enumerate(files):
df_temp = pd.read_csv(file_name, delimiter=',', header=None)
column_prefix = 'f_' + str(idx) + '_'
df_temp.columns = [column_prefix + c for c in column_names]
df_temp[column_prefix + 'creation_time'] = time.strftime('%m/%d/%Y', time.gmtime(os.path.getmtime(file_name)))
all_dataframes.append(df_temp)
pd.concat(all_dataframes, axis=1)
output:

Related

How to add CSV file names as Column Headers in a dataframe using pandas?

I'm trying to concatenate some specific columns from all the CSV files in a directory. I'm able to do all of that and make a resulting CSV file. The thing is since I don't know which columns belong to which CSV file I would like to make the header of each column as the CSV file it came from.
For eg.
CSVFile1:
Col1|Col2
CSVFile2:
Col1|Col2
CSVMergeFile:
CSVFile1|CSVFile2|CSVFile1|CSVFile2
Col1 |Col1 |Col2 |Col2
The following is the code I'm using to concatenate the columns:
import pandas as pd
import glob
p = input("Enter folder path :")
n = int(input("Enter number of columns: "))
col = []
for i in range(0, n):
ele = int(input())
col.append(ele)
path = f'{p}'
all_files = glob.glob(path + "/*.csv")
li = []
for filename in all_files:
df = pd.read_csv(filename, usecols=col, index_col=False, header=0)
li.append(df)
frame = pd.concat(li, axis=1, ignore_index='False')
Any suggestions?
You can create key which is the csv filename and concat it within a dict comprehension. This will create an index with the filename.
from pathlib import Path
import pandas as pd
p = input("Enter folder path :")
#change .stem to .name if you want the `.csv` appendage.
dfs = {file.stem : pd.read_csv(file, usecols=col, header=0)
) for file in Path(p).glob('*.csv')}
df = pd.concat(dfs)
dfs will be a dictionary of file names with with the key being a dataframe.

How to create variables and read several excel files in a loop with pandas?

L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df
Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best
This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)

Need to pick 'second column' from multiple csv files and save all 'second columns' in one csv file

So I have 366 CSV files and I want to copy their second columns and write them into a new CSV file. Need a code for this job. I tried some codes available here but nothing works. please help.
Assuming all the 2nd columns are the same length, you could simply loop through all the files. Read them, save the 2nd column to memory and construct a new df along the way.
filenames = ['test.csv', ....]
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
filenames = glob.glob(r'D:/CSV_FOLDER' + "/*.csv")
new_df = pd.DataFrame()
for filename in filenames:
df = pd.read_csv(filename)
second_column = df.iloc[:, 1]
new_df[f'SECOND_COLUMN_{filename.upper()}'] = second_column
del(df)
new_df.to_csv('new_csv.csv', index=False)
This can accomplished with glob and pandas:
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(mylist[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,1]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,1]).columns
df[colName] = pd.DataFrame(t.iloc[:,1])
df.to_csv('output.csv', index=False)
import glob
import pandas as pd
mylist = [f for f in glob.glob("*.csv")]
df = pd.read_csv(csvList[0]) #create the dataframe from the first csv
df = pd.DataFrame(df.iloc[:,0]) #only keep 2nd column
for x in mylist[1:]: #loop through the rest of the csv files doing the same
t = pd.read_csv(x)
colName = pd.DataFrame(t.iloc[:,0]).columns
df[colName] = pd.DataFrame(t.iloc[:,0])
df.to_csv('output.csv', index=False)

Importing multiple excel files into Python, merge and apply filename to a new column

I have a for loop that imports all of the Excel files in the directory and merge them together in a single dataframe. However, I want to create a new column where each row takes the string of the filename of the Excel-file.
Here is my import and merge code:
path = os.getcwd()
files = os.listdir(path)
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
df = df.append(data)
For example if first Excel file is named "file1.xlsx", I want all rows from that file to have value file1.xlsx in col3 (a new column). If the second Excel file is named "file2.xlsx" I want all rows from that file to have value file2.xlsx. Notice that there is no real pattern of the Excel files, and I just use those names as an example.
Many thanks
Create new column in loop:
df = pd.DataFrame()
for f in files:
data = pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2'])
data['col3'] = f
df = df.append(data)
Another possible solution with list comprehension:
dfs = [pd.read_excel(f, 'Sheet1', header = None, names = ['col1','col2']).assign(col3 = f)
for f in files]
df = pd.concat(dfs)

Merged file excels overwritting first column in Python using Pandas

I have a lot of files excel, I want to append multiple excel files using the following code:
import pandas as pd
import glob
import os
import openpyxl
df = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
data.index = [os.path.basename(f)] * len(data)
df.append(data)
df = pd.concat(df)
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
Excel files have this structure:
the output is the following:
Why does python alter the first column when concatenating excel files?
I think you need:
df = []
for f in glob.glob("*.xlsx"):
data = pd.read_excel(f, 'Sheet1')
name = os.path.basename(f)
#create Multiindex for not overwrite original index
data.index = pd.MultiIndex.from_product([[name], data.index], names=('files','orig'))
df.append(data)
#reset index for columns from MultiIndex
df = pd.concat(df).reset_index()
Another solution is use parameter keys in concat:
files = glob.glob("*.xlsx")
names = [os.path.basename(f) for f in files]
dfs = [pd.read_excel(f, 'Sheet1') for f in files]
df = pd.concat(dfs, keys=names).rename_axis(('files','orig')).reset_index()
What is same as:
df = []
names = []
for f in glob.glob(".xlsx"):
df.append(pd.read_excel(f, 'Sheet1'))
names.append(os.path.basename(f))
df = pd.concat(df, keys=names).rename_axis(('files','orig')).reset_index()
Last write to excel with no index and no columns names:
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,'Sheet1', index=False, header=False)
writer.save()

Categories

Resources