Merge all dataframes together in a loop - python

I have several CSV files in a certain path. I would like to add them all together, I did this laboriously with a function and assigned the individual arrays to the dataframe.
Is there an option to do everything in a for loop?
So I don't have to do df1 = pd.read_csv(CSV_FILES[0] and frames = [df1, df2, df3, df4]?
As soon as I try to loop reading in a for, I get an error.
How can I improve this code by not referring to the individual arrays CSV_FILES[0] , but doing everything in a loop?
PATH = ''
def find_csv(path):
csv_files = []
print("Looking for files at ", path)
for file in Path(path).glob('*.csv'):
csv_files.append(str(file))
print("Found ", len(csv_files), " csv files")
return csv_files
CSV_FILES = find_csv(PATH)
df1 = pd.read_csv(CSV_FILES[0])
df2 = pd.read_csv(CSV_FILES[1])
df3 = pd.read_csv(CSV_FILES[2])
df4 = pd.read_csv(CSV_FILES[3])
frames = [df1, df2, df3, df4]
df = pd.concat(frames)

You can create list of DataFrames, change:
csv_files.append(str(file))
to:
csv_files.append(pd.read_csv(str(file)))
And then join them together:
df = pd.concat(CSV_FILES)

Related

Select specific column from multiple csv files, then merge those columns into single file using pandas

I am trying to select a specific column, with the header "Average", from multiple csv files. Then take the "Average" column from each of those multiple csv files and merge them into a new csv file.
I left the comments in to show the other ways I tried to accomplish this:
procdir = r"C:\Users\ChromePnP\Desktop\exchange\processed"
collected = os.listdir(procdir)
flist = list(collected)
flist.sort()
#exclude first files in list
rest_of_files = flist[1:]
for f in rest_of_files:
get_averages = pd.read_csv(f, usecols = ['Average'])
#df1 = pd.DataFrame(f)
# df2 = pd.DataFrame(rundata_file)
#get_averages = pd.read_csv(f)
#for col in ['Average']:
#get_averages[col].to_csv(f_template)
got_averages = pd.merge(get_averages, right_on = 'Average')
got_averages.to_csv("testfile.csv", index=False)
EDIT:
I was able to get the columns I wanted, and they will print. However now the saved file only has a single average column from the loop, instead of saving all the columns selected in the loop.
rest_of_files = flist[1:]
#f.sort()
print(rest_of_files)
for f in rest_of_files:
get_averages = pd.read_csv(f)
df1 = pd.DataFrame(get_averages)
got_averages = df1.loc[:, ['Average']]
print(got_averages)
f2_temp = pd.read_csv(rundata_file)
df2 = pd.DataFrame(f2_temp)
merge_averages = pd.concat([df2, got_averages], axis=1)
merge_averages.to_csv(rundata_file, index=False)
Either you use pd.merge with argument left and right as specified here :
got_averages = pd.merge(got_averages, get_averages, right_on = 'Average')
Or you use .merge for dataframe, doc here :
got_averages = got_averages.merge(get_averages, right_on = 'Average')
Keep in mind you need to initialize got_averages (as empty dataframe for instance) before using it in your for loop

Loop through old and new versions of files

I am trying to create a .csv containing records that are different between old and new csv files. I have successfully accomplished this with a single such pair using
old_df = 'file1_old.csv'
new_df = 'file1_new.csv'
df1 = pd.read_csv(old_df)
df2 = pd.read_csv(new_df)
df1['flag'] = 'old'
df2['flag'] = 'new'
df = pd.concat([df1, df2])
dups_dropped = df.drop_duplicates(df.columns.difference(['flag']) keep=False)
dups_dropped.to_csv('difference.csv', index=False)
I am struggling to wrap my mind around how to scale this with a loop (?) to output a csv for each new pairing if the new v. old file names input are of the same convention, for instance:
file1_new, file1_new
file2_new, file2_old
file3_new, file3_old
so that the output is
file1_difference.csv
file2_difference.csv
file3_difference.csv
Thoughts? Much appreciated
Using a simple for loop with f-strings to help format the filenames should work:
for i in range(1,11): # replace 11 with the number of files you have + 1
old_df = f'file{i}_old.csv'
new_df = f'file{i}_new.csv'
df1 = pd.read_csv(old_df)
df2 = pd.read_csv(new_df)
df1['flag'] = 'old'
df2['flag'] = 'new'
df = pd.concat([df1, df2])
dups_dropped = df.drop_duplicates(df.columns.difference(['flag']) keep=False)
dups_dropped.to_csv(f'difference{i}.csv', index=False)

Saving pandas dataframe as csv and overwrite existing file

I have always two dataframes which come from different directories with the same last four digits in their names. The filepaths are:
dir1 = "path/to/files1/"
dir2 = "path/to/files2/"
Then I use a loop to load and concatenate the dataframes which belong together to dataframe df.
# For each file in the first directory
for i in os.listdir(dir1):
# For each file in the second directory
for j in os.listdir(dir2):
# If the last 4 digits of filename match (ignoring file extension)
if i[-8:-4] == j[-8:-4]:
# Load CSVs into pandas
print(i[-12:-4] + ' CPU Analysis')
print('\n')
df1 = pd.read_csv(dir1 + i,delimiter=',')
df2 = pd.read_csv(dir2 + j,delimiter=';')
df = pd.concat([df1, df2])
What I now want to do is to store df in dir1 using the same filename as before, i.e. I want to overwrite the existing file in dir1 and save as csv.
So, I think I should use something like this at the end of the loop:
df.to_csv(dir1, i[:-4])
But I am not sure about this.
I think here is possible join values by +:
df = pd.concat([df1, df2])
df.to_csv(dir1 + i[:-4] + '.csv', index=False)
Or use f-strings:
df = pd.concat([df1, df2])
df.to_csv(f'{dir1}{i[:-4]}.csv', index=False)
But if need original extension use same path like for reading file:
df = pd.concat([df1, df2])
df.to_csv(dir1 + i, index=False)
df = pd.concat([df1, df2])
df.to_csv(f'{dir1}{i}', index=False)

Append Dataframe inside the loop - Python

I am trying append dataframe inside the loop after reading the file, but still not appending full dataset.
columns = list(df)
data= []
for file in glob.glob("*.html"):
df = pd.read_html(file)[2]
zipped_date = zip(columns , df.values)
a_dictionary = dict(zipped_date)
data.append(a_dictionary)
full_df = full_df .append(data, False)
Maybe create a list of dataframes inside the loop and the concat them:
for file in glob.glob("*.html"):
data.append( pd.read_html(file)[2] )
full_df = pd.concat(data, ignore_index=True)
Use pd.concat:
df = pd.concat([pd.read_html(file)[2] for files in glob.glob("*.html")])

How to create variables and read several excel files in a loop with pandas?

L=[('X1',"A"),('X2',"B"),('X3',"C")]
for i in range (len(L)):
path=os.path.join(L[i][1] + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
File "<ipython-input-1-6220ffd8958b>", line 6
''.join(L[i][0])=pd.read_excel(xls,'Sheet1')
^
SyntaxError: can't assign to function call
I have a problem with pandas, I can not create several dataframes for several excel files but i don't know how to create variables
I'll need a result that looks like this :
X1 will have dataframe of A.xlsx
X2 will have dataframe of B.xlsx
.
.
.
Solved :
d = {}
for i,value in L:
path=os.path.join(value + '.xlsx')
book = load_workbook(path)
xls = pd.ExcelFile(path)
df = pd.read_excel(xls,'Sheet1')
key = 'df-'+str(i)
d[key] = df
Main pull:
I would approach this by reading everything into 1 dataframe (loop over files, and concat):
import os
import pandas as pd
files = [] #generate list for files to go into
path_of_directory = "path/to/folder/"
for dirname, dirnames, filenames in os.walk(path_of_directory):
for filename in filenames:
files.append(os.path.join(dirname, filename))
output_data = [] #blank list for building up dfs
for name in files:
df = pd.read_excel(name)
df['name'] = os.path.basename(name)
output_data.append(df)
total = pd.concat(output_data, ignore_index=True, sort=True)
Then:
From then you can interrogate the df by using df.loc[df['name'] == 'choice']
Or (in keeping with your question):
You could then split into a dictionary of dataframes, based on this column. This is the best approach...
dictionary = {}
df[column] = df[column].astype(str)
col_values = df[column].unique()
for value in col_values:
key_name = 'df'+str(value)
dictionary[key_name] = copy.deepcopy(df)
dictionary[key_name] = dictionary[key_name][df[column] == value]
dictionary[key_name].reset_index(inplace=True, drop=True)
return dictionary
The reason for this approach is discussed here:
Create new dataframe in pandas with dynamic names also add new column which basically says that dynamic naming of dataframes is bad, and this dict approach is best
This might help.
files_xls = ['all your excel filename goes here']
df = pd.DataFrame()
for f in files_xls:
data = pd.read_excel(f, 'Sheet1')
df = df.append(data)
print(df)

Categories

Resources