Append Dataframe inside the loop - Python - python

I am trying append dataframe inside the loop after reading the file, but still not appending full dataset.
columns = list(df)
data= []
for file in glob.glob("*.html"):
df = pd.read_html(file)[2]
zipped_date = zip(columns , df.values)
a_dictionary = dict(zipped_date)
data.append(a_dictionary)
full_df = full_df .append(data, False)

Maybe create a list of dataframes inside the loop and the concat them:
for file in glob.glob("*.html"):
data.append( pd.read_html(file)[2] )
full_df = pd.concat(data, ignore_index=True)

Use pd.concat:
df = pd.concat([pd.read_html(file)[2] for files in glob.glob("*.html")])

Related

Iterate and Concat multiple Dataframe pandas DF python

I have the below code for a pandas operation to parse a json and pick certain columns and concat at axis 1
df_columns_raw_1 = df_tables_normalized['columns'][1]
df_columns_normalize_1 = pd.json_normalize(df_columns_raw_1)
df_colName_1 = df_columns_normalize_1['columnName']
df_table_1 = df_columns_normalize_1['tableName']
df_colLen_1 = df_columns_normalize_1['columnLength']
df_colDataType_1 = df_columns_normalize_1['columnDatatype']
result_1 = pd.concat([df_table_1, df_colName_1,df_colLen_1,df_colDataType_1], axis=1)
bigdata = pd.concat([result_1, result_2....result_500], ignore_index=True, sort=False)
I need to iterate and automate the above code to concat until result_500 df in the bigdata variable instead writing manually for all the dfs

Select specific column from multiple csv files, then merge those columns into single file using pandas

I am trying to select a specific column, with the header "Average", from multiple csv files. Then take the "Average" column from each of those multiple csv files and merge them into a new csv file.
I left the comments in to show the other ways I tried to accomplish this:
procdir = r"C:\Users\ChromePnP\Desktop\exchange\processed"
collected = os.listdir(procdir)
flist = list(collected)
flist.sort()
#exclude first files in list
rest_of_files = flist[1:]
for f in rest_of_files:
get_averages = pd.read_csv(f, usecols = ['Average'])
#df1 = pd.DataFrame(f)
# df2 = pd.DataFrame(rundata_file)
#get_averages = pd.read_csv(f)
#for col in ['Average']:
#get_averages[col].to_csv(f_template)
got_averages = pd.merge(get_averages, right_on = 'Average')
got_averages.to_csv("testfile.csv", index=False)
EDIT:
I was able to get the columns I wanted, and they will print. However now the saved file only has a single average column from the loop, instead of saving all the columns selected in the loop.
rest_of_files = flist[1:]
#f.sort()
print(rest_of_files)
for f in rest_of_files:
get_averages = pd.read_csv(f)
df1 = pd.DataFrame(get_averages)
got_averages = df1.loc[:, ['Average']]
print(got_averages)
f2_temp = pd.read_csv(rundata_file)
df2 = pd.DataFrame(f2_temp)
merge_averages = pd.concat([df2, got_averages], axis=1)
merge_averages.to_csv(rundata_file, index=False)
Either you use pd.merge with argument left and right as specified here :
got_averages = pd.merge(got_averages, get_averages, right_on = 'Average')
Or you use .merge for dataframe, doc here :
got_averages = got_averages.merge(get_averages, right_on = 'Average')
Keep in mind you need to initialize got_averages (as empty dataframe for instance) before using it in your for loop

Loop through old and new versions of files

I am trying to create a .csv containing records that are different between old and new csv files. I have successfully accomplished this with a single such pair using
old_df = 'file1_old.csv'
new_df = 'file1_new.csv'
df1 = pd.read_csv(old_df)
df2 = pd.read_csv(new_df)
df1['flag'] = 'old'
df2['flag'] = 'new'
df = pd.concat([df1, df2])
dups_dropped = df.drop_duplicates(df.columns.difference(['flag']) keep=False)
dups_dropped.to_csv('difference.csv', index=False)
I am struggling to wrap my mind around how to scale this with a loop (?) to output a csv for each new pairing if the new v. old file names input are of the same convention, for instance:
file1_new, file1_new
file2_new, file2_old
file3_new, file3_old
so that the output is
file1_difference.csv
file2_difference.csv
file3_difference.csv
Thoughts? Much appreciated
Using a simple for loop with f-strings to help format the filenames should work:
for i in range(1,11): # replace 11 with the number of files you have + 1
old_df = f'file{i}_old.csv'
new_df = f'file{i}_new.csv'
df1 = pd.read_csv(old_df)
df2 = pd.read_csv(new_df)
df1['flag'] = 'old'
df2['flag'] = 'new'
df = pd.concat([df1, df2])
dups_dropped = df.drop_duplicates(df.columns.difference(['flag']) keep=False)
dups_dropped.to_csv(f'difference{i}.csv', index=False)

Pandas - Loop through sheets

I have 5 sheets and created a script to do numerous formatting, I tested it per sheet, and it works perfectly.
import numpy as np
import pandas as pd
FileLoc = r'C:\T.xlsx'
Sheets = ['Alex','Elvin','Gerwin','Jeff','Joshua',]
df = pd.read_excel(FileLoc, sheet_name= 'Alex', skiprows=6)
df = df[df['ENDING'] != 0]
df = df.head(30).T
df = df[~df.index.isin(['Unnamed: 2','Unnamed: 3','Unnamed: 4','ENDING' ,3])]
df.index.rename('STORE', inplace=True)
df['index'] = df.index
df2 = df.melt(id_vars=['index', 2 ,0, 1] ,value_name='SKU' )
df2 = df2[df2['variable']!= 3]
df2['SKU2'] = np.where(df2['SKU'].astype(str).fillna('0').str.contains('ALF|NOB|MET'),df2.SKU, None)
df2['SKU2'] = df2['SKU2'].ffill()
df2 = df2[~df2[0].isnull()]
df2 = df2[df2['SKU'] != 0]
df2[1] = pd.to_datetime(df2[1]).dt.date
df2.to_excel(r'C:\test.xlsx', index=False)
but when I assigned a list in Sheet_name = Sheets it always produced an error KeyError: 'ENDING'. This part of the code:
Sheets = ['Alex','Elvin','Gerwin','Jeff','Joshua',]
df = pd.read_excel(FileLoc,sheet_name='Sheets',skiprows=6)
Is there a proper way to do this, like looping?
My expected result is to execute the formatting that I have created and consolidate it into one excel file.
NOTE: All sheets have the same format.
In using the read_excel method, if you give the parameter sheet_name=None, this will give you a OrderedDict with the sheet names as keys and the relevant DataFrame as the value. So, you could apply this and loop through the dictionary using .items().
The code would look something like this,
dfs = pd.read_excel('your-excel.xlsx', sheet_name=None)
for key, value in dfs.items():
# apply logic to value
If you wish to combine the data in the sheets, you could use .append(). We can append the data after the logic has been applied to the data in each sheet. That would look something like this,
combined_df = pd.DataFrame()
dfs = pd.read_excel('your-excel.xlsx', sheet_name=None)
for key, value in dfs.items():
# apply logic to value, which is a DataFrame
combined_df = combined_df.append(sheet_df)

Python: Pandas dataframe - data overwritten instead of concatinated

I want to extract data from several .csv files and combine them into one big dataframe in pandas.To do this I created one dataframe that should be filled with the data of the incoming dataframes.
final_df = DataFrame(columns=['Column1','Column2','Column3'])
for file in glob.glob("file.csv"):
name_csv = str(file)
logfile = pd.read_csv(name_csv, skip_blank_lines = False)
df = DataFrame(logFile, columns=['Column1','Column2','Column3']
concat = pd.concat([final_df,df])
However, with every iteration through the loop, the previously extracted data is overwritten. How can I solve this problem?
You are not using the result of pd.concat at all. The variable concat is just thrown away in each iteration, but it would be the partial data frame.
You need first append all df to list and then use concat:
Also some improvement to read_csv - logfile is already df, better is use parameter names.
dfs = []
for file in glob.glob("*.csv"):
logfile = pd.read_csv(str(file),
skip_blank_lines = False,
names = ['Column1','Column2','Column3'])
dfs.append(logfile)
concat = pd.concat(dfs)
Or use list comprehension:
dfs = [pd.read_csv(str(file),
skip_blank_lines = False,
names = ['Column1','Column2','Column3']) for file in glob.glob("*.csv")]
concat = pd.concat(dfs)
You should create the list of the df's and concat it all at the end:
concat_list = []
for file in glob.glob("file.csv"):
name_csv = str(file)
logfile = pd.read_csv(name_csv, skip_blank_lines = False)
df = DataFrame(logFile, columns=['Column1','Column2','Column3']
concat_list.appned(df)
final_df = pd.concat(concat_list)

Categories

Resources