Python: Pandas dataframe - data overwritten instead of concatinated

Python: Pandas dataframe - data overwritten instead of concatinated - python

I want to extract data from several .csv files and combine them into one big dataframe in pandas.To do this I created one dataframe that should be filled with the data of the incoming dataframes.
final_df = DataFrame(columns=['Column1','Column2','Column3'])
for file in glob.glob("file.csv"):
name_csv = str(file)
logfile = pd.read_csv(name_csv, skip_blank_lines = False)
df = DataFrame(logFile, columns=['Column1','Column2','Column3']
concat = pd.concat([final_df,df])
However, with every iteration through the loop, the previously extracted data is overwritten. How can I solve this problem?

You are not using the result of pd.concat at all. The variable concat is just thrown away in each iteration, but it would be the partial data frame.

You need first append all df to list and then use concat:
Also some improvement to read_csv - logfile is already df, better is use parameter names.
dfs = []
for file in glob.glob("*.csv"):
logfile = pd.read_csv(str(file),
skip_blank_lines = False,
names = ['Column1','Column2','Column3'])
dfs.append(logfile)
concat = pd.concat(dfs)
Or use list comprehension:
dfs = [pd.read_csv(str(file),
skip_blank_lines = False,
names = ['Column1','Column2','Column3']) for file in glob.glob("*.csv")]
concat = pd.concat(dfs)

You should create the list of the df's and concat it all at the end:
concat_list = []
for file in glob.glob("file.csv"):
name_csv = str(file)
logfile = pd.read_csv(name_csv, skip_blank_lines = False)
df = DataFrame(logFile, columns=['Column1','Column2','Column3']
concat_list.appned(df)
final_df = pd.concat(concat_list)

Related

Python: use the arguments in enumerate as variable names for the nested loop

I would like to generate two data frames (and subsequently export to CSV) from two CSV files. I come up with the following (incomplete) code, which focuses on dealing with a.csv. I create an empty data frame (df_a) to store rows from itterows iteration (df_b is missing).
The problem is I do not know how to process b.csv without manually describing all avariables of empty dataframes in advance (i.e. df_a = pd.DataFrame(columns=['start', 'end']) and df_b = pd.DataFrame(columns=['start', 'end'])).
I hope I can use the arguments of enumerate (ie. the content of file) as variables (ie. something like df_file) for the data frames (instead of df_a and df_b).
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_a = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_a = pd.concat([df_a, df_dicts], ignore_index=True)
df_a_csv = df_a.to_csv('df_a.csv')
Ideally, it could look a bit like (note: file is used as a part of variable name df_file)
list_files = [a.csv, b.csv]
for i, file in enumerate(list_file):
df = pd.read_csv(file)
# Create empty data frame to store data for each iteration below
df_file = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_file = pd.concat([df_file, df_dicts], ignore_index=True)
df_file_csv = df_file.to_csv('df_' + file + '.csv')
Different approaches are also welcome. I just need to save the dataframe outcome for each input file. Many Thanks!

SomeFunction(var) aside, can you get the result you seek without pandas for the most part?
import csv
import pandas
## -----------
## mocked
## -----------
def SomeFunction(var):
return None
## -----------
list_files = ["a.csv", "b.csv"]
for file_path in list_files:
with open(file_path, "r") as file_in:
results = []
for row in csv.DictReader(file_in):
df_new = SomeFunction(row['name'])
start, end = df_new['column1'], df_new['column2']
results.append({"start": start, "end": end})
with open(f"df_{file_path}", "w") as file_out:
writer = csv.DictWriter(file_out, fieldnames=list(results[0].keys())):
writer.writeheader()
writer.writerows(results)
Note that you can also stream rows from the input to the output if you would rather not read them all into memory.

There are many things we could comment, but I understand that you are concerned about not having to specify the loop for a and for b, given that you already are doing it in list_files.
If this is the issue, what about doing something like this?
# CHANGED list only the stem of the base name, we will use them for many things
file_name_stems = ["a", "b"]
# CHANGED we save a dictionary for the dataframes
dataframes = {}
# CHANGED did you really need the enumerate?
for file_stem in file_name_stems:
filename = file_stem + ".csv"
df = pd.read_csv(filename)
# Create empty data frame to store data for each iteration below
# CHANGED let's use df_x as a generic name. Knowing your code, you will surely find better names
df_x = pd.DataFrame(columns=['start', 'end'])
for index, row in df.iterrows():
var = df.loc[index, 'name']
df_new = SomeFunction(var)
# Append a new row to the empty data frame
dicts = {'start': df_new['column1'], 'end': df_new['column2']}
df_dicts = pd.DataFrame([dicts])
df_x = pd.concat([df_a, df_dicts], ignore_index=True)
# CHANGED and now, we print to the file
csv_x = df_x.to_csv(f'df_{file_stem}.csv')
# CHANGED and save it to a dictionary in case you need it
dataframes[stem] = csv_x
So, instead of listing the exact filenames, you can list the stem of their name, and then compose de source filename and the output one.
Another option could be to list the source filenames and replace some part of the filename to generate the output filename:
list_files = ["a.csv", "b.csv"]
for filename in list_files:
# ...
output_file_name = filename.replace(".csv", "_df.csv")
# this produces "a_df.csv" and "b_df.csv"
Does any of this look to solve your problem? :)

Select specific column from multiple csv files, then merge those columns into single file using pandas

I am trying to select a specific column, with the header "Average", from multiple csv files. Then take the "Average" column from each of those multiple csv files and merge them into a new csv file.
I left the comments in to show the other ways I tried to accomplish this:
procdir = r"C:\Users\ChromePnP\Desktop\exchange\processed"
collected = os.listdir(procdir)
flist = list(collected)
flist.sort()
#exclude first files in list
rest_of_files = flist[1:]
for f in rest_of_files:
get_averages = pd.read_csv(f, usecols = ['Average'])
#df1 = pd.DataFrame(f)
# df2 = pd.DataFrame(rundata_file)
#get_averages = pd.read_csv(f)
#for col in ['Average']:
#get_averages[col].to_csv(f_template)
got_averages = pd.merge(get_averages, right_on = 'Average')
got_averages.to_csv("testfile.csv", index=False)
EDIT:
I was able to get the columns I wanted, and they will print. However now the saved file only has a single average column from the loop, instead of saving all the columns selected in the loop.
rest_of_files = flist[1:]
#f.sort()
print(rest_of_files)
for f in rest_of_files:
get_averages = pd.read_csv(f)
df1 = pd.DataFrame(get_averages)
got_averages = df1.loc[:, ['Average']]
print(got_averages)
f2_temp = pd.read_csv(rundata_file)
df2 = pd.DataFrame(f2_temp)
merge_averages = pd.concat([df2, got_averages], axis=1)
merge_averages.to_csv(rundata_file, index=False)

Either you use pd.merge with argument left and right as specified here :
got_averages = pd.merge(got_averages, get_averages, right_on = 'Average')
Or you use .merge for dataframe, doc here :
got_averages = got_averages.merge(get_averages, right_on = 'Average')
Keep in mind you need to initialize got_averages (as empty dataframe for instance) before using it in your for loop

Append Dataframe inside the loop - Python

I am trying append dataframe inside the loop after reading the file, but still not appending full dataset.
columns = list(df)
data= []
for file in glob.glob("*.html"):
df = pd.read_html(file)[2]
zipped_date = zip(columns , df.values)
a_dictionary = dict(zipped_date)
data.append(a_dictionary)
full_df = full_df .append(data, False)

Maybe create a list of dataframes inside the loop and the concat them:
for file in glob.glob("*.html"):
data.append( pd.read_html(file)[2] )
full_df = pd.concat(data, ignore_index=True)

Use pd.concat:
df = pd.concat([pd.read_html(file)[2] for files in glob.glob("*.html")])

dropping columns in multiple excel spreedsheets

Is there a way in python i can drop columns in multiple excel files? i.e. i have a folder with several xlsx files. each file has about 5 columns (date, value, latitude, longitude, region). I want to drop all columns except date and value in each excel file.

Let's say you have a folder with multiple excel files:
from pathlib import Path
folder = Path('excel_files')
xlsx_only_files = list(folder.rglob('*.xlsx'))
def process_files(xls_file):
#stem is a method in pathlib
#that gets just the filename without the parent or the suffix
filename = xls_file.stem
#sheet = None ensure the data is read in as a dictionary
#this sets the sheetname as the key
#usecols allows you to read in only the relevant columns
df = pd.read_excel(xls_file, usecols = ['date','value'] ,sheet_name = None)
df_cleaned = [data.assign(sheetname=sheetname,
filename = filename)
for sheetname, data in df.items()
]
return df_cleaned
combo = [process_files(xlsx) for xlsx in xlsx_only_files]
final = pd.concat(combo, ignore_index = True)
Let me know how it goes
stem

I suggest you should define columns you want to keep as a list and then select as a new dataframe.
# after open excel file as
df = pd.read_excel(...)
keep_cols = ['date', 'value']
df = df[keep_cols] # keep only selected columns it will return df as dataframe
df.to_excel(...)

Create a new dataframe out of dozens of df.sum() series

I have several pandas DataFrames of the same format, with five columns.
I would like to sum the values of each one of these dataframes using df.sum(). This will create a Series for each Dataframe, still with 5 columns.
My problem is how to take these Series, and create another Dataframe, one column being the filename, the other columns being the five columns above from df.sum()
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = []
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf.append(df)
newdf = pd.concat(newdf, ignore_index=True)
This approach doesn't work unfortunately. 'df['filename'] = str(filename)' throws a TypeError, and the creating a new dataframe newdf doesn't parse correctly.
How would one do this correctly?
How do you take a number of pandas.Series objects and create a DataFrame?

Try in this order:
Create an empty list, say list_of_series.
For every file:
load into a data frame, then save the sum in a series s
add an element to s: s['filename'] = your_filename
append s to list_of_series
Finally, concatenate (and transpose if needed):
final_df = pd.concat(list_of_series, axis = 1).T
Code
Preparation:
l_df = [pd.DataFrame(np.random.rand(3,5), columns = list("ABCDE")) for _ in range(5)]
for i, df in enumerate(l_df):
df.to_csv(str(i)+'.txt', index = False)
Files *.txt are comma separated and contain headers.
! cat 1.txt
A,B,C,D,E
0.18021800981245173,0.29919271590063656,0.09527248614484807,0.9672038093199938,0.07655003742768962
0.35422759068109766,0.04184770882952815,0.682902924462214,0.9400817219440063,0.8825581077493059
0.3762875793116358,0.4745731412494566,0.6545473610147845,0.7479829630649761,0.15641907539706779
And, indeed, the rest is quite similar to what you did (I append file names to a series, not to data frames. Otherwise they got concatenated several times by sum()):
files = glob.glob('*.txt')
print(files)
['3.txt', '0.txt', '4.txt', '2.txt', '1.txt']
list_of_series = []
for f in files:
df = pd.read_csv(f)
s = df.sum()
s['filename'] = f
list_of_series.append(s)
final_df = pd.concat(list_of_series, axis = 1).T
print(final_df)
A B C D E filename
0 1.0675 2.20957 1.65058 1.80515 2.22058 3.txt
1 0.642805 1.36248 0.0237625 1.87767 1.63317 0.txt
2 1.68678 1.26363 0.835245 2.05305 1.01829 4.txt
3 1.22748 2.09256 0.785089 1.87852 2.05043 2.txt
4 0.910733 0.815614 1.43272 2.65527 1.11553 1.txt

To answer this specific question :
#ThomasTu How do I go from a list of Series with 'Filename' as a
column to a dataframe? I think that's the problem---I don't understand
this
It's essentially what you have now, but instead of appending to an empty list, you append to an empty dataframe. I think there's an inplace keyword if you don't want to reassign newdf on each iteration.
import pandas as pd
import glob
batch_of_dataframes = glob.glob("*.txt")
newdf = pd.DataFrame()
for filename in batch_of_dataframes:
df = pd.read_csv(filename)
df['filename'] = str(filename)
df = df.sum()
newdf = newdf.append(df, ignore_index=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: Pandas dataframe - data overwritten instead of concatinated - python

You are not using the result of pd.concat at all. The variable concat is just thrown away in each iteration, but it would be the partial data frame.

Related

Python: use the arguments in enumerate as variable names for the nested loop

Select specific column from multiple csv files, then merge those columns into single file using pandas

Append Dataframe inside the loop - Python

dropping columns in multiple excel spreedsheets

Create a new dataframe out of dozens of df.sum() series

Categories

Resources