Appending DataFrames in Loop - python

Goal: Appending DataFrames in Loop to get combined dataframe.
df_base = pd.DataFrame(columns=df_col.columns)
file_path = 'DATA/'
filenames = ['savedrecs-23.txt', 'savedrecs-25.txt', 'savedrecs-24.txt']
For-Loop:
for file in filenames:
path = file_path+file
doc = codecs.open(path,'rU','UTF-8')
df_add = pd.read_csv(doc, sep='\t')
res = df_base.append(df_add)
res.shape
Expected Outcome:
(15, 67) ; all three data frames merged into one dataframe
Current Outcome:
(5, 67) ; just returns the last dataframe in the loop.

res = df_base.append(df_add)
Pandas append function does not modify the object it is called on. It returns a new object that contains the rows from the added dataframe appended onto the rows of the original dataframe.
Since you never modified df_base, so your output is just the frame from the last file, appended to the empty df_base dataframe.
Note that the pandas documentation doesn't recommend iteratively appending dataframes together. Instead, "a better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once." (with an example given)

Related

Amending dataframe from a generator that reads multiple excel files

My question ultimately is - is it possible to amend inplace each dataframe of a generator of dataframes?
I have a series of excel files in a folder that each have a table in the same format. Ultimately I want to concatenate each file into 1 large dataframe. They all have unique column headers but share the same indices (historical dates but may be across different time frames) so I want to concatenate the dataframes but aligned by their date. So I first created a generator function to create dataframes from each 'Data1' worksheet in the excel files
all_files = glob.glob(os.path.join(path, "*"))
df_from_each_file = (pd.read_excel(f,'Data1') for f in all_files) #generator comprehension
The below code is the formatting that needs to be done to each dataframe so that I can concatenate them correctly in my final line. I changed the index to the date column but there are also some rows that contain data that is not relevant.
def format_ABS(df):
df.drop(labels=range(0, 9), axis=0,inplace=True)
df.set_index(df.iloc[:,0],inplace=True)
df.drop(df.columns[0],axis=1,inplace=True)
However this doesn't work when I place the function within a generator comphrension (as i am amending all the dataframes inplace). The generator produced has no objects. Why doesn't the below line work? Is it because it can only loop through the generator once?
format_df = (format_ABS(x) for x in df_from_each_file)
but
format_df(next(df_from_each_file)
does work on each individual dataframe
The final product is then the below
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
I have gotten what I wanted by assigning index_col=0 in the pd.read_excel line but it go me thinking about generators and amending the dataframe in general.

Why appending a pandas DataFrame to a python list convert the resulting df in Series but assigning it works as expected?

I'm filtering out a big dataframe in subsequent steps, willing to temporary store filtered out ones in a list to eventually tamper with them later.
When I append the filtered dataframe to the list (i.e. temp.append(df[df.isna().any(axis=1)])), the item is stored as pandas Series, while if I assign it to the same list it appear as a dataframe (as expected):
check = []
check[0] = pdo[pdo.isnull().any(axis=1)]
check.append(pdo[pdo.isnull().any(axis=1)])
type(check[0]), type(check[1])
Out: (pandas.core.frame.DataFrame, pandas.core.series.Series)
Is your full line of code is the following?:
temp = temp.append(df[df.isna().any(axis=1)])
#^^^^^^

How to export a list of df to separate csv files

I am trying to export a list of pandas dataframes to indivudal csv files.
I have currently got this
import pandas as pd
import numpy as np
data = {"a":[1,2,3,4,5,6,7,8,9], "b":[0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}
df = pd.DataFrame (data, columns = ["a","b"])
df = np.array_split(df, 3)
I have tried:
for i in df:
i.to_csv((r'df.csv'))
However this doesn't ouput all the sub df, only the last one.
How do I get this to output all the df, with the outputted csv having the names df1.csv, df2.csv, and df3.csv?
It did output all three. It's just that the second one overwrote the first, and likewise with the last one. You have to write them out to three separate filenames.
To achieve this, we need to modify the string based on where we are in the loop. The easiest way to do this is with a counter on the loop. Since the variable 'i' is normally reserved for such counters, I'm going to rename your dummy variable to _df. Don't get confused by it. To get a counter in the loop, we use enumerate.
for i, _df in enumerate(df):
print(i)
filename = 'df' + str(i) + '.csv'
_df.to_csv(filename) # I think the extra parenthesis are unecessary?
edit: Just to note the advantage of this over the suggestion of specifying all the filenames in a list is that you don't need to know the length of the list in advance. Also helpful if you do know the length, but it is big. If it's 3 and you know it's 3 and that won't change, then you can specify the filenames as suggested elsewhere.
you can use floor division on the index then use a groupby to create your individual frames.
for data, group in df.groupby(df.index // 3):
group.to_csv(f"df{data+1}.csv")
You're writing each data frame to the same file, 'df.csv'. In your for loop, you can specify both the dataframes to save and the files to save them to with zip().
>>> for i, outfile in zip(df, ["df1.csv", "df2.csv", "df3.csv"]):
... i.to_csv(outfile)
You can do this particular task a number of ways. Here's a loop with enumerate() so you don't have to write the whole list of filenames.
>>> for j, frame in enumerate(df):
... frame.to_csv(f"df{j+1}.csv")
Data is getting replace in same file with each iterations.
Try :
for i, value in enumerate(df):
value.to_csv('/path/to/folder/df'+str(i)+'.csv')

How do i convert list into python dataframe

I have used for loop for text extraction from images. So i getting errors while converting list into python pandas dataframe.
info = []
for item in dirs:
if os.path.isfile(path+item):
for a in x:
img = Image.open(path+item)
crop = img.crop(a)
text = pytesseract.image_to_string(crop)
info.append(text)
df = pd.DataFrame([info], colnames=['col1','col2'])
df
Expected result: data store in dataframe row wise.
Yes list is not a list of two items. i have 14 predefined columns.
Here it is another code
for i in range(info):
df.loc[i] = [ info for n in range(14))
Please check documentation for .DataFrame
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
The line in which you create your dataframe
df = pd.DataFrame([info], colnames=['col1','col2']
Is missing parenthesis at the end, uses colnames instead of columns, has unnecessary square brackets around your list and is creating two columns where you only need one.
Please mention the exact error
There are two problems here I think.
First of all, you are passing to the DataFrame [info] although info is already a list. You can just pass this list as it is.
Now that you pass a list of items as argument, you're trying to convert the list as a DataFrame with two columns: colnames=['col1','col2']. And the keyword is columns not colnames.
I think that's the problem. You list is not a list of two-items list (like [[a, b], [c, d]]). Just use:
df = pd.DataFrame(info, columns=['col1'])
Best

Alternative to concat for inserting records in a dataframe

I have a for loop of 90,000 iterations. Each iteration cooks a row and at the end of the loop, I want to have a dataframe with all 90K rows.
The way I am doing it now as follows - In each iteration, I store the row as a dataframe called 'sum_df' and use concat to insert each row into the dataframe called output_df. Like below -
output_df = pd.concat([output_df, sum_df], sort=False)
However, this concat function seems to be inefficient and slowing down the execution. What is the a better way to do this?
I store the row as a dataframe and use concat to insert each row
into the dataframe called output_df.
Your pre-processing is the cause of the inefficiency. Concatenating dataframes is expensive relative to appending to a list of lists. So do not store each row as a dataframe. Assuming you can convert your "row" into a single list:
LoL = []
for item in some_iterable:
lst = func(item) # func is a function which returns a list from item
LoL.append(lst) # append to list of lists
df = pd.DataFrame(LoL) # construct dataframe from list of lists
Or more succinctly:
df = pd.DataFrame([func(item) for item in some_iterable])

Categories

Resources