Concatenate pandas DataFrames generated with a loop - python

I am creating a new DataFrame named data_day, containing new features, for each day extrapolated from the day-timestamp of a previous DataFrame df.
My new dataframes data_day are 30 independent DataFrames that I need to concatenate/append at the end in a unic dataframe (final_data_day).
The for loop for each day is defined as follow:
num_days=len(list_day)
#list_day= random.sample(list_day,num_days_to_simulate)
data_frame = pd.DataFrame()
for i, day in enumerate(list_day):
print('*** ',day,' ***')
data_day=df[df.day==day]
.....................
final_data_day = pd.concat()
Hope I was clear. Mine is basically a problem of append/concatenation of data-frames generated in a non-trivial for loop

Pandas concat takes a list of dataframes. If you can generate a list of dataframes with your looping function, once you are finished you can concatenate the list together:
data_day_list = []
for i, day in enumerate(list_day):
data_day = df[df.day==day]
data_day_list.append(data_day)
final_data_day = pd.concat(data_day_list)

Exhausting a generator is more elegant (if not more efficient) than appending to a list. For example:
def yielder(df, list_day):
for i, day in enumerate(list_day):
yield df[df['day'] == day]
final_data_day = pd.concat(list(yielder(df, list_day))

Appending or concatenating pd.DataFrames is slow. You can use a list in the interim and then create the final pd.DataFrame at the end with pd.DataFrame.from_records() e.g.:
interim_list = []
for i,(k,g) in enumerate(df.groupby(['[*name of your date column here*'])):
if i % 1000 == 0 and i != 0:
print('iteration: {}'.format(i)) # just tells you where you are in iteration
# add your "new features" here...
for v in g.values:
interim_list.append(v)
# here you want to specify the resulting df's column list...
df_final = pd.DataFrame.from_records(interim_list,columns=['a','list','of','columns'])

Related

Iterate through different dataframes and apply a function to each one

I have 4 different dataframes containing time series data that all have the same structure.
My goal is to take each individual dataframe and pass it through a function I have defined that will group them by datestamp, sum the columns and return a new dataframe with the columns I want. So in total I want 4 new dataframes that have only the data I want.
I just looked through this post:
Loop through different dataframes and perform actions using a function
but applying this did not change my results.
Here is my code:
I am putting the dataframes in a list so I can iterate through them
dfs = [vds, vds2, vds3, vds4]
This is my function I want to pass each dataframe through:
def VDS_pre(df):
df = df.groupby(['datestamp','timestamp']).sum().reset_index()
df = df.rename(columns={'datestamp': 'Date','timestamp':'Time','det_vol': 'VolumeVDS'})
df = df[['Date','Time','VolumeVDS']]
return df
This is the loop I made to iterate through my dataframe list and pass each one through my function:
for df in dfs:
df = VDS_pre(df)
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did. Thanks for the help!
However once I go through my loop and go to print out the dataframes, they have not been modified and look like they initially did.
Yes, this is actually the case. The reason why they have not been modified is:
Assignment to an item in a for item in lst: loop does not have any effect on both the lst and the identifier/variables from which the lst items got their values as it is demonstrated with following code:
v1=1; v2=2; v3=3
lst = [v1,v2,v3]
for item in lst:
item = 0
print(lst, v1, v2, v3) # gives: [1, 2, 3] 1 2 3
To achieve the result you expect to obtain you can use a list comprehension and the list unpacking feature of Python:
vds,vds2,vds3,vds4=[VDS_pre(df) for df in [vds,vds2,vds3,vds4]]
or following code which is using a list of strings with the identifier/variable names of the dataframes:
sdfs = ['vds', 'vds2', 'vds3', 'vds4']
for sdf in sdfs:
exec(str(f'{sdf} = VDS_pre(eval(sdf))'))
Now printing vds, vds2, vds3 and vds4 will output the modified dataframes.
Pandas frame operations return new copy of data. Your snippet store the result in df variable which is not stored or updated to your initial list. This is why you don't have any stored result after execution.
If you don't need to keep original frames, you may simply overwrite them:
for i, df in enumerate(dfs):
dfs[i] = VDS_pre(df)
If not just use a second list and append result to it.
l = []
for df in dfs:
df2 = VDS_pre(df)
l.append(df2)
Or even better use list comprehension to rewrite this snippet into a single line of code.
Now you are able to store the result of your processing.
Additionally if your frames have the same structure and can be merged as a single frame, you may consider to first concat them and then apply your function on it. That would be totally pandas.

Efficient way to append dataframes below each other

I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:
Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?
First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)
I hope it helps!

Pandas/Python append or concat dataframes being created in a for loop

I have a process which I am able to loop through for values held in a list but it overwrites the final dataframe with each
loop and I would like to append or concat the result of the loops into one dataframe.
For example given below I can see 'dataframe' will populate initially with result of 'blah1', then when process finishes it has the result of 'blah2'
listtoloop = ['blah1','blah2']
for name in listtoloop:
some process happens here resulting in
dataframe = result of above process
The typical pattern used for this is to create a list of DataFrames, and only at the end of the loop, concatenate them into a single DataFrame. This is usually much faster than appending new rows to the DataFrame after each step, as you are not constructing a new DataFrame on every iteration.
Something like this should work:
listtoloop = ['blah1','blah2']
dfs = []
for name in listtoloop:
# some process happens here resulting in
# dataframe = result of above process
dfs.append(dataframe)
final = pd.concat(dfs, ignore_index=True)
Put your results in a list and then append the list to the df making sure the list is in the same order as the df
listtoloop = ['blah1','blah2']
df = pd.DataFrame(columns="A","B")
for name in listtoloop:
## processes here
to_append = [5, 6]
df_length = len(df)
df.loc[df_length] = to_append
data_you_need=pd.DataFrame()
listtoloop = ['blah1','blah2']
for name in listtoloop:
##some process happens here resulting in
##dataframe = result of above process
data_you_need=data_you_need.append(dataframe,ignore_index=True)

My dataframe has many (192) columns. How to select two columns at time?

My dataframe is like df.columns= ['Time1','Pmpp1','Time2',..........,'Pmpp96'] I want to select two successive columns at a time. Example, Time1,Pmpp1 at a time.
My code is:
for i,j in zip(df.columns,df.columns[1:]):
print(i,j)
My present output is:
Time1 Pmmp1
Pmmp1 Time2
Time2 Pmpp2
Expected output is:
Time1 Pmmp1
Time2 Pmpp2
Time3 Pmpp3
You're zipping on the list, and the same list starting from the second element, which is not what you want. You want to zip on the uneven and even indices of your list. For example, you could replace your code with:
for i, j in zip(df.columns[::2], df.columns[1::2]):
print(i, j)
As an alternative to integer positional slicing, you can use str.startswith to create 2 index objects. Then use zip to iterate over them pairwise:
df = pd.DataFrame(columns=['Time1', 'Pmpp1', 'Time2', 'Pmpp2', 'Time3', 'Pmpp3'])
times = df.columns[df.columns.str.startswith('Time')]
pmpps = df.columns[df.columns.str.startswith('Pmpp')]
for i, j in zip(times, pmpps):
print(i, j)
Time1 Pmpp1
Time2 Pmpp2
Time3 Pmpp3
In this kind of scenario, it might make sense to reshape your DataFrame. So instead of selecting two columns at a time, you have a DataFrame with the two columns that ultimately represent your measurements.
First, you make a list of DataFrames, where each one only has a Time and Pmpp column:
dfs = []
for i in range(1,97):
tmp = df[['Time{0}'.format(i),'Pmpp{0}'.format(i)]]
tmp.columns = ['Time', 'Pmpp'] # Standardize column names
tmp['n'] = i # Remember measurement number
dfs.append(tmp) # Keep with our cleaned dataframes
And then you can join them together into a new DataFrame. That has three columns.
new_df = pd.concat(dfs, ignore_index=True, sort=False)
This should be a much more manageable shape for your data.
>>> new_df.columns
[n, Time, Pmpp]
Now you can iterate through the rows in this DataFrame and get the values for your expected output
for i, row in new_df.iterrows():
print(i, row.n, row.Time, row.Psmpp)
It also will make it easier to use the rest of pandas to analyze your data.
new_df.Pmpp.mean()
new_df.describe()
After a series of trials, I got it. My code is given below:
for a in range(0,len(df.columns),2):
print(df.columns[a],df.columns[a+1])
My output is:
DateTime A016.Pmp_ref
DateTime.1 A024.Pmp_ref
DateTime.2 A040.Pmp_ref
DateTime.3 A048.Pmp_ref
DateTime.4 A056.Pmp_ref
DateTime.5 A064.Pmp_ref
DateTime.6 A072.Pmp_ref
DateTime.7 A080.Pmp_ref
DateTime.8 A096.Pmp_ref
DateTime.9 A120.Pmp_ref
DateTime.10 A124.Pmp_ref
DateTime.11 A128.Pmp_ref

Pandas append() error with two dataframes

When I try to append two or more dataframe and output the result to a csv, it shows like a waterfall format.
dataset = pd.read_csv('testdata.csv')
for i in segment_dist:
for j in step:
print_msg = str(i) + ":" + str(j)
print("\n",i,":",j,"\n")
temp = pd.DataFrame(estimateRsq(dataset,j,i),columns=[print_msg])
csv = csv.append(temp)
csv.to_csv('output.csv',encoding='utf-8', index=False)
estimateRsq() returns array. I think this much code snippet should be enough to help me out.
The format I am getting in output.csv is:
Please help, How can I shift the contents go up from index 1.
From df.append documentation:
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
If you want to add column to the right, use pd.concat with axis=1 (means horizontally):
list_of_dfs = [first_df, second_df, ...]
pd.concat(list_of_dfs, axis=1)
You may want to add parameter ignore_index=True if indexes in dataframes don't match.
Build a list of dataframes, then concatenate
pd.DataFrame.append is expensive relative to list.append + a single call of pd.concat.
Therefore, you should aggregate to a list of dataframes and then use pd.concat on this list:
lst = []
for i in segment_dist:
# do something
temp = pd.DataFrame(...)
lst.append(temp)
df = pd.concat(lst, ignore_index=True, axis=0)
df.to_csv(...)

Categories

Resources