Efficient way to append dataframes below each other - python

I have the following part of code:
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
where I am extracting some values based on some columns of the DataFrame batch. Since the initial DataFrame df can be quite large, I need to find an efficient way of doing the following:
Putting together the results of the for loop in a new DataFrame with columns unique_request, unique_ua, reply_length_avg and response4xx at each iteration.
Stacking these DataFrames below of each other at each iteration.
I tried to do the following:
df_final = pd.DataFrame()
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = [unique_request, unique_ua, reply_length_avg, response4xx]
df_final = pd.concat([df_final, concat], axis = 1, ignore_index = True)
return df_final
But I am getting the following error:
TypeError: cannot concatenate object of type '<class 'list'>'; only Series and DataFrame objs are valid
Any idea of what should I try?

First of all avoid using pd.concat to build the main dataframe inside a for loop as it gets exponentially slower. The problem you are facing is that pd.concat should receive as input a list of dataframes, however you are passing [df_final, concat] which, in essence, is a list containing 2 elements: one dataframe and one list of dataframes. Ultimately, it seems you want to stack the dataframes vertically, thus axis should be 0 and not 1.
Therefore, I suggest you to do the following:
df_final = []
for batch in chunk(df, n):
unique_request = batch.groupby('clientip')['clientip'].count()
unique_ua = batch.groupby('clientip')['name'].nunique()
reply_length_avg = batch.groupby('clientip')['bytes'].mean()
response4xx = batch.groupby('clientip')['response'].apply(lambda x: x.astype(str).str.startswith(str(4)).sum())
concat = pd.concat([unique_request, unique_ua, reply_length_avg, response4xx], axis = 1, ignore_index = True)
df_final.append(concat)
df_final = pd.concat(df_final, axis = 0, ignore_index = True)
return df_final
Note that pd.concat receives a list of dataframes and not a list that contains a list inside of it! Also, this approach is way faster since the pd.concat inside the for loop doesn't get bigger every iteration :)
I hope it helps!

Related

Concatenate two pandas dataframe and follow a sequence of uid

I have a pandas dataframe with the following data: (in csv)
#list1
poke_id,symbol
0,BTC
1,ETB
2,USDC
#list2
5,SOL
6,XRP
I am able to concatenate them into one dataframe using the following code:
df = pd.concat([df1, df2], ignore_index = True)
df = df.reset_index(drop = True)
df['poke_id'] = df.index
df = df[['poke_id','symbol']]
which gives me the output: (in csv)
poke_id,symbol
0,BTC
1,ETB
2,USDC
3,SOL
4,XRP
Is there any other way to do the same. I think calling the whole data frame of ~4000 entries just to add ~100 more will be a little pointless and cumbersome. How can I make it in such a way that it picks list 1 (or dataframe 1) and picks the highest poke_id; and just does i + 1 to the later entries in list 2.
Your solution is good, is possible simplify:
df = pd.concat([df1, df2], ignore_index = True).rename_axis('poke_id').reset_index()
use indexes to get what data you want from the dataframe, although this is not effective if you want large amounts of data from the dataframe, this method allows you to take specific amounts of data from the dataframe

Write two dataframes to two different columns in CSV

I converted two arrays into two dataframes and would like to write them to a CSV file in two separate columns. There are no common columns in the dataframes. I tried the solutions as follows and also from stack exchange but did not get the result. Solution 2 has no error but it prints all the data into one column. I am guessing that is a problem with how the arrays are converted to df? I basically want two column values of Frequency and PSD exported to csv. How do I do that ?
Solution 1:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_BP_frq['tmp'] = 1
df_BP_psd['tmp'] = 1
df_500 = pd.merge(df_BP_frq, df_BP_psd, on=['tmp'], how='outer')
df_500 = df_500.drop('tmp', axis=1)
Error: Unable to allocate 2.00 TiB for an array with shape (274870566961,) and data type int64
Solution 2:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_500 = df_BP_frq.merge(df_BP_psd, left_on='Frequency', right_on='PSD', how='outer')
No Error.
Result: The PSD values are all 0 and are seen below the frequency values in the lower rows.
Solution 3:
df_BP_frq = pd.DataFrame(freq_BP[L_BP], columns=['Frequency'])
df_BP_psd = pd.DataFrame(PSDclean_BP[L_BP], columns=['PSD'])
df_500 = pd.merge(df_BP_frq, df_BP_psd, on='tmp').ix[:, ('Frequency','PSD')]
Error: KeyError: 'tmp'
Exporting to csv using:
df_500.to_csv("PSDvalues500.csv", index = False, sep=',', na_rep = 'N/A', encoding = 'utf-8')
You can use directly store the array as columns of the dataframe. If the lengths of both arrays is same, the following method would work.
df_500 = pd.DataFrame()
df_500['Frequency'] = freq_BP[L_BP]
df_500['PSD'] = PSDclean_BP[L_BP]
If the lengths of the arrays are different, you can convert them to series and then add them as columns in the following way. This would make add nan for empty values in the dataframe.
df_500 = pd.DataFrame()
df_500['Frequency'] = pd.Series(freq_BP[L_BP])
df_500['PSD'] = pd.Series(PSDclean_BP[L_BP])
From your question what I understood is that you have two arrays you want to store them into one dataframe different columns and save that dataframe to csv with separate columns .
Creating two Numpy arrays of equal length .
import numpy as np
n1 = np.arange(2, 100, 0.01)
n2 = np.arange(3, 101, 0.01)
Creating an empty dataframe and storing the above arrays as columns of the dataframe
n = pd.DataFrame()
n['feq']= n1
n['psd'] = n2
Storing into Csv
n.to_csv(r"C\:...\dataframe.csv",index= False)
If they are unequal dataframes convert them as series and then store them in empty dataframe .

Concatenate two dataframes with different row indices

I want to concatenate two data frames of the same length, by adding a column to the first one (df).
But because certain df rows are being filtered, it seems the index isn't matching.
import pandas as pd
pd.read_csv(io.StringIO(uploaded['customer.csv'].decode('utf-8')), sep=";")
df["Margin"] = df["Sales"]-df["Cost"]
df = df.loc[df["Margin"]>-100000]
df = df.loc[df["Sales"]> 1000]
df.reindex()
df
This returns:
So this operation:
customerCluster = pd.concat([df, clusters], axis = 1, ignore_index= True)
print(customerCluster)
Is returning:
So, I've tried reindex and the argument ignore_index = True as you can see in above code snippet.
Thanks for all the answers. If anyone encounters the same problem, the solution I found was this:
customerID = df["CustomerID"]
customerID = customerID.reset_index(drop=True)
df = df.reset_index(drop=True)
So, basically, the indexes of both data frames are now matching, thus:
customerCluster = pd.concat((customerID, clusters), axis = 1)
This will concatenate correctly the two data frames.

pandas apply function to each group (output is not really an aggregation)

I have a list of time-series (=pandas dataframe) and want to calculate for each time-series (of a device) the matrixprofile.
One option is to iterate all the devices - which seems to be slow.
A second option would be to group by the devices - and apply a UDF. The problem is now, that the UDF will return 1:1 rows i.e. not a single scalar value per group but the same number of rows will be outputted as the input.
Is it still possible to somehow vectorize this calculation for reach group when 1:1 (or at least non scalar values) are returned?
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
print('***************************')
# slow version retaining all the rows
for g in df.bar.unique():
print(g)
this_group = df[df.bar == g]
# perform a UDF which needs to have all the values per group
# i.e. for real I want to calculate the matrixprofile for each time-series of a device
this_group['result'] = this_group.baz.apply(lambda x: 1)
display(this_group)
print('***************************')
def my_non_scalar1_1_agg_function(x):
display(pd.DataFrame(x))
return x
# neatly vectorized application of a non_scalar function
# but this fails as: Must produce aggregated value
df = df.groupby(['bar']).baz.agg(my_non_scalar1_1_agg_function)
display(df)
For non-aggregated functions applied to each distinct group that does not return a non-scalar value, you need to iterate method across groups and then compile together.
Therefore, consider a list or dict comprehension using groupby(), followed by concat. Be sure method inputs and returns a full data frame, series, or ndarray.
# LIST COMPREHENSION
df_list = [ myfunction(sub) for index, sub in df.groupby(['group_column']) ]
final_df = pd.concat(df_list)
# DICT COMPREHENSION
df_dict = { index: myfunction(sub) for index, sub in df.groupby(['group_column']) }
final_df = pd.concat(df_dict, ignore_index=True)
Indeed this (see also the link above in the comment) is a way to get it to work in a faster/more desired way. Perhaps there is even a better alternative
import pandas as pd
df = pd.DataFrame({
'foo':[1,2,3], 'baz':[1.1, 0.5, 4], 'bar':[1,2,1]
})
display(df)
grouped_df = df.groupby(['bar'])
altered = []
for index, subframe in grouped_df:
display(subframe)
subframe = subframe# obviously we need to apply the UDF here - not the idempotent operation (=doing nothing)
altered.append(subframe)
print (index)
#print (subframe)
pd.concat(altered, ignore_index=True)
#pd.DataFrame(altered)

Pandas append() error with two dataframes

When I try to append two or more dataframe and output the result to a csv, it shows like a waterfall format.
dataset = pd.read_csv('testdata.csv')
for i in segment_dist:
for j in step:
print_msg = str(i) + ":" + str(j)
print("\n",i,":",j,"\n")
temp = pd.DataFrame(estimateRsq(dataset,j,i),columns=[print_msg])
csv = csv.append(temp)
csv.to_csv('output.csv',encoding='utf-8', index=False)
estimateRsq() returns array. I think this much code snippet should be enough to help me out.
The format I am getting in output.csv is:
Please help, How can I shift the contents go up from index 1.
From df.append documentation:
Append rows of other to the end of this frame, returning a new
object. Columns not in this frame are added as new columns.
If you want to add column to the right, use pd.concat with axis=1 (means horizontally):
list_of_dfs = [first_df, second_df, ...]
pd.concat(list_of_dfs, axis=1)
You may want to add parameter ignore_index=True if indexes in dataframes don't match.
Build a list of dataframes, then concatenate
pd.DataFrame.append is expensive relative to list.append + a single call of pd.concat.
Therefore, you should aggregate to a list of dataframes and then use pd.concat on this list:
lst = []
for i in segment_dist:
# do something
temp = pd.DataFrame(...)
lst.append(temp)
df = pd.concat(lst, ignore_index=True, axis=0)
df.to_csv(...)

Categories

Resources