Pandas: Calculate mean, var of similar columns grouped together - python

Trying to do a analysis of network trace data using pandas. I have read the dump file and created the following DataFrame:
So to detect the individual flows in the DataFrame data2, I have grouped the entire DataFrame according to ['ip_src', 'ip_dst', 'sport', 'dport', 'ip_proto', 'service'] using the following piece of code:
flow = ['ip_src', 'ip_dst', 'sport', 'dport', 'ip_proto', 'service']
grp1 = data2.groupby(flow, sort=False)
So when I do grp1.size() of the first twenty rows of data2, I get the following information:
What I would like to do now is to calculate the mean of ip_len, packet_len, var of ip_len, packet_len and mean of the interpacket arrival times (using the timestamps of packets belonging to the same flow).
How can I accomplish this in pandas so that the dataframe I get contains the statistics of each flow i.e. the columns should contain the ip_src, ip_dst, sport, dport, ip_proto, service, and the mean & var values calculated as earlier. I have tried both the aggr and apply methods, but haven't been able to do it. Thanks in advance!

data2.groupby(['colName1','colName2']).mean()
should do the job.

Related

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

How to make processing of a Pandas groupby object more efficient?

"""
I have a data frame of million of rows that I did .groupby() on.
I'd like to retrieve the rows containing the nlargest value for each id and tissue combination.
Also, I need to generate another df containing the mean value for each id and tissue combination.
Although I'm using a powerful Linux server, the process is being running for more that 24 hours. Therefore, I'm looking for a more efficient strategy. I spend hours on stackoverflow but I failed to apply the solutions to my particular problem.
Thanks you in advance for helping me out.
"""
df = pd.DataFrame({'id': ['g1','g1','g1','g1','g2','g2','g2','g2','g2','g2'],\
'Trans':['g1.1','g1.2','g1.3','g1.4','g2.1','g2.2','g2.3','g2.2','g2.1','g2.1'],\
'Tissue': ['Lf','Lf','Lf','pc','Pol','Pol','Pol','Ant','Ant','m2'],\
'val': [0.0948,1.5749,1.8904,0.8673,2.1089,2.5058,4.5722,0.7626,3.1381,2.723]})
print('df')
df_highest = pd.DataFrame(columns=df.columns)#brand new df that will contain the rows of interest
for grpID,data in df.groupby(['id','Tissue']):
highest = data.nlargest(1,'val')
df_highest.append(highest)
df_highest.to_csv('out.txt',sep='\t',index=False)
If you are trying to get the largest value for each id and tissue combination, try this code.
df_highest = df.loc[df.groupby(['id','Tissue'])['val'].idxmax()]
This will give you the mean of the id and Tissue combination.
df_mean = df.groupby(['id','Tissue']).agg({'val':np.mean})

Z-score calculation/standardisation using pandas

I come across this video and it bugs me.
Essentially, at 5:50, they calculate Z-score for the whole data frame by the following snippet:
df_z = (df - df.describle.T['mean'])/df.describle.T['std']
It is a neat and beautiful line.
However, df.describle.T looks like this and df looks like this
df.describle.T['mean'] and df.describle.T['std'] are two individual series, which take the df columns name as index and describle statistic parameters as columns, and df is an ordinary pd.DataFramewhich has numercial index and columns names in the right places.
My question is: how does that line make sense when they are not matching at all, in particular, how do they ensure that every variable example (x_i) matches their mean or std?
Thank you.

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

How to apply a low-pass filter of 5Hz to a pandas dataframe?

I have a pandas.DataFrame indexed by time, as seen below. The other column contains data recorded from a device measuring current. I want to filter to the second column by a low pass filter with a frequency of 5Hz to eliminate high frequency noise. I want to return a dataframe, but I do not mind if it changes type for the application of the filter (numpy array, etc.).
In [18]: print df.head()
Time
1.48104E+12 1.1185
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
1.48104E+12 0.8168
I am graphing this data by df.plot(legend=True, use_index=False, color='red') but would like to graph the filtered data instead.
I am using pandas 0.18.1 but I can change.
I have visited https://oceanpython.org/2013/03/11/signal-filtering-butterworth-filter/ and many other sources of similar approaches.
Perhaps I am over-simplifying this but you create a simple condition, create a new dataframe with the filter, and then create your graph from the new dataframe. Basically just reducing the dataframe to only the records that meet the condition. I admit I do not know what the exact number is for high frequency, but let's assume your second column name is "Frequency"
condition = df["Frequency"] < 1.0
low_pass_df = df[condition]
low_pass_df.plot(legend=True, use_index=False, color='red')

Categories

Resources