How to make processing of a Pandas groupby object more efficient? - python

"""
I have a data frame of million of rows that I did .groupby() on.
I'd like to retrieve the rows containing the nlargest value for each id and tissue combination.
Also, I need to generate another df containing the mean value for each id and tissue combination.
Although I'm using a powerful Linux server, the process is being running for more that 24 hours. Therefore, I'm looking for a more efficient strategy. I spend hours on stackoverflow but I failed to apply the solutions to my particular problem.
Thanks you in advance for helping me out.
"""
df = pd.DataFrame({'id': ['g1','g1','g1','g1','g2','g2','g2','g2','g2','g2'],\
'Trans':['g1.1','g1.2','g1.3','g1.4','g2.1','g2.2','g2.3','g2.2','g2.1','g2.1'],\
'Tissue': ['Lf','Lf','Lf','pc','Pol','Pol','Pol','Ant','Ant','m2'],\
'val': [0.0948,1.5749,1.8904,0.8673,2.1089,2.5058,4.5722,0.7626,3.1381,2.723]})
print('df')
df_highest = pd.DataFrame(columns=df.columns)#brand new df that will contain the rows of interest
for grpID,data in df.groupby(['id','Tissue']):
highest = data.nlargest(1,'val')
df_highest.append(highest)
df_highest.to_csv('out.txt',sep='\t',index=False)

If you are trying to get the largest value for each id and tissue combination, try this code.
df_highest = df.loc[df.groupby(['id','Tissue'])['val'].idxmax()]
This will give you the mean of the id and Tissue combination.
df_mean = df.groupby(['id','Tissue']).agg({'val':np.mean})

Related

Min of a column if another column is above average

I'm doing some Python exercises and I'm stuck with a question.
I'm using the following Titanic dataframe: https://drive.google.com/file/d/1NEHvlUMTNPusHZvHUFTqeUR_9yY1tHVz/view
Now I need to find the minimum value of the column 'Age' for each class of 'Pclass' for the passengers that paid a fare ('Fare') above the average.
Using this I can get the minimum age by group, but how can I add the 'above average Fare' condition to this?
df.groupby('Pclass')['Age'].min()
you can:
find mean
filter
pivot_table, minimum value of the column 'Age' for each class of 'Pclass'
avrg_Fare = df['Fare'].mean()
df = df.loc[df['Fare'] > avrg_Fare]
PVT_min_age = df.pivot_table(index='Pclass', aggfunc={'Age':np.min}).reset_index()
Give this a shot
average_fare = df['Fare'].mean()
df.query("fare > #average_fare").groupby('Pclass_2').agg{'Age': ['min']}
Grouping by with Where conditions in Pandas
I may have some syntax errors since its been awhile since i've done pandas, if anyone sees a problem please correct it

pandas computing new column as a average of other two conditions

So I have this dataset of temperatures. Each line describe the temperature in celsius measured by hour in a day.
So, I need to compute a new variable called avg_temp_ar_mensal which representsthe average temperature of a city in a month. City in this dataset is represented as estacao and month as mes.
I'm trying to do this using pandas. The following line of code is the one I'm trying to use to solve this problem:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes', 'estacao']).mean()
The goal of this code is to store in a new column the average of the temperature of the city and month. But it doesn't work. If I try the following line of code:
df2['avg_temp_ar_mensal'] = df2['temp_ar'].groupby(df2['mes']).mean()
It will works, but it is wrong. It will calculate for every city of the dataset and I don't want it because it will cause noise in my data. I need to separate each temperature based on month and city and then calculate the mean.
The dataframe after groupby is smaller than the initial dataframe, that is why your code run into error.
There is two ways to solve this problem. The first one is using transform as:
df.groupby(['mes', 'estacao'])['temp_ar'].transform(lambda g: g.mean())
The second is to create a new dfn from groupby then merge back to df
dfn = df.groupby(['mes', 'estacao'])['temp_ar'].mean().reset_index(name='average')
df = pd.merge(df, dfn, on=['mes', 'estacao'], how='left']
You are calling a groupby on a single column when you are doing df2['temp_ar'].groupby(...). This doesn't make much sense since in a single column, there's nothing to group by.
Instead, you have to perform the groupby on all the columns you need. Also, make sure that the final output is a series and not a dataframe
df['new_column'] = df[['city_column', 'month_column', 'temp_column']].groupby(['city_column', 'month_column']).mean()['temp_column']
This should do the trick if I understand your dataset correctly. If not, please provide a reproducible version of your df

Detecting bad information (python/pandas)

I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances.
Thanks in advance!!
First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing
truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict()
Then you can get a column that is the 'correct' unit of measurement
df['correct_unit'] = df['product_code'].apply(truth_mapping.get)
Then you can filter to rows that do not have the correct mapping:
df[df['correct_unit'] != df['unit_of_measurement']]
Try this:
Sample df:
df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'],
'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']})
Group by and see count of all non unique pairs:
new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'})
Drop all rows where the Product Code is repeated
new.drop_duplicates(subset=['Product Code'], keep=False)

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Pandas: Calculate mean, var of similar columns grouped together

Trying to do a analysis of network trace data using pandas. I have read the dump file and created the following DataFrame:
So to detect the individual flows in the DataFrame data2, I have grouped the entire DataFrame according to ['ip_src', 'ip_dst', 'sport', 'dport', 'ip_proto', 'service'] using the following piece of code:
flow = ['ip_src', 'ip_dst', 'sport', 'dport', 'ip_proto', 'service']
grp1 = data2.groupby(flow, sort=False)
So when I do grp1.size() of the first twenty rows of data2, I get the following information:
What I would like to do now is to calculate the mean of ip_len, packet_len, var of ip_len, packet_len and mean of the interpacket arrival times (using the timestamps of packets belonging to the same flow).
How can I accomplish this in pandas so that the dataframe I get contains the statistics of each flow i.e. the columns should contain the ip_src, ip_dst, sport, dport, ip_proto, service, and the mean & var values calculated as earlier. I have tried both the aggr and apply methods, but haven't been able to do it. Thanks in advance!
data2.groupby(['colName1','colName2']).mean()
should do the job.

Categories

Resources