Performant alternative to constructing a dataframe by applying repeated pivots - python

I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

Related

Pandas: filter on grouped and aggregated dataframe

I have a dataframe which is based on a read-in excel list. The data has multiple columns and rows with one unique identifier. I want to plot the data through a PyQT interface based on some user selection (checkboxes), but I cannot select one unique row for plotting.
The data looks like this:
| Experiment | Data 1 | Data 2 |
| -------- | ------ | -------- |
| Exp1 | 0 | 1 |
| Exp2 | 0 | 2 |
| Exp3 | 1 | 2 |
| Exp1 | 1 | 2 |
| Exp3 | 2 | 2 |
After
df.groupby('Experiment').agg(list)
I get this:
| Experiment | Data 1 | Data 2 |
| ---------- | ------ | -------- |
| Exp1 | [0, 1] | [1, 2] |
| Exp2 | [0] | [2] |
| Exp3 | [1, 2] | [2, 2] |
I can use this for plotting e.g. with pyqtgraph. However, after the user makes a selection, only that specific experiment is supposed to be plotted (e.g. Exp3).
I tried filtering on the aggregated list with
.filter(lambda x: x['Culture ID']=='Exp3')
but it says that 'function' object is not iterable and I have a feeling this is the wrong approach.
Is there a way for me to get for example the index of the Experiment name (e.g. Exp3) so that I can access it this way or can someone explain how I could filter or access one of the rows based on the string/experiment key?
df.groupby('Experiment').agg(list).query('index == "Exp3"')
output:
Data 1 Data 2
Experiment
Exp3 [ 1 , 2 ] [ 2 , 2 ]

pyspark create all possible combinations of column values of a dataframe

I want to get all the possible combinations of size 2 of a column in pyspark dataframe.
My pyspark dataframe looks like
| id |
| 1 |
| 2 |
| 3 |
| 4 |
For above input, I want to get output as
| id1 | id2 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 2 | 3 |
and so on..
One way would be to collect the values and get them into python iterable (list, pandas df) and use itertools.combinations to generate all combinations.
values = df.select(F.collect_list('id')).first()[0]
combns = list(itertools.combinations(values, 2))
However, I want to avoid collecting the dataframe column to the driver since the rows can be extremely large. Is there a better way to achieve this using spark APIs?
You can use the crossJoin method, and then cull the lines with id1 > id2.
df = df.toDF('id1').crossJoin(df.toDF('id2')).filter('id1 < id2')

Using pandas apply to pass in both a row and the entire dataframe with it [duplicate]

This question already has an answer here:
Pandas - Finding percent contributed by each group
(1 answer)
Closed 2 years ago.
I have a df and I want to create some new cols with it. How would I use the apply function to both pass in the row, and the entire df with it? I need the entire df to do some filtering, and the data is subject to the values in each row.
Or maybe I don't need to use apply, but that's the first thing that came to my mind. Thank you and all help is appreciated!
Ex of df:
+----+--------+--------+
| ID | Family | Amount |
+----+--------+--------+
| 1 | A | 2 |
| 2 | A | 10 |
| 3 | B | 4 |
| 4 | B | 7 |
+----+--------+--------+
Result:
+----+--------+--------+-----------+------------+
| ID | Family | Amount | Total_Fam | Id_Percent |
+----+--------+--------+-----------+------------+
| 1 | A | 2 | 12 | .166 |
| 2 | A | 10 | 12 | .833 |
| 3 | B | 4 | 11 | .363 |
| 4 | B | 7 | 11 | .636 |
+----+--------+--------+-----------+------------+
First, group by Family and then transform amount and then you can directly divide Amount by the new column.
df['Total_Fam'] = df.groupby('Family')['Amount'].transform(np.sum)
df['Id_Percent'] = df['Amount']/df['Total_Fam']
df
Using apply on a column passes each row individualy. If you use apply on the entire dataset, it sees the entire dataset, hence, you can use all columns. As you can see in the example below, df['new_2] which is made using a function which I apply to the dataset, I do not need to pass the df to it.
import pandas as pd
import seaborn as sns
df = sns.load_dataset('iris')
df['new'] = df['species'].apply(lambda x: x[:2])
def sumIsMore(dataframe):
x = dataframe['sepal_length']
y = dataframe['sepal_width']
return x+y >= 8.5
df['new_2'] = df.apply(sumIsMore, axis=1)

Python Pandas datasets - Having integers values in new column by making a dictionary

I am trying to output in a new column integers values (labels/classes) based on labels of another column in my dataset. Actually I did it by creating new columns (numerical column heading) for each class with boolean values in them, so then I can use these to create the new class column with numerical values. But I was trying to do it with a dictionary, which I think it is a good and faster way.
If I run a code like this:
x=df['Item_Type'].value_counts()
item_type_mapping={}
item_list=x.index
for i in range(0,len(item_list)):
item_type_mapping[item_list[i]]=i
It generates the dictionary, but then if I run:
df['Item_Type']=df['Item_Type'].map(lambda x:item_type_mapping[x])
or
df['New_column']=[item_type_mapping[item] for item in data.Item_Type]
It displays KeyError=None
Anybody know why this occurs? I think that's strange since the dictionary has been created and I can see it through my variables
Thanks
Edit 1
#Fourier
simply I have this column:
| Item_type|
| -------- |
| Nino |
| Nino |
| Nino |
| Pasquale |
| Franco |
| Franco |
and then I need the same column or a new one to display:
| Item_type| New_column |
| -------- | ---------- |
| Nino | 1 |
| Nino | 1 |
| Nino | 1 |
| Pasquale | 2 |
| Franco | 3 |
| Franco | 3 |
Your code works for me, but what you're trying to do is already provided by pandas as Categorical data.
df = pd.DataFrame({'Item_Type': list('abca')})
df['New_column'] = df.Item_Type.astype('category').cat.codes
Result:
Item_Type New_column
0 a 0
1 b 1
2 c 2
3 a 0

Pandas - applying groupings and counts to multiple columns in order to generate/change a dataframe

I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.

Categories

Resources