I have a dataframe which looks like:
|--------------|------------|------------|
| User | Text | Effort |
|--------------|------------|------------|
| user122 | TextA | 2 Weeks |
|--------------|------------|------------|
| user124 | TextB | 2 Weeks |
|--------------|------------|------------|
| user125 | TextC | 3 Weeks |
|--------------|------------|------------|
| user126 | TextD | 2 Weeks |
|--------------|------------|------------|
| user126 | TextE | 2 Weeks |
|--------------|------------|------------|
My goal is to group the table by Effort and get the unique count of each user per group. I am able to do that by:
df.groupby(['Effort']).agg({"User": pd.Series.nunique})
And this results into this table:
|--------------|------------|
| Effort | User |
|--------------|------------|
| 2 weeks | 3 |
|--------------|------------|
| 3 weeks | 1 |
|--------------|------------|
However, by doing so I am loosing my text column information. Another solution I tried is to keep the first occurrence of that column, but I am still unhappy because I loose something on the way.
Question
Is there any way in which I can keep my initial dataframe without loosing any row and column but at the same time still group by Effort?
The best option you have is using transform if you ask me. This way you keep the shape of your original data, but still get the results of a groupby.
df['Nunique'] = df.groupby('Effort')['User'].transform('nunique')
User Text Effort Nunique
0 user122 TextA 2 Weeks 3
1 user124 TextB 2 Weeks 3
2 user125 TextC 3 Weeks 1
3 user126 TextD 2 Weeks 3
4 user126 TextE 2 Weeks 3
Related
I'm working with a panel dataset that contains many days' info on each ID number. There is one variable that takes the number of months in which the clients did something.
I want to find the clients that only reached 1 month, so the clients that never reached months 2, 3, etc.
Here is a sample of my data. The date column is in str format.
Client| Date | Months
1 | 04/01/2019 | 1
1 | 05/01/2019 | 1
1 | 06/01/2019 | 2
2 | 11/01/2019 | 1
2 | 12/01/2019 | 1
2 | 13/01/2019 | 1
2 | 14/01/2019 | 1
3 | 20/01/2019 | 1
3 | 21/01/2019 | 2
3 | 22/01/2019 | 2
3 | 23/01/2019 | 2
3 | 24/01/2019 | 3
3 | 25/01/2019 | 3
3 | 26/01/2019 | 3
In this example only client 2 would be selected. I would make a list or something like that to store the client numbers that follow the rule.
The code I tried was
df.loc[df["MONTHS"]==1, "CLIENT"].unique()
which didn't give me what I wanted (this includes all client id's that ever had 1 month, but not the ones that only had 1 month)
Any ideas are very much appreciated!
Perhaps something like this:
s = df.set_index('Client')['Months'].eq(1).groupby(level=0).all()
s[s].index
Result:
Int64Index([2], dtype='int64', name='Client')
Get the rows where there is only one unique month and filter :
df.loc[df.groupby(["Client"]).Months.transform("nunique").eq(1)]
Client Date Months
3 2 11/01/2019 1
4 2 12/01/2019 1
5 2 13/01/2019 1
6 2 14/01/2019 1
If you just want the Client number :
df.loc[df.groupby(["Client"]).Months.transform("nunique").eq(1), "Client"].unique()[0]
OR
df.groupby("Client").Months.nunique().loc[lambda x: x == 1].index[0]
Let me begin by noting that this question is very close to this question about getting non-zero values for each column in a pandas dataframe, but in addition to getting the values, I would also like to know the row from which it was drawn. (And, ultimately, I would like to be able to re-use the code to find columns in which a non-zero value occurred some x amount of times.)
What I have is a dataframe with a count of words for a given year of documents:
|Year / Term | word1 | word2 | word3 | ... | wordn |
|------------|-------|-------|-------|-----|-------|
| 2001 | 23 | 0 | 0 | | 0 |
| 2002 | 0 | 0 | 12 | | 0 |
| 2003 | 0 | 42 | 34 | | 0 |
| year(n) | 0 | 0 | 0 | | 45 |
So for word1 I would like to get both 23 and 2001 -- this could be as a tuple or as a dictionary. (It doesn't matter so long as I can work through the data.) And ultimately, I would like very much to be able to discover that word3 enjoyed a two-year span of usage.
FTR, the dataframe has only 16 rows but it has a lot, a lot of columns. If there's an answer to this questions already available, revealing the weakness of my search fu, I will take the scorn as my just due.
In you case melt then groupby
df.melt('Year / Term').loc[lambda x : x['value']!=0].groupby('variable')['value'].apply(tupl)
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.
I'm sure what I'm trying to do is fairly simple for those with better knowledge of PD, but I'm simply stuck at transforming:
+---------+------------+-------+
| Trigger | Date | Value |
+---------+------------+-------+
| 1 | 01/01/2016 | a |
+---------+------------+-------+
| 2 | 01/01/2016 | b |
+---------+------------+-------+
| 3 | 01/01/2016 | c |
+---------+------------+-------+
...etc, into:
+------------+---------------------+---------+---------+---------+
| Date | #of triggers | count a | count b | count c |
+------------+---------------------+---------+---------+---------+
| 01/01/2016 | 3 | 1 | 1 | 1 |
+------------+---------------------+---------+---------+---------+
| 02/01/2016 | 5 | 2 | 1 | 2 |
+------------+---------------------+---------+---------+---------+
... and so on
The issue is, I've got no bloody idea of how to achieve this..
I've scoured SO, but I can't seem to find anything that applies to my specific case.
I presume I'd have to group it all by date, but then once that is done, what do I need to do to get the remaining columns?
The initial DF is loaded from an SQL Alchemy query object, and then I want to manipulate it to get the result I described above. How would one do this?
Thanks
df.groupby(['Date','Value']).count().unstack(level=-1)
You can use GroupBy.size with unstack, also parameter sort=False is helpful:
df1 = df.groupby(['Date','Value'])['Value'].size().unstack(fill_value=0)
df1['Total'] = df1.sum(axis=1)
cols = df1.columns[-1:].union(df1.columns[:-1])
df1 = df1[cols]
print (df1)
Value Total a b c
Date
01/01/2016 3 1 1 1
The difference between size and count is:
size counts NaN values, count does not.
I am trying to select a grouped average.
a1_avg = session.query(func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
.group_by(Table_A.a1_group)
I have tried a few different iterations of this query and this is as close as I can get to what I need. I am fairly certain the group_by is creating the issue, but I am unsure how to correctly implement the query using SQLA. The table structure and expected output:
TABLE A
A1_ID | A1_VALUE | A1_DATE | A1_LOC | A1_GROUP
1 | 5 | 2011-10-05 | 5 | 6
2 | 15 | 2011-10-14 | 5 | 6
3 | 2 | 2011-10-21 | 6 | 7
4 | 20 | 2011-11-15 | 4 | 8
5 | 6 | 2011-10-27 | 6 | 7
EXPECTED OUTPUT
A1_LOC | A1_GROUP | A1_AVG
5 | 6 | 10
6 | 7 | 4
I would guess that you are just missing the group identifier (a1_group) in the result. Also (given I understand your model correctly), you need to add a group by clause also for a1_loc column:
edit-1: updated the query due to question specificaion
a1_avg = session.query(Table_A.a1_loc, Table_A.a1_group, func.avg(Table_A.a1_value).label('a1_avg'))\
.filter(between(Table_A.a1_date, '2011-10-01', '2011-10-30'))\
#.filter(Table_A.a1_id == '12')\ # #note: you do NOT NEED this
.group_by(Table_A.a1_loc)\ # #note: you NEED this
.group_by(Table_A.a1_group)