Combining pandas rows based on condition [duplicate] - python

This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Concatenate strings from several rows using Pandas groupby
(8 answers)
Closed 3 years ago.
Given a Pandas Dataframe df, with column names 'Session', and 'List':
Can I group together the 'List' values for the same values of 'Session'?
My Approach
I've tried solving the problem by creating a new dataframe, and iterating through the rows of the inital dataframe while maintaing a session counter that I increment if I see that the session has changed.
If it hasn't changed, then I append the List value that corresponds to that rows value with a comma.
Whenever the session changes, I used strip to get rid of the last comma (extra).
Initial DataFrame
Session List
0 1 a
1 1 b
2 1 c
3 2 d
4 2 e
5 3 f
Required DataFrame
Session List
0 1 a,b,c
1 2 d,e
2 3 f
Can someone suggest something more efficient or simple?
Thank you in advance.

Use groupby and apply and reset_index:
>>> df.groupby('Session')['List'].agg(','.join).reset_index()
Session List
0 1 a,b,c
1 2 d,e
2 3 f
>>>

Related

Pandas Drop Duplicates And Store Duplicates [duplicate]

This question already has answers here:
How do I get a list of all the duplicate items using pandas in python?
(13 answers)
Closed 2 months ago.
i use the pandas.DataFrame.drop_duplicates to search duplicates in a dataframe. This removes the duplicates from the dataframe. This also works great. However, I would like to know which data has been removed.
Is there a way to save the data in a new list before removing it?
I have unfortunately found in the documentation of pandas no information on this.
Thanks for the answer.
It uses duplicated function to filter out the information which is duplicated. By default the first occurrence is set to True, all others set as False, Using this function and filter on original data, you can know which data is kept and which is dropped out.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
You can use duplicated and boolean indexing with groupby.agg to keep the list of duplicates:
m = df.duplicated('group')
dropped = df[m].groupby(df['group'])['value'].agg(list)
print(dropped)
df = df[~m]
print(df)
Output:
# print(dropped)
group
A [2]
B [4, 5]
C [7]
Name: value, dtype: object
# print(df)
group value
0 A 1
2 B 3
5 C 6
Used input:
group value
0 A 1
1 A 2
2 B 3
3 B 4
4 B 5
5 C 6
6 C 7

How can I select distinct values into a pivot using pandas? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Pivoting a Pandas Dataframe containing strings - 'No numeric types to aggregate' error
(3 answers)
Closed 1 year ago.
Pandas question:
If I have this dataframe:
Member
Value
Group
1
a
AC
1
c
AC
1
d
DF
2
b
AC
2
e
DF
which I would like to transform, using pivot?, to a DataFrame showing occurences of individual elements of the group, like:
x
AC
DF
1
ac
d
2
b
e
I run into "Index contains duplicate values, cannot reshape" if I try:
pivot(index='Member', columns=['Group'], values='Value')
Feel confused over something seemingly very trivial. Can somebody help?

Two column DataFrame to transition table (pivot) [duplicate]

This question already has answers here:
Get statistics for each group (such as count, mean, etc) using pandas GroupBy?
(9 answers)
How can I pivot a dataframe?
(5 answers)
Closed 3 years ago.
I have a pandas dataframe with two columns. I want to measure the transition count, that is, the number of times that each unique first column value is related to each unique second column value. This should be a pivot or pivot_table but I am stuck. In the code pasted, trial is the input dataframe, and ans is the answer dataframe what I would like to see by manipulating the trial dataframe.
I did not spot a similar dataframe question which has only two columns. The others used pivot on a third table where a mean or sum aggfunc were used. This is a case where there are only two columns, and I want to count the transitions. The other questions also used numerical columns where aggregation is possible. I want to count the columns for a non-numeric value.
If there is a similar question, would be very helpful if someone can point me to it.
trial=pd.DataFrame({'col1':list('AABCCCDDDD'),'col2':list('XYXXXYYXZZ')})
index col1 col2
0 A X
1 A Y
2 B X
3 C X
4 C X
5 C Y
6 D Y
7 D X
8 D Z
9 D Z
ans=pd.DataFrame({'col1':list('ABCD'),'X':[1,1,2,1],'Y':[1,0,1,1],'Z':[0,0,0,2]})
ans.set_index('col1')
col1 X Y Z
A 1 1 0
B 1 0 0
C 2 1 0
D 1 1 2

How can I drop rows in a dataframe efficiently ir a specific column contains a substring [duplicate]

This question already has answers here:
Pandas filtering for multiple substrings in series
(3 answers)
Closed 4 years ago.
I tried
df = df[~df['event.properties.comment'].isin(['Extra'])]
Problem is it would just drop the row if the column contains exactly 'Extra' and I need to drop the ones that contain it even as a substring.
Any help?
You can use or condition to have multiple conditions in checking string, for your requirement you may retain text if it have "Extra" or "~".
Considered df
vals ids
0 1 ~
1 2 bball
2 3 NaN
3 4 Extra text
df[~df.ids.fillna('').str.contains('Extra')]
Out:
vals ids
0 1 ~
1 2 bball
2 3 NaN

Error subsetting a data frame in python [duplicate]

This question already has an answer here:
Python - splitting dataframe into multiple dataframes based on column values and naming them with those values [duplicate]
(1 answer)
Closed 4 years ago.
I am learning python and pandas and am having trouble overcoming an error while trying to subset a data frame.
I have an input data frame:
df0-
Index Group Value
1 A 10
2 A 15
3 B 20
4 C 10
5 C 10
df0.dtypes-
Group object
Value float64
That I am trying to split out into unique values based off of the Group column. With the output looking something like this:
df1-
Index Group Value
1 A 10
2 A 15
df2-
Index Group Value
3 B 20
df3-
Index Group Value
4 C 10
5 C 10
So far I have written this code to subset the input:
UniqueGroups = df0['Group'].unique().tolist()
OutputFrame = {}
for x in UniqueAgencies:
ReturnFrame[str('ConsolidateReport_')+x] = UniqueAgencies[df0['Group']==x]
The code above returns the following error, which I can`t quite work my head around. Can anyone point me in the right direction?
*** TypeError: list indices must be integers or slices, not str
you can use groupby to group the column
for _, g in df0.groupby('Group'):
print g

Categories

Resources