I have a table:
ID
Component
Revenue
1
4
10
1
5
20
2
4
15
3
6
30
and I'd like to group by ID, creating a column with a dictionary or list as such:
ID
Grouped
1
[[4, 10], [5,20]]
2
[4, 15]
3
[6, 30]
I know using
df.groupby(['ID']).Component.apply(list).reset_index()
will do so for one column but I'm not sure for many columns.
You can use:
(df.groupby(['ID'])[['Component', 'Revenue']]
.apply(lambda d: d.to_numpy().tolist())
.reset_index(name='Grouped')
)
output:
ID Grouped
0 1 [[4, 10], [5, 20]]
1 2 [[4, 15]]
2 3 [[6, 30]]
Related
I have two dataframes, one has a column that contains a list of values and the other one has some values.
I want to filter the main df if one of the values in the second df exists in the main df column.
Code:
import pandas as pd
A = pd.DataFrame({'index':[0,1,2,3,4], 'vals':[[1,2],[5,4],[7,1,26],['-'],[9,8,5]]})
B = pd.DataFrame({'index':[4,7], 'val':[1,8]})
print(A)
print(B)
print(B['val'].isin(A['vals'])) # Will not work since its comparing element to list
result = pd.DataFrame({'index':[0,2,4], 'vals':[[1,2],[7,1,26],[9,8,5]]})
Dataframe A
index
vals
0
[1, 2]
1
[5, 4]
2
[7, 1, 26]
3
[-]
4
[9, 8, 5]
Dataframe B
index
val
4
1
7
8
Result
index
vals
0
[1, 2]
2
[7, 1, 26]
4
[9, 8, 5]
You can explode your vals column then compute the intersection:
>>> A.loc[A['vals'].explode().isin(B['val']).loc[lambda x: x].index]
index vals
0 0 [1, 2]
2 2 [7, 1, 26]
4 4 [9, 8, 5]
Detail about explode:
>>> A['vals'].explode()
0 1
0 2
1 5
1 4
2 7 # not in B -|
2 1 # in B | -> keep index 2
2 26 # not in B -|
3 -
4 9
4 8
4 5
Name: vals, dtype: object
You can use:
# mask the values based on the intersection between the list in each row and B values
mask = A['vals'].apply(lambda a: len(list(set(a) & set(B['val'])))) > 0
result = A[mask]
print(result)
Output:
index vals
0 0 [1, 2]
2 2 [7, 1, 26]
4 4 [9, 8, 5]
I have two data frame as below
Data Frame 1
Data Frame 2
I would like to merge this two data frames into something like below;
I try to use pd.merge and join as below
frames = pd.merge(df1, df2, how='outer', on=['apple_id','apple_wgt_colour', 'apple_wgt_no_colour'])
But the result is like this one
Anyone can help?
You can do it by using concat() and groupby(). Because you want to sum the corresponding values from apple_wgt_colour and apple_wgt_no_colour, you should use agg() to sum at the end.
You first should concat the two dataframes, then use groupby to aggreate the two columns, apple_wgt_colour and apple_wgt_no_colour.
# Generating the two dataframe you exampled.
df1 = pd.DataFrame(
{
'apple_id': [1, 2, 3],
'apple_wgt_1': [9, 16, 8],
'apple_wgt_colour': [9, 6, 8],
'apple_wgt_no_colour': [0, 10, 13],
}
)
df2 = pd.DataFrame(
{
'apple_id': [1, 2, 3],
'apple_wgt_2': [9, 16, 8],
'apple_wgt_colour': [9, 6, 8],
'apple_wgt_no_colour': [0, 10, 13],
}
)
print(df1)
print(df2)
apple_id apple_wgt_1 apple_wgt_colour apple_wgt_no_colour
0 1 9 9 0
1 2 16 6 10
2 3 8 8 13
apple_id apple_wgt_2 apple_wgt_colour apple_wgt_no_colour
0 1 9 9 0
1 2 16 6 10
2 3 8 8 13
Next code will make a result you want:
frames = pd.concat([df1, df2]).groupby('apple_id', as_index=False).agg(sum)
# to change column order as you want
frames = frames[['apple_id', 'apple_wgt_1', 'apple_wgt_2', 'apple_wgt_colour', 'apple_wgt_no_colour']]
print(frames)
apple_id apple_wgt_1 apple_wgt_2 apple_wgt_colour apple_wgt_no_colour
0 1 9.0 9.0 18 0
1 2 16.0 16.0 12 20
2 3 8.0 8.0 16 26
am trying to get the count of each group in a column ignoring the null values.
dataset:
A B C
1 null 12
1 xx 13
1 yy 14
2 xx 15
2 yy 16
2 zz 12
3 xx 12
3 null 12
3 null 12
expected output:
A B
1 2
2 3
3 1
code used : df.groupby(['A'])["b"].apply(lambda x: x.notnull().count())
The pandas library has a builtin pair of functions that can count the number of items in a group that has been created via the .groupby() method. There should be no need to use the .apply() method OR the lambda your code has.
from pandas import DataFrame
data = [[1, None, 12],
[1, 'xx', 13],
[1, 'yy', 14],
[2, 'xx', 15],
[2, 'yy', 16],
[2, 'zz', 12],
[3, 'xx', 12],
[3, None, 12],
[3, None, 12]]
df = DataFrame(data, columns=['A', 'B', 'C'])
COUNT
Count will count rows that are non-null.
In [11]: df.groupby(['A'])['B'].count()
Out[11]:
A
1 2
2 3
3 1
Name: B, dtype: int64
SIZE
Size, on the other hand will count all elements of the group, regardless of whether there are non-null values or not.
In [12]: df.groupby(['A'])['B'].size()
Out[12]:
A
1 3
2 3
3 3
Name: B, dtype: int64
Try this
df.groupby("A").count()
Very new to pandas and I'm trying to sum the elements of a list in a single column for a pandas dataframe, except I can't find a way to do so
The dataframe looks something like this:
index codes
0 [19, 19]
1 [3, 4]
2 [20, 5, 3]
3 NaN
4 [1]
5 NaN
6 [14, 2]
What I'm trying to get is:
index codes total
0 [19, 19] 38
1 [3, 4] 7
2 [20, 5, 3] 28
3 NaN 0
4 [1] 1
5 NaN 0
6 [14, 2] 16
However the values in codes were obtained by using str.findall('-(\d+)') from a different column, so they are not a list of ints
Any help would be much appreciated, thanks.
I would use str.extractall() instead of str.findall():
# replace orig_column with the correct column name
df['total'] = (df['orig_column'].str.extractall('-(\d+)')
.astype(int).sum(level=0)
.reindex(df.index, fill_value=0)
)
If you really want to use your current codes column:
df['total'] = df['codes'].explode().astype(float).sum(level=0)
Output:
index codes total
0 0 [19, 19] 38
1 1 [3, 4] 7
2 2 [20, 5, 3] 28
3 3 NaN 0
4 4 [1] 1
5 5 NaN 0
6 6 [14, 2] 16
Try df['total'] = df['codes'].apply(lambda x:int(np.nansum(x))) if you want int type output.
Try df['total'] = df['codes'].apply(lambda x:np.nansum(x)) otherwise.
df['total'] = (
df.codes.apply(lambda x: sum([int(e) for e in x]) if type(x) == list else 0)
)
If I want to create a new DataFrame with several columns, I can add all the columns at once -- for example, as follows:
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df = pd.DataFrame(data)
But now suppose farther down the road I want to add a set of additional columns to this DataFrame. Is there a way to add them all simultaneously, as in
additional_data = {'col_3': [8, 9, 10, 11],
'col_4': [12, 13, 14, 15]}
#Below is a made-up function of the kind I desire.
df.add_data(additional_data)
I'm aware I could do this:
for key, value in additional_data.iteritems():
df[key] = value
Or this:
df2 = pd.DataFrame(additional_data, index=df.index)
df = pd.merge(df, df2, on=df.index)
I was just hoping for something cleaner. If I'm stuck with these two options, which is preferred?
Pandas has assign method since 0.16.0. You could use it on dataframes like
In [1506]: df1.assign(**df2)
Out[1506]:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
or, you could directly use the dictionary like
In [1507]: df1.assign(**additional_data)
Out[1507]:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
What you need is the join function:
df1.join(df2, how='outer')
#or
df1.join(df2) # this works also
Example:
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
df1 = pd.DataFrame(data)
additional_data = {'col_3': [8, 9, 10, 11],
'col_4': [12, 13, 14, 15]}
df2 = pd.DataFrame(additional_data)
df1.join(df2, how='outer')
output:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
If you don't want to create new DataFrame from additional_data, you can use something like this:
>>> additional_data = [[8, 9, 10, 11], [12, 13, 14, 15]]
>>> df['col3'], df['col4'] = additional_data
>>> df
col_1 col_2 col3 col4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
It's also possible to do something like this, but it would be new DataFrame, not inplace modification of existing DataFrame:
>>> additional_header = ['col_3', 'col_4']
>>> additional_data = [[8, 9, 10, 11], [12, 13, 14, 15]]
>>> df = pd.DataFrame(data=np.concatenate((df.values.T, additional_data)).T, columns=np.concatenate((df.columns, additional_header)))
>>> df
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
All you need to do is create the new columns with data from the additional dataframe.
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
additional_data = {'col_3': [8, 9, 10, 11],
'col_4': [12, 13, 14, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(additional_data)
df[df2.columns] = df2
df now looks like:
col_1 col_2 col_3 col_4
0 0 4 8 12
1 1 5 9 13
2 2 6 10 14
3 3 7 11 15
Indices from the original dataframe will be used as if you had performed an in-place left join. Data from the original dataframe in columns with a matching name in the additional dataframe will be overwritten.
For example:
data = {'col_1': [0, 1, 2, 3],
'col_2': [4, 5, 6, 7]}
additional_data = {'col_2': [8, 9, 10, 11],
'col_3': [12, 13, 14, 15]}
df = pd.DataFrame(data)
df2 = pd.DataFrame(additional_data, index=[0,1,2,4])
df[df2.columns] = df2
df now looks like:
col_1 col_2 col_3
0 0 8 12
1 1 9 13
2 2 10 14
3 3 NaN NaN