Aggregate and concatenate multiple columns

Aggregate and concatenate multiple columns - python

I want to groupby my dataframe and concatenate the values/strings from the other columns together.
Year Letter Number Note Text
0 2022 a 1 8 hi
1 2022 b 1 7 hello
2 2022 a 1 6 bye
3 2022 b 3 5 joe
To this:
Column
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe
I tried some things with groupby, apply() and agg() but I can't get it work:
df.groupby(['Year', 'Letter']).agg(lambda x: '|'.join(x))
Output:
Text
Year Letter
2022 a hi|bye
b hello|joe

You can first join values per rows converted to strings by DataFrame.astype and DataFrame.agg and then aggregate join in GroupBy.agg:
df1 = (df.assign(Text= df[['Number','Note','Text']].astype(str).agg('|'.join, axis=1))
.groupby(['Year', 'Letter'])['Text']
.agg('; '.join)
.to_frame())
print (df1)
Text
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe
Or create custom lambda function in GroupBy.apply:
f = lambda x: '; '.join('|'.join(y) for y in x.astype(str).to_numpy())
df1 = (df.groupby(['Year', 'Letter'])[['Number','Note','Text']].apply(f)
.to_frame(name='Text')
)
print (df1)
Text
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe
If need join all columns without grouping columns:
grouped = ['Year','Letter']
df1 = (df.assign(Text= df[df.columns.difference(grouped, sort=False)]
.astype(str).agg('|'.join, axis=1))
.groupby(['Year', 'Letter'])['Text']
.agg('; '.join)
.to_frame())
grouped = ['Year','Letter']
f = lambda x: '; '.join('|'.join(y) for y in x.astype(str).to_numpy())
df1 = (df.groupby(grouped)[df.columns.difference(grouped, sort=False)].apply(f)
.to_frame(name='Text')
)

Using groupby.apply:
cols = ['Year', 'Letter']
(df.groupby(cols)
.apply(lambda d: '; '.join(d.drop(columns=cols) # or slice the columns here
.astype(str)
.agg('|'.join, axis=1)))
.to_frame(name='Column')
)
Output:
Column
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe

Related

How to concatenate one column together within a dataframe based on a group, store them in new column then drop the duplicates using python [duplicate]

I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!

You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite

We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})

The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)

If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})

For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()

Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()

Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])

Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j

How to group by 2 columns and add a column with the value of the first column [duplicate]

I want to merge several strings in a dataframe based on a groupedby in Pandas.
This is my code so far:
import pandas as pd
from io import StringIO
data = StringIO("""
"name1","hej","2014-11-01"
"name1","du","2014-11-02"
"name1","aj","2014-12-01"
"name1","oj","2014-12-02"
"name2","fin","2014-11-01"
"name2","katt","2014-11-02"
"name2","mycket","2014-12-01"
"name2","lite","2014-12-01"
""")
# load string as stream into dataframe
df = pd.read_csv(data,header=0, names=["name","text","date"],parse_dates=[2])
# add column with month
df["month"] = df["date"].apply(lambda x: x.month)
I want the end result to look like this:
I don't get how I can use groupby and apply some sort of concatenation of the strings in the column "text". Any help appreciated!

You can groupby the 'name' and 'month' columns, then call transform which will return data aligned to the original df and apply a lambda where we join the text entries:
In [119]:
df['text'] = df[['name','text','month']].groupby(['name','month'])['text'].transform(lambda x: ','.join(x))
df[['name','text','month']].drop_duplicates()
Out[119]:
name text month
0 name1 hej,du 11
2 name1 aj,oj 12
4 name2 fin,katt 11
6 name2 mycket,lite 12
I sub the original df by passing a list of the columns of interest df[['name','text','month']] here and then call drop_duplicates
EDIT actually I can just call apply and then reset_index:
In [124]:
df.groupby(['name','month'])['text'].apply(lambda x: ','.join(x)).reset_index()
Out[124]:
name month text
0 name1 11 hej,du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite
update
the lambda is unnecessary here:
In[38]:
df.groupby(['name','month'])['text'].apply(','.join).reset_index()
Out[38]:
name month text
0 name1 11 du
1 name1 12 aj,oj
2 name2 11 fin,katt
3 name2 12 mycket,lite

We can groupby the 'name' and 'month' columns, then call agg() functions of Panda’s DataFrame objects.
The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation.
df.groupby(['name', 'month'], as_index = False).agg({'text': ' '.join})

The answer by EdChum provides you with a lot of flexibility but if you just want to concateate strings into a column of list objects you can also:
output_series = df.groupby(['name','month'])['text'].apply(list)

If you want to concatenate your "text" in a list:
df.groupby(['name', 'month'], as_index = False).agg({'text': list})

For me the above solutions were close but added some unwanted /n's and dtype:object, so here's a modified version:
df.groupby(['name', 'month'])['text'].apply(lambda text: ''.join(text.to_string(index=False))).str.replace('(\\n)', '').reset_index()

Please try this line of code : -
df.groupby(['name','month'])['text'].apply(','.join).reset_index()

Although, this is an old question. But just in case. I used the below code and it seems to work like a charm.
text = ''.join(df[df['date'].dt.month==8]['text'])

Thanks to all the other answers, the following is probably the most concise and feels more natural. Using df.groupby("X")["A"].agg() aggregates over one or many selected columns.
df = pandas.DataFrame({'A' : ['a', 'a', 'b', 'c', 'c'],
'B' : ['i', 'j', 'k', 'i', 'j'],
'X' : [1, 2, 2, 1, 3]})
A B X
a i 1
a j 2
b k 2
c i 1
c j 3
df.groupby("X", as_index=False)["A"].agg(' '.join)
X A
1 a c
2 a b
3 c
df.groupby("X", as_index=False)[["A", "B"]].agg(' '.join)
X A B
1 a c i i
2 a b j k
3 c j

Pandas Dataframe convert column of lists to multiple columns

I am trying to convert a dataframe that has list of various size for example something like this:
d={'A':[1,2,3],'B':[[1,2,3],[3,5],[4]]}
df = pd.DataFrame(data=d)
df
to something like this:
d1={'A':[1,2,3],'B-1':[1,0,0],'B-2':[1,0,0],'B-3':[1,1,0],'B-4':[0,0,1],'B-5':[0,1,0]}
df1 = pd.DataFrame(data=d1)
df1
Thank you for the help

explode the lists then get_dummies and sum over the original index. (max [credit to #JonClements] if you want true dummies and not counts in case there can be multiples). Then join the result back
dfB = pd.get_dummies(df['B'].explode()).sum(level=0).add_prefix('B-')
#dfB = pd.get_dummies(df['B'].explode()).max(level=0).add_prefix('B-')
df = pd.concat([df['A'], dfB], axis=1)
# A B-1 B-2 B-3 B-4 B-5
#0 1 1 1 1 0 0
#1 2 0 0 1 0 1
#2 3 0 0 0 1 0
You can use pop to remove the column you explode so you don't need to specify df[list_of_all_columns_except_B] in the concat:
df = pd.concat([df, pd.get_dummies(df.pop('B').explode()).sum(level=0).add_prefix('B-')],
axis=1)

Avoiding KeyError in dataframe

I am validating my dataframe with below code,
df = df[(df[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1)) &
((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4)) &
(df[['network_url', 'formulary_url', 'sbc_download_url', 'treatment_cost_calculator_url']].astype(str).apply(lambda x: (x.str.contains('\A(https?:\/\/)([a-zA-Z0-9\-_])*(\.)*([a-zA-Z0-9\-]+)\.([a-zA-Z\.]{2,5})(\.*.*)?\Z')) | x.isin(['nan'])).all(axis=1)) &
(df[['promotional_label']].astype(str).apply(lambda x: (x.str.len <= 65) | x.isin(['nan'])).all(axis=1)) &
# (df[['sort_rank_override']].astype(str).apply(lambda x: (x.str.isdigit()) | x.isin(['nan'])).all(axis=1)) &
((df['hios_plan_identifier'].notnull()) & (df['hios_plan_identifier'].str.len() >= 10) & (df['hios_plan_identifier'].str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z'))) &
(df['type'].isin(['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan'])) &
(df['price_period'].isin(['Monthly', 'Yearly'])) &
(df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']))]
# (df[['composite_rating']].astype(str).apply(lambda x: (x.str.isin(['True', 'False']) & x.isnotin(['nan'])).all(axis=1)))]
This would throw me
KeyError: "['name'] not in index"
when the column is not present in my dataframe. I need to handle for all columns. How can I efficiently add a check to my above code which checks for validation only when the column is present?

You can use intersection:
L = ['name', 'issuer_id', 'service_area_id']
cols = df.columns.intersection(L)
(df[cols].notnull().all(axis=1))
EDIT:
df = pd.DataFrame({
'name':list('abcdef'),
'plan_year':[2015,2015,2015,5,5,4],
})
print (df)
name plan_year
0 a 2015
1 b 2015
2 c 2015
3 d 5
4 e 5
5 f 4
Idea is create dictionary of valid values for each colum first:
valid = {'name':'a',
'issuer_id':'a',
'service_area_id':'a',
'plan_year':2015,
...}
Then filter new dictionary by missing columns and assign to original DataFrame and create new DataFrame:
d1 = {k: v for k, v in valid.items() if k in set(valid.keys()) - set(df.columns)}
print (d1)
{'issuer_id': 'a', 'service_area_id': 'a'}
df1 = df.assign(**d1)
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
3 d 5 a a
4 e 5 a a
5 f 4 a a
Last filter:
m1 = (df1[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1))
m2 = ((df1['plan_year'].notnull()) &
(df1['plan_year'].astype(str).str.isdigit()) &
(df1['plan_year'].astype(str).str.len() == 4))
df1 = df1[m1 & m2]
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
Last you can remove helper columns:
df1 = df1[m1 & m2].drop(d1.keys(), axis=1)
print (df1)
name plan_year
0 a 2015
1 b 2015
2 c 2015

Add another variable called columns and filter it with the ones that exist in df:
columns = ['name', 'issuer_id', 'service_area_id']
existing = [i for i in columns if i in df.columns]
df = df[(df[existing]...
EDIT
You could also assign each condition to a variable and use it later like this:
cond1 = df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']) if 'is_age_29_plan' in df.columns else True
Then, use the cond1 in your filtering statement.

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.

The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)

you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps

Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()

You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregate and concatenate multiple columns - python

Using groupby.apply: cols = ['Year', 'Letter'] (df.groupby(cols) .apply(lambda d: '; '.join(d.drop(columns=cols) # or slice the columns here .astype(str) .agg('|'.join, axis=1))) .to_frame(name='Column') ) Output: Column Year Letter 2022 a 1|8|hi; 1|6|bye b 1|7|hello; 3|5|joe

Related

How to concatenate one column together within a dataframe based on a group, store them in new column then drop the duplicates using python [duplicate]

How to group by 2 columns and add a column with the value of the first column [duplicate]

Pandas Dataframe convert column of lists to multiple columns

Avoiding KeyError in dataframe

pandas convert grouped rows into columns

Categories

Resources