Aggregating string columns using pandas GroupBy

Aggregating string columns using pandas GroupBy - python

I have a DF such as the following:
df =
vid pos value sente
1 a A 21
2 b B 21
3 b A 21
3 a A 21
1 d B 22
1 a C 22
1 a D 22
2 b A 22
3 a A 22
Now I want to combine all rows with the same value for sente and vid into one row with the values for value joined by an " "
df2 =
vid pos value sente
1 a A 21
2 b B 21
3 b a A A 21
1 d a a B C D 22
2 b A 22
3 a A 22
I suppose a modification of this should do the trick:
df2 = df.groupby["sente"].agg(lambda x: " ".join(x))
But I can't seem to figure out how to add the second column to the statement.

Groupers can be passed as lists. Furthermore, you can simplify your solution a bit by ridding your code of the lambda—it isn't needed.
df.groupby(['vid', 'sente'], as_index=False, sort=False).agg(' '.join)
vid sente pos value
0 1 21 a A
1 2 21 b B
2 3 21 b a A A
3 1 22 d a a B C D
4 2 22 b A
5 3 22 a A
Some other notes: specifying as_index=False means your groupers will be present as columns in the result (and not as the index, as is the default). Furthermore, sort=False will preserve the original order of the columns.

As of this edit, #cᴏʟᴅsᴘᴇᴇᴅ's answer is way better.
Fun Way! Only works because single char values
df.set_index(['sente', 'vid']).sum(level=[0, 1]).applymap(' '.join).reset_index()
sente vid pos value
0 21 1 a A
1 21 2 b B
2 21 3 b a A A
3 22 1 d a a B C D
4 22 2 b A
5 22 3 a A
somewhat ok answer
df.set_index(['sente', 'vid']).groupby(level=[0, 1]).apply(
lambda d: pd.Series(d.to_dict('l')).str.join(' ')
).reset_index()
sente vid pos value
0 21 1 a A
1 21 2 b B
2 21 3 b a A A
3 22 1 d a a B C D
4 22 2 b A
5 22 3 a A
not recommended
df.set_index(['sente', 'vid']).add(' ') \
.sum(level=[0, 1]).applymap(str.strip).reset_index()
sente vid pos value
0 21 1 a A
1 21 2 b B
2 21 3 b a A A
3 22 1 d a a B C D
4 22 2 b A
5 22 3 a A

Related

Mark repeated id with a-b relationship in dataframe

I'm trying to create a relationship between repeated ID's in dataframe. For example take 91, so 91 is repeated 4 times so for first 91 entry first column row value will be updated to A and second will be updated to B then for next row of 91, first will be updated to B and second will updated to C then for next first will be C and second will be D and so on and this same relationship will be there for all duplicated ID's.
For ID's that are not repeated first will marked as A.
id
first
other
11
0
0
09
0
0
91
0
0
91
0
0
91
0
0
91
0
0
15
0
0
15
0
0
12
0
0
01
0
0
01
0
0
01
0
0
Expected output:
id
first
other
11
A
0
09
A
0
91
A
B
91
B
C
91
C
D
91
D
E
15
A
B
15
B
C
12
A
0
01
A
B
01
B
C
01
C
D
I using df.iterrows() for this but that's becoming very messy code and will be slow if dataset increases is there any easy way of doing it.

You can perform a mapping using a cumcount per group as source:
from string import ascii_uppercase
# mapping dictionary
# this is an example, you can use any mapping
d = dict(enumerate(ascii_uppercase))
# {0: 'A', 1: 'B', 2: 'C'...}
g = df.groupby('id')
c = g.cumcount()
m = g['id'].transform('size').gt(1)
df['first'] = c.map(d)
df.loc[m, 'other'] = c[m].add(1).map(d)
Output:
id first other
0 11 A 0
1 9 A 0
2 91 A B
3 91 B C
4 91 C D
5 91 D E
6 15 A B
7 15 B C
8 12 A 0
9 1 A B
10 1 B C
11 1 C D

Given:
id
0 12
1 9
2 91
3 91
4 91
5 91
6 15
7 15
8 12
9 1
10 1
11 1
Doing:
# Count ids per group
df['first'] = df.groupby('id').cumcount()
# convert to letters and make other col
m = df.groupby('id').filter(lambda x: len(x)>1).index
df.loc[m, 'other'] = df['first'].add(66).apply(chr)
df['first'] = df['first'].add(65).apply(chr)
# fill in missing with 0
df['other'] = df['other'].fillna(0)
Output:
id first other
0 11 A 0
1 9 A 0
2 91 A B
3 91 B C
4 91 C D
5 91 D E
6 15 A B
7 15 B C
8 12 A 0
9 1 A B
10 1 B C
11 1 C D

How do I Add a sequence column to a Dask dataframe?

I have the following dask dataframe
a b c
1 a 30
1 a 11
2 b 99
2 b 55
3 c 21
4 d 21
I want to sequence the duplicate rows based on the size of the row's c field below is example output
a b c seq
1 a 30 2
1 a 11 1
2 b 99 2
2 b 55 1
3 c 21 1
4 d 21 1
Is there an easy way to do this in dask?
Before you ask, I'm replicating an existing process and I don't know why the duplicate rows are sequenced using the c field.

Try with rank
df['new'] = df.groupby('a')['c'].rank().astype(int)
Out[29]:
0 2
1 1
2 2
3 1
4 1
5 1
Name: c, dtype: int32

Pandas: Add rows in the groups of a dataframe

I have a data frame as follows:
df = pd.DataFrame({"date": [1,2,5,6,2,3,4,5,1,3,4,5,6,1,2,3,4,5,6],
"variable": ["A","A","A","A","B","B","B","B","C","C","C","C","C","D","D","D","D","D","D"]})
date variable
0 1 A
1 2 A
2 5 A
3 6 A
4 2 B
5 3 B
6 4 B
7 5 B
8 1 C
9 3 C
10 4 C
11 5 C
12 6 C
13 1 D
14 2 D
15 3 D
16 4 D
17 5 D
18 6 D
In this data frame, there are 4 values in the variable column: A, B, C, D. My goal is that each of the variables needs to contain 1 to 6 dates in the date column.
But currently, a few values in the date column are missing for some variable. I tried grouping them and filling each value with a counter but sometimes there are more than one dates missing (For example, in variable A, the dates 4 and 5 are missing). Also, the counter made my code terribly slow as I have a couple of thousand of rows.
Is there a faster and smarter way to do this without using a counter?
The desired output should be as follows:
date variable
0 1 A
1 2 A
2 3 A
3 4 A
4 5 A
5 6 A
6 1 B
7 2 B
8 3 B
9 4 B
10 5 B
11 6 B
12 1 C
13 2 C
14 3 C
15 4 C
16 5 C
17 6 C
18 1 D
19 2 D
20 3 D
21 4 D
22 5 D
23 6 D

itertools.product
from itertools import product
pd.DataFrame([*product(
range(df.date.min(), df.date.max() + 1),
sorted({*df.variable})
)], columns=df.columns)
date variable
0 1 A
1 1 B
2 1 C
3 1 D
4 2 A
5 2 B
6 2 C
7 2 D
8 3 A
9 3 B
10 3 C
11 3 D
12 4 A
13 4 B
14 4 C
15 4 D
16 5 A
17 5 B
18 5 C
19 5 D
20 6 A
21 6 B
22 6 C
23 6 D

Using grpupby + reindex
df.groupby('variable', as_index=False).apply(
lambda g: g.set_index('date').reindex([1,2,3,4,5,6]).ffill().bfill())
.reset_index(level=1)
Output:
date variable
0 1 A
0 2 A
0 3 A
0 4 A
0 5 A
0 6 A
1 1 B
1 2 B
1 3 B
1 4 B
1 5 B
1 6 B
2 1 C
2 2 C
2 3 C
2 4 C
2 5 C
2 6 C
3 1 D
3 2 D
3 3 D
3 4 D
3 5 D
3 6 D

This is more of a work around but it should work
df.groupby(by=['variable']).agg({'date': range(6)}).explode('date')

how to subtract all pandas dataframe elements with each other easier way?

let's say I have a dataframe like this
name time
a 10
b 30
c 11
d 13
now I want a new dataframe like this
name1 name2 time_diff
a a 0
a b -20
a c -1
a d -3
b a 20
b b 0
b c 19
b d 17
.....
.....
d d 0
nested for loops, lambda function can be used but as the number of elements go above 200, for loops just take too much time to finish or should I say, I always have to interrupt the process. Does someone know a panda query way or something quicker & easier. shape of my dataframe is 1600x2

Solution with itertools:
import itertools
d=pd.DataFrame(list(itertools.product(df.name,df.name)),columns=['name1','name2'])
dic = dict(zip(df.name,df.time))
d['time_diff']=d.name1.map(dic)-d.name2.map(dic)
print(d)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0

Use cross join first by merge with helper column, get difference and select only necessary columns:
df = df.assign(A=1)
df = pd.merge(df, df, on='A', suffixes=('1','2'))
df['time_diff'] = df['time1'] - df['time2']
df = df[['name1','name2','time_diff']]
print (df)
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0
Another solution with MultiIndex.from_product and reindex by first and second level:
df = df.set_index('name')
mux = pd.MultiIndex.from_product([df.index, df.index], names=['name1','name2'])
df = (df['time'].reindex(mux, level=0)
.sub(df.reindex(mux, level=1)['time'])
.rename('time_diff')
.reset_index())

another way would be, df.apply
df=pd.DataFrame({'col':['a','b','c','d'],'col1':[10,30,11,13]})
index = pd.MultiIndex.from_product([df['col'], df['col']], names = ["name1", "name2"])
res=pd.DataFrame(index = index).reset_index()
res['time_diff']=df.apply(lambda x: x['col1']-df['col1'],axis=1).values.flatten()
O/P:
name1 name2 time_diff
0 a a 0
1 a b -20
2 a c -1
3 a d -3
4 b a 20
5 b b 0
6 b c 19
7 b d 17
8 c a 1
9 c b -19
10 c c 0
11 c d -2
12 d a 3
13 d b -17
14 d c 2
15 d d 0

Python convert variables to cases

I'm trying to transform a DataFrame from this
id track var1 text1 var1 text2
1 1 10 a 11 b
2 1 17 b 19 c
3 2 20 c 33 d
Into this:
id track var text
1 1 10 a
1 1 11 b
2 1 17 b
2 1 19 c
3 2 20 c
3 2 33 d
I'm trying Pandas stack() method yet it seems to force all columns as respondents and does not keep fixed vales (i.e id track).
Any ideas?

Try with wide_to_long
df.columns=['id','track','var1','text1','var2','text2']
pd.wide_to_long(df,['var','text'],i=['id','track'],j='drop').reset_index(level=[0,1])
Out[238]:
id track var text
drop
1 1 1 10 a
2 1 1 11 b
1 2 1 17 b
2 2 1 19 c
1 3 2 20 c
2 3 2 33 d

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Aggregating string columns using pandas GroupBy - python

Related

Mark repeated id with a-b relationship in dataframe

How do I Add a sequence column to a Dask dataframe?

Pandas: Add rows in the groups of a dataframe

how to subtract all pandas dataframe elements with each other easier way?

Python convert variables to cases

Categories

Resources