Pandas dataframe groupby - python

I am a beginner in Pandas so please bear with me. I know this is a very basic question/
I am working with pandas on the following dataframe :
x y w
1 2 5
1 2 7
3 4 3
5 4 8
3 4 5
5 9 9
And I want the following output :
x y w
1 2 5,7
3 4 2,5
5 4 8
5 9 9
Can Anyone tell me how to do it using pandas groupby.

You can use groupby with apply join:
#if type of column w is not string, convert it
print type(df.at[0,'w'])
<type 'numpy.int64'>
df['w'] = df['w'].astype(str)
print df.groupby(['x','y'])['w'].apply(','.join).reset_index()
x y w
0 1 2 5,7
1 3 4 3,5
2 5 4 8
3 5 9 9
If you have duplicates, use drop_duplicates:
print df
x y w
0 1 2 5
1 1 2 5
2 1 2 5
3 1 2 7
4 3 4 3
5 5 4 8
6 3 4 5
7 5 9 9
df['w'] = df['w'].astype(str)
print df.groupby(['x','y'])['w'].apply(lambda x: ','.join(x.drop_duplicates()))
.reset_index()
x y w
0 1 2 5,7
1 3 4 3,5
2 5 4 8
3 5 9 9
Or modified EdChum solution:
print df.groupby(['x','y'])['w'].apply(lambda x: ','.join(x.astype(str).drop_duplicates()))
.reset_index()
x y w
0 1 2 5,7
1 3 4 3,5
2 5 4 8
3 5 9 9

You can groupby on columns 'x' and 'y' and apply a lambda on the 'w' column, if required you need to cast the dtype using astype:
In [220]:
df.groupby(['x','y'])['w'].apply(lambda x: ','.join(x.astype(str)))
Out[220]:
x y
1 2 5,7
3 4 3,5
5 4 8
9 9
Name: w, dtype: object
In [221]:
df.groupby(['x','y'])['w'].apply(lambda x: ','.join(x.astype(str))).reset_index()
Out[221]:
x y w
0 1 2 5,7
1 3 4 3,5
2 5 4 8
3 5 9 9
EDIT
on your modified sample:
In [237]:
df.groupby(['x','y'])['w'].apply(lambda x: ','.join(x.unique().astype(str))).reset_index()
Out[237]:
x y w
0 1 2 5,7
1 3 4 3,5
2 5 4 8
3 5 9 9

Related

Pandas: get rows with consecutive column values and add a couter row

I need to go through a large pd and select consecutive rows with similar values in a column. i.e. in the pd below and selecting column x:
col row x y
1 1 1 1
2 2 2 2
6 3 3 8
9 2 3 4
5 3 3 9
4 9 4 4
5 5 5 1
3 7 5 2
6 6 6 6
The results output would be:
col row x y
6 3 3 8
9 2 3 4
5 3 3 9
5 5 5 1
3 7 5 2
Not sure how to do this.
IIUC, use boolean indexing using a mask of the consecutive values:
m = df['x'].eq(df['x'].shift())
df[m|m.shift(-1, fill_value=False)]
Output:
col row x y
2 6 3 3 8
3 9 2 3 4
4 5 3 3 9
6 5 5 5 1
7 3 7 5 2

How to add multiindex columns to existing df, preserving original index

I start with:
df
0 1 2 3 4
0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
and want to end up with:
df
0 1 2 3 4
A B C
1 2 0 5 0 0 2 6
1 9 6 5 8 6
2 8 9 4 2 1
3 2 5 8 9 6
4 8 8 8 0 8
where A and B are known after df creation, and C is the original
index of the df.
MWE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(10, size=(5, 5)))
df_a = 1
df_b = 2
breakpoint()
What I have in mind, but gives unhashable type error:
df.reindex([df_a, df_b, df.index])
Try with pd.MultiIndex.from_product:
df.index = pd.MultiIndex.from_product(
[[df_a], [df_b], df.index], names=['A','B','C'])
df
Out[682]:
0 1 2 3 4
A B C
1 2 0 7 0 1 9 9
1 0 4 7 3 2
2 7 2 0 0 4
3 5 5 6 8 4
4 1 4 9 8 1

I want to generate a new column in a pandas dataframe, counting "edges" in another column

i have a dataframe looking like this:
A B....X
1 1 A
2 2 B
3 3 A
4 6 K
5 7 B
6 8 L
7 9 M
8 1 N
9 7 B
1 6 A
7 7 A
that is, some "rising edges" occur from time to time in the column X (in this example the edge is x==B)
What I need is, a new column Y which increments every time a value of B occurs in X:
A B....X Y
1 1 A 0
2 2 B 1
3 3 A 1
4 6 K 1
5 7 B 2
6 8 L 2
7 9 M 2
8 1 N 2
9 7 B 3
1 6 A 3
7 7 A 3
In SQL I would use some trick like sum(case when x=B then 1 else 0) over ... rows between first and previous. How can I do it in Pandas?
Use cumsum
df['Y'] = (df.X == 'B').cumsum()
Out[8]:
A B X Y
0 1 1 A 0
1 2 2 B 1
2 3 3 A 1
3 4 6 K 1
4 5 7 B 2
5 6 8 L 2
6 7 9 M 2
7 8 1 N 2
8 9 7 B 3
9 1 6 A 3
10 7 7 A 3

pandas select FROM .... TO

I have a dataframe
C V S D LOC
1 2 3 4 X
5 6 7 8
1 2 3 4
5 6 7 8 Y
9 10 11 12
how can i select rows from loc X to Y and inport them in another csv
Use idxmax for first values of index where True in condition:
df = df.loc[(df['LOC'] == 'X').idxmax():(df['LOC'] == 'Y').idxmax()]
print (df)
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
In [133]: df.loc[df.index[df.LOC=='X'][0]:df.index[df.LOC=='Y'][0]]
Out[133]:
C V S D LOC
0 1 2 3 4 X
1 5 6 7 8 NaN
2 1 2 3 4 NaN
3 5 6 7 8 Y
PS this will select all rows between first occurence of X and first occurence of Y

Converting one to many mapping dictionary to Dataframe

I have a dictionary as follows:
d={1:(array[2,3]), 2:(array[8,4,5]), 3:(array[6,7,8,9])}
As depicted, here the values for each key are variable length arrays.
Now I want to convert it to DataFrame. So the output looks like:
A B
1 2
1 3
2 8
2 4
2 5
3 6
3 7
3 8
3 9
I used pd.Dataframe(d), but it does not handle one to many mapping.Any help would be appreciated.
Use Series constructor with str.len for lenghts of lists (arrays was converted to lists).
Then create new DataFrame with numpy.repeat, numpy.concatenate and Index.values:
d = {1:np.array([2,3]), 2:np.array([8,4,5]), 3:np.array([6,7,8,9])}
print (d)
a = pd.Series(d)
l = a.str.len()
df = pd.DataFrame({'A':np.repeat(a.index.values, l), 'B': np.concatenate(a.values)})
print (df)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
pd.DataFrame(
[[k, v] for k, a in d.items() for v in a.tolist()],
columns=['A', 'B']
)
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9
Setup
d = {1: np.array([2,3]), 2: np.array([8,4,5]), 3: np.array([6,7,8,9])}
Here's my version:
(pd.DataFrame.from_dict(d, orient='index').rename_axis('A')
.stack()
.reset_index(name='B')
.drop('level_1', axis=1)
.astype('int'))
Out[63]:
A B
0 1 2
1 1 3
2 2 8
3 2 4
4 2 5
5 3 6
6 3 7
7 3 8
8 3 9

Categories

Resources