How to transform dataframe to from-to pairs?

How to transform dataframe to from-to pairs? - python

If I have a dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['A', 'B', 'C', 'D'],
... ['E', 'B', 'C']
... ])
>>> df
0 1 2 3
0 A B C D
1 E B C None
>>>
I shoudl transform the dataframe to two columns format:
x, y
-----
A, B
B, C
C, D
E, B
B, C
For each row, from left to right, take two neighbor values and make a pair of it.
It is kind of from-to if you consider each row as a path.
How to do the transformation?

We can do explode with zip
s=pd.DataFrame(df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode().tolist())
Out[336]:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
Update
s=df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode()
s=pd.DataFrame(s.tolist(),index=s.index)
s
Out[340]:
0 1
0 A B
0 B C
0 C D
1 E B
1 B C

Pre-preparing the data could help too:
import pandas as pd
inp = [['A', 'B', 'C', 'D'],
['E', 'B', 'C']]
# Convert beforehand
inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp]
inp2 = inp2[0] + inp2[1]
df = pd.DataFrame(inp2)
print(df)
Output:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C

Related

How to add padded rows of 0 to a pandas dataframe?

I have a df in the following form
import pandas as pd
df = pd.DataFrame({'col1' : [1,1,1,2,2,3,3,4],
'col2' : ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3' : ['x', 'y', 'z', 'p','q','r','s','t']
})
col1 col2 col3
0 1 a x
1 1 b y
2 1 c z
3 2 a p
4 2 b q
5 3 a r
6 3 b s
7 4 a t
df2 = df.groupby(['col1','col2'])['col3'].sum()
df2
col1 col2
1 a x
b y
c z
2 a p
b q
3 a r
b s
4 a t
Now I want to add padded 0 rows to each of col1 index where a , b, c, d is missing , so expected output should be
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0

Use unstack + reindex + stack:
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
out:
col1 col2
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
dtype: object
Complete Working Example:
import pandas as pd
df = pd.DataFrame({
'col1': [1, 1, 1, 2, 2, 3, 3, 4],
'col2': ['a', 'b', 'c', 'a', 'b', 'a', 'b', 'a'],
'col3': ['x', 'y', 'z', 'p', 'q', 'r', 's', 't']
})
df2 = df.groupby(['col1', 'col2'])['col3'].sum()
out = (
df2.unstack(fill_value=0)
.reindex(columns=['a', 'b', 'c', 'd'], fill_value=0)
.stack()
)
print(out)

Here's another way using pd.MultiIndex.from_product, then reindex:
mindx = pd.MultiIndex.from_product([df2.index.levels[0], [*'abcd']])
df2.reindex(mindx, fill_value=0)
Output:
col1
1 a x
b y
c z
d 0
2 a p
b q
c 0
d 0
3 a r
b s
c 0
d 0
4 a t
b 0
c 0
d 0
Name: col3, dtype: object

Latest values based on time column

I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?

Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a

Expanding column values as key and values in DataFrame [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 3 years ago.
I have a DataFrame like:
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B':['d', 'e', 'f'], 'C':[1,2,3], 'D':[4,5,6]})
A B C D
a d 1 4
b e 2 5
c f 3 6
I have to expand columns C and D treating A and B as keys. The output should look like:
A B key val
a d C 1
a d D 4
b e C 2
b e D 5
c f C 3
c f D 6
I have coded this as:
df_new = pd.DataFrame()
list_to_expand = ['C', 'D']
for index, row in df.iterrows():
for l in list_to_expand:
df_new = df_new.append(
pd.DataFrame({'A': row['A'],'B': row['B'], 'key': l, 'val': row[l]},
index=[0]))
I need to optimize my code in a vectorized format but couldn't find any function. Please note that the list of columns can increase i.e. ['C', 'D', 'E'...]. I am using python3 and pandas.

You want DataFrame.melt:
df.melt(id_vars=['A', 'B'], var_name='key', value_name='val')
A B key val
0 a d C 1
1 b e C 2
2 c f C 3
3 a d D 4
4 b e D 5
5 c f D 6

Cannot re-add column to pandas multi-index dataframe after deletion

It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1

There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8

replicate rows in pandas by specific column with the values from that column

What would be the most efficient way to solve this problem?
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
i_need = pd.DataFrame(data={
'id': ['A','A','A','B','B','B','C', 'C'],
'v' : ['s','m','l','1','2','3','k','g']
})
I though about creating a new df and while iterating over i_have append the records to the new df. But as number of rows grow, it can take a while.

Use numpy.repeat with numpy.concatenate for flattening:
#create lists by split
splitted = i_have['v'].str.split(',')
#get legths of each lists
lens = splitted.str.len()
df = pd.DataFrame({'id':np.repeat(i_have['id'], lens),
'v':np.concatenate(splitted)})
print (df)
id v
0 A s
0 A m
0 A l
1 B 1
1 B 2
1 B 3
2 C k
2 C g
Thank you piRSquared for solution for repeat multiple columns:
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'id1': ['A1', 'B1', 'C1'],
'v' : [ 's,m,l', '1,2,3', 'k,g']
})
print (i_have)
id id1 v
0 A A1 s,m,l
1 B B1 1,2,3
2 C C1 k,g
splitted = i_have['v'].str.split(',')
lens = splitted.str.len()
df = i_have.loc[i_have.index.repeat(lens)].assign(v=np.concatenate(splitted))
print (df)
id id1 v
0 A A1 s
0 A A1 m
0 A A1 l
1 B B1 1
1 B B1 2
1 B B1 3
2 C C1 k
2 C C1 g

If you have multiple columns then first split the data by , with expand = True(Thank you piRSquared) then stack and ffill i.e
i_have = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'v' : [ 's,m,l', '1,2,3', 'k,g'],
'w' : [ 's,8,l', '1,2,3', 'k,g'],
'x' : [ 's,0,l', '1,21,3', 'ks,g'],
'y' : [ 's,m,l', '11,2,3', 'ks,g'],
'z' : [ 's,4,l', '1,2,32', 'k,gs'],
})
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True).ffill()
If the values are not equal sized then
i_want = i_have.apply(lambda x :x.str.split(',',expand=True).stack()).reset_index(level=1,drop=True)
i_want['id'] = i_want['id'].ffill()
Output i_want
id v w x y z
0 A s s s s s
1 A m 8 0 m 4
2 A l l l l l
3 B 1 1 1 11 1
4 B 2 2 21 2 2
5 B 3 3 3 3 32
6 C k k ks ks k
7 C g g g g gs

Here's another way
In [1667]: (i_have.set_index('id').v.str.split(',').apply(pd.Series)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1667]:
id v
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g
As pointed in comment.
In [1672]: (i_have.set_index('id').v.str.split(',', expand=True)
.stack().reset_index(name='v').drop('level_1', 1))
Out[1672]:
id V
0 A s
1 A m
2 A l
3 B 1
4 B 2
5 B 3
6 C k
7 C g

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to transform dataframe to from-to pairs? - python

Pre-preparing the data could help too: import pandas as pd inp = [['A', 'B', 'C', 'D'], ['E', 'B', 'C']] # Convert beforehand inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp] inp2 = inp2[0] + inp2[1] df = pd.DataFrame(inp2) print(df) Output: 0 1 0 A B 1 B C 2 C D 3 E B 4 B C

Related

How to add padded rows of 0 to a pandas dataframe?

Latest values based on time column

Expanding column values as key and values in DataFrame [duplicate]

Cannot re-add column to pandas multi-index dataframe after deletion

replicate rows in pandas by specific column with the values from that column

Categories

Resources