This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 3 years ago.
I have a DataFrame like:
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B':['d', 'e', 'f'], 'C':[1,2,3], 'D':[4,5,6]})
A B C D
a d 1 4
b e 2 5
c f 3 6
I have to expand columns C and D treating A and B as keys. The output should look like:
A B key val
a d C 1
a d D 4
b e C 2
b e D 5
c f C 3
c f D 6
I have coded this as:
df_new = pd.DataFrame()
list_to_expand = ['C', 'D']
for index, row in df.iterrows():
for l in list_to_expand:
df_new = df_new.append(
pd.DataFrame({'A': row['A'],'B': row['B'], 'key': l, 'val': row[l]},
index=[0]))
I need to optimize my code in a vectorized format but couldn't find any function. Please note that the list of columns can increase i.e. ['C', 'D', 'E'...]. I am using python3 and pandas.
You want DataFrame.melt:
df.melt(id_vars=['A', 'B'], var_name='key', value_name='val')
A B key val
0 a d C 1
1 b e C 2
2 c f C 3
3 a d D 4
4 b e D 5
5 c f D 6
Related
Pretty new to stackoverflow and data munging all together so apologies if this is a overly simple or previously asked question,
Say I have data as below:
index = list('ABCDEF')
values = [1,2,3,4,5,6]
test = pd.Series(values, index = index)
A 1
B 2
C 3
D 4
E 5
F 6
and want to create something like below, where the number of times each index value is appended is given by its value in the previous object
0 A
1 B
2 B
3 C
4 C
5 C
6 D
7 D
8 D
9 D
10 E
11 E
12 E
13 E
14 E
15 F
16 F
17 F
18 F
19 F
20 F
I have written the following code, but feel that looping defeats the whole purpose of using pandas. If anyone knows of a more simplistic and elegant solution, please share:
aggr = pd.Series([])
for index,value in zip(test.index.values,test):
to_append = pd.Series(list(index*value))
aggr = aggr.append(to_append, ignore_index = True)
Cheers
You can use pd.repeat on the index:
pd.Series(test.index.repeat(test))
0 A
1 B
2 B
3 C
4 C
5 C
6 D
7 D
8 D
9 D
10 E
11 E
12 E
13 E
14 E
15 F
16 F
17 F
18 F
19 F
20 F
In a general case, outside pandas you can generate this with list comprehension which then you might want to flatten. Given we are using pandas, we can make good use of explode() to flatten the nested list:
[[index[x-1]]*x for x in values]
Outputs:
[['A'],
['B', 'B'],
['C', 'C', 'C'],
['D', 'D', 'D', 'D'],
['E', 'E', 'E', 'E', 'E'],
['F', 'F', 'F', 'F', 'F', 'F']]
Therefore passing it to a pd.Series() and using explode():
pd.Series([[index[x-1]]*x for x in values]).explode()
Outputs:
0 A
1 B
1 B
2 C
2 C
2 C
3 D
3 D
3 D
3 D
4 E
4 E
4 E
4 E
4 E
5 F
5 F
5 F
5 F
5 F
5 F
dtype: object
Use Index.repeat.
You can transform your index to a Series (to_series) or a DataFrame (to_frame) and give it a name with name='...' as parameter of both methods:
>>> test.index.repeat(test).to_series().reset_index(drop=True)
0 A
1 B
2 B
3 C
4 C
5 C
6 D
7 D
8 D
9 D
10 E
11 E
12 E
13 E
14 E
15 F
16 F
17 F
18 F
19 F
20 F
dtype: object
If I have a dataframe:
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['A', 'B', 'C', 'D'],
... ['E', 'B', 'C']
... ])
>>> df
0 1 2 3
0 A B C D
1 E B C None
>>>
I shoudl transform the dataframe to two columns format:
x, y
-----
A, B
B, C
C, D
E, B
B, C
For each row, from left to right, take two neighbor values and make a pair of it.
It is kind of from-to if you consider each row as a path.
How to do the transformation?
We can do explode with zip
s=pd.DataFrame(df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode().tolist())
Out[336]:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
Update
s=df.apply(lambda x : list(zip(x.dropna()[:-1],x.dropna()[1:])),axis=1).explode()
s=pd.DataFrame(s.tolist(),index=s.index)
s
Out[340]:
0 1
0 A B
0 B C
0 C D
1 E B
1 B C
Pre-preparing the data could help too:
import pandas as pd
inp = [['A', 'B', 'C', 'D'],
['E', 'B', 'C']]
# Convert beforehand
inp2 = [[[i[k], i[k+1]] for k in range(len(i)-1)] for i in inp]
inp2 = inp2[0] + inp2[1]
df = pd.DataFrame(inp2)
print(df)
Output:
0 1
0 A B
1 B C
2 C D
3 E B
4 B C
I have mydf below, which I have sorted on a dummy time column and the id:
mydf = pd.DataFrame(
{
'id': ['A', 'B', 'B', 'C', 'A', 'C', 'A'],
'time': [1, 4, 3, 5, 2, 6, 7],
'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g']
}
).sort_values(['id', 'time'], ascending=False)
mydf
id time val
5 C 6 f
3 C 5 d
1 B 4 b
2 B 3 c
6 A 7 g
4 A 2 e
0 A 1 a
I want to add a column (last_val) which, for each unique id, holds the latest val based on the time column. Entries for which there is no last_val can be dropped. The output in this example would look like:
mydf
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
Any ideas?
Use DataFrameGroupBy.shift after sort_values(['id', 'time'], ascending=False) (already in question) and then remove rows with missing values by DataFrame.dropna:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf.dropna(subset=['last_val'])
Similar solution, only removed last duplicated rows by id column:
mydf['last_val'] = mydf.groupby('id')['val'].shift(-1)
mydf = mydf[mydf['id'].duplicated(keep='last')]
print (mydf)
id time val last_val
5 C 6 f d
1 B 4 b c
6 A 7 g e
4 A 2 e a
I am trying to convert a dataframe to long form.
The dataframe I am starting with:
df = pd.DataFrame([['a', 'b'],
['d', 'e'],
['f', 'g', 'h'],
['q', 'r', 'e', 't']])
df = df.rename(columns={0: "Key"})
Key 1 2 3
0 a b None None
1 d e None None
2 f g h None
3 q r e t
The number of columns is not specified, there may be more than 4. There should be a new row for each value after the key
This gets what I need, however, it seems there should be a way to do this without having to drop null values:
new_df = pd.melt(df, id_vars=['Key'])[['Key', 'value']]
new_df = new_df.dropna()
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q e
11 q t​
Option 1
You should be able to do this with set_index + stack:
df.set_index('Key').stack().reset_index(level=0, name='value').reset_index(drop=True)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
If you don't want to keep resetting the index, then use an intermediate variable and create a new DataFrame:
v = df.set_index('Key').stack()
pd.DataFrame({'Key' : v.index.get_level_values(0), 'value' : v.values})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
The essence here is that stack automatically gets rid of NaNs by default (you can disable that by setting dropna=False).
Option 2
More performance with np.repeat and numpy's version of pd.DataFrame.stack:
i = df.pop('Key').values
j = df.values.ravel()
pd.DataFrame({'Key' : v.repeat(df.count(axis=1)), 'value' : j[pd.notnull(j)]
})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
By using melt(I do not think dropna create more 'trouble' here)
df.melt('Key').dropna().drop('variable',1)
Out[809]:
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q s
11 q t
And if without dropna
s=df.fillna('').set_index('Key').sum(1).apply(list)
pd.DataFrame({'Key': s.reindex(s.index.repeat(s.str.len())).index,'value':s.sum()})
Out[862]:
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
With a comprehension
This assumes the key is the first element of the row.
pd.DataFrame(
[[k, v] for k, *r in df.values for v in r if pd.notna(v)],
columns=['Key', 'value']
)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
It seems odd that after deleting a column, I cannot add it back with the same name. So I create a simple dataframe with multi labeled columns and add a new column with level0 name only, and then I delete it.
>>> import pandas as pd
>>> df = pd.DataFrame([[1,2,3],[4,5,6]])
>>> df.columns=[['a','b','c'],['e','f','g']]
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
>>> df['d'] = df.c+2
>>> print(df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
>>> del df['d']
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Now I try to add it again, and it seems like it has no effect and no error or warning is shown.
>>> df['d'] = df.c+2
>>> print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Is this expected behaviour? Should I report a bugreport to pandas project? There is no such issue if I add 'd' columns with both levels specified, like this
df['d', 'x'] = df.c+2
Thanks,
PS: Python is 2.7.14 and pandas 0.20.1
There is problem your MultiIndex level are not removed after calling del:
del df['d']
print(df)
a b c
e f g
0 1 2 3
1 4 5 6
Check columns:
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['e', 'f', 'g', '']],
labels=[[0, 1, 2], [0, 1, 2]])
Solution for remove is MultiIndex.remove_unused_levels:
df.columns = df.columns.remove_unused_levels()
print (df.columns)
MultiIndex(levels=[['a', 'b', 'c'], ['e', 'f', 'g']],
labels=[[0, 1, 2], [0, 1, 2]])
df['d'] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8
Another solution is reaasign to MultiIndex, need tuple for select MultiIndex column:
df[('d', '')] = df.c+2
print (df)
a b c d
e f g
0 1 2 3 5
1 4 5 6 8