I am trying to convert a dataframe to long form.
The dataframe I am starting with:
df = pd.DataFrame([['a', 'b'],
['d', 'e'],
['f', 'g', 'h'],
['q', 'r', 'e', 't']])
df = df.rename(columns={0: "Key"})
Key 1 2 3
0 a b None None
1 d e None None
2 f g h None
3 q r e t
The number of columns is not specified, there may be more than 4. There should be a new row for each value after the key
This gets what I need, however, it seems there should be a way to do this without having to drop null values:
new_df = pd.melt(df, id_vars=['Key'])[['Key', 'value']]
new_df = new_df.dropna()
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q e
11 q t​
Option 1
You should be able to do this with set_index + stack:
df.set_index('Key').stack().reset_index(level=0, name='value').reset_index(drop=True)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
If you don't want to keep resetting the index, then use an intermediate variable and create a new DataFrame:
v = df.set_index('Key').stack()
pd.DataFrame({'Key' : v.index.get_level_values(0), 'value' : v.values})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
The essence here is that stack automatically gets rid of NaNs by default (you can disable that by setting dropna=False).
Option 2
More performance with np.repeat and numpy's version of pd.DataFrame.stack:
i = df.pop('Key').values
j = df.values.ravel()
pd.DataFrame({'Key' : v.repeat(df.count(axis=1)), 'value' : j[pd.notnull(j)]
})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
By using melt(I do not think dropna create more 'trouble' here)
df.melt('Key').dropna().drop('variable',1)
Out[809]:
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q s
11 q t
And if without dropna
s=df.fillna('').set_index('Key').sum(1).apply(list)
pd.DataFrame({'Key': s.reindex(s.index.repeat(s.str.len())).index,'value':s.sum()})
Out[862]:
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
With a comprehension
This assumes the key is the first element of the row.
pd.DataFrame(
[[k, v] for k, *r in df.values for v in r if pd.notna(v)],
columns=['Key', 'value']
)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
Related
I have a dataframe:
df = [A B C D E_p0 E_p1 E_p2 K_p0 K_p1 K_2
a 2 r 4 3 6 1 9 5 1
e g 1 d 5 8 2 7 1 4]
And I want to group columns based on the prefix and aggregate them by a function, such as mean or max or rms.
So, for example if my function is max, the output is:
df = [A B C D E K
a 2 r 4 6 9
e g 1 d 8 7 ]
You can convert columns without separator to index and then grouping with lambda function per columns with aggregate function like max:
m = df.columns.str.contains('_')
df = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.max()
.reset_index())
print (df)
A B C D E K
0 a 2 r 4 6 9
1 e g 1 d 8 7
Solution with custom function:
def rms(x):
return np.sqrt(np.sum(x**2, axis=1)/len(x.columns))
m = df.columns.str.contains('_')
df1 = (df.set_index(df.columns[~m].tolist())
.groupby(lambda x: x.split('_')[0], axis=1)
.agg(rms)
.reset_index())
print (df1)
A B C D E K
0 a 2 r 4 3.915780 5.972158
1 e g 1 d 5.567764 4.690416
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to obtain pairwise counts of two column variables using pandas. I have a dataframe of two columns in the following format:
col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h
What I would like to get as output would be the following matrix of counts, for e.g.:
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
I am getting totally confused with pandas iterating over columns, rows, indexes and such. Appreciate some guidance here.
Pandas often has simple functions built in - in this case, you want crosstab:
pd.crosstab(dat['col1'], dat['col2'])
full code:
import pandas as pd
from io import StringIO
x = '''col1 col2
a e
b g
c h
d f
a g
b h
c f
d e
a f
b g
c g
d h
a e
b e
c g
d h
b h'''
dat = pd.read_csv(StringIO(x), sep = '\s+')
pd.crosstab(dat['col1'], dat['col2'])
You're looking for a crosstab:
count_matrix = pd.crosstab(index=df["col1"], columns=df["col2"])
print(count_matrix)
col2 e f g h
col1
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you don't like the column/index names in (e.g. still seeing "col1" and "col2"), then you can remove them with rename_axis:
count_matrix = count_matrix.rename_axis(index=None, columns=None)
print(count_matrix)
e f g h
a 2 1 1 0
b 1 0 2 2
c 0 1 2 1
d 1 1 0 2
If you want that all together in one snippet:
count_matrix = (pd.crosstab(index=df["col1"], columns=df["col2"])
.rename_axis(index=None, columns=None))
I have the following data frames,
data frame- 1 (named as df1)
index A B C
1 q a w
2 e d q
3 r f r
4 t g t
5 y j o
6 i k p
7 j w k
8 i o u
9 a p v
10 o l a
data frame- 2 (named as df2)
index C
3 a
7 b
9 c
10 d
I tried to replace the rows for specific indexes in the column "C" using the data frame - 2 for the data frame - 1 but I got the following result after using the below code:
df1['C'] = df2
Output:
index A B C
1 q a NaN
2 e d NaN
3 r f a
4 t g NaN
5 y j NaN
6 i k NaN
7 j w b
8 i o NaN
9 a p c
10 o l d
But I want something like this,
Expected output:
index A B C
1 q a w
2 e d q
3 r f a
4 t g t
5 y j o
6 i k p
7 j w b
8 i o u
9 a p c
10 o l d
So clearly I don't need NaN values in column "C" instead I want the values to remain as it is. (I mean it should change only for that particular index value).
Please let me know the solution.
Thanks in advance!
Assuming index is the actual index column, we can do loc:
df1.loc[df2.index, 'C'] = df2['C']
Or even more simple with:
df1.update(df2)
Output:
A B C
index
1 q a w
2 e d q
3 r f a
4 t g t
5 y j o
6 i k p
7 j w b
8 i o u
9 a p c
10 o l d
Try this
for idx, row in df2.iterrows():
df1.at[idx, 'C'] = row['C']
I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('aaabdcde'),
'B': list('smnipiuy'),
'C': list('zzzqqwll')
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B C
0 a s z
1 a m z
2 a n z
3 b i q
4 d p q
5 c i w
6 d u l
7 e y l
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
Now I want to create an additional column in df1 which should be filled with values coming from the columns A, B and C respectively, depending on whether their values can be found in df2['mapcol']. If the values in one row can be found in more than one column, they should be first used from A, then B and then C, so my expected outcome looks like this:
A B C final
0 a s z a # <- values can be found in A and C, but A is preferred
1 a m z a # <- values can be found in A and C, but A is preferred
2 a n z a # <- values can be found in A and C, but A is preferred
3 b i q b # <- value can be found in A
4 d p q p # <- value can be found in B
5 c i w NaN # none of the values can be mapped
6 d u l l # value can be found in C
7 e y l l # value can be found in C
A straightforward implementation could look like this (filling the column final iteratively using fillna in the preferred order):
preferred_order = ['A', 'B', 'C']
df1['final'] = np.nan
for col in preferred_order:
df1['final'] = df1['final'].fillna(df1[col][df1[col].isin(df2['mapcol'])])
which gives the desired outcome.
Does anyone see a solution that avoids the loop?
you can use where and isin on the full dataframe df1 to mask the value not in the df2, then reorder with the preferred_order and bfill along the column, keep the first column with iloc
preferred_order = ['A', 'B', 'C']
df1['final'] = (df1.where(df1.isin(df2['mapcol'].to_numpy()))
[preferred_order]
.bfill(axis=1)
.iloc[:, 0]
)
print (df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
Use:
order = ['A', 'B', 'C'] # order of columns
d = df1[order].isin(df2['mapcol'].tolist()).loc[lambda x: x.any(axis=1)].idxmax(axis=1)
df1.loc[d.index, 'final'] = df1.lookup(d.index, d)
Details:
Use DataFrame.isin and filter the rows using boolean masking with DataFrame.any along axis=1 then use DataFrame.idxmax along axis=1 to get column names names associated with max values along axis=1.
print(d)
0 A
1 A
2 A
3 A
4 B
6 C
7 C
dtype: object
Use DataFrame.lookup to lookup the values in df1 corresponding to the index and columns of d and assign this values to column final:
print(df1)
A B C final
0 a s z a
1 a m z a
2 a n z a
3 b i q b
4 d p q p
5 c i w NaN
6 d u l l
7 e y l l
I have two dataframes like this:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
'A': list('abdcde'),
'B': ['s', np.nan, 'h', 'j', np.nan, 'g']
}
)
df2 = pd.DataFrame(
{
'mapcol': list('abpppozl')
}
)
A B
0 a s
1 b NaN
2 d h
3 c j
4 d NaN
5 e g
mapcol
0 a
1 b
2 p
3 p
4 p
5 o
6 z
7 l
I would now like to fill B in df1 using the values of df2['mapcol'], however not using the actual index but - in this case - just the first two entries of df2['mapcol']. So, instead of b and p that correspond to index 1 and 4, respectively, I would like to use the values a and b.
One way of doing it would be to construct a dictionary with the correct indices and values:
df1['B_filled_incorrect'] = df1['B'].fillna(df2['mapcol'])
ind = df1[df1['B'].isna()].index
# reset_index is required as we might have a non-numerical index
val = df2.reset_index().loc[:len(ind-1), 'mapcol'].values
map_dict = dict(zip(ind, val))
df1['B_filled_correct'] = df1['B'].fillna(map_dict)
A B B_filled_incorrect B_filled_correct
0 a s s s
1 b NaN b a
2 d h h h
3 c j j j
4 d NaN p b
5 e g g g
which gives the desired output.
Is there a more straightforward way that avoids the creation of all these intermediate variables?
position fill you can assign the value via the loc and convert fill value to list
df1.loc[df1.B.isna(),'B']=df2.mapcol.iloc[:df1.B.isna().sum()].tolist()
df1
Out[232]:
A B
0 a s
1 b a
2 d h
3 c j
4 d b
5 e g