I have the following data frames,
data frame- 1 (named as df1)
index A B C
1 q a w
2 e d q
3 r f r
4 t g t
5 y j o
6 i k p
7 j w k
8 i o u
9 a p v
10 o l a
data frame- 2 (named as df2)
index C
3 a
7 b
9 c
10 d
I tried to replace the rows for specific indexes in the column "C" using the data frame - 2 for the data frame - 1 but I got the following result after using the below code:
df1['C'] = df2
Output:
index A B C
1 q a NaN
2 e d NaN
3 r f a
4 t g NaN
5 y j NaN
6 i k NaN
7 j w b
8 i o NaN
9 a p c
10 o l d
But I want something like this,
Expected output:
index A B C
1 q a w
2 e d q
3 r f a
4 t g t
5 y j o
6 i k p
7 j w b
8 i o u
9 a p c
10 o l d
So clearly I don't need NaN values in column "C" instead I want the values to remain as it is. (I mean it should change only for that particular index value).
Please let me know the solution.
Thanks in advance!
Assuming index is the actual index column, we can do loc:
df1.loc[df2.index, 'C'] = df2['C']
Or even more simple with:
df1.update(df2)
Output:
A B C
index
1 q a w
2 e d q
3 r f a
4 t g t
5 y j o
6 i k p
7 j w b
8 i o u
9 a p c
10 o l d
Try this
for idx, row in df2.iterrows():
df1.at[idx, 'C'] = row['C']
Related
I am trying to convert a dataframe to long form.
The dataframe I am starting with:
df = pd.DataFrame([['a', 'b'],
['d', 'e'],
['f', 'g', 'h'],
['q', 'r', 'e', 't']])
df = df.rename(columns={0: "Key"})
Key 1 2 3
0 a b None None
1 d e None None
2 f g h None
3 q r e t
The number of columns is not specified, there may be more than 4. There should be a new row for each value after the key
This gets what I need, however, it seems there should be a way to do this without having to drop null values:
new_df = pd.melt(df, id_vars=['Key'])[['Key', 'value']]
new_df = new_df.dropna()
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q e
11 q t​
Option 1
You should be able to do this with set_index + stack:
df.set_index('Key').stack().reset_index(level=0, name='value').reset_index(drop=True)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
If you don't want to keep resetting the index, then use an intermediate variable and create a new DataFrame:
v = df.set_index('Key').stack()
pd.DataFrame({'Key' : v.index.get_level_values(0), 'value' : v.values})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
The essence here is that stack automatically gets rid of NaNs by default (you can disable that by setting dropna=False).
Option 2
More performance with np.repeat and numpy's version of pd.DataFrame.stack:
i = df.pop('Key').values
j = df.values.ravel()
pd.DataFrame({'Key' : v.repeat(df.count(axis=1)), 'value' : j[pd.notnull(j)]
})
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
By using melt(I do not think dropna create more 'trouble' here)
df.melt('Key').dropna().drop('variable',1)
Out[809]:
Key value
0 a b
1 d e
2 f g
3 q r
6 f h
7 q s
11 q t
And if without dropna
s=df.fillna('').set_index('Key').sum(1).apply(list)
pd.DataFrame({'Key': s.reindex(s.index.repeat(s.str.len())).index,'value':s.sum()})
Out[862]:
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
With a comprehension
This assumes the key is the first element of the row.
pd.DataFrame(
[[k, v] for k, *r in df.values for v in r if pd.notna(v)],
columns=['Key', 'value']
)
Key value
0 a b
1 d e
2 f g
3 f h
4 q r
5 q s
6 q t
I have a dataframe like this one (basically two columns: first contains blogger id and second one contains followers):
blogger follower
A c
A d
A e
A f
A g
A h
A i
A j
A k
B c
B f
B g
B l
B m
B n
B o
B p
B q
B r
B s
B t
B k
C a
C k
C r
C g
C t
C c
C p
C y
C z
C w
What I want to get is a square matrix with all-to-all intersection count, like this:
A B C
A - 4 3
B 4 - 6
C 3 6 -
I'm not a skilled pandas user and all I achieved is doing this by using 2 loops and np.intersect which I believe is not efficient. I've been trying to play with pivot_table(), crosstab() and groupby() - no luck, so unfortunately there is no code to share. Maybe someone here knows an efficient solution?
Perform a self-merge, followed by crosstabulation operation.
i = df.merge(df, on='follower')
j = pd.crosstab(i.blogger_x, i.blogger_y)
j
blogger_y A B C
blogger_x
A 9 4 3
B 4 13 6
C 3 6 10
Of course, the diagonals aren't -, but that's easy.
j = j.astype(object)
j.values[[np.arange(j.shape[0])] * 2] = '-'
j
blogger_y A B C
blogger_x
A - 4 3
B 4 - 6
C 3 6 -
Note that this ruins performance, because your columns are now object type, which is the only way to mix values of different types in the same column.
I have a dataframe df1 that looks like this:
df1 = pd.DataFrame({'A':[0,5,4,8,9,0,7,6],
'B':['a','s','d','f','g','h','j','k'],
'C':['XX','XX','XX','YY','YY','WW','ZZ','ZZ']})
My goal is to group the elements according to the values contained in column Cso that rows having the same value, have the same index (which must contain the value stored in C). Therefore the output should be like this:
A B
XX 0 a
5 s
4 d
YY 8 f
9 g
WW 0 h
ZZ 7 j
6 k
I tried to use the command df.groupby('C') but it returns the following object:
<pandas.core.groupby.DataFrameGroupBy object at 0x000000001A9D4860>
Can you suggest me an elegant and smart way to achieve my goal?
Note: I think my question is somehow related to multi-indexing
It seems you need DataFrame.set_index
df2 = df1.set_index('C')
print (df2)
A B
C
XX 0 a
XX 5 s
XX 4 d
YY 8 f
YY 9 g
WW 0 h
ZZ 7 j
ZZ 6 k
print (df2.loc['XX'])
A B
C
XX 0 a
XX 5 s
XX 4 d
If need MultiIndex from columns C and A:
df3 = df1.set_index(['C', 'A'])
print (df3)
B
C A
XX 0 a
5 s
4 d
YY 8 f
9 g
WW 0 h
ZZ 7 j
6 k
print (df3.loc['XX'])
B
A
0 a
5 s
4 d
I think you are looking for pivot_table i.e
pd.pivot_table(df1, values='A', index=['C','B'])
Output :
A
C B
WW h 0
XX a 0
d 4
s 5
YY f 8
g 9
ZZ j 7
k 6
I want to select to new dataframe, columns that have 'C' in value
protein 1 2 3 4 5
prot1 C M D F A
prot2 C D A M A
prot3 C C D F A
prot4 S D F C L
prot5 S D A I L
So i want to have this:
protein 1 2 4
prot1 C M F
prot2 C D M
prot3 C C F
prot4 S D C
prot5 S D I
Number of colums can be n, i found examples only which i must specify column name... i cant do this here. The script should check column by colummn.
In [22]: df[['protein']].join(df[df.columns[df.eq('C').any()]])
Out[22]:
protein 1 2 4
0 prot1 C M F
1 prot2 C D M
2 prot3 C C F
3 prot4 S D C
4 prot5 S D I
Use:
np.random.seed(123)
n = np.random.choice(['C','M','D', '-'], size=(3,10))
n[:,0] = ['a','b','w']
foo = pd.DataFrame(n)
print (foo)
0 1 2 3 4 5 6 7 8 9
0 a M D D C D D M - D
1 b M D M C M D - M C
2 w C - M - D M C C C
mask = foo.eq('C').any()
#set columns which need in output
mask.loc[0] = True
#filter
print (foo.loc[:,mask])
0 1 4 7 8 9
0 a M C M - D
1 b M C - M C
2 w C - C C C
I am using Python 2.7 with Pandas on a Windows 10 machine.
I have an n by n Dataframe where:
1) The index represents peoples names
2) The column headers are the same peoples names in the same order
3) Each cell of the Dataframeis the average number of times they email each other each day.
How would I transform that Dataframeinto a Dataframewith 3 columns, where:
1) Column 1 would be the index of the n by n Dataframe
2) Column 2 would be the row headers of the n by n Dataframe
3) Column 3 would be the cell value corresponding to those two names from the index, column header combination from the n by n Dataframe
Edit
Appologies for not providing an example of what I am looking for. I would like to take df1 and turn it into rel_df, from the code below.
import pandas as pd
from itertools import permutations
df1 = pd.DataFrame()
df1['index'] = ['a', 'b','c','d','e']
df1.set_index('index', inplace = True)
df1['a'] = [0,1,2,3,4]
df1['b'] = [1,0,2,3,4]
df1['c'] = [4,1,0,3,4]
df1['d'] = [5,1,2,0,4]
df1['e'] = [7,1,2,3,0]
##df of all relationships to build
flds = pd.Series(SO_df.fld1.unique())
flds = pd.Series(flds.append(pd.Series(SO_df.fld2.unique())).unique())
combos = []
for L in range(0, len(flds)+1):
for subset in permutations(flds, L):
if len(subset) == 2:
combos.append(subset)
if len(subset) > 2:
break
rel_df = pd.DataFrame.from_records(data = combos, columns = ['fld1','fld2'])
rel_df['value'] = [1,4,5,7,1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4]
print df1
>>> print df1
a b c d e
index
a 0 1 4 5 7
b 1 0 1 1 1
c 2 2 0 2 2
d 3 3 3 0 3
e 4 4 4 4 0
>>> print rel_df
fld1 fld2 value
0 a b 1
1 a c 4
2 a d 5
3 a e 7
4 b a 1
5 b c 1
6 b d 1
7 b e 1
8 c a 2
9 c b 2
10 c d 2
11 c e 2
12 d a 3
13 d b 3
14 d c 3
15 d e 3
16 e a 4
17 e b 4
18 e c 4
19 e d 4
Use melt:
df1 = df1.reset_index()
pd.melt(df1, id_vars='index', value_vars=df1.columns.tolist()[1:])
(If in your actual code you're explicitly setting the index as you do here, just skip that step rather than doing the reset_index; melt doesn't work on an index.)
# Flatten your dataframe.
df = df1.stack().reset_index()
# Remove duplicates (e.g. fld1 = 'a' and fld2 = 'a').
df = df.loc[df.iloc[:, 0] != df.iloc[:, 1]]
# Rename columns.
df.columns = ['fld1', 'fld2', 'value']
>>> df
fld1 fld2 value
1 a b 1
2 a c 4
3 a d 5
4 a e 7
5 b a 1
7 b c 1
8 b d 1
9 b e 1
10 c a 2
11 c b 2
13 c d 2
14 c e 2
15 d a 3
16 d b 3
17 d c 3
19 d e 3
20 e a 4
21 e b 4
22 e c 4
23 e d 4