I have two data frames:
In [14]: rep1
Out[14]:
x y z
A 1 2 3
B 4 5 6
C 1 1 2
In [15]: rep2
Out[15]:
x y z
A 7 3 4
B 3 3 3
created with this code:
import pandas as pd
rep1 = pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6]),('C',[1,1,2])], orient='index', columns=['x', 'y', 'z'])
rep2 = pd.DataFrame.from_items([('A', [7, 3, 4]), ('B', [3, 3, 3])], orient='index', columns=['x', 'y', 'z'])
What I want to do then is to mesh rep1 and rep2 so that it results something like this:
gene rep1 rep2 type
A 1 7 x
B 4 3 x
A 2 3 y
B 5 3 y
A 3 4 z
B 6 3 z
row C is skipped because it is not shared by rep1 and rep2.
How can I achieve that?
This does it:
df =pd.concat([rep1.stack(),rep2.stack()],axis=1).reset_index().dropna()
df.columns =['GENE','TYPE','REP1','REP2']
df.sort(columns=['TYPE','GENE'], inplace=True)
Concatenate the stacked data frames on axis =1. Resetting the index gets you back the gene and type columns. dropna takes care of the nulls produced for gene c. Add the correct column names etc.
returns:
GENE TYPE REP1 REP2
0 A x 1 7
3 B x 4 3
1 A y 2 3
4 B y 5 3
2 A z 3 4
5 B z 6 3
>>> c1 = rep1.values.T.flatten()
>>> c2 = rep2.values.T.flatten()
>>> c3 = np.vstack((rep1.columns.values, rep2.columns.values)).T.flatten()
>>> pd.DataFrame(np.vstack((c1,c2,c3)).T)
0 1 2
0 1 7 x
1 4 3 x
2 2 3 y
3 5 3 y
4 3 4 z
5 6 3 z
Edit: When I was answering this, the question did not have row C at all. Now things are more complicated, but I'll leave this here anyway.
Related
I have a dataframe with multilevel headers for the columns like this:
name 1 2 3 4
x y x y x y x y
A 1 4 3 7 2 1 5 2
B 2 2 6 1 4 5 1 7
How can I calculate the mean for 1x, 2x and 3x, but not 4x?
I tried:
df['mean']= df[('1','x'),('2','x'),('3','x')].mean()
This did not work, it syas key error. I would like to get:
name 1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2
B 2 2 6 1 4 5 1 7 4
Is there a way to calculate the mean while keeping the first column header as an integer?
This is only one solution:
import pandas as pd
iterables = [[1, 2, 3, 4], ["x", "y"]]
array = [
[1, 4, 3, 7, 2, 1, 5, 2],
[2, 2, 6, 1, 4, 5, 1, 7]
]
index = pd.MultiIndex.from_product(iterables)
df = pd.DataFrame(array, index=["A", "B"], columns=index)
df["mean"] = df.xs("x", level=1, axis=1).loc[:,1:3].mean(axis=1)
print(df)
1 2 3 4 mean
x y x y x y x y
A 1 4 3 7 2 1 5 2 2.0
B 2 2 6 1 4 5 1 7 4.0
Steps:
Select all the "x"-columns with df.xs("x", level=1, axis=1)
Select only columns 1 to 3 with .loc[:,1:3]
Calculate the mean value with .mean(axis=1)
Given the following data frame:
import pandas as pd
import numpy as np
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':['a','a','b','a','a','a'],
})
df
A B
0 A a
1 A a
2 A b
3 B a
4 B a
5 B a
I'd like to create column 'C', which numbers the rows within each group in columns A and B like this:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
I've tried this so far:
df['C']=df.groupby(['A','B'])['B'].transform('rank')
...but it doesn't work!
Use groupby/cumcount:
In [25]: df['C'] = df.groupby(['A','B']).cumcount()+1; df
Out[25]:
A B C
0 A a 1
1 A a 2
2 A b 1
3 B a 1
4 B a 2
5 B a 3
Use groupby.rank function.
Here the working example.
df = pd.DataFrame({'C1':['a', 'a', 'a', 'b', 'b'], 'C2': [1, 2, 3, 4, 5]})
df
C1 C2
a 1
a 2
a 3
b 4
b 5
df["RANK"] = df.groupby("C1")["C2"].rank(method="first", ascending=True)
df
C1 C2 RANK
a 1 1
a 2 2
a 3 3
b 4 1
b 5 2
I have a dataframe and want to sort all columns independently in descending or ascending order.
import pandas as pd
data = {'a': [5, 2, 3, 6],
'b': [7, 9, 1, 4],
'c': [1, 5, 4, 2]}
df = pd.DataFrame.from_dict(data)
a b c
0 5 7 1
1 2 9 5
2 3 1 4
3 6 4 2
When I use sort_values() for this it does not work as expected (to me) and only sorts one column:
foo = df.sort_values(by=['a', 'b', 'c'], ascending=[False, False, False])
a b c
3 6 4 2
0 5 7 1
2 3 1 4
1 2 9 5
I can get the desired result if I use the solution from this answer which applies a lambda function:
bar = df.apply(lambda x: x.sort_values().values)
print(bar)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
But this looks a bit heavy-handed to me.
What's actually happening in the sort_values() example above and how can I sort all columns in my dataframe in a pandas-way without the lambda function?
You can use numpy.sort with DataFrame constructor:
df1 = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)
print (df1)
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
EDIT:
Answer with descending order:
arr = df.values
arr.sort(axis=0)
arr = arr[::-1]
print (arr)
[[6 9 5]
[5 7 4]
[3 4 2]
[2 1 1]]
df1 = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df1)
a b c
0 6 9 5
1 5 7 4
2 3 4 2
3 2 1 1
sort_values will sort the entire data frame by the columns order you pass to it. In your first example you are sorting the entire data frame with ['a', 'b', 'c']. This will sort first by 'a', then by 'b' and finally by 'c'.
Notice how, after sorting by a, the rows maintain the same. This is the expected result.
Using lambda you are passing each column to it, this means sort_values will apply to a single column, and that's why this second approach sorts the columns as you would expect. In this case, the rows change.
If you don't want to use lambda nor numpy you can get around using this:
pd.DataFrame({x: df[x].sort_values().values for x in df.columns.values})
Output:
a b c
0 2 1 1
1 3 4 2
2 5 7 4
3 6 9 5
With the below pd dataframe:
print(df)
A B C
X 1 2 3
Y 4 5 6
Z 7 8 9
I need to create a simple interaction network file, or SIF file, of the format:
node1 xx node2
node1 xx node2
node1 yy node2
.
.
.
Where each line is an interaction on a df: row label, value, column label. Below is an iterative (and naive) approach to writing such a file:
with open ('interaction.sif', 'w') as sif:
for row in df.index:
for col in df.columns:
sif.write('{}\t{}\t{}'.format(row, df[col][row], col))
The inefficient code above offers the ideal sif file for the dataframe df:
X 1 A
X 2 B
X 3 C
Y 4 A
Y 5 B
Y 6 C
Z 7 A
Z 8 B
Z 9 C
Is there a dataframe method to write to a csv or table, for example, in the format above? Or is there a way to vectorize this operation?
You need stack with reset_index:
df = df.stack().reset_index()
df.columns = list('ABC')
df = df[['A','C','B']]
print (df)
A C B
0 X 1 A
1 X 2 B
2 X 3 C
3 Y 4 A
4 Y 5 B
5 Y 6 C
6 Z 7 A
7 Z 8 B
8 Z 9 C
And then DataFrame.to_csv:
print (df.to_csv(sep='\t', index=None, header=None))
X 1 A
X 2 B
X 3 C
Y 4 A
Y 5 B
Y 6 C
Z 7 A
Z 8 B
Z 9 C
df.to_csv('interaction.sif', sep='\t', index=None, header=None)
Most likely a function you're looking for is stack
which in pure form will give you the following result:
df = pd.DataFrame({'A': [1, 4, 7], 'B': [2, 5, 8], 'C':[3, 6, 9]}, index=['X', 'Y', 'Z'])
df.stack()
X A 1
B 2
C 3
Y A 4
B 5
C 6
Z A 7
B 8
C 9
dtype: int64
which than can be easily exported to csv using below:
df.stack().to_csv('sample_unordered.csv', sep='\t')
But so far as order of columns matters for you, this will require a bit more data manipulation:
df1 = df.stack().reset_index()
df1.loc[:, ['level_0', 0 ,'level_1']].to_csv('sample_ordered.csv', sep='\t', header=False, index=False)
Alternative solution would be using melt function:
df2 = pd.melt(df.reset_index(1), id_vars=['index'], value_vars=['A', 'B', 'C']).sort_values('index')
df2[['index', 'value', 'variable']].to_csv('sample_melt.csv', sep='\t', header=False, index=False)
I have the following data frame and need to repeat the values for a set of values. That is, given
test3 = pd.DataFrame(data={'x':[1, 2, 3, 4, pd.np.nan], 'y':['a', 'a', 'a', 'b', 'b']})
test3
x y
0 1 a
1 2 a
2 3 a
3 4 b
4 NaN b
I need to do something like this, but more performant:
test3['group'] = np.NaN
groups = ['a', 'b']
dfs = []
for group in groups:
temp = test3.copy()
temp['group'] = group
dfs.append(temp)
pd.concat(dfs)
That is, the expected output is:
x y group
0 1 a a
1 2 a a
2 3 a a
3 4 b a
4 NaN b a
0 1 a b
1 2 a b
2 3 a b
3 4 b b
4 NaN b b