I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
id time col_id col_a col_b col_c
0 1 1 ffp 1 -1 10
1 1 2 ffp 2 -2 20
2 1 3 ffp 3 -3 30
3 2 1 hie 4 -4 40
4 2 2 hie 5 -5 50
5 2 3 ttt 6 -6 60
I would like to create a new col in foo, which will take the value of either col_a or col_b or col_c, depending on the value of col_id.
I am doing the following:
foo['col'] = np.where(foo.col_id == "ffp", foo.col_a,
np.where(foo.col_id == "hie",foo.col_b, foo.col_c))
which gives
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
Since I have a lot of columns, I was wondering if there is a cleaner way to do that, with using a dictionary for example:
dict_cols_matching = {"ffp" : "col_a", "hie": "col_b", "ttt": "col_c"}
Any ideas ?
You can map the values of the dictionary on col_id, then perform indexing lookup:
import numpy as np
idx, cols = pd.factorize(foo['col_id'].map(dict_cols_matching))
foo['col'] = foo.reindex(cols, axis=1).to_numpy()[np.arange(len(foo)), idx]
Output:
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
With np.select function to arrange condition list to choice list:
foo['col'] = np.select([foo.col_id.eq("ffp"), foo.col_id.eq("hie"), foo.col_id.eq("ttt")],
[foo.col_a, foo.col_b, foo.col_c])
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
You can use lambda function to select the column based on your id, but the method depends on the order of the columns, adjust the parameter 3 if you change the order.
import pandas as pd
import numpy as np
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
idSet = np.unique(foo['col_id'].to_numpy()).tolist()
foo['col'] = foo.apply(lambda x: x[idSet.index(x.col_id)+3], axis=1)
display(foo)
Output
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
You might use a reset_index in combination with a rowwise apply:
foo[["col_id"]].reset_index().apply(lambda u: foo.loc[u["index"],dict_cols_matching[u["col_id"]]], axis=1)
here is my data frame
data={'first':[5,4,3,2,3], 'second':[1,2,3,4,5]}
df= pd.DataFrame(data)
first second
5 1
4 2
3 3
2 4
3 5
and I want to do like this in third column like 0-5+1= -4, -4-4+2= -6, -6-3+3= -6, and so on. And I am sorry for I am not so good in English.
first second third
5 1 -4 #0-5(first)+1(second)= balance -4
4 2 -6 #-4(balance)-4(first)+2(second)= balance -6
3 3 -6
2 4 -4
3 5 -2
You can subtract second from first and take the cumsum (cumulated sum):
df['third'] = (df['second']-df['first']).cumsum()
output:
first second third
0 5 1 -4
1 4 2 -6
2 3 3 -6
3 2 4 -4
4 3 5 -2
Sorry if this has already been answered somewhere!
I am trying to format an array in numpy to a data frame in pandas, which I have done like so:
# array
a = [[' ' '0' 'A' 'T' 'G']
['0' 0 0 0 0]
['G' 0 -3 -3 5]
['G' 0 -3 -6 2]
['A' 0 5 0 -3]
['A' 0 5 2 -3]
['T' 0 0 10 5]
['G' 0 -3 5 15]]
# Output data frame using pandas
0 1 2 3 4
0 0 A T G
1 0 0 0 0 0
2 G 0 -3 -3 5
3 G 0 -3 -6 2
4 A 0 5 0 -3
5 A 0 5 2 -3
6 T 0 0 10 5
7 G 0 -3 5 15
# Output I want
0 A T G
0 0 0 0 0
G 0 -3 -3 5
G 0 -3 -6 2
A 0 5 0 -3
A 0 5 2 -3
T 0 0 10 5
G 0 -3 5 15
Any advice on how to do this would be appreciated! :)
Declare the first row to be column names and the first column to be row names:
df = pd.DataFrame(data=a[1:], columns=a[0]).set_index(' ')
df.index.name = None
# 0 A T G
#0 0 0 0 0
#G 0 -3 -3 5
#G 0 -3 -6 2
#A 0 5 0 -3
I wrote this code that computes time since a sign change (from positive to negative or vice versa) in data frame columns.
df = pd.DataFrame({'x': [1, -4, 5, 1, -2, -4, 1, 3, 2, -4, -5, -5, -6, -1]})
for column in df.columns:
days_since_sign_change = [0]
for k in range(1, len(df[column])):
last_different_sign_index = np.where(np.sign(df[column][:k]) != np.sign(df[column][k]))[0][-1]
days_since_sign_change.append(abs(last_different_sign_index- k))
df[column+ '_days_since_sign_change'] = days_since_sign_change
df[column+ '_days_since_sign_change'][df[column] < 0] = df[column+ '_days_since_sign_change'] *-1
# this final stage allows the "days_since_sign_change" column to also indicate if the sign changed
# from - to positive or from positive to negative.
In [302]:df
Out[302]:
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 1 2
4 -2 -1
5 -4 -2
6 1 1
7 3 2
8 2 3
9 -4 -1
10 -5 -2
11 -5 -3
12 -6 -4
13 -1 -5
Issue: with large datasets (150,000 * 50,000), the python code is extremely slow. How can I speed this up?
You can using cumcount
s=df.groupby(df.x.gt(0).astype(int).diff().ne(0).cumsum()).cumcount().add(1)*df.x.gt(0).replace({True:1,False:-1})
s.iloc[0]=0
s
Out[645]:
0 0
1 -1
2 1
3 2
4 -1
5 -2
6 1
7 2
8 3
9 -1
10 -2
11 -3
12 -4
13 -5
dtype: int64
You can surely do this without a loop. Create a sign column with -1 if value in x is less than 0 and 1 otherwise. Then group that sign column by difference in the value in the current row vs the previous one and get cumulative sum.
df['x_days_since_sign_change'] = (df['x'] > 0).astype(int).replace(0, -1)
df.iloc[0,1] = 0
df.groupby((df['x_days_since_sign_change'] != df['x_days_since_sign_change'].shift()).cumsum()).cumsum()
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 6 2
4 -2 -1
5 -6 -2
6 1 1
7 4 2
8 6 3
9 -4 -1
10 -9 -2
11 -14 -3
12 -20 -4
13 -21 -5
Does pandas have an analogue to dplyr's filter() operation?
basically I'd like to be able to remove rows based on a predicate.
I can of course do df = df[condition], but that doesn't compose as nicely as method chaining.
use query
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(-5, 6, (10, 10)),
columns=list('ABCDEFGHIJ'))
df
A B C D E F G H I J
0 0 4 -1 1 -3 -1 -4 -5 -1 2
1 -4 2 -1 0 5 -1 1 -3 1 4
2 3 -2 3 -2 -4 5 1 1 0 -2
3 1 4 -5 4 -3 -3 -3 -3 -4 4
4 -3 4 4 5 -2 -3 -1 3 3 -1
5 0 0 -1 -1 2 2 5 -4 -1 -1
6 -2 1 2 0 -1 -1 1 0 4 -4
7 5 2 5 2 3 2 3 -3 1 1
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can easily pipeline operations based on filtering conditions
df.query('A < 0')
A B C D E F G H I J
1 -4 2 -1 0 5 -1 1 -3 1 4
4 -3 4 4 5 -2 -3 -1 3 3 -1
6 -2 1 2 0 -1 -1 1 0 4 -4
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can include multiple conditions
df.query('A < 0 & B < -1')
A B C D E F G H I J
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can do many cool things
df.query('-3 < A < 3 & H * J > 0')
A B C D E F G H I J
5 0 0 -1 -1 2 2 5 -4 -1 -1
8 -2 -5 1 4 0 -1 4 4 -5 3
And it all gets returned as a dataframe to enable the next operation