I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
id time col_id col_a col_b col_c
0 1 1 ffp 1 -1 10
1 1 2 ffp 2 -2 20
2 1 3 ffp 3 -3 30
3 2 1 hie 4 -4 40
4 2 2 hie 5 -5 50
5 2 3 ttt 6 -6 60
I would like to create a new col in foo, which will take the value of either col_a or col_b or col_c, depending on the value of col_id.
I am doing the following:
foo['col'] = np.where(foo.col_id == "ffp", foo.col_a,
np.where(foo.col_id == "hie",foo.col_b, foo.col_c))
which gives
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
Since I have a lot of columns, I was wondering if there is a cleaner way to do that, with using a dictionary for example:
dict_cols_matching = {"ffp" : "col_a", "hie": "col_b", "ttt": "col_c"}
Any ideas ?
You can map the values of the dictionary on col_id, then perform indexing lookup:
import numpy as np
idx, cols = pd.factorize(foo['col_id'].map(dict_cols_matching))
foo['col'] = foo.reindex(cols, axis=1).to_numpy()[np.arange(len(foo)), idx]
Output:
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
With np.select function to arrange condition list to choice list:
foo['col'] = np.select([foo.col_id.eq("ffp"), foo.col_id.eq("hie"), foo.col_id.eq("ttt")],
[foo.col_a, foo.col_b, foo.col_c])
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
You can use lambda function to select the column based on your id, but the method depends on the order of the columns, adjust the parameter 3 if you change the order.
import pandas as pd
import numpy as np
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
idSet = np.unique(foo['col_id'].to_numpy()).tolist()
foo['col'] = foo.apply(lambda x: x[idSet.index(x.col_id)+3], axis=1)
display(foo)
Output
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
You might use a reset_index in combination with a rowwise apply:
foo[["col_id"]].reset_index().apply(lambda u: foo.loc[u["index"],dict_cols_matching[u["col_id"]]], axis=1)
I have a csv file with something like this column
Comulative
-1
-3
-4
1
2
5
-1
-4
-8
1
3
5
10
I would like to add an internal column counting the number of sign shifts
To get something like this
Comulative
Score
-1
1
-3
2
-4
3
1
1
2
2
5
3
-1
1
-4
2
-8
3
1
1
3
2
5
3
10
4
In my original csv file, the Comulative column usually does not change the sign from about 100 to 500 lines here , for clarity , it changes so often !
Can you tell me how to do it better ?
Get the sign with numpy.sign, then use a custom groupby with cumcount:
# get sign
s = np.sign(df['Comulative'])
# group by consecutive signs
group = s.ne(s.shift()).cumsum()
# enumerate
df['Score'] = s.groupby(group).cumcount().add(1)
NB. if you want to consider 0 as part of the positive numbers, use s = df['Comulative'].gt(0).
output:
Comulative Score
0 -1 1
1 -3 2
2 -4 3
3 1 1
4 2 2
5 5 3
6 -1 1
7 -4 2
8 -8 3
9 1 1
10 3 2
11 5 3
12 10 4
mozway's usage of np.sign is far prettier, but for a more drawn out answer - you could do something like this.
# Marks true every time the previous value is positive,
# and the current is negative or visa versa.
groups = (df.Comulative.lt(0) & df.Comulative.shift().gt(0)
| df.Comulative.gt(0) & df.Comulative.shift().lt(0)).cumsum()
df['Score'] = df.groupby(groups).cumcount().add(1)
Output:
Comulative Score
0 -1 1
1 -3 2
2 -4 3
3 1 1
4 2 2
5 5 3
6 -1 1
7 -4 2
8 -8 3
9 1 1
10 3 2
11 5 3
12 10 4
Sorry if this has already been answered somewhere!
I am trying to format an array in numpy to a data frame in pandas, which I have done like so:
# array
a = [[' ' '0' 'A' 'T' 'G']
['0' 0 0 0 0]
['G' 0 -3 -3 5]
['G' 0 -3 -6 2]
['A' 0 5 0 -3]
['A' 0 5 2 -3]
['T' 0 0 10 5]
['G' 0 -3 5 15]]
# Output data frame using pandas
0 1 2 3 4
0 0 A T G
1 0 0 0 0 0
2 G 0 -3 -3 5
3 G 0 -3 -6 2
4 A 0 5 0 -3
5 A 0 5 2 -3
6 T 0 0 10 5
7 G 0 -3 5 15
# Output I want
0 A T G
0 0 0 0 0
G 0 -3 -3 5
G 0 -3 -6 2
A 0 5 0 -3
A 0 5 2 -3
T 0 0 10 5
G 0 -3 5 15
Any advice on how to do this would be appreciated! :)
Declare the first row to be column names and the first column to be row names:
df = pd.DataFrame(data=a[1:], columns=a[0]).set_index(' ')
df.index.name = None
# 0 A T G
#0 0 0 0 0
#G 0 -3 -3 5
#G 0 -3 -6 2
#A 0 5 0 -3
I wrote this code that computes time since a sign change (from positive to negative or vice versa) in data frame columns.
df = pd.DataFrame({'x': [1, -4, 5, 1, -2, -4, 1, 3, 2, -4, -5, -5, -6, -1]})
for column in df.columns:
days_since_sign_change = [0]
for k in range(1, len(df[column])):
last_different_sign_index = np.where(np.sign(df[column][:k]) != np.sign(df[column][k]))[0][-1]
days_since_sign_change.append(abs(last_different_sign_index- k))
df[column+ '_days_since_sign_change'] = days_since_sign_change
df[column+ '_days_since_sign_change'][df[column] < 0] = df[column+ '_days_since_sign_change'] *-1
# this final stage allows the "days_since_sign_change" column to also indicate if the sign changed
# from - to positive or from positive to negative.
In [302]:df
Out[302]:
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 1 2
4 -2 -1
5 -4 -2
6 1 1
7 3 2
8 2 3
9 -4 -1
10 -5 -2
11 -5 -3
12 -6 -4
13 -1 -5
Issue: with large datasets (150,000 * 50,000), the python code is extremely slow. How can I speed this up?
You can using cumcount
s=df.groupby(df.x.gt(0).astype(int).diff().ne(0).cumsum()).cumcount().add(1)*df.x.gt(0).replace({True:1,False:-1})
s.iloc[0]=0
s
Out[645]:
0 0
1 -1
2 1
3 2
4 -1
5 -2
6 1
7 2
8 3
9 -1
10 -2
11 -3
12 -4
13 -5
dtype: int64
You can surely do this without a loop. Create a sign column with -1 if value in x is less than 0 and 1 otherwise. Then group that sign column by difference in the value in the current row vs the previous one and get cumulative sum.
df['x_days_since_sign_change'] = (df['x'] > 0).astype(int).replace(0, -1)
df.iloc[0,1] = 0
df.groupby((df['x_days_since_sign_change'] != df['x_days_since_sign_change'].shift()).cumsum()).cumsum()
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 6 2
4 -2 -1
5 -6 -2
6 1 1
7 4 2
8 6 3
9 -4 -1
10 -9 -2
11 -14 -3
12 -20 -4
13 -21 -5
I am trying to understand the difference between these two statements
dataframe['newColumn'] = 'stringconst'
and
for x in y:
if x=="value":
csv = pd.read_csv(StringIO(table), header=None, names=None)
dataframe['newColumn'] = csv[0]
In the first case pandas populates all the rows with the constant value, but in the second case it populates only the first row and assigns NaN to rest of the rows. Why is this? How can I assign the value in the second case to all the rows in the dataframe?
Because csv[0] is not a scalar value. It's a pd.Series, and when you do assignment with pd.Series it tries to align by index (the whole point of pandas), and probably it's getting NAN everywhere except the first row because only the first-row's index aligns with the pd.DataFrame index. So, consider two data-frames (note, they are copies except for the index, which is shifted by 20):
>>> df
0 1 2 3 4
0 4 -5 -1 0 3
1 -2 -2 1 3 4
2 1 2 4 4 -4
3 -5 2 -3 -5 1
4 -5 -3 1 1 -1
5 -4 0 4 -3 -4
6 -2 -5 -3 1 0
7 4 0 0 -4 -4
8 -4 4 -2 -5 4
9 1 -2 4 3 0
>>> df2
0 1 2 3 4
20 4 -5 -1 0 3
21 -2 -2 1 3 4
22 1 2 4 4 -4
23 -5 2 -3 -5 1
24 -5 -3 1 1 -1
25 -4 0 4 -3 -4
26 -2 -5 -3 1 0
27 4 0 0 -4 -4
28 -4 4 -2 -5 4
29 1 -2 4 3 0
>>> df['new'] = df[1]
>>> df
0 1 2 3 4 new
0 4 -5 -1 0 3 -5
1 -2 -2 1 3 4 -2
2 1 2 4 4 -4 2
3 -5 2 -3 -5 1 2
4 -5 -3 1 1 -1 -3
5 -4 0 4 -3 -4 0
6 -2 -5 -3 1 0 -5
7 4 0 0 -4 -4 0
8 -4 4 -2 -5 4 4
9 1 -2 4 3 0 -2
>>> df['new2'] = df2[1]
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 NaN
1 -2 -2 1 3 4 -2 NaN
2 1 2 4 4 -4 2 NaN
3 -5 2 -3 -5 1 2 NaN
4 -5 -3 1 1 -1 -3 NaN
5 -4 0 4 -3 -4 0 NaN
6 -2 -5 -3 1 0 -5 NaN
7 4 0 0 -4 -4 0 NaN
8 -4 4 -2 -5 4 4 NaN
9 1 -2 4 3 0 -2 NaN
So, one thing you can do to assign the whole column is to simply assign the values:
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 NaN
1 -2 -2 1 3 4 -2 NaN
2 1 2 4 4 -4 2 NaN
3 -5 2 -3 -5 1 2 NaN
4 -5 -3 1 1 -1 -3 NaN
5 -4 0 4 -3 -4 0 NaN
6 -2 -5 -3 1 0 -5 NaN
7 4 0 0 -4 -4 0 NaN
8 -4 4 -2 -5 4 4 NaN
9 1 -2 4 3 0 -2 NaN
>>> df['new2'] = df2[1].values
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 -5
1 -2 -2 1 3 4 -2 -2
2 1 2 4 4 -4 2 2
3 -5 2 -3 -5 1 2 2
4 -5 -3 1 1 -1 -3 -3
5 -4 0 4 -3 -4 0 0
6 -2 -5 -3 1 0 -5 -5
7 4 0 0 -4 -4 0 0
8 -4 4 -2 -5 4 4 4
9 1 -2 4 3 0 -2 -2
Or, if you want to assign the first value in the first column, then actually select the first value using iloc or another selector and then do assignment:
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 -5
1 -2 -2 1 3 4 -2 -2
2 1 2 4 4 -4 2 2
3 -5 2 -3 -5 1 2 2
4 -5 -3 1 1 -1 -3 -3
5 -4 0 4 -3 -4 0 0
6 -2 -5 -3 1 0 -5 -5
7 4 0 0 -4 -4 0 0
8 -4 4 -2 -5 4 4 4
9 1 -2 4 3 0 -2 -2
>>> df['newest'] = df2.iloc[0,0]
>>> df
0 1 2 3 4 new new2 newest
0 4 -5 -1 0 3 -5 -5 4
1 -2 -2 1 3 4 -2 -2 4
2 1 2 4 4 -4 2 2 4
3 -5 2 -3 -5 1 2 2 4
4 -5 -3 1 1 -1 -3 -3 4
5 -4 0 4 -3 -4 0 0 4
6 -2 -5 -3 1 0 -5 -5 4
7 4 0 0 -4 -4 0 0 4
8 -4 4 -2 -5 4 4 4 4
9 1 -2 4 3 0 -2 -2 4