Adding a column in pandas using a variable - python

I am trying to understand the difference between these two statements
dataframe['newColumn'] = 'stringconst'
and
for x in y:
if x=="value":
csv = pd.read_csv(StringIO(table), header=None, names=None)
dataframe['newColumn'] = csv[0]
In the first case pandas populates all the rows with the constant value, but in the second case it populates only the first row and assigns NaN to rest of the rows. Why is this? How can I assign the value in the second case to all the rows in the dataframe?

Because csv[0] is not a scalar value. It's a pd.Series, and when you do assignment with pd.Series it tries to align by index (the whole point of pandas), and probably it's getting NAN everywhere except the first row because only the first-row's index aligns with the pd.DataFrame index. So, consider two data-frames (note, they are copies except for the index, which is shifted by 20):
>>> df
0 1 2 3 4
0 4 -5 -1 0 3
1 -2 -2 1 3 4
2 1 2 4 4 -4
3 -5 2 -3 -5 1
4 -5 -3 1 1 -1
5 -4 0 4 -3 -4
6 -2 -5 -3 1 0
7 4 0 0 -4 -4
8 -4 4 -2 -5 4
9 1 -2 4 3 0
>>> df2
0 1 2 3 4
20 4 -5 -1 0 3
21 -2 -2 1 3 4
22 1 2 4 4 -4
23 -5 2 -3 -5 1
24 -5 -3 1 1 -1
25 -4 0 4 -3 -4
26 -2 -5 -3 1 0
27 4 0 0 -4 -4
28 -4 4 -2 -5 4
29 1 -2 4 3 0
>>> df['new'] = df[1]
>>> df
0 1 2 3 4 new
0 4 -5 -1 0 3 -5
1 -2 -2 1 3 4 -2
2 1 2 4 4 -4 2
3 -5 2 -3 -5 1 2
4 -5 -3 1 1 -1 -3
5 -4 0 4 -3 -4 0
6 -2 -5 -3 1 0 -5
7 4 0 0 -4 -4 0
8 -4 4 -2 -5 4 4
9 1 -2 4 3 0 -2
>>> df['new2'] = df2[1]
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 NaN
1 -2 -2 1 3 4 -2 NaN
2 1 2 4 4 -4 2 NaN
3 -5 2 -3 -5 1 2 NaN
4 -5 -3 1 1 -1 -3 NaN
5 -4 0 4 -3 -4 0 NaN
6 -2 -5 -3 1 0 -5 NaN
7 4 0 0 -4 -4 0 NaN
8 -4 4 -2 -5 4 4 NaN
9 1 -2 4 3 0 -2 NaN
So, one thing you can do to assign the whole column is to simply assign the values:
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 NaN
1 -2 -2 1 3 4 -2 NaN
2 1 2 4 4 -4 2 NaN
3 -5 2 -3 -5 1 2 NaN
4 -5 -3 1 1 -1 -3 NaN
5 -4 0 4 -3 -4 0 NaN
6 -2 -5 -3 1 0 -5 NaN
7 4 0 0 -4 -4 0 NaN
8 -4 4 -2 -5 4 4 NaN
9 1 -2 4 3 0 -2 NaN
>>> df['new2'] = df2[1].values
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 -5
1 -2 -2 1 3 4 -2 -2
2 1 2 4 4 -4 2 2
3 -5 2 -3 -5 1 2 2
4 -5 -3 1 1 -1 -3 -3
5 -4 0 4 -3 -4 0 0
6 -2 -5 -3 1 0 -5 -5
7 4 0 0 -4 -4 0 0
8 -4 4 -2 -5 4 4 4
9 1 -2 4 3 0 -2 -2
Or, if you want to assign the first value in the first column, then actually select the first value using iloc or another selector and then do assignment:
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 -5
1 -2 -2 1 3 4 -2 -2
2 1 2 4 4 -4 2 2
3 -5 2 -3 -5 1 2 2
4 -5 -3 1 1 -1 -3 -3
5 -4 0 4 -3 -4 0 0
6 -2 -5 -3 1 0 -5 -5
7 4 0 0 -4 -4 0 0
8 -4 4 -2 -5 4 4 4
9 1 -2 4 3 0 -2 -2
>>> df['newest'] = df2.iloc[0,0]
>>> df
0 1 2 3 4 new new2 newest
0 4 -5 -1 0 3 -5 -5 4
1 -2 -2 1 3 4 -2 -2 4
2 1 2 4 4 -4 2 2 4
3 -5 2 -3 -5 1 2 2 4
4 -5 -3 1 1 -1 -3 -3 4
5 -4 0 4 -3 -4 0 0 4
6 -2 -5 -3 1 0 -5 -5 4
7 4 0 0 -4 -4 0 0 4
8 -4 4 -2 -5 4 4 4 4
9 1 -2 4 3 0 -2 -2 4

Related

How to use dictionary on np.where clause in pandas

I have the following dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
id time col_id col_a col_b col_c
0 1 1 ffp 1 -1 10
1 1 2 ffp 2 -2 20
2 1 3 ffp 3 -3 30
3 2 1 hie 4 -4 40
4 2 2 hie 5 -5 50
5 2 3 ttt 6 -6 60
I would like to create a new col in foo, which will take the value of either col_a or col_b or col_c, depending on the value of col_id.
I am doing the following:
foo['col'] = np.where(foo.col_id == "ffp", foo.col_a,
np.where(foo.col_id == "hie",foo.col_b, foo.col_c))
which gives
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
Since I have a lot of columns, I was wondering if there is a cleaner way to do that, with using a dictionary for example:
dict_cols_matching = {"ffp" : "col_a", "hie": "col_b", "ttt": "col_c"}
Any ideas ?
You can map the values of the dictionary on col_id, then perform indexing lookup:
import numpy as np
idx, cols = pd.factorize(foo['col_id'].map(dict_cols_matching))
foo['col'] = foo.reindex(cols, axis=1).to_numpy()[np.arange(len(foo)), idx]
Output:
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
With np.select function to arrange condition list to choice list:
foo['col'] = np.select([foo.col_id.eq("ffp"), foo.col_id.eq("hie"), foo.col_id.eq("ttt")],
[foo.col_a, foo.col_b, foo.col_c])
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
You can use lambda function to select the column based on your id, but the method depends on the order of the columns, adjust the parameter 3 if you change the order.
import pandas as pd
import numpy as np
foo = pd.DataFrame({'id': [1,1,1,2,2,2],
'time': [1,2,3,1,2,3],
'col_id': ['ffp','ffp','ffp', 'hie', 'hie', 'ttt'],
'col_a': [1,2,3,4,5,6],
'col_b': [-1,-2,-3,-4,-5,-6],
'col_c': [10,20,30,40,50,60]})
idSet = np.unique(foo['col_id'].to_numpy()).tolist()
foo['col'] = foo.apply(lambda x: x[idSet.index(x.col_id)+3], axis=1)
display(foo)
Output
id time col_id col_a col_b col_c col
0 1 1 ffp 1 -1 10 1
1 1 2 ffp 2 -2 20 2
2 1 3 ffp 3 -3 30 3
3 2 1 hie 4 -4 40 -4
4 2 2 hie 5 -5 50 -5
5 2 3 ttt 6 -6 60 60
You might use a reset_index in combination with a rowwise apply:
foo[["col_id"]].reset_index().apply(lambda u: foo.loc[u["index"],dict_cols_matching[u["col_id"]]], axis=1)

How to get balance value in a new column pandas

here is my data frame
data={'first':[5,4,3,2,3], 'second':[1,2,3,4,5]}
df= pd.DataFrame(data)
first second
5 1
4 2
3 3
2 4
3 5
and I want to do like this in third column like 0-5+1= -4, -4-4+2= -6, -6-3+3= -6, and so on. And I am sorry for I am not so good in English.
first second third
5 1 -4 #0-5(first)+1(second)= balance -4
4 2 -6 #-4(balance)-4(first)+2(second)= balance -6
3 3 -6
2 4 -4
3 5 -2
You can subtract second from first and take the cumsum (cumulated sum):
df['third'] = (df['second']-df['first']).cumsum()
output:
first second third
0 5 1 -4
1 4 2 -6
2 3 3 -6
3 2 4 -4
4 3 5 -2

Numpy array to Pandas data frame formatting

Sorry if this has already been answered somewhere!
I am trying to format an array in numpy to a data frame in pandas, which I have done like so:
# array
a = [[' ' '0' 'A' 'T' 'G']
['0' 0 0 0 0]
['G' 0 -3 -3 5]
['G' 0 -3 -6 2]
['A' 0 5 0 -3]
['A' 0 5 2 -3]
['T' 0 0 10 5]
['G' 0 -3 5 15]]
# Output data frame using pandas
0 1 2 3 4
0 0 A T G
1 0 0 0 0 0
2 G 0 -3 -3 5
3 G 0 -3 -6 2
4 A 0 5 0 -3
5 A 0 5 2 -3
6 T 0 0 10 5
7 G 0 -3 5 15
# Output I want
0 A T G
0 0 0 0 0
G 0 -3 -3 5
G 0 -3 -6 2
A 0 5 0 -3
A 0 5 2 -3
T 0 0 10 5
G 0 -3 5 15
Any advice on how to do this would be appreciated! :)
Declare the first row to be column names and the first column to be row names:
df = pd.DataFrame(data=a[1:], columns=a[0]).set_index(' ')
df.index.name = None
# 0 A T G
#0 0 0 0 0
#G 0 -3 -3 5
#G 0 -3 -6 2
#A 0 5 0 -3

Count cumulative and sequential values of the same sign in Pandas series

I wrote this code that computes time since a sign change (from positive to negative or vice versa) in data frame columns.
df = pd.DataFrame({'x': [1, -4, 5, 1, -2, -4, 1, 3, 2, -4, -5, -5, -6, -1]})
for column in df.columns:
days_since_sign_change = [0]
for k in range(1, len(df[column])):
last_different_sign_index = np.where(np.sign(df[column][:k]) != np.sign(df[column][k]))[0][-1]
days_since_sign_change.append(abs(last_different_sign_index- k))
df[column+ '_days_since_sign_change'] = days_since_sign_change
df[column+ '_days_since_sign_change'][df[column] < 0] = df[column+ '_days_since_sign_change'] *-1
# this final stage allows the "days_since_sign_change" column to also indicate if the sign changed
# from - to positive or from positive to negative.
In [302]:df
Out[302]:
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 1 2
4 -2 -1
5 -4 -2
6 1 1
7 3 2
8 2 3
9 -4 -1
10 -5 -2
11 -5 -3
12 -6 -4
13 -1 -5
Issue: with large datasets (150,000 * 50,000), the python code is extremely slow. How can I speed this up?
You can using cumcount
s=df.groupby(df.x.gt(0).astype(int).diff().ne(0).cumsum()).cumcount().add(1)*df.x.gt(0).replace({True:1,False:-1})
s.iloc[0]=0
s
Out[645]:
0 0
1 -1
2 1
3 2
4 -1
5 -2
6 1
7 2
8 3
9 -1
10 -2
11 -3
12 -4
13 -5
dtype: int64
You can surely do this without a loop. Create a sign column with -1 if value in x is less than 0 and 1 otherwise. Then group that sign column by difference in the value in the current row vs the previous one and get cumulative sum.
df['x_days_since_sign_change'] = (df['x'] > 0).astype(int).replace(0, -1)
df.iloc[0,1] = 0
df.groupby((df['x_days_since_sign_change'] != df['x_days_since_sign_change'].shift()).cumsum()).cumsum()
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 6 2
4 -2 -1
5 -6 -2
6 1 1
7 4 2
8 6 3
9 -4 -1
10 -9 -2
11 -14 -3
12 -20 -4
13 -21 -5

How to delete rows from a pandas DataFrame with method chaining?

Does pandas have an analogue to dplyr's filter() operation?
basically I'd like to be able to remove rows based on a predicate.
I can of course do df = df[condition], but that doesn't compose as nicely as method chaining.
use query
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(-5, 6, (10, 10)),
columns=list('ABCDEFGHIJ'))
df
A B C D E F G H I J
0 0 4 -1 1 -3 -1 -4 -5 -1 2
1 -4 2 -1 0 5 -1 1 -3 1 4
2 3 -2 3 -2 -4 5 1 1 0 -2
3 1 4 -5 4 -3 -3 -3 -3 -4 4
4 -3 4 4 5 -2 -3 -1 3 3 -1
5 0 0 -1 -1 2 2 5 -4 -1 -1
6 -2 1 2 0 -1 -1 1 0 4 -4
7 5 2 5 2 3 2 3 -3 1 1
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can easily pipeline operations based on filtering conditions
df.query('A < 0')
A B C D E F G H I J
1 -4 2 -1 0 5 -1 1 -3 1 4
4 -3 4 4 5 -2 -3 -1 3 3 -1
6 -2 1 2 0 -1 -1 1 0 4 -4
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can include multiple conditions
df.query('A < 0 & B < -1')
A B C D E F G H I J
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can do many cool things
df.query('-3 < A < 3 & H * J > 0')
A B C D E F G H I J
5 0 0 -1 -1 2 2 5 -4 -1 -1
8 -2 -5 1 4 0 -1 4 4 -5 3
And it all gets returned as a dataframe to enable the next operation

Categories

Resources