Sorry if this has already been answered somewhere!
I am trying to format an array in numpy to a data frame in pandas, which I have done like so:
# array
a = [[' ' '0' 'A' 'T' 'G']
['0' 0 0 0 0]
['G' 0 -3 -3 5]
['G' 0 -3 -6 2]
['A' 0 5 0 -3]
['A' 0 5 2 -3]
['T' 0 0 10 5]
['G' 0 -3 5 15]]
# Output data frame using pandas
0 1 2 3 4
0 0 A T G
1 0 0 0 0 0
2 G 0 -3 -3 5
3 G 0 -3 -6 2
4 A 0 5 0 -3
5 A 0 5 2 -3
6 T 0 0 10 5
7 G 0 -3 5 15
# Output I want
0 A T G
0 0 0 0 0
G 0 -3 -3 5
G 0 -3 -6 2
A 0 5 0 -3
A 0 5 2 -3
T 0 0 10 5
G 0 -3 5 15
Any advice on how to do this would be appreciated! :)
Declare the first row to be column names and the first column to be row names:
df = pd.DataFrame(data=a[1:], columns=a[0]).set_index(' ')
df.index.name = None
# 0 A T G
#0 0 0 0 0
#G 0 -3 -3 5
#G 0 -3 -6 2
#A 0 5 0 -3
Related
I wrote this code that computes time since a sign change (from positive to negative or vice versa) in data frame columns.
df = pd.DataFrame({'x': [1, -4, 5, 1, -2, -4, 1, 3, 2, -4, -5, -5, -6, -1]})
for column in df.columns:
days_since_sign_change = [0]
for k in range(1, len(df[column])):
last_different_sign_index = np.where(np.sign(df[column][:k]) != np.sign(df[column][k]))[0][-1]
days_since_sign_change.append(abs(last_different_sign_index- k))
df[column+ '_days_since_sign_change'] = days_since_sign_change
df[column+ '_days_since_sign_change'][df[column] < 0] = df[column+ '_days_since_sign_change'] *-1
# this final stage allows the "days_since_sign_change" column to also indicate if the sign changed
# from - to positive or from positive to negative.
In [302]:df
Out[302]:
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 1 2
4 -2 -1
5 -4 -2
6 1 1
7 3 2
8 2 3
9 -4 -1
10 -5 -2
11 -5 -3
12 -6 -4
13 -1 -5
Issue: with large datasets (150,000 * 50,000), the python code is extremely slow. How can I speed this up?
You can using cumcount
s=df.groupby(df.x.gt(0).astype(int).diff().ne(0).cumsum()).cumcount().add(1)*df.x.gt(0).replace({True:1,False:-1})
s.iloc[0]=0
s
Out[645]:
0 0
1 -1
2 1
3 2
4 -1
5 -2
6 1
7 2
8 3
9 -1
10 -2
11 -3
12 -4
13 -5
dtype: int64
You can surely do this without a loop. Create a sign column with -1 if value in x is less than 0 and 1 otherwise. Then group that sign column by difference in the value in the current row vs the previous one and get cumulative sum.
df['x_days_since_sign_change'] = (df['x'] > 0).astype(int).replace(0, -1)
df.iloc[0,1] = 0
df.groupby((df['x_days_since_sign_change'] != df['x_days_since_sign_change'].shift()).cumsum()).cumsum()
x x_days_since_sign_change
0 1 0
1 -4 -1
2 5 1
3 6 2
4 -2 -1
5 -6 -2
6 1 1
7 4 2
8 6 3
9 -4 -1
10 -9 -2
11 -14 -3
12 -20 -4
13 -21 -5
I have a dataframe looked like below.
T$QOOR
3
14
12
-6
-19
9
I want to move the positive and negative one into new columns.
sls_item['SALES'] = sls_item['T$QOOR'].apply(lambda x: x if x >= 0 else 0)
sls_item['RETURN'] = sls_item['T$QOOR'].apply(lambda x: x*-1 if x < 0 else 0)
The result will be as below.
T$QOOR SALES RETURN
3 3 0
14 14 0
12 12 0
-6 0 -6
-19 0 -19
9 9 0
Any better and cleaner way to do so other than using apply?
Solution with clip_lower and
clip_upper, also mul for multiple by -1 is added:
sls_item['SALES'] = sls_item['T$QOOR'].clip_lower(0)
sls_item['RETURN'] = sls_item['T$QOOR'].clip_upper(0).mul(-1)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
Use where or numpy.where:
sls_item['SALES'] = sls_item['T$QOOR'].where(lambda x: x >= 0, 0)
sls_item['RETURN'] = sls_item['T$QOOR'].where(lambda x: x < 0, 0) * -1
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
mask = sls_item['T$QOOR'] >=0
sls_item['SALES'] = np.where(mask, sls_item['T$QOOR'], 0)
sls_item['RETURN'] = np.where(~mask, sls_item['T$QOOR'] * -1, 0)
print (sls_item)
T$QOOR SALES RETURN
0 3 3 0
1 14 14 0
2 12 12 0
3 -6 0 6
4 -19 0 19
5 9 9 0
assgin + where
df.assign(po=df.where(df['T$QOOR']>0,0),ne=df.where(df['T$QOOR']<0,0))
Out[1355]:
T$QOOR ne po
0 3 0 3
1 14 0 14
2 12 0 12
3 -6 -6 0
4 -19 -19 0
5 9 0 9
I am trying to understand the difference between these two statements
dataframe['newColumn'] = 'stringconst'
and
for x in y:
if x=="value":
csv = pd.read_csv(StringIO(table), header=None, names=None)
dataframe['newColumn'] = csv[0]
In the first case pandas populates all the rows with the constant value, but in the second case it populates only the first row and assigns NaN to rest of the rows. Why is this? How can I assign the value in the second case to all the rows in the dataframe?
Because csv[0] is not a scalar value. It's a pd.Series, and when you do assignment with pd.Series it tries to align by index (the whole point of pandas), and probably it's getting NAN everywhere except the first row because only the first-row's index aligns with the pd.DataFrame index. So, consider two data-frames (note, they are copies except for the index, which is shifted by 20):
>>> df
0 1 2 3 4
0 4 -5 -1 0 3
1 -2 -2 1 3 4
2 1 2 4 4 -4
3 -5 2 -3 -5 1
4 -5 -3 1 1 -1
5 -4 0 4 -3 -4
6 -2 -5 -3 1 0
7 4 0 0 -4 -4
8 -4 4 -2 -5 4
9 1 -2 4 3 0
>>> df2
0 1 2 3 4
20 4 -5 -1 0 3
21 -2 -2 1 3 4
22 1 2 4 4 -4
23 -5 2 -3 -5 1
24 -5 -3 1 1 -1
25 -4 0 4 -3 -4
26 -2 -5 -3 1 0
27 4 0 0 -4 -4
28 -4 4 -2 -5 4
29 1 -2 4 3 0
>>> df['new'] = df[1]
>>> df
0 1 2 3 4 new
0 4 -5 -1 0 3 -5
1 -2 -2 1 3 4 -2
2 1 2 4 4 -4 2
3 -5 2 -3 -5 1 2
4 -5 -3 1 1 -1 -3
5 -4 0 4 -3 -4 0
6 -2 -5 -3 1 0 -5
7 4 0 0 -4 -4 0
8 -4 4 -2 -5 4 4
9 1 -2 4 3 0 -2
>>> df['new2'] = df2[1]
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 NaN
1 -2 -2 1 3 4 -2 NaN
2 1 2 4 4 -4 2 NaN
3 -5 2 -3 -5 1 2 NaN
4 -5 -3 1 1 -1 -3 NaN
5 -4 0 4 -3 -4 0 NaN
6 -2 -5 -3 1 0 -5 NaN
7 4 0 0 -4 -4 0 NaN
8 -4 4 -2 -5 4 4 NaN
9 1 -2 4 3 0 -2 NaN
So, one thing you can do to assign the whole column is to simply assign the values:
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 NaN
1 -2 -2 1 3 4 -2 NaN
2 1 2 4 4 -4 2 NaN
3 -5 2 -3 -5 1 2 NaN
4 -5 -3 1 1 -1 -3 NaN
5 -4 0 4 -3 -4 0 NaN
6 -2 -5 -3 1 0 -5 NaN
7 4 0 0 -4 -4 0 NaN
8 -4 4 -2 -5 4 4 NaN
9 1 -2 4 3 0 -2 NaN
>>> df['new2'] = df2[1].values
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 -5
1 -2 -2 1 3 4 -2 -2
2 1 2 4 4 -4 2 2
3 -5 2 -3 -5 1 2 2
4 -5 -3 1 1 -1 -3 -3
5 -4 0 4 -3 -4 0 0
6 -2 -5 -3 1 0 -5 -5
7 4 0 0 -4 -4 0 0
8 -4 4 -2 -5 4 4 4
9 1 -2 4 3 0 -2 -2
Or, if you want to assign the first value in the first column, then actually select the first value using iloc or another selector and then do assignment:
>>> df
0 1 2 3 4 new new2
0 4 -5 -1 0 3 -5 -5
1 -2 -2 1 3 4 -2 -2
2 1 2 4 4 -4 2 2
3 -5 2 -3 -5 1 2 2
4 -5 -3 1 1 -1 -3 -3
5 -4 0 4 -3 -4 0 0
6 -2 -5 -3 1 0 -5 -5
7 4 0 0 -4 -4 0 0
8 -4 4 -2 -5 4 4 4
9 1 -2 4 3 0 -2 -2
>>> df['newest'] = df2.iloc[0,0]
>>> df
0 1 2 3 4 new new2 newest
0 4 -5 -1 0 3 -5 -5 4
1 -2 -2 1 3 4 -2 -2 4
2 1 2 4 4 -4 2 2 4
3 -5 2 -3 -5 1 2 2 4
4 -5 -3 1 1 -1 -3 -3 4
5 -4 0 4 -3 -4 0 0 4
6 -2 -5 -3 1 0 -5 -5 4
7 4 0 0 -4 -4 0 0 4
8 -4 4 -2 -5 4 4 4 4
9 1 -2 4 3 0 -2 -2 4
Does pandas have an analogue to dplyr's filter() operation?
basically I'd like to be able to remove rows based on a predicate.
I can of course do df = df[condition], but that doesn't compose as nicely as method chaining.
use query
Consider the dataframe df
df = pd.DataFrame(
np.random.randint(-5, 6, (10, 10)),
columns=list('ABCDEFGHIJ'))
df
A B C D E F G H I J
0 0 4 -1 1 -3 -1 -4 -5 -1 2
1 -4 2 -1 0 5 -1 1 -3 1 4
2 3 -2 3 -2 -4 5 1 1 0 -2
3 1 4 -5 4 -3 -3 -3 -3 -4 4
4 -3 4 4 5 -2 -3 -1 3 3 -1
5 0 0 -1 -1 2 2 5 -4 -1 -1
6 -2 1 2 0 -1 -1 1 0 4 -4
7 5 2 5 2 3 2 3 -3 1 1
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can easily pipeline operations based on filtering conditions
df.query('A < 0')
A B C D E F G H I J
1 -4 2 -1 0 5 -1 1 -3 1 4
4 -3 4 4 5 -2 -3 -1 3 3 -1
6 -2 1 2 0 -1 -1 1 0 4 -4
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can include multiple conditions
df.query('A < 0 & B < -1')
A B C D E F G H I J
8 -2 -5 1 4 0 -1 4 4 -5 3
9 -3 -2 -5 0 -5 -2 -2 2 0 -1
You can do many cool things
df.query('-3 < A < 3 & H * J > 0')
A B C D E F G H I J
5 0 0 -1 -1 2 2 5 -4 -1 -1
8 -2 -5 1 4 0 -1 4 4 -5 3
And it all gets returned as a dataframe to enable the next operation
having following column in dataframe:
0
0
0
0
0
5
I would like to check for values greater than a threshold. If found, set to zero and move up by the difference value-threshold, setting threshold on the new position. Let's say threshold=3, then the resulting column has to be:
0
0
0
3
0
0
Any idea for fast transformation?
For this DataFrame:
df
Out:
A
0 0
1 0
2 0
3 0
4 0
5 5
6 0
7 0
8 0
9 0
10 6
11 0
12 0
threshold = 3
above_threshold = df['A'] > threshold
df.loc[df[above_threshold].index - (df.loc[above_threshold, 'A'] - 3).values, 'A'] = 3
df.loc[above_threshold, 'A'] = 0
df
Out:
A
0 0
1 0
2 0
3 3
4 0
5 0
6 0
7 3
8 0
9 0
10 0
11 0
12 0