I am adding a new function which converts the DataFrame to lower triangle if its an upper triangle and vice versa. The data I am using always has first two rows filled with the first index only.
I tried using the solution from this problem Pandas: convert upper triangular dataframe by shifting rows to the left
Data :
0 1 2 3
0 1.000000 NaN NaN NaN
1 0.421655 NaN NaN NaN
2 0.747064 5.000000 NaN NaN
3 0.357616 0.631622 8.000000 NaN
which should be turned into:
Data :
0 1 2 3
0 NaN 8.000000 0.631622 0.357616
1 NaN NaN 5.000000 0.747064
2 NaN NaN NaN 0.421655
3 NaN NaN NaN 1.000000
Just like you need reverse order for row and columns
yourdf=df.iloc[::-1,::-1]
yourdf
Out[94]:
3 2 1 0
3 NaN 8.0 0.631622 0.357616
2 NaN NaN 5.000000 0.747064
1 NaN NaN NaN 0.421655
0 NaN NaN NaN 1.000000
your system should be having numpy installed. So, using numpy.flip is another way and provide more readable options
In [722]: df
Out[722]:
0 1 2 3
0 1.000000 NaN NaN NaN
1 0.421655 NaN NaN NaN
2 0.747064 5.000000 NaN NaN
3 0.357616 0.631622 8.0 NaN
In [724]: import numpy as np
In [725]: df_flip = pd.DataFrame(np.flip(df.values))
In [726]: df_flip
Out[726]:
0 1 2 3
0 NaN 8.0 0.631622 0.357616
1 NaN NaN 5.000000 0.747064
2 NaN NaN NaN 0.421655
3 NaN NaN NaN 1.000000
Related
This question already has answers here:
What do lambda function closures capture?
(7 answers)
Closed 1 year ago.
I have a data frame df and I want to create multiple lags of column A.
I should be able to use the .assign() method and a dictionary comprehension, I think.
However, all lags are the longest lag with my solution below, even though the dictionary comprehension itself creates the correct lags.
Also, can someone explain why I need the ** just before my dictionary comprehension?
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': np.arange(5)})
df.assign(**{'lag_' + str(i): lambda x: x['A'].shift(i) for i in range(1, 5+1)})
A lag_1 lag_2 lag_3 lag_4 lag_5
0 0 NaN NaN NaN NaN NaN
1 1 NaN NaN NaN NaN NaN
2 2 NaN NaN NaN NaN NaN
3 3 NaN NaN NaN NaN NaN
4 4 NaN NaN NaN NaN NaN
The dictionary comprehension itself creates the correct lags.
{'lag_' + str(i): df['A'].shift(i) for i in range(1, 5+1)}
{'lag_1': 0 NaN
1 0.0
2 1.0
3 2.0
4 3.0
Name: A, dtype: float64,
'lag_2': 0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
Name: A, dtype: float64,
'lag_3': 0 NaN
1 NaN
2 NaN
3 0.0
4 1.0
Name: A, dtype: float64,
'lag_4': 0 NaN
1 NaN
2 NaN
3 NaN
4 0.0
Name: A, dtype: float64,
'lag_5': 0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: A, dtype: float64}
Just pass what you did for the dict remove lambda
out = df.assign(**{'lag_' + str(i): df['A'].shift(i) for i in range(1, 5+1)})
Out[65]:
A lag_1 lag_2 lag_3 lag_4 lag_5
0 0 NaN NaN NaN NaN NaN
1 1 0.0 NaN NaN NaN NaN
2 2 1.0 0.0 NaN NaN NaN
3 3 2.0 1.0 0.0 NaN NaN
4 4 3.0 2.0 1.0 0.0 NaN
For each row I would like to set all values to NaN after the appearance of the first NaN. E.g.:
a b c
1 2 3 4
2 nan 2 nan
3 3 nan 23
Should become this:
a b c
1 2 3 4
2 nan nan nan
3 3 nan nan
So far I only know how to do this with an apply with a for loop over each column per row - it's very slow!
Check with cumprod
df=df.where(df.notna().cumprod(axis=1).eq(1))
a b c
1 2.0 3.0 4.0
2 NaN NaN NaN
3 3.0 NaN NaN
Hi I have the following dataframe
z a b c
a 1 NaN NaN
ss NaN 2 NaN
cc 3 NaN NaN
aa NaN 4 NaN
ww NaN 5 NaN
ss NaN NaN 6
aa NaN NaN 7
g NaN NaN 8
j 9 NaN NaN
I would like to create a new column d to do something like this
z a b c d
a 1 NaN NaN 1
ss NaN 2 NaN 2
cc 3 NaN NaN 3
aa NaN 4 NaN 4
ww NaN 5 NaN 5
ss NaN NaN 6 6
aa NaN NaN 7 7
g NaN NaN 8 8
j 9 NaN NaN 9
For the numbers, it is not in integer. It is in np.float64. The integers are for clear example. you may assume the numbers are like 32065431243556.62, 763835218962767.8 Thank you for your help
We can replace the NA by 0 and sum up the rows.
df['d'] = df[['a', 'b', 'c']].fillna(0).sum(axis=1)
In fact, it's not nessary to use fillna, sum can transform the NAN elements to zeros automatically.
I'm a python newcomer as well,and I suggest maybe you should read the pandas cookbook first.
The code is:
df['Total']=df[['a','b','c']].sum(axis=1).astype(int)
You can use pd.DataFrame.ffill over axis=1:
df['D'] = df.ffill(1).iloc[:, -1].astype(int)
print(df)
a b c D
0 1.0 NaN NaN 1
1 NaN 2.0 NaN 2
2 3.0 NaN NaN 3
3 NaN 4.0 NaN 4
4 NaN 5.0 NaN 5
5 NaN NaN 6.0 6
6 NaN NaN 7.0 7
7 NaN NaN 8.0 8
8 9.0 NaN NaN 9
Of course, if you have float values, int conversion is not required.
if there is only one value per row as given example, you can use the code below to dropna for each row and assign the remaining value to column d
df['d']=df.apply(lambda row: row.dropna(), axis=1)
x = pd.DataFrame(index = pd.date_range(start="2017-1-1", end="2017-1-13"),
columns="a b c".split())
x.ix[0:2, "a"] = 1
x.ix[5:10, "a"] = 1
x.ix[9:12, "b"] = 1
x.ix[1:3, "c"] = 1
x.ix[5, "c"] = 1
a b c
2017-01-01 1 NaN NaN
2017-01-02 1 NaN 1
2017-01-03 NaN NaN 1
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 1 NaN 1
2017-01-07 1 NaN NaN
2017-01-08 1 NaN NaN
2017-01-09 1 NaN NaN
2017-01-10 1 1 NaN
2017-01-11 NaN 1 NaN
2017-01-12 NaN 1 NaN
2017-01-13 NaN NaN NaN
Given the above dataframe, x, I want to return the average of the number of occurences of 1s within each group of a, b, and c. The average for each column is taken over the number of blocks that contains consecutive 1s.
For example, column a will output the average of 2 and 5, which is 3.5. We divide it by 2 because there are 2 consecutive 1s between Jan-1 and Jan-2, then 5 consecutive 1s between Jan-06 and Jan-10, 2 blocks of 1s in total. Similarly, for column b, we will have 3 because only one consecutive sequence of 1s occur once between Jan-10 and Jan-13. Finally, for column c, we will have the average of 2 and 1, which is 1.5.
Expected output of the toy example:
a b c
3.5 3 1.5
Use mask + apply with value_counts, and finally, find the mean of your counts -
x.eq(1)\
.ne(x.eq(1).shift())\
.cumsum(0)\
.mask(x.ne(1))\
.apply(pd.Series.value_counts)\
.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
Details
First, find a list of all consecutive values in your dataframe -
i = x.eq(1).ne(x.eq(1).shift()).cumsum(0)
i
a b c
2017-01-01 1 1 1
2017-01-02 1 1 2
2017-01-03 2 1 2
2017-01-04 2 1 3
2017-01-05 2 1 3
2017-01-06 3 1 4
2017-01-07 3 1 5
2017-01-08 3 1 5
2017-01-09 3 1 5
2017-01-10 3 2 5
2017-01-11 4 2 5
2017-01-12 4 2 5
2017-01-13 4 3 5
Now, keep only those group values whose cells were originally 1 in x -
j = i.mask(x.ne(1))
j
a b c
2017-01-01 1.0 NaN NaN
2017-01-02 1.0 NaN 2.0
2017-01-03 NaN NaN 2.0
2017-01-04 NaN NaN NaN
2017-01-05 NaN NaN NaN
2017-01-06 3.0 NaN 4.0
2017-01-07 3.0 NaN NaN
2017-01-08 3.0 NaN NaN
2017-01-09 3.0 NaN NaN
2017-01-10 3.0 2.0 NaN
2017-01-11 NaN 2.0 NaN
2017-01-12 NaN 2.0 NaN
2017-01-13 NaN NaN NaN
Now, apply value_counts across each column -
k = j.apply(pd.Series.value_counts)
k
a b c
1.0 2.0 NaN NaN
2.0 NaN 3.0 2.0
3.0 5.0 NaN NaN
4.0 NaN NaN 1.0
And just find the column-wise mean -
k.mean(0)
a 3.5
b 3.0
c 1.5
dtype: float64
As a handy note, if you want to, for example, find the mean counts only for more than n consecutive 1s (say, n = 1 here), then you can filter on k's index quite easily -
k[k.index > 1].mean(0)
a 5.0
b 3.0
c 1.5
dtype: float64
Let's try:
x.apply(lambda s: s.groupby(s.ne(1).cumsum()).sum().mean())
Output:
a 3.5
b 3.0
c 1.5
dtype: float64
Apply the lambda function to each column of the dataframe. The lambda function groups none 1 values together and counts them using sum() then takes the average using mean().
This utilizes cumsum, shift, and an xor mask.
b = x.cumsum()
c = b.shift(-1)
b_masked = b[b.isnull() ^ c.isnull()]
b_masked.max() / b_masked.count()
a 3.5
b 3.0
c 1.5
dtype: float64
First do b = x.cumsum()
a b c
0 1.0 NaN NaN
1 2.0 NaN 1.0
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 3.0 NaN 3.0
6 4.0 NaN NaN
7 5.0 NaN NaN
8 6.0 NaN NaN
9 7.0 1.0 NaN
10 NaN 2.0 NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Then, shift b upward: c = b.shift(-1). Then, we create a xor mask with b.isnull() ^ c.isnull(). This mask will only keep one value per consecutive ones. Note that it seems that it will create an extra True at the end. But since we put it back to b, where in the place it is NaN, it will not generate new elements. We use an example to illustrate
b c b.isnull() ^ c.isnull() b[b.isnull() ^ c.isnull()]
NaN 1 True NaN
1 2 False NaN
2 NaN True 2
NaN NaN False NaN
Real big b[b.isnull() ^ c.isnull()] looks like
a b c
0 NaN NaN NaN
1 2.0 NaN NaN
2 NaN NaN 2.0
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN 3.0
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 7.0 NaN NaN
10 NaN NaN NaN
11 NaN 3.0 NaN
12 NaN NaN NaN
Because we did cumsum in the first place, we only need the maximum and the number of non-NaN in each column to calculate the mean.
Thus, we do b[b.isnull() ^ c.isnull()].max() / b[b.isnull() ^ c.isnull()].count()
you could use regex:
import re
p = r'1+'
counts = {
c: np.mean(
[len(x) for x in re.findall(p, ''.join(map(str, x[c].values)))]
)
for c in ['a', 'b', 'c']
}
This method works because the columns here could be thought as expressions in a language with alphabet {1, nan}. 1+ matches all groups of adjacent 1s and re.findall returns a list of strings. Then, it is necessary to calculate the mean of the lengths of each string.
I have a python pandas DataFrame that looks like this:
A B C ... ZZ
2008-01-01 00 NaN NaN NaN ... 1
2008-01-02 00 NaN NaN NaN ... NaN
2008-01-03 00 NaN NaN 1 ... NaN
... ... ... ... ... ...
2012-12-31 00 NaN 1 NaN ... NaN
and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:
B C ... ZZ
2008-01-01 00 NaN NaN ... 1
2008-01-03 00 NaN 1 ... NaN
... ... ... ... ...
2012-12-31 00 1 NaN ... NaN
This is, removing all rows and columns that do not have a 1 in it.
I try this which seems to remove the rows with no 1:
df_filtered = df[df.sum(1)>0]
And the try to remove columns with:
df_filtered = df_filtered[df.sum(0)>0]
but get this error after the second line:
IndexingError('Unalignable boolean Series key provided')
Do it with loc:
In [90]: df
Out[90]:
0 1 2 3 4 5
0 1 NaN NaN 1 1 NaN
1 NaN NaN NaN NaN NaN NaN
2 1 1 NaN NaN 1 NaN
3 1 NaN 1 1 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
0 1 2 3 4
0 1 NaN NaN 1 1
2 1 1 NaN NaN 1
3 1 NaN 1 1 NaN
Here's why you get that error:
Let's say I have the following frame, df, (similar to yours):
In [112]: df
Out[112]:
a b c d e
0 0 1 1 NaN 1
1 NaN NaN NaN NaN NaN
2 0 0 0 NaN 0
3 0 0 1 NaN 1
4 1 1 1 NaN 1
5 0 0 0 NaN 0
6 1 0 1 NaN 0
When I sum along the rows and threshold at 0, I get:
In [113]: row_sum = df.sum()
In [114]: row_sum > 0
Out[114]:
a True
b True
c True
d False
e True
dtype: bool
Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.
Alternatively to remove all NaN rows or columns you can use .any() too.
In [1680]: df
Out[1680]:
0 1 2 3 4 5
0 1.0 NaN NaN 1.0 1.0 NaN
1 NaN NaN NaN NaN NaN NaN
2 1.0 1.0 NaN NaN 1.0 NaN
3 1.0 NaN 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
0 1 2 3 4
0 1.0 NaN NaN 1.0 1.0
2 1.0 1.0 NaN NaN 1.0
3 1.0 NaN 1.0 1.0 NaN