I have a DataFrame with two years of monthly data Y. I need the second column Y_avg with the climatology to be able to subtract both.
Y Y_avg
T X
2000-01-31 1 51 63
2 52 64
2000-02-29 1 53 65
2 54 66
2000-03-31 1 55 67
2 56 68
2000-04-30 1 57 69
2 58 70
2000-05-31 1 59 71
2 60 72
2000-06-30 1 61 73
2 62 74
2000-07-31 1 63 75
2 64 76
2000-08-31 1 65 77
2 66 78
2000-09-30 1 67 79
2 68 80
2000-10-31 1 69 81
2 70 82
2000-11-30 1 71 83
2 72 84
2000-12-31 1 73 85
2 74 86
2001-01-31 1 75 63
2 76 64
2001-02-28 1 77 65
2 78 66
2001-03-31 1 79 67
2 80 68
2001-04-30 1 81 69
2 82 70
2001-05-31 1 83 71
2 84 72
2001-06-30 1 85 73
2 86 74
2001-07-31 1 87 75
2 88 76
2001-08-31 1 89 77
2 90 78
2001-09-30 1 91 79
2 92 80
2001-10-31 1 93 81
2 94 82
2001-11-30 1 95 83
2 96 84
2001-12-31 1 97 85
2 98 86
This is my temporal solution:
f = np.tile(np.arange(1,25),2)
df['Y_avg'] = np.tile(df.groupby(f).mean().values.ravel(),2)
But how can I do that more efficiently?
Thanks for the help!
So you want the Y_avg to be the mean by X and the month of T, right? Assuming the T level of your MultiIndex is a DatetimeIndex, use
gb = df['Y'].groupby([df.index.get_level_values(0).month,
pd.Grouper(level=1)])
df['Y_avg'] = gb.transform('mean')
Firts of all, I had a hard time recreating the dataframe copy-pasting the data, so
for all of you that may want to answer the question, you can recreate the example with the following code:
import pandas as pd
# Create a date range, convert to list and duplicate
T = pd.date_range("2000-01-31", "2001-12-31", freq="M").tolist() * 2
# Create a list of repeated [1, 2] to match length of T
X = [1, 2] * (len(T) // 2)
Y = range(51, 99)
index = pd.MultiIndex.from_arrays([sorted(T), X], names=("T", "X"))
df = pd.DataFrame({"Y": Y}, index=index)
Then to calculate the mean of Y with respect of level T, you can use the following code:
Y_avg = df.Y.mean(level="T")
df = df.join(Y_avg, on="T", rsuffix="_avg")
First, you can calculate the mean with respect to certain index using the level parameter of the mean series method. The you can perform a standard dataframe join to merge the Y_avg series with the dataframe on the "T" index. Please note that you must provide a suffix (rsuffix in this case) to properly deal with columns' names.
Related
I have the following dataframe:
Patient
HR
02
PaO2
Hgb
1
62
94
73
31
1
64
93
73
34
1
62
92
73
31
2
64
90
84
42
3
62
95
75
30
3
70
97
77
29
Each row for a patient indicates an hourly observation. So, patient 1 has three observations, patient 2 has one observation and patient 3 has two observations. I'm trying to find a way to pad each patient group so that they are the same size (the same number of observations) as I'm trying to use this data for an LSTM. I'm not sure what the best way to do this would be though. I was wondering if anyone had any ideas?
The output would hopefully look like this:
Patient
HR
02
PaO2
Hgb
1
62
94
73
31
1
64
93
73
34
1
62
92
73
31
2
64
90
84
42
2
0
0
0
0
2
0
0
0
0
3
62
95
75
30
3
70
97
77
29
3
0
0
0
0
Reindex your original data to a pandas.MultiIndex on the Patient and Cumulative Count:
df = df.set_index(["Patient", df.groupby("Patient").cumcount()])
index = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
output = df.reindex(index, fill_value=0).reset_index(level=1, drop=True).reset_index()
>>> output
Patient HR 02 PaO2 Hgb
0 1 62 94 73 31
1 1 64 93 73 34
2 1 62 92 73 31
3 2 64 90 84 42
4 2 0 0 0 0
5 2 0 0 0 0
6 3 62 95 75 30
7 3 70 97 77 29
8 3 0 0 0 0
Comparing one pair of columns and result as enum to new columns and compare other pair of column and need the result as enum to the same new column
The df is as shown:
a b c d length
18 32 76 75 8
64 63 76 64 9
55 84 98 45 0
72 92 87 65 0
76 83 23 56 0
36 87 97 12 11
As shown in the dummy dataframe I am comparing columns in sequence
filtering if b > a
filtering if d > c
filtering if length is 0
My code is as follows,
df['status_flag'] = np.where(df['b']>=df['a'], "Filtered out based on b>a", None)
df['status_flag'] = np.where(df['d']>=df['c'], "Filtered out based on b>a", None)
df['status_flag'] = np.where(df['e']==0, "Filtered out based on length", None)
this yeild output as:
a b c d length new
18 32 76 75 8
64 68 76 94 9
55 84 98 99 0 "Filtered out based on length"
72 92 87 65 0
76 83 23 56 0 "Filtered out based on length"
36 87 97 100 11
basically it replaces existing strings with None. How to do this in a different way.
expected output:
a b c d length new
18 32 76 75 8 "Filtered out based on b>a"
64 68 76 94 9 "Filtered out based on d>c"
55 84 98 99 0 "Filtered out based on length"
72 92 87 65 0 "Filtered out based on d>c"
76 83 23 56 0 "Filtered out based on length"
36 87 97 100 11 "Passed all filters"
You can accomplish this with the following:
# Apply filters in the reverse order to get the sequence you want
df['new'] = 'Passed all filters'
df.loc[df.b > df.a, 'new'] = 'Filtered out based on b>a'
df.loc[df.d > df.c, 'new'] = 'Filtered out based on d>c'
df.loc[df.length == 0, 'new'] = 'Filtered out based on length'
print(df)
a b c d length new
0 18 32 76 75 8 Filtered out based on b>a
1 64 63 76 64 9 Passed all filters
2 55 84 98 45 0 Filtered out based on length
3 72 92 87 65 0 Filtered out based on length
4 76 83 23 56 0 Filtered out based on length
5 36 87 97 12 11 Filtered out based on b>a
Note: this uses the first data frame given, which differs from the one used in your example. Using that one gives the following result:
a b c d length new
0 18 32 76 75 8 Filtered out based on b>a
1 64 68 76 94 9 Filtered out based on d>c
2 55 84 98 99 0 Filtered out based on length
3 72 92 87 65 0 Filtered out based on length
4 76 83 23 56 0 Filtered out based on length
5 36 87 97 100 11 Filtered out based on d>c
I just had a quick question. How would one go about getting the last cell value of an excel spreadsheet when working with it as a dataframe using pandas, for every single different column. I'm having quite some difficulty with this, I know the index can be found with len(), but I can't quite wrap my finger around it. Thank you any help would be greatly appreciated.
If you want the last cell of a dataframe meaning the most bottom right cell, then you can use .iloc:
df = pd.DataFrame(np.arange(1,101).reshape((10,-1)))
df
Output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10
1 11 12 13 14 15 16 17 18 19 20
2 21 22 23 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37 38 39 40
4 41 42 43 44 45 46 47 48 49 50
5 51 52 53 54 55 56 57 58 59 60
6 61 62 63 64 65 66 67 68 69 70
7 71 72 73 74 75 76 77 78 79 80
8 81 82 83 84 85 86 87 88 89 90
9 91 92 93 94 95 96 97 98 99 100
Use .iloc with -1 index selection on both rows and columns.
df.iloc[-1,-1]
Output:
100
DataFrame.head(n) gets the top n results from the dataframe. DataFrame.tail(n) gets the bottom n results from the dataframe.
If your dataframe is named df, you could use df.tail(1) to get the last row of the dataframe. The returned value is also a dataframe.
The request is simple: I want to select all rows which contain a value greater than a threshold.
If I do it like this:
df[(df > threshold)]
I get these rows, but values below that threshold are simply NaN. How do I avoid selecting these rows?
There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.
df[(df > threshold).any(1)]
Example
>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))
>>> df
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96
>>> df[(df > 95).any(1)]
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96
Transposing as your self-answer does is just an unnecessary performance hit.
df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))
# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop
# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
This is actually very simple:
df[df.T[(df.T > 0.33)].any()]
I have the following database in Pandas:
Student-ID Last-name First-name HW1 HW2 HW3 HW4 HW5 M1 M2 Final
59118211 Alf Brian 96 90 88 93 96 78 60 59.0
59260567 Anderson Jill 73 83 96 80 84 80 52 42.5
59402923 Archangel Michael 99 80 60 94 98 41 56 0.0
59545279 Astor John 93 88 97 100 55 53 53 88.9
59687635 Attach Zach 69 75 61 65 91 90 63 69.0
I want to add only those columns which have "HW" in them. Any suggestions on how I can do that?
Note: The number of columns containing HW may differ. So I can't reference them directly.
You could all df.filter(regex='HW') to return column names like 'HW' and then apply sum row-wise via sum(axis-1)
In [23]: df
Out[23]:
StudentID Lastname Firstname HW1 HW2 HW3 HW4 HW5 HW6 HW7 M1
0 59118211 Alf Brian 96 90 88 93 96 97 88 10
1 59260567 Anderson Jill 73 83 96 80 84 99 80 100
2 59402923 Archangel Michael 99 80 60 94 98 73 97 50
3 59545279 Astor John 93 88 97 100 55 96 86 60
4 59687635 Attach Zach 69 75 61 65 91 89 82 55
5 59829991 Bake Jake 56 0 77 78 0 79 0 10
In [24]: df.filter(regex='HW').sum(axis=1)
Out[24]:
0 648
1 595
2 601
3 615
4 532
5 290
dtype: int64
John's solution - using df.filter() - is more elegant, but you could also consider a list comprehension ...
df[[x for x in df.columns if 'HW' in x]].sum(axis=1)