My dataframe called df_results
It looks like this
Cust Code Price
=====================
1 A 98
1 B 25
1 C 74
1 D 55
1 E 15
1 F 32
1 G 71
2 A 10
2 K 52
2 M 33
2 S 14
99 K 10
99 N 24
99 S 26
99 A 49
99 W 50
99 J 52
99 Q 55
99 U 68
99 C 73
99 Z 74
99 P 82
99 E 92
. . .
. . .
. . .
I am trying to break each customer prices into categories per percntile
Cust 99 prices are 10 24 26 49 50 52 55 68 73 74 82 92
Therefore for this customer the
25% ==> 31.75
50% ==> 53.5
75% ==> 73.75
100% ==> 92
Prices for each customer have to be averaged as per the percentile it belongs to
Cust Code Price Perctile Perce_Value Perc_Avg
=====================================================
99 K 10 25% 31.75 20
99 N 24 25% 31.75 20
99 S 26 25% 31.75 20
99 A 49 50% 53.5 50.33
99 W 50 50% 53.5 50.33
99 J 52 50% 53.5 50.33
99 Q 55 75% 73.75 65.33
99 U 68 75% 73.75 65.33
99 C 73 75% 73.75 65.33
99 Z 74 100% 92 82.66
99 P 82 100% 92 82.66
99 E 92 100% 92 82.66
I managed to do that through multiple looping through the dataframe
which is not effecient and I believe there must be a better solution.
Is there a better way to do that?
EDIT
I tried using lambda function
Step 1 : to fine Percentile_Value
df_results["Percentile_Value"] = df_results.apply(lambda x: np.percentile(x["Price"],25), axis=1)
but this did not give me any value , it just repeated Price into Percentile_Value as is
Related
I'm trying to change/eliminate the 1's that run diagonally in a correlation matrix so that when I take the average of the rows of the correlation matrix, the 1s don't affect the mean of each of the rows.
Let's say I have the dataset,
A B C D E F
0 45 100 58 78 80 35
1 49 80 80 104 58 20
2 49 80 65 78 79 20
3 65 100 80 159 83 45
4 65 123 78 115 100 50
5 45 122 84 100 85 20
6 60 120 78 44 105 55
7 62 80 109 48 78 25
8 63 39 85 65 79 25
9 80 52 100 50 103 30
10 80 43 78 64 120 60
11 60 60 130 43 135 45
12 80 50 111 59 115 50
13 82 65 130 63 78 90
14 83 58 85 80 45 80
15 100 64 100 65 30 70
When I do dfcorr = df.corr()
dfcorr, I get
A B C D E F
A 1.000000 0.842125 0.834808 0.832773 0.844158 0.806787
B 0.842125 1.000000 0.847606 0.907595 0.818668 0.863645
C 0.834808 0.847606 1.000000 0.718199 0.804671 0.582033
D 0.832773 0.907595 0.718199 1.000000 0.884236 0.878421
E 0.844158 0.818668 0.804671 0.884236 1.000000 0.718668
F 0.806787 0.863645 0.582033 0.878421 0.718668 1.000000
I want all the 1's to be dropped so that if I want to take the mean of each of the rows, the 1's won't affect them.
If you are working with it as a data frame this will work:
df=pd.DataFrame({'c1':[1, 0, 0.3, 0.4], 'c2':[0.2, 1, 0.6, 0.4], 'c3':[0.1, 0, 1, 0.4], 'c4':[0.7, 0.2, 0.2, 1]} )
df.where(df!=1).mean(axis=1)
This only works correctly if all 1's are on the diagonal.
I have the following DataFrame:
1-A-873 2-A-129 3-A-123
12/12/20 45 32 41
13/12/20 94 56 87
14/12/20 12 42 84
15/12/20 73 24 25
Each column represent an equipment. Each equipment has a size that is declared in the code:
1A = 5
2A = 3
3A = 7
Every column will need to be divided by this equipment size that is - (value / size)
This is what I am using:
df["1A-NewValue"] = df["1-A-873"] / 1A
df["2A-NewValue"] = df["2-A-129"] / 2A
df["3A-NewValue"] = df["3-A-123"] / 3A
End result:
1-A-873 2-A-129 3-A-123 1A-NewValue 2A-NewValue 3A-NewValue
12/12/20 45 32 41 9 10.67 5.86
13/12/20 94 56 87 18.8 18.67 12.43
14/12/20 12 42 84 2.4 14 12
15/12/20 73 24 25 14.6 8 3.57
This works perfectly and do what I want by having three extra columns at the end of the DataFrame.
However, this will be a tedious effort later on if my total number of equipment increases to 250 instead of 3; I will need to have 250 lines for equipment size and 250 lines for the formula.
Naturally the first thing that come to my mind is a for loop, but is there a more Pandas-way of doing this efficiently?
Thanks!
You can create dictionary, rename columns names by split by - and join first 2 values for match and divide like:
d = {'1A': 5, '2A':3, '3A':7}
f = lambda x: ''.join(x.split('-')[:2])
df = df.join(df.rename(columns=f).div(d).add_suffix(' NewValue'))
print (df)
1-A-873 2-A-129 3-A-123 1A NewValue 2A NewValue 3A NewValue
12/12/20 45 32 41 9.0 10.666667 5.857143
13/12/20 94 56 87 18.8 18.666667 12.428571
14/12/20 12 42 84 2.4 14.000000 12.000000
15/12/20 73 24 25 14.6 8.000000 3.571429
I have a DataFrame with two years of monthly data Y. I need the second column Y_avg with the climatology to be able to subtract both.
Y Y_avg
T X
2000-01-31 1 51 63
2 52 64
2000-02-29 1 53 65
2 54 66
2000-03-31 1 55 67
2 56 68
2000-04-30 1 57 69
2 58 70
2000-05-31 1 59 71
2 60 72
2000-06-30 1 61 73
2 62 74
2000-07-31 1 63 75
2 64 76
2000-08-31 1 65 77
2 66 78
2000-09-30 1 67 79
2 68 80
2000-10-31 1 69 81
2 70 82
2000-11-30 1 71 83
2 72 84
2000-12-31 1 73 85
2 74 86
2001-01-31 1 75 63
2 76 64
2001-02-28 1 77 65
2 78 66
2001-03-31 1 79 67
2 80 68
2001-04-30 1 81 69
2 82 70
2001-05-31 1 83 71
2 84 72
2001-06-30 1 85 73
2 86 74
2001-07-31 1 87 75
2 88 76
2001-08-31 1 89 77
2 90 78
2001-09-30 1 91 79
2 92 80
2001-10-31 1 93 81
2 94 82
2001-11-30 1 95 83
2 96 84
2001-12-31 1 97 85
2 98 86
This is my temporal solution:
f = np.tile(np.arange(1,25),2)
df['Y_avg'] = np.tile(df.groupby(f).mean().values.ravel(),2)
But how can I do that more efficiently?
Thanks for the help!
So you want the Y_avg to be the mean by X and the month of T, right? Assuming the T level of your MultiIndex is a DatetimeIndex, use
gb = df['Y'].groupby([df.index.get_level_values(0).month,
pd.Grouper(level=1)])
df['Y_avg'] = gb.transform('mean')
Firts of all, I had a hard time recreating the dataframe copy-pasting the data, so
for all of you that may want to answer the question, you can recreate the example with the following code:
import pandas as pd
# Create a date range, convert to list and duplicate
T = pd.date_range("2000-01-31", "2001-12-31", freq="M").tolist() * 2
# Create a list of repeated [1, 2] to match length of T
X = [1, 2] * (len(T) // 2)
Y = range(51, 99)
index = pd.MultiIndex.from_arrays([sorted(T), X], names=("T", "X"))
df = pd.DataFrame({"Y": Y}, index=index)
Then to calculate the mean of Y with respect of level T, you can use the following code:
Y_avg = df.Y.mean(level="T")
df = df.join(Y_avg, on="T", rsuffix="_avg")
First, you can calculate the mean with respect to certain index using the level parameter of the mean series method. The you can perform a standard dataframe join to merge the Y_avg series with the dataframe on the "T" index. Please note that you must provide a suffix (rsuffix in this case) to properly deal with columns' names.
The request is simple: I want to select all rows which contain a value greater than a threshold.
If I do it like this:
df[(df > threshold)]
I get these rows, but values below that threshold are simply NaN. How do I avoid selecting these rows?
There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.
df[(df > threshold).any(1)]
Example
>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))
>>> df
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96
>>> df[(df > 95).any(1)]
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96
Transposing as your self-answer does is just an unnecessary performance hit.
df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))
# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop
# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
This is actually very simple:
df[df.T[(df.T > 0.33)].any()]
I have the following database in Pandas:
Student-ID Last-name First-name HW1 HW2 HW3 HW4 HW5 M1 M2 Final
59118211 Alf Brian 96 90 88 93 96 78 60 59.0
59260567 Anderson Jill 73 83 96 80 84 80 52 42.5
59402923 Archangel Michael 99 80 60 94 98 41 56 0.0
59545279 Astor John 93 88 97 100 55 53 53 88.9
59687635 Attach Zach 69 75 61 65 91 90 63 69.0
I want to add only those columns which have "HW" in them. Any suggestions on how I can do that?
Note: The number of columns containing HW may differ. So I can't reference them directly.
You could all df.filter(regex='HW') to return column names like 'HW' and then apply sum row-wise via sum(axis-1)
In [23]: df
Out[23]:
StudentID Lastname Firstname HW1 HW2 HW3 HW4 HW5 HW6 HW7 M1
0 59118211 Alf Brian 96 90 88 93 96 97 88 10
1 59260567 Anderson Jill 73 83 96 80 84 99 80 100
2 59402923 Archangel Michael 99 80 60 94 98 73 97 50
3 59545279 Astor John 93 88 97 100 55 96 86 60
4 59687635 Attach Zach 69 75 61 65 91 89 82 55
5 59829991 Bake Jake 56 0 77 78 0 79 0 10
In [24]: df.filter(regex='HW').sum(axis=1)
Out[24]:
0 648
1 595
2 601
3 615
4 532
5 290
dtype: int64
John's solution - using df.filter() - is more elegant, but you could also consider a list comprehension ...
df[[x for x in df.columns if 'HW' in x]].sum(axis=1)