I have a dataframe column which specifies how many times a user has performed an activity.
eg.
>>> df['ActivityCount']
Users ActivityCount
User0 220
User1 190
User2 105
User3 109
User4 271
User5 265
...
User95 64
User96 15
User97 168
User98 251
User99 278
Name: ActivityCount, Length: 100, dtype: int32
>>> activities = sorted(df['ActivityCount'].unique())
[9, 15, 16, 17, 20, 23, 25, 26, 28, 31, 33, 34, 36, 38, 39, 43, 49, 57, 59, 64, 65, 71, 76, 77, 78,
83, 88, 94, 95, 100, 105, 109, 110, 111, 115, 116, 117, 120, 132, 137, 138, 139, 140, 141, 144, 145, 148, 153, 155, 157, 162, 168, 177, 180, 182, 186, 190, 192, 194, 197, 203, 212, 213, 220, 223, 231, 232, 238, 240, 244, 247, 251, 255, 258, 260, 265, 268, 269, 271, 272, 276, 278, 282, 283, 285, 290]
According to their ActivityCount, I have to divide users into 5 different categories eg A, B, C, D and E.
Activity Count range varies from time to time. In the above example it's approx in-between (9-290) (lowest and highest of the series), it could be (5-500) or (5 to 30).
In above example, I can take the max number of activities and divide it by 5 and categorize each user between the range of 58 (from 290/5) like Range A: 0-58, Range B: 59-116, Range C: 117-174...etc
Is there any other way to achieve this using pandas or numpy, so that I can directly categorize the column in the given categories?
Expected output: -
>>> df
Users ActivityCount Category/Range
User0 220 D
User1 190 D
User2 105 B
User3 109 B
User4 271 E
User5 265 E
...
User95 64 B
User96 15 A
User97 168 C
User98 251 E
User99 278 E
The natural way to do that would be to split the data into 5 quanties, and then split the data into bins based on these quantities. Luckily, pandas allows you do easily do that:
df["category"] = pd.cut(df.Activity, 5, labels= ["a","b", "c", "d", "e"])
The output is something like:
Activity Category
34 115 b
15 43 a
57 192 d
78 271 e
26 88 b
6 25 a
55 186 d
63 220 d
1 15 a
76 268 e
An alternative view - clustering
In the above method, we've split the data into 5 bins, where the sizes of the different bins are equal. An alternative, more sophisticated approach, would be to split the data into 5 clusters and aim to have the data points in each cluster as similar to each other as possible. In machine learning, this is known as a clustering / classification problem.
One classic clustering algorithm is k-means. It's typically used for data with multiple dimensions (e.g. monthly activity, age, gender, etc.) This is, therefore, a very simplistic case of clustering.
In this case, k-means clustering can be done in the following way:
import scipy
from scipy.cluster.vq import vq, kmeans, whiten
df = pd.DataFrame({"Activity": l})
features = np.array([[x] for x in df.Activity])
whitened = whiten(features)
codebook, distortion = kmeans(whitened, 5)
code, dist = vq(whitened, codebook)
df["Category"] = code
And the output looks like:
Activity Category
40 138 1
79 272 0
72 255 0
13 38 3
41 139 1
65 231 0
26 88 2
59 197 4
76 268 0
45 145 1
A couple of notes:
The labels of the categories are random. In this case label '2' refers to higher activity than lavel '1'.
I didn't migrate the labels from 0-4 to A-E. This can easily be done using pandas' map.
Try the below solution:
df['Categ'] = pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'))
It creates Categ column - a result of division of ActivityCount
into 5 bins, labelled with A, ... E.
Borders of bins are set by division of full range into n subranges of
equal size.
You can also see the borders of each bin, calling:
pd.cut(df.ActivityCount, bins=5, labels=list('ABCDE'), retbins=True)[1]
Related
I have a dataframe as follows. The values are in a cell, a list of elements. I want to visualize distribution of the values from the "Values" column using histogram"S" stacked in rows OR separated by colours (Area_code).
How can I get the values and construct histogram"S" in plotly? Any other idea also welcome. Thank you.
Area_code Values
0 New_York [999, 54, 231, 43, 177, 313, 212, 279, 199, 267]
1 Dallas [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316]
2 XXX [560]
3 YYY [884, 13]
4 ZZZ [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]
If you reshape your data, this would be a perfect case for px.histogram. And from there you can opt between several outputs like sum, average, count through the histfunc method:
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
You haven't specified what kind of output you're aiming for, but I'll leave it up to you to change the argument for histfunc and see which option suits your needs best.
I'm often inclined to urge users to rethink their entire data process, but I'm just going to assume that there are good reasons why you're stuck with what seems like a pretty weird setup in your dataframe. The snippet below contains a complete data munginge process to reshape your data from your setup, to a so-called long format:
Area_code Values
0 New_York 999
1 New_York 54
2 New_York 231
3 New_York 43
4 New_York 177
5 New_York 313
6 New_York 212
7 New_York 279
8 New_York 199
9 New_York 267
10 Dallas 915
11 Dallas 183
12 Dallas 2326
13 Dallas 316
14 Dallas 206
15 Dallas 31
16 Dallas 317
17 Dallas 26
18 Dallas 31
19 Dallas 56
20 Dallas 316
21 XXX 560
22 YYY 884
23 YYY 13
24 ZZZ 203
And this is a perfect format for many of the great functionalites of plotly.express.
Complete code:
import plotly.graph_objects as go
import plotly.express as px
import pandas as pd
# data input
df = pd.DataFrame({'Area_code': {0: 'New_York', 1: 'Dallas', 2: 'XXX', 3: 'YYY', 4: 'ZZZ'},
'Values': {0: [999, 54, 231, 43, 177, 313, 212, 279, 199, 267],
1: [915, 183, 2326, 316, 206, 31, 317, 26, 31, 56, 316],
2: [560],
3: [884, 13],
4: [203, 1066, 453, 266, 160, 109, 45, 627, 83, 685, 120, 410, 151, 33, 618, 164, 496]}})
# data munging
areas = []
value = []
for i, row in df.iterrows():
# print(row['Values'])
for j, val in enumerate(row['Values']):
areas.append(row['Area_code'])
value.append(val)
df = pd.DataFrame({'Area_code': areas,
'Values': value})
# plotly
fig = px.histogram(df, x = 'Area_code', y = 'Values', histfunc='sum')
fig.show()
I have the following values:
student_list = [521 597 624 100] # Ids of students
grade_list = [[99 73 97 98] [98 71 70 99]] # Grades per student, first array are grades of student #521 for exercise #1 (4 grades)
My goal is to return a multidimensional array that for each student, will get the max grade he got in all exercises.
desired output example:
[[521 597 624 100] [ 99 73 97 99]]
[521 597 624 100] - the IDS of the students
[ 99 73 97 99] - The maximum grade per exercise, in exc #1 the highest betweeen 98 and 99 is 99, next is 73, and so on.
How I can return it using NumPy? I have looked after argmax() but not sure how to put it together.
You can try np.amax:
grade_list = np.array([[99, 73, 97, 98], [98, 71, 70, 49]])
np.amax(grade_list,axis=0)
output:
array([99, 73, 97, 98])
I have a quite complex question about how to add a new column with conditions for each group. Here is the example dataframe,
df = pd.DataFrame({
'id': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'],
'From_num': [80, 68, 751, 'Issued', 32, 68, 126, 'Issued', 'Missed', 105, 68, 114, 76, 68, 99, 'Missed'],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 68, 114, 76, 68, 99],
})
id From_num To_num
0 AA 80 99
1 AA 68 80
2 AA 751 68
3 AA Issued 751
4 BB 32 105
5 BB 68 32
6 BB 126 68
7 BB Issued 126
8 BB Missed 49
9 CC 105 324
10 CC 68 105
11 CC 114 68
12 CC 76 114
13 CC 68 76
14 CC 99 68
15 CC Missed 99
I have a 'flag' number 68. In each group, for any row equals or above this flag number in 'From_num' column will be tagged "Forward" in the new column , any row equals or below the flag number in 'To_num' column will be labelled 'Back' in the same column. However, the hardest situation is: if this flag number appears more than once in each column, the rows between the 'From_num' and 'To_num' will be labelled "Forward&Back" in the new column, see the df and the expected result below.
Expected result
id From_num To_num Direction
0 AA 80 99 Forward
1 AA 68 80 Forward
2 AA 751 68 Back
3 AA Issued 751 Back
4 BB 32 105 Forward
5 BB 68 32 Forward
6 BB 126 68 Back
7 BB Issued 126 Back
8 BB Missed 49 Back
9 CC 105 324 Forward
10 CC 68 105 Forward
11 CC 114 68 Forward&Back # From line 11 to 13, flag # 68 appears more than once
12 CC 76 114 Forward&Back # so the line 11, 12 and 13 labelled "Forward&Back"
13 CC 68 76 Forward&Back
14 CC 99 68 Back
15 CC Missed 99 Back
I tried to write many loops, and they all failed and could not have an expected result. So if anyone has ideas, please help. Hopefully the question is clear. Many thanks!
I've done without "real looping".
preserve the row numbers (reset_index())
construct a new data frame that is records that contain flags (68)
simple logic for "Forward" and "Back" is based on row being before or after first sighting of 68
"Forward&Back" occurs when there are multiple sightings and between 2nd and (n-1)th sighting
def direction(r):
flagrow = df2[(df2["id"]==r["id"]) ]["index"].values
if r["index"] <= flagrow[0]: val = "Forward"
elif r["index"] > flagrow[0]: val = "Back"
if len(flagrow)>2 and r["index"] >= flagrow[1] and r["index"]<flagrow[-1]: val = "Forward&Back"
return val
df = pd.DataFrame({
'id': ['AA', 'AA', 'AA', 'AA', 'BB', 'BB', 'BB', 'BB', 'BB',
'CC', 'CC', 'CC', 'CC', 'CC', 'CC', 'CC'],
'From_num': [80, 68, 751, 'Issued', 32, 68, 126, 'Issued', 'Missed', 105, 68, 114, 76, 68, 99, 'Missed'],
'To_num':[99, 80, 68, 751, 105, 32, 68, 126, 49, 324, 105, 68, 114, 76, 68, 99],
})
df = df.reset_index()
df2 = df[(df.From_num==68) | (df.To_num==68)].copy()
df["Direction"] = df.apply(lambda r: direction(r), axis=1)
df
I have a dataframe with the following information:
ticker date close gap
0 BHP 1981-07-31 0.945416 -0.199458
1 BHP 1981-08-31 0.919463 -0.235930
2 BHP 1981-09-30 0.760040 -0.434985
3 BHP 1981-10-30 0.711842 -0.509136
4 BHP 1981-11-30 0.778578 -0.428161
.. ... ... ... ...
460 BHP 2019-11-29 38.230000 0.472563
461 BHP 2019-12-31 38.920000 0.463312
462 BHP 2020-01-31 39.400000 0.459691
463 BHP 2020-02-28 33.600000 0.627567
464 BHP 2020-03-31 28.980000 0.784124
I developed the following code to find where the rows are when it crosses 0:
zero_crossings =np.where(np.diff(np.sign(BHP_data['gap'])))[0]
This returns:
array([ 52, 54, 57, 75, 79, 86, 93, 194, 220, 221, 234, 235, 236,
238, 245, 248, 277, 379, 381, 382, 383, 391, 392, 393, 395, 396],
dtype=int64)
I need to be able to do the following:
calculate the number of months between points where 'gap' crosses 0
remove items where the number of months is <12
average the remaining months
However, I don't know how to turn this nd.array into something useful that I can make the calculations from. When I try:
pd.DataFrame(zero_crossings)
I get the following df, which only returns the index:
0
0 52
1 54
2 57
3 75
4 79
5 86
.. ..
Please help...
Just extended your code a bit to get the zero_crossings into the original dataframe as required.
import pandas as pd
import numpy as np
BHP_data = pd.DataFrame({'gap': [-0.199458, 0.472563, 0.463312, 0.493318, -0.509136, 0.534985, 0.784124]})
BHP_data['zero_crossings'] = 0
zero_crossings = np.where(np.diff(np.sign(BHP_data['gap'])))[0]
print(zero_crossings) # [0 3 4]
# Updates the column to 1 based on the 0 crossing
BHP_data.loc[zero_crossings, 'zero_crossings'] = 1
print(BHP_data)
Output
gap zero_crossings
0 -0.199458 1
1 0.472563 0
2 0.463312 0
3 0.493318 1
4 -0.509136 1
5 0.534985 0
6 0.784124 0
Two data frames like below and I want to calculate the correlation coefficient.
It works fine when both columns are completed with actual values. But when they are not, it takes zero as value when calculating the correlation coefficient.
For example, Addison’s and Caden’s weights are 0. Jack and Noah don’t have Weights. I want to exclude them for calculation.
(In the tries, it seems only consider the same lengths, i.e. Jack and Noah are automatically excluded – is it?)
How can I include only the people with non-zero values for calculation?
Thank you.
import pandas as pd
Weight = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah"],
'Weight': [10, 0, 12, 20, 25, 10, 0, 18, 16, 13]}
df_wt = pd.DataFrame(Weight)
Score = {'Name': ["Abigail","Addison","Aiden","Amelia","Aria","Ava","Caden","Charlotte","Chloe","Elijah", "Jack", "Noah"],
'Score': [360, 476, 345, 601, 604, 313, 539, 531, 507, 473, 450, 470]}
df_sc = pd.DataFrame(Score)
print df_wt.Weight.corr(df_sc.Score)
Masking and taking non-zero values and common index:
df_wt.set_index('Name', inplace=True)
df_sc.set_index('Name', inplace=True)
mask = df_wt['Weight'].ne(0)
common_index = df_wt.loc[mask, :].index
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
0.923425144491911
If both dataframes contains zeros then:
mask1 = df_wt['Weight'].ne(0)
mask2 = df_sc['Score'].ne(0)
common_index = df_wt.loc[mask1, :].index.intersection(df_sc.loc[mask2, :].index)
df_wt.loc[common_index, 'Weight'].corr(df_sc.loc[common_index, 'Score'])
Use map for add new column, remove 0 rows byboolean indexing and last apply your solution in same DataFrame:
df_wt['Score'] = df_wt['Name'].map(df_sc.set_index('Name')['Score'])
df_wt = df_wt[df_wt['Weight'].ne(0)]
print (df_wt)
Name Weight Score
0 Abigail 10 360
2 Aiden 12 345
3 Amelia 20 601
4 Aria 25 604
5 Ava 10 313
7 Charlotte 18 531
8 Chloe 16 507
9 Elijah 13 473
print (df_wt.Weight.corr(df_wt.Score))
0.923425144491911