Generate count column for IDs in a Pandas DataFrame - python

Here how you can generate the dummy version of my Pandas DataFrame:
import pandas as pd
usr_id = [121,121,121,121,135,135,135,135,135,135,135,135,135]
ses_id = [95,95,95,108,97,97,97,97,98,98,98,101,101]
que_id = [1,8,15,23,1,42,9,5,7,9,10,17,20]
df = pd.DataFrame(list(zip(usr_id, ses_id, que_id)),
columns =['usr_id', 'ses_id', 'que_id'])
usr_id
ses_id
que_id
121
95
1
121
95
8
121
95
15
121
108
23
135
97
1
135
97
42
135
97
9
135
97
5
135
98
7
135
98
9
135
98
10
135
101
17
135
101
20
A user can attempt multiple sessions where each sesssion can have varying number of multiple questions. I need to create two columns which will number the session and question i.e, (session number or question number 1, 2, 3...) for each indiviudal users. Something like this:
usr_id
ses_id
que_id
ses_no
que_no
121
95
1
1
1
121
95
8
1
2
121
95
15
1
3
121
108
23
2
1
135
97
1
1
1
135
97
42
1
2
135
97
9
1
3
135
97
5
1
4
135
98
7
2
1
135
98
9
2
2
135
98
10
2
3
135
101
17
3
1
135
101
20
3
2
So session_id 95 was the first session usr_id 121 attempted within which he attempted three questions que_id 1, 8 & 15. Next session attempted by the same user is ses_id 108 with only 1 question que_id 23. Another user, usr_id 135 atempted it's first session recorded as ses_id 97 in which he attempted four questions que_id 1, 42, 9 & 5. The second session from the same user now is ses_id 98 and so on.
I managed to generate the 'que_no' using the following:
df['que_no'] = df.groupby('ses_id').cumcount()+1
But could't find a way to do the same for ses_no.
I am also having an idea of using .shift() to compare whether there is a change in 'usr_id' and/or 'ses_id and some how apply a count logic on the output. Something like this:
i = df.usr_id
j = df.sess_id
i_shift_ne = i.ne(i.shift())
j_shift_ne = j.ne(j.shift())
Not sure whether this idea will work or not also I am pretty sure there has to be a smarter way of doing this. It will be great if we can make it happen using pandas library itself.

IIUC use custom lambda function per usr_id with factorize:
df['ses_no'] = df.groupby('usr_id')['ses_id'].transform(lambda x: pd.factorize(x)[0]) + 1
#if values are sorted
#df['ses_no'] = df.groupby('usr_id')['ses_id'].rank(method='dense').astype(int)
df['que_no'] = df.groupby(['usr_id','ses_no']).cumcount()+1
print (df)
usr_id ses_id que_id ses_no que_no
0 121 95 1 1 1
1 121 95 8 1 2
2 121 95 15 1 3
3 121 108 23 2 1
4 135 97 1 1 1
5 135 97 42 1 2
6 135 97 9 1 3
7 135 97 5 1 4
8 135 98 7 2 1
9 135 98 9 2 2
10 135 98 10 2 3
11 135 101 17 3 1
12 135 101 20 3 2

Related

Read online excel file with a specific sheet and only selected columns

I have to read through pandas the CTG.xls file from the following path:
https://archive.ics.uci.edu/ml/machine-learning-databases/00193/.
From this file I have to select the sheet Data. Moreover I have to select from column K to the column AT of the file. So at the end one have a dataset with these column:
["LB","AC","FM","UC","DL","DS","DP","ASTV","MSTV","ALTV" ,"MLTV" ,"Width","Min","Max" ,"Nmax","Nzeros","Mode","Mean" ,"Median" ,"Variance" ,"Tendency" ,"CLASS","NSP"]
How can I do this using the read function in pandas?
Use:
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00193/CTG.xls'
df = pd.read_excel(url, sheet_name='Data', skipfooter=3)
df = df.drop(columns=df.filter(like='Unnamed').columns)
df.columns = df.iloc[0].to_list()
df = df[1:].reset_index(drop=True)
Output
LB AC FM UC DL DS DP ASTV MSTV ALTV MLTV Width Min Max Nmax Nzeros Mode Mean Median Variance Tendency CLASS NSP
0 120 0 0 0 0 0 0 73 0.5 43 2.4 64 62 126 2 0 120 137 121 73 1 9 2
1 132 0.00638 0 0.00638 0.00319 0 0 17 2.1 0 10.4 130 68 198 6 1 141 136 140 12 0 6 1
2 133 0.003322 0 0.008306 0.003322 0 0 16 2.1 0 13.4 130 68 198 5 1 141 135 138 13 0 6 1
3 134 0.002561 0 0.007682 0.002561 0 0 16 2.4 0 23 117 53 170 11 0 137 134 137 13 1 6 1
4 132 0.006515 0 0.008143 0 0 0 16 2.4 0 19.9 117 53 170 9 0 137 136 138 11 1 2 1
... ... ... ... ... ... .. .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ..
2121 140 0 0 0.007426 0 0 0 79 0.2 25 7.2 40 137 177 4 0 153 150 152 2 0 5 2
2122 140 0.000775 0 0.006971 0 0 0 78 0.4 22 7.1 66 103 169 6 0 152 148 151 3 1 5 2
2123 140 0.00098 0 0.006863 0 0 0 79 0.4 20 6.1 67 103 170 5 0 153 148 152 4 1 5 2
2124 140 0.000679 0 0.00611 0 0 0 78 0.4 27 7 66 103 169 6 0 152 147 151 4 1 5 2
2125 142 0.001616 0.001616 0.008078 0 0 0 74 0.4 36 5 42 117 159 2 1 145 143 145 1 0 1 1
[2126 rows x 23 columns]

Pandas assign points ranging 5-1 depending upon timestamp, most recent get max point

While designing recommendation system, I have stumbled upon a case where voting or something similar is required for collaborative filtering implementation.
But in our system we don't have any field for rating/voting. I am willing to deduce similar kind of rating/voting depending upon the timestamp on which user watched the show.
This is what view-history looks like
subscriber_id content_id timestamp
1 123 1576833135000
1 124 1576833140000
1 125 1576833145000
1 126 1576833150000
1 127 1576833155000
1 128 1576833160000
1 129 1576833165000
1 130 1576833170000
1 131 1576833175000
1 132 1576833180000
2 123 1576833135000
2 124 1576833140000
2 125 1576833145000
2 126 1576833150000
2 127 1576833155000
2 128 1576833160000
2 129 1576833165000
2 130 1576833170000
2 131 1576833175000
2 132 1576833180000
2 133 1576833185000
2 134 1576833190000
2 135 1576833195000
2 136 1576833200000
2 137 1576833205000
2 138 1576833210000
2 139 1576833215000
2 140 1576833220000
2 141 1576833225000
2 142 1576833230000
2 143 1576833235000
2 144 1576833240000
I want to assign a number to each of these entries, ranging from 5-1(5 being most recent), I have implemented the rank system, but it is not working for the range.
df1['rank'] = df1.sort_values(['subscriber_id','timestamp']) \
.groupby(['subscriber_id'])['timestamp'] \
.rank(method='max').astype(int)
Expected Output:
subscriber_id content_id timestamp rating
1 123 1576833135000 1
1 124 1576833140000 1
1 125 1576833145000 2
1 126 1576833150000 2
1 127 1576833155000 3
1 128 1576833160000 3
1 129 1576833165000 4
1 130 1576833170000 4
1 131 1576833175000 5
1 132 1576833180000 5
2 123 1576833135000 1
2 124 1576833140000 1
2 125 1576833145000 1
2 126 1576833150000 1
2 127 1576833155000 2
2 128 1576833160000 2
2 129 1576833165000 2
2 130 1576833170000 2
2 131 1576833175000 3
2 132 1576833180000 3
2 133 1576833185000 3
2 134 1576833190000 3
2 135 1576833195000 4
2 136 1576833200000 4
2 137 1576833205000 4
2 138 1576833210000 4
2 139 1576833215000 4
2 140 1576833220000 5
2 141 1576833225000 5
2 142 1576833230000 5
2 143 1576833235000 5
2 144 1576833240000 5
Any help would be much appreciated!
Now it make sense. Solution is to create list with ranks based on modulo value from dividing number of data for selected user by 5. There You go :)
import pandas as pd
from io import StringIO
data = StringIO("""
content_id subscriber_id timestamp
123 1 1576833135000
124 1 1576833140000
125 1 1576833145000
126 1 1576833150000
127 1 1576833155000
128 1 1576833160000
129 1 1576833165000
130 1 1576833170000
131 1 1576833175000
132 1 1576833180000
123 2 1576833135000
124 2 1576833140000
125 2 1576833145000
126 2 1576833150000
127 2 1576833155000
128 2 1576833160000
129 2 1576833165000
130 2 1576833170000
131 2 1576833175000
132 2 1576833180000
133 2 1576833185000
134 2 1576833190000
135 2 1576833195000
136 2 1576833200000
137 2 1576833205000
138 2 1576833210000
139 2 1576833215000
140 2 1576833220000
141 2 1576833225000
142 2 1576833230000
143 2 1576833235000
144 2 1576833240000
""")
# load data into data frame
df = pd.read_csv(data, sep=' ')
# get unique users
user_list = df['subscriber_id'].unique()
# collect results
results = pd.DataFrame(columns=['content_id','subscriber_id','timestamp','rating'])
for user in user_list:
# select data range for one user
df2 = df[df['subscriber_id'] == user]
items_numer = df2.shape[0]
modulo_remider = items_numer % 5
ranks_repeat = int(items_numer / 5)
# create rating list based on modulo
if modulo_remider > 0:
rating = []
for i in range(1, 6, 1):
l = [i for j in range(ranks_repeat)]
for number in l:
rating.append(number)
if modulo_remider == 1:
rating.insert(rating.index(5), 5)
if modulo_remider == 2:
rating.insert(rating.index(4), 4)
rating.insert(rating.index(5), 5)
if modulo_remider == 3:
rating.insert(rating.index(3), 3)
rating.insert(rating.index(4), 4)
rating.insert(rating.index(5), 5)
if modulo_remider == 4:
rating.insert(rating.index(2), 2)
rating.insert(rating.index(3), 3)
rating.insert(rating.index(4), 4)
rating.insert(rating.index(5), 5)
df2.insert(3, 'rating', rating, True)
else:
rating = []
for i in range(1, 6, 1):
l = [i for j in range(ranks_repeat)]
for number in l:
rating.append(number)
df2.insert(3, 'rating', rating, True)
# collect results
results = results.append(df2)
Result:
content_id subscriber_id timestamp rating
0 123 1 1576833135000 1
1 124 1 1576833140000 1
2 125 1 1576833145000 2
3 126 1 1576833150000 2
4 127 1 1576833155000 3
5 128 1 1576833160000 3
6 129 1 1576833165000 4
7 130 1 1576833170000 4
8 131 1 1576833175000 5
9 132 1 1576833180000 5
10 123 2 1576833135000 1
11 124 2 1576833140000 1
12 125 2 1576833145000 1
13 126 2 1576833150000 1
14 127 2 1576833155000 2
15 128 2 1576833160000 2
16 129 2 1576833165000 2
17 130 2 1576833170000 2
18 131 2 1576833175000 3
19 132 2 1576833180000 3
20 133 2 1576833185000 3
21 134 2 1576833190000 3
22 135 2 1576833195000 4
23 136 2 1576833200000 4
24 137 2 1576833205000 4
25 138 2 1576833210000 4
26 139 2 1576833215000 4
27 140 2 1576833220000 5
28 141 2 1576833225000 5
29 142 2 1576833230000 5
30 143 2 1576833235000 5
31 144 2 1576833240000 5

Get row numbers based on column values from numpy array

I am new to numpy and need some help in solving my problem.
I read records from a binary file using dtypes, then I am selecting 3 columns
df = pd.DataFrame(np.array([(124,90,5),(125,90,5),(126,90,5),(127,90,0),(128,91,5),(129,91,5),(130,91,5),(131,91,0)]), columns = ['atype','btype','ctype'] )
which gives
atype btype ctype
0 124 90 5
1 125 90 5
2 126 90 5
3 127 90 0
4 128 91 5
5 129 91 5
6 130 91 5
7 131 91 0
'atype' is of no interest to me for now.
But what I want is the row numbers when
(x,90,5) appears in 2nd and 3rd columns
(x,90,0) appears in 2nd and 3rd columns
when (x,91,5) appears in 2nd and 3rd columns
and (x,91,0) appears in 2nd and 3rd columns
etc
There are 7 variables like 90,91,92,93,94,95,96 and correspondingly there will be values of either 5 or 0 in the 3rd column.
The entries are 1 million. So is there anyway to find out these without a for loop.
Using pandas you could try the following.
df[(df['btype'].between(90, 96)) & (df['ctype'].isin([0, 5]))]
Using your example. if some of the values are changed, such that df is
atype btype ctype
0 124 90 5
1 125 90 5
2 126 0 5
3 127 90 100
4 128 91 5
5 129 0 5
6 130 91 5
7 131 91 0
then using the solution above, the following is returned.
atype btype ctype
0 124 90 5
1 125 90 5
4 128 91 5
6 130 91 5
7 131 91 0

Selecting rows which match condition of group

I have a Pandas DataFrame df which looks as follows:
ID Timestamp x y
1 10 322 222
1 12 234 542
1 14 22 523
2 55 222 76
2 56 23 87
2 58 322 5436
3 100 322 345
3 150 22 243
3 160 12 765
3 170 78 65
Now, I would like to keep all rows where the timestamp is between 12 and 155. This I could do by df[df["timestamp"] >= 12 & df["timestamp"] <= 155]. But I would like to have only rows included where all timestamps in the corresponding ID group are within the range. So in the example above it should result in the following dataframe:
ID Timestamp x y
2 55 222 76
2 56 23 87
2 58 322 5436
For ID == 1 and ID == 3 not all timestamps of the rows are in the range that's why they are not included.
How can this be done?
You can combine groupby("ID") and filter:
df.groupby("ID").filter(lambda x: x.Timestamp.between(12, 155).all())
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436
Use transform with groupby and using all() to check if all items in the group matches the condition:
df[df.groupby('ID').Timestamp.transform(lambda x: x.between(12,155).all())]
ID Timestamp x y
3 2 55 222 76
4 2 56 23 87
5 2 58 322 5436

Issue combining columns in Dataframe?

I have the following dataframe:
Obj BIT BIT BIT GAS GAS GAS OIL OIL OIL
Date
2007-01-03 18 7 0 184 35 2 52 14 0
2007-01-09 43 3 0 249 35 2 68 11 1
2007-01-16 60 6 0 254 35 5 72 13 1
2007-01-23 69 11 1 255 43 2 81 6 0
2007-01-30 74 8 0 263 29 4 69 9 0
2007-02-06 78 6 1 259 34 2 79 6 0
2007-02-14 76 9 1 263 24 2 70 10 1
2007-02-20 85 7 0 241 20 6 72 4 0
2007-02-27 79 6 0 242 35 3 68 7 0
2007-03-06 68 14 0 225 26 2 57 10 1
How can I sum each of the 9 columns into 3 columns. "BIT","GAS" and "OIL"
This is the code for the dataframe which basically just gets me a cross section from a larger df I want:
ABrigsA = ndfAB.xs(['BIT','GAS','OIL'],axis=1)
Any suggestions?
Assuming that you want to sum similarly-named columns, you can use groupby [tutorial docs]:
>>> df.groupby(level=0, axis='columns').sum()
Obj BIT GAS OIL
Date
2007-01-03 25 221 66
2007-01-09 46 286 80
2007-01-16 66 294 86
2007-01-23 81 300 87
2007-01-30 82 296 78
2007-02-06 85 295 85
2007-02-14 86 289 81
2007-02-20 92 267 76
2007-02-27 85 280 75
2007-03-06 82 253 68

Categories

Resources