Counting number of ratings without a loop python - python

In python, given a list of ratings as:
import pandas as pd
path = 'ratings_ml100k.csv'
data = pd.read_csv(path,sep= ',')
print(data)
user_id item_id rating
28422 100 690 4
32020 441 751 4
15819 145 265 5
where the items are:
print(itemsTrain)
[ 690 751 265 ..., 1650 1447 1507]
For each item, I would like to compute the number of ratings. Is there anyway to do this without resorting to a Loop? All ideas are appreciated,
data is a pandas dataframe. The desire output should look like this:
pop =
item_id rating_count
690 120
751 10
265 159
... ...
Note that itemsTrain contain unique item_ids in the rating dataset data.

you can do it this way:
In [200]: df = pd.DataFrame(np.random.randint(0,8,(15,2)),columns=['id', 'rating'])
In [201]: df
Out[201]:
id rating
0 4 6
1 0 1
2 2 4
3 2 5
4 2 7
5 3 5
6 6 1
7 4 3
8 4 3
9 3 2
10 2 4
11 7 7
12 3 1
13 2 7
14 7 3
In [202]: df.groupby('id').rating.count()
Out[202]:
id
0 1
2 5
3 3
4 3
6 1
7 2
Name: rating, dtype: int64
if you want to have result as a DF (you can also name the count column as you wish):
In [206]: df.groupby('id').rating.count().to_frame('count').reset_index()
Out[206]:
id count
0 0 1
1 2 5
2 3 3
3 4 3
4 6 1
5 7 2
you can also count # of unique ratings:
In [203]: df.groupby('id').rating.nunique()
Out[203]:
id
0 1
2 3
3 3
4 2
6 1
7 2
Name: rating, dtype: int64

You can use the method df.groupby() to group items by item_id and then use the method count() to sum the ratings.
Do as follows :
# df is your dataframe
v # the method allows you to sum values of the previous feature
df.groupby('item_id').rating.count()
^ ^ # the feature you want to sum upon its values
^
# The method allows you to group the samples by the feature "item_id"
# which is supposed to be unique

Related

Pandas Dataframe: For loop that adds a fixed integer if the value exists in previous rows

for the following dataframe
df = pd.DataFrame({'Rounds':[1000,1000,1000,1000,3000,3000,4000,5000,6000,6000]})
I would like to have a for loop that if the value already exists in previous rows, a fixed int, in this case 25, is added to the value and creates:
df = pd.DataFrame({'Rounds':[1000,1025,1050,1075,3000,3025,4000,5000,6000,6025]})
Initially I tried
for i in df.index:
if df.iat[i,1] == df.iloc[i-1,1]:
df.iat[i,1] = df.iat[i-1,1]+25
The problem is that it doesn't work for more than two similar values in a column and I would like to give column name "Rounds" instead of the index of column.
You need groupby.cumcount:
df['Rounds'] += df.groupby('Rounds').cumcount()*25
output:
Rounds
0 1000
1 1025
2 1050
3 1075
4 3000
5 3025
6 4000
7 5000
8 6000
9 6025
intermediate:
df.groupby('Rounds').cumcount()
0 0
1 1
2 2
3 3
4 0
5 1
6 0
7 0
8 0
9 1
dtype: int64
Use groupby + cumcount:
df["Rounds"] += df.groupby(df["Rounds"]).cumcount() * 25
print(df)
Output
Rounds
0 1000
1 1025
2 1050
3 1075
4 3000
5 3025
6 4000
7 5000
8 6000
9 6025

make a unique enumeration for concatenated pandas df

I have some dataframes where data is tagged in groups, let's say as such:
df1 = pd.DataFrame({'id':[1,3,7, 10,30, 70, 100, 300], 'name':[1,1,1,1,1,1,1,1], 'tag': [1,1,1, 2,2,2, 3,3]})
df2 = pd.DataFrame({'id':[2,5,6, 20, 50, 200, 500, 600], 'name': [2,2,2,2,2,2,2,2], 'tag':[1,1,1, 2, 2, 3,3,3]})
df3 = pd.DataFrame({'id':[4, 8, 9, 40, 400, 800, 900], 'name': [3,3,3,3,3,3,3], 'tag':[1,1,1, 2, 3, 3,3]})
In each dataframe, the tag is attibuted in an ascending order of ids (so bigger ids will have equal or bigger tags).
My wish is to recalculate tags in the concatenated dataframe,
df = pd.concat([df1, df2, df3])
so that the tag of each group will be in ascending order of ids of the first element of each. So, the group starting by id=1 will be tagged by 1 (that is, ids 1,3,7), the group starting by id=2 will be tagged by 2 (that is , ids 2,5,6), the group starting by 4 will be tagged by 3, the group starting by 10 will be tagged as 4, and so on.
I did manage to get a (complicated!) solution:
1) Get first row of each group , put those in a dataframe , sort by id and create the new tags:
dff = pd.concat([df1.groupby('tag').first(), df2.groupby('tag').first(), df3.groupby('tag').first()])
dff = dff.sort(['id'])
dff = dff.reset_index()
dff['new_tags'] = dff.index +1
2) Concatenate this dataframe with initial ones, drop_duplicates so as to keep the newly tagged rows, order by group , then propagate new tags:
df = pd.concat([dff, df1, df2, df3])
df = df.drop_duplicates(subset=['id', 'tag', 'name'])
df = df.sort(['name', 'tag'])
df = df.fillna(method = 'pad')
The new tags are exactly what needed, but my solution seems too complicated. Would you have a suggestion on how to make easier? I think I must be missing something!
Thanks in advance,
M.
Using pd.concat + keys , I break down the steps
df=pd.concat([df1,df2,df3],keys=[0,1,2])
df=df.reset_index(level=0)#get the level=0 index
df=df.sort_values(['tag','level_0']) # sort the value
df['New']=(df['tag'].diff().ne(0)|df['level_0'].diff().ne(0)).cumsum()
df
Out[110]:
level_0 id name tag New
0 0 1 1 1 1
1 0 3 1 1 1
2 0 7 1 1 1
0 1 2 2 1 2
1 1 5 2 1 2
2 1 6 2 1 2
0 2 4 3 1 3
1 2 8 3 1 3
2 2 9 3 1 3
3 0 10 1 2 4
4 0 30 1 2 4
5 0 70 1 2 4
3 1 20 2 2 5
4 1 50 2 2 5
3 2 40 3 2 6
6 0 100 1 3 7
7 0 300 1 3 7
5 1 200 2 3 8
6 1 500 2 3 8
7 1 600 2 3 8
4 2 400 3 3 9
5 2 800 3 3 9
6 2 900 3 3 9
Once concatenated, you can use groupby the columns 'tag' and 'name' with transform and first on the column 'id'. Then sort_values this series and cumsum the diff is more than 0 such as:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
df['new'] = (df.groupby(['tag','name'])['id'].transform('first')
.sort_values().diff().ne(0.).cumsum())
and you get the expected output:
id name tag new
0 1 1 1 1
1 2 2 1 2
2 3 1 1 1
3 4 3 1 3
4 5 2 1 2
5 6 2 1 2
6 7 1 1 1
7 8 3 1 3
8 9 3 1 3
9 10 1 2 4
10 20 2 2 5
11 30 1 2 4
12 40 3 2 6
...
EDIT: to avoid using groupby, you can drop_duplicates and index to get the index of the first ids, create the column new with an incremental value using loc and range and then ffill after sort_values to fill the values:
df = pd.concat([df1, df2, df3]).sort_values('id').reset_index(drop=True)
list_ind = df.drop_duplicates(['name','tag']).index
df.loc[list_ind,'new'] = range(1,len(list_ind)+1)
df['new'] = df.sort_values(['tag','name'])['new'].ffill().astype(int)
and you get the same result

Drop rows after maximum value in a grouped Pandas dataframe

I've got a date-ordered dataframe that can be grouped. What I am attempting to do is groupby a variable (Person), determine the maximum (weight) for each group (person), and then drop all rows that come after (date) the maximum.
Here's an example of the data:
df = pd.DataFrame({'Person': 1,1,1,1,1,2,2,2,2,2],'Date': '1/1/2015','2/1/2015','3/1/2015','4/1/2015','5/1/2015','6/1/2011','7/1/2011','8/1/2011','9/1/2011','10/1/2011'], 'MonthNo':[1,2,3,4,5,1,2,3,4,5], 'Weight':[100,110,115,112,108,205,210,211,215,206]})
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
3 4/1/2015 4 1 112
4 5/1/2015 5 1 108
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
9 10/1/2011 5 2 206
Here's what I want the result to look like:
Date MonthNo Person Weight
0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
5 6/1/2011 1 2 205
6 7/1/2011 2 2 210
7 8/1/2011 3 2 211
8 9/1/2011 4 2 215
I think its worth noting, there can be disjoint start dates and the maximum may appear at different times.
My idea was to find the maximum for each group, obtain the MonthNo the maximum was in for that group, and then discard any rows with MonthNo greater Max Weight MonthNo. So far I've been able to obtain the max by group, but cannot get past doing a comparison based on that.
Please let me know if I can edit/provide more information, haven't posted many questions here! Thanks for the help, sorry if my formatting/question isn't clear.
Using idxmax with groupby
df.groupby('Person',sort=False).apply(lambda x : x.reset_index(drop=True).iloc[:x.reset_index(drop=True).Weight.idxmax()+1,:])
Out[131]:
Date MonthNo Person Weight
Person
1 0 1/1/2015 1 1 100
1 2/1/2015 2 1 110
2 3/1/2015 3 1 115
2 0 6/1/2011 1 2 205
1 7/1/2011 2 2 210
2 8/1/2011 3 2 211
3 9/1/2011 4 2 215
You can use groupby.transform with idxmax. The first 2 steps may not be necessary depending on how your dataframe is structured.
# convert Date to datetime
df['Date'] = pd.to_datetime(df['Date'])
# sort by Person and Date to make index usable for next step
df = df.sort_values(['Person', 'Date']).reset_index(drop=True)
# filter for index less than idxmax transformed by group
df = df[df.index <= df.groupby('Person')['Weight'].transform('idxmax')]
print(df)
Date MonthNo Person Weight
0 2015-01-01 1 1 100
1 2015-02-01 2 1 110
2 2015-03-01 3 1 115
5 2011-06-01 1 2 205
6 2011-07-01 2 2 210
7 2011-08-01 3 2 211
8 2011-09-01 4 2 215

Sort strings of mixed types and different lengths

I have a dataframe like this:
import pandas as pd
df = pd.DataFrame({'A': ['286a2', '17', '286a1', '373', '200b', '150'], 'B': range(6)})
A B
0 286a2 0
1 17 1
2 286a1 2
3 373 3
4 200b 4
5 150 5
which I want to sort according to A. When I do this using
df.sort_values(by='A')
I obtain
A B
5 150 5
1 17 1
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
which is almost correct: I would like to have 17 before 150 but don't know how to do this as those entries are not just values but actual strings consisting of numerical values and letters. Is there a way to do this?
EDIT
About the pattern of the entries:
It is always a numeric value first of arbitrary length, then it can be followed by characters, which can be followed by numerical values again.
You can use replace characters to . with cast to float with sort_index:
df.index = df['A'].str.replace('[a-zA-Z]+','.').astype(float)
df = df.sort_index().reset_index(drop=True)
print (df)
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3
Another variant to jezrael's
In [1706]: df.assign(
A_=df.A.str.replace('[/\D]', '.').astype(float) # or '[a-zA-Z]+'
).sort_values(by='A_').drop('A_', 1)
Out[1706]:
A B
1 17 1
5 150 5
4 200b 4
2 286a1 2
0 286a2 0
3 373 3
Or you can try , natsort
from natsort import natsorted, ns
df.set_index('A').reindex(natsorted(df.A, key=lambda y: y.lower())).reset_index()
Out[395]:
A B
0 17 1
1 150 5
2 200b 4
3 286a1 2
4 286a2 0
5 373 3

How to remove ugly row in pandas.dataframe

so I am filling dataframes from 2 different files. While those 2 files should have the same structure (the values should be different thought) the resulting dataframes look different. So when printing those I get:
a b c d
0 70402.14 70370.602112 0.533332 98
1 31362.21 31085.682726 1.912552 301
... ... ... ... ...
753919 64527.16 64510.008206 0.255541 71
753920 58077.61 58030.943621 0.835758 152
a b c d
index
0 118535.32 118480.657338 0.280282 47
1 49536.10 49372.999416 0.429902 86
... ... ... ... ...
753970 52112.95 52104.717927 0.356051 116
753971 37044.40 36915.264944 0.597472 165
So in the second dataframe there is that "index" row that doesnt make any sense for me and it causes troubles in my following code. I did neither write the code to fill the files into the dataframes nor I did create those files. So I am rather interested in checking if such a row exists and how I might be able to remove it. Does anyone have an idea about this?
The second dataframe has an index level named "index".
You can remove the name with
df.index.name = None
For example,
In [126]: df = pd.DataFrame(np.arange(15).reshape(5,3))
In [128]: df.index.name = 'index'
In [129]: df
Out[129]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
In [130]: df.index.name = None
In [131]: df
Out[131]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
The dataframe might have picked up the name "index" if you used reset_index and set_index like this:
In [138]: df.reset_index()
Out[138]:
index 0 1 2
0 0 0 1 2
1 1 3 4 5
2 2 6 7 8
3 3 9 10 11
4 4 12 13 14
In [140]: df.reset_index().set_index('index')
Out[140]:
0 1 2
index
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
4 12 13 14
Index is just the first column - it's numbering the rows by default, but you can change it a number of ways (e.g. filling it with values from one of the columns)

Categories

Resources