I'm using Movie Lens Dataset in Python Pandas. I need to print the matrix of u.data a tab separated file in foll. way
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating Rating
UserID2 Rating Rating Rating
I've already been through following links
One - Dataset is much huge put it in series
Two - Transpose of Row not mentioned
Three - Tried with reindex so as
to get NaN values in one column
Four - df.iloc and df.ix
didn't work either
I need the output so as it shows me rating and NaN (when not rated) for movies w.r.t. users.
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating NaN
UserID2 Rating NaN Rating
P.S. I won't mind having solutions with numpy, crab, recsys, csv or any other python package
EDIT 1 - Sorted the data and exported but got an additional field
df2 = df.sort_values(['UserID','MovieID'])
print type(df2)
df2.to_csv("sorted.csv")
print df2
The file produces foll. sorted.csv file
,UserID,MovieID,Rating,TimeStamp
32236,1,1,5,874965758
23171,1,2,3,876893171
83307,1,3,4,878542960
62631,1,4,3,876893119
47638,1,5,3,889751712
5533,1,6,5,887431973
70539,1,7,4,875071561
31650,1,8,1,875072484
20175,1,9,5,878543541
13542,1,10,3,875693118
EDIT 2 - As asked in Comments
Here's the format of Data in u.data file which acts as input
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
One method:
Use pivot_table and if one value per user and movie id then aggfunc doesn't matter, however if there are multiple values, the choose your aggregation.
df.pivot_table(values='Rating',index='UserID',columns='MovieID', aggfunc='mean')
Second method (no duplicate userid, movieid records):
df.set_index(['UserID','MovieID'])['Rating'].unstack()
Third method (no duplicate userid, movieid records):
df.pivot(index='UserID',columns='MovieID',values='Rating')
Fourth method (like the first you can choose your aggregation method):
df.groupby(['UserID','MovieID'])['Rating'].mean().unstack()
Output:
MovieID 1 2 3 4 5 6 7 8 9 10
UserID
1 5 3 4 3 3 5 4 1 5 3
Related
I'm working with a dataframe that contains several observations of BMW cars. The thing is that in the model columns I got several models like
Model
320
420
425
335
325
118
Z4
.
.
.
I want to change the number to its series, the ones that start by 1 are serie1, if starts with 2 are serie2. I've already checked str.contains(pat = '1') but I still don't know how to apply it to the whole column.
Pandas has a pandas.Series.str.replace method which can be used for this purpose. You would do something like:
df['Model'].str.replace('^1', 'serie1')
If I understand you correctly, you want to get first character from the model column:
df["Serie"] = "Serie " + df["Model"].str[0]
print(df)
Prints:
Model Serie
0 320 Serie 3
1 420 Serie 4
2 425 Serie 4
3 335 Serie 3
4 325 Serie 3
5 118 Serie 1
6 Z4 Serie Z
I have a dataframe with two columns
df
hr count
2 53
3 1586
4 890
5 833
6 209
I want to take average/Percentile of each value of column and want to store result in new column. Currently I am using this way.
df['avg'] = (df['count'] / df['count'].sum()) * 100
df
hr count avg
2 53 1.484178
3 1586 44.413330
4 890 24.922991
5 833 23.326799
6 209 5.852702
I want to this by built in function like mean() . How can I achieve this with built in function?
I need to merge some data in dataframe because I will code [sequential association rule] in python.
How can I merge the data and what algorithm I should use in python?
Apriori? FP growth?
I can't find [sequential association rule] using apriori in python.
They use R
visit places are 250. unique id numbers are 116807 and total row is 1.7millions. and, each id has country_code(111 countries but I will classify them to 10 countries).. so I will merge them one more.
Previous Data
index date_ymd id visit_nm country
1 20170801 123123 seoul 460
2 20170801 123123 tokyo 460
3 20170801 124567 seoul 440
4 20170802 123123 osaka 460
5 20170802 123123 seoul 460
... ... ... ...
What I need
index Transaction visit_nm country
1 20170801123123 {seoul,tokyo} 460
2 20170802123123 {osaka,seoul} 460
From what i understood seeing the data, Use groupby agg:
s=pd.Series(df.date_ymd.astype(str)+df.id.astype(str),name='Transaction')
(df.groupby(s)
.agg({'visit_nm':lambda x: set(x),'country':'first'}).reset_index())
Transaction visit_nm country
0 20170801123123 {seoul, tokyo} 460
1 20170801124567 {seoul} 440
2 20170802123123 {osaka, seoul} 460
Also you could use:
df['Transaction'] = df['date_ymd'].map(str)+df['id'].map(str)
df.groupby('Transaction').agg({'visit_nm': lambda x: set(x), 'country': 'first'}).reset_index()
The Scenario
My dataset was in format as follows:
Which I refer as ACTUAL FORMAT
uid iid rat tmp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013
and while passing it to other function (KMeans Clustering) it requires to be format like this, which I've created using Pivot mapping:
Which I refer as MATRIX FORMAT
uid 1 2 3 4
4 4.3320762062 4.3407749532 4.3111995162 4.3411425423
5 4 3 2.1952622349 3.1913491995
6 4 3.4233243638 3.8255108621 3.948791424
7 4.4983411706 4.0477240538 4.0241460801 5
8 4.1773004578 4.0191412859 4.0442369862 4.1754642909
9 4.2733984521 4.2797130861 4.2682723131 4.2816986988
15 1 3.0554789259 3.2279546684 3.1282278957
16 5 4.3473697565 4.0675394438 5
The Problem:
Now, Since I need the result / MATRIX FORMAT Data to passed again to the First Algorithm, I need to convert it to OLD FORMAT.
Coversion:
For conversion of OLD to MATRIX Format I did:
Pivot_Matrix = source_data.pivot(values='rat', index='uid', columns='iid')
I tried reversing & interchanging of values to get the OLD FORMAT, which has apparently failed. Is there any way to retrieve MATRIX to OLD FORMAT?
You need stack with rename_axis for columns names and last reset_index:
df = df.stack().rename_axis(('uid','iid')).reset_index(name='rat')
print (df.head())
uid iid rat
0 4 1 4.332076
1 4 2 4.340775
2 4 3 4.311200
3 4 4 4.341143
4 5 1 4.000000
I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!
I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)