I need to merge some data in dataframe because I will code [sequential association rule] in python.
How can I merge the data and what algorithm I should use in python?
Apriori? FP growth?
I can't find [sequential association rule] using apriori in python.
They use R
visit places are 250. unique id numbers are 116807 and total row is 1.7millions. and, each id has country_code(111 countries but I will classify them to 10 countries).. so I will merge them one more.
Previous Data
index date_ymd id visit_nm country
1 20170801 123123 seoul 460
2 20170801 123123 tokyo 460
3 20170801 124567 seoul 440
4 20170802 123123 osaka 460
5 20170802 123123 seoul 460
... ... ... ...
What I need
index Transaction visit_nm country
1 20170801123123 {seoul,tokyo} 460
2 20170802123123 {osaka,seoul} 460
From what i understood seeing the data, Use groupby agg:
s=pd.Series(df.date_ymd.astype(str)+df.id.astype(str),name='Transaction')
(df.groupby(s)
.agg({'visit_nm':lambda x: set(x),'country':'first'}).reset_index())
Transaction visit_nm country
0 20170801123123 {seoul, tokyo} 460
1 20170801124567 {seoul} 440
2 20170802123123 {osaka, seoul} 460
Also you could use:
df['Transaction'] = df['date_ymd'].map(str)+df['id'].map(str)
df.groupby('Transaction').agg({'visit_nm': lambda x: set(x), 'country': 'first'}).reset_index()
Related
Hello am doing my assignment and I have encountered a question that I can't answer. The question is to create another DataFrame df_urban consisting of all columns of the original dataset but comprising of only applicants with Urban status in their Property_Area attribute (exclude Rural and Semiurban) with ApplicantIncome of at least S$10,000. Reset the row index and display the last 10 rows of this DataFrame.
Picture of the question
My code however will not meet the criteria of Applicant Income of at least 10,000 as well as only urban status in the area.
df_urban = df
df_urban.iloc[-10:[11]]
I Was wondering what is the solution to the question.
Data picture
you can use the '&' operator to limit the data by multiple column conditions:
df_urban = df[(df[col]==<condition>) & (df[col] >= <condition>)]
Following is a simple code snippet performing a proof of principle in extracting a subset of the primary data frame to produce a subset data frame of only "Urban" locations.
import pandas as pd
df=pd.read_csv('Applicants.csv',delimiter='\t')
print(df)
df_urban = df[(df['Property_Area'] == 'Urban')]
print(df_urban)
Using a simply built CSV file, here is a sample of the output.
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
0 4583 1508 128000 360 1 Rural
1 1222 0 55000 360 1 Rural
2 8285 0 64000 360 1 Urban
3 3988 1144 75000 360 1 Rural
4 2588 0 84700 360 1 Urban
5 5248 0 48550 360 1 Rural
6 7488 0 111000 360 1 SemiUrban
7 3252 1112 14550 360 1 Rural
8 1668 0 67500 360 1 Urban
ApplicantIncome CoapplicantIncome LoanAmount Loan_Term Credit_History Property_Area
2 8285 0 64000 360 1 Urban
4 2588 0 84700 360 1 Urban
8 1668 0 67500 360 1 Urban
Hope that helps.
Regards.
See below. I leave it to you to work out how to reset index. You might want to look at .tail() to display last rows.
df_urban = df[(df['ApplicantIncome'] > 10000) & (df['Property_Area'] == 'Urban')]
Edited my previous question:
Want to distinguish each Devices (FOUR types) that are attached to a particular Building's particular Elevator (represented by height).
As there is no unique IDs for the devices, want to identify them and assign unique IDs to each of them by Grouping ('BldID', 'BldHt', 'Deivce') to identify any particular 'Device'.
Count their testing results, i.e. how many times it failed (NG) out of total number of testing (NG + OK) for any particular date for the entire duration consisting of few months.
Original dataframe looks like this
BldgID BldgHt Device Date Time Result
1074 34.0 790 2018/11/20 10:30 OK
1072 31.0 780 2018/11/19 11:10 NG
1072 36.0 780 2018/11/17 05:30 OK
1074 10.0 790 2018/11/19 06:10 OK
1074 10.0 790 2018/12/20 11:50 NG
1076 17.0 760 2018/08/15 09:20 NG
1076 17.0 760 2018/09/20 13:40 OK
As 'Time' is irrelevant, dropped it. Want to find the number of [NG] per day for each set (consists of 'BldgID', 'BlgHt', 'Device'].
#aggregate both functions only once by groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), \
('ALL','count')]).round(2).reset_index()
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1),
index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Now the filtered DataFrame looks like:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 2
8 002 1076 17.0 760 2018/11/20 1 1
If I groupby ['BldgID', 'BldgHt', 'Device', 'Date'] then I get per day 'NG'.
But it would consider every day differently and if I assign 'unique' IDs I can plot how the unique Devices behave in every other single day.
If I groupby ['BldgId', 'BldgHt', 'Device'] then I get the overall 'NG' for that set (or unique Device), which is not my goal.
What I want to achieve is:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
001 1072 31.0 780 2018/11/19 1 2
1072 31.0 780 2018/12/30 3 4
002 1076 17.0 760 2018/11/20 1 1
1076 17.0 760 2018/09/20 2 4
003 1072 36.0 780 2018/08/15 1 3
Any tips would be very much appreciated.
Use:
#aggregate both aggregate function only in once groupby
df1 = mel_df.groupby(['BldgID','BldgHt','Device','Date'])\
['Result'].agg([('NG', lambda x :(x=='NG').sum()), ('ALL','count')]).round(2).reset_index()
#filter non 0 rows
mel_df2 = df1[df1.NG != 0]
#filter first rows by Date
mel_df2 = mel_df2.drop_duplicates('Date')
#create New_ID by insert with Series with zero fill 3 values
s = pd.Series(np.arange(1, len(mel_df2) + 1), index=mel_df2.index).astype(str).str.zfill(3)
mel_df2.insert(0, 'New_ID', s)
Output from data from question:
print (mel_df2)
New_ID BldgID BldgHt Device Date NG ALL
1 001 1072 31.0 780 2018/11/19 1 1
8 002 1076 17.0 780 2018/11/20 1 1
I have a dataframe like so:
Year RS Team RS_target
1 1962 599 WSA
2 1962 774 STL
3 1963 747 WSA
4 1963 725 STL
5 1964 702 WSA
6 1964 800 STL
I'd like to create a new column (RS_target) that will have the RS value for the next year (i.e. index 1: Year = 1962, RS = 599, RS_target = 747). The aim is to get next year's RS for the team and place that value in the new column "RS_target".
I've been trying a combination of conditionals and apply(), but having trouble getting the output I want. Looking for an efficient alternative method, or any other way to get the desired outcome. Thanks!
You need to first apply dataframe.groupby() on Team column and then use shift() to get next RS value for the Team.
df = pd.DataFrame({'Year':[1962,1962,1963,1963,1964,1964], 'RS':[599,774,747,725,702,800], 'Team':['WSA','STL','WSA','STL','WSA','STL']})
df['RS_Target'] = df.groupby('Team')['RS'].shift(-1)
print(df)
Output:
Year RS Team RS_Target
0 1962 599 WSA 747.0
1 1962 774 STL 725.0
2 1963 747 WSA 702.0
3 1963 725 STL 800.0
4 1964 702 WSA NaN
5 1964 800 STL NaN
EDIT:
If your Year column contains random values b. Sort the column using below before applying groupby operation:
df.sort_values(['Year'], inplace=True)
I have a dataframe in which under the column "component_id", I have component_ids repeating several times.
Here is what the df looks like:
In [82]: df.head()
Out[82]:
index molregno chembl_id assay_id tid tid component_id
0 0 942606 CHEMBL1518722 688422 103668 103668 4891
1 0 942606 CHEMBL1518722 688422 103668 103668 4891
2 0 942606 CHEMBL1518722 688721 78 78 286
3 0 942606 CHEMBL1518722 688721 78 78 286
4 0 942606 CHEMBL1518722 688779 103657 103657 5140
component_synonym
0 LMN1
1 LMNA
2 LGR3
3 TSHR
4 MAPT
As can be seen, the same component_id can be linked to various component_synonyms(essentially the same gene, but different names). I wanted to find out the frequency of each gene as I want to find out the top 20 most frequently hit genes and therefore, I performed a value_counts on the column "component_id". I get something like this.
In [84]: df.component_id.value_counts()
Out[84]:
5432 804
3947 402
5147 312
3 304
2693 294
75 282
Name: component_id, dtype: int64
Is there a way for me to order the entire dataframe according to the component_id that is present the most number of times?
And also, is it possible for my dataframe to contain only the first occurrence of each component_id?
Any advice would be greatly appreciated!
I think you can make use of count to sort the rows and then drop the count column i.e
df['count'] = df.groupby('component_id')['component_id'].transform('count')
df_sorted = df.sort_values(by='count',ascending=False).drop('count',1)
I'm using Movie Lens Dataset in Python Pandas. I need to print the matrix of u.data a tab separated file in foll. way
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating Rating
UserID2 Rating Rating Rating
I've already been through following links
One - Dataset is much huge put it in series
Two - Transpose of Row not mentioned
Three - Tried with reindex so as
to get NaN values in one column
Four - df.iloc and df.ix
didn't work either
I need the output so as it shows me rating and NaN (when not rated) for movies w.r.t. users.
NULL MovieID1 MovieID2 MovieID3
UserID1 Rating Rating NaN
UserID2 Rating NaN Rating
P.S. I won't mind having solutions with numpy, crab, recsys, csv or any other python package
EDIT 1 - Sorted the data and exported but got an additional field
df2 = df.sort_values(['UserID','MovieID'])
print type(df2)
df2.to_csv("sorted.csv")
print df2
The file produces foll. sorted.csv file
,UserID,MovieID,Rating,TimeStamp
32236,1,1,5,874965758
23171,1,2,3,876893171
83307,1,3,4,878542960
62631,1,4,3,876893119
47638,1,5,3,889751712
5533,1,6,5,887431973
70539,1,7,4,875071561
31650,1,8,1,875072484
20175,1,9,5,878543541
13542,1,10,3,875693118
EDIT 2 - As asked in Comments
Here's the format of Data in u.data file which acts as input
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
One method:
Use pivot_table and if one value per user and movie id then aggfunc doesn't matter, however if there are multiple values, the choose your aggregation.
df.pivot_table(values='Rating',index='UserID',columns='MovieID', aggfunc='mean')
Second method (no duplicate userid, movieid records):
df.set_index(['UserID','MovieID'])['Rating'].unstack()
Third method (no duplicate userid, movieid records):
df.pivot(index='UserID',columns='MovieID',values='Rating')
Fourth method (like the first you can choose your aggregation method):
df.groupby(['UserID','MovieID'])['Rating'].mean().unstack()
Output:
MovieID 1 2 3 4 5 6 7 8 9 10
UserID
1 5 3 4 3 3 5 4 1 5 3