Get a subset of columns by row value in pandas - python
I have a DataFrame of users and their ratings for movies:
userId movie1 movie2 movie3 movie4 movie5 movie6
0 4.1 NaN 1.0 NaN 2.1 NaN
1 3.1 1.1 3.4 1.4 NaN NaN
2 2.8 NaN 1.7 NaN 3.0 NaN
3 NaN 5.0 NaN 2.3 NaN 2.1
4 NaN NaN NaN NaN NaN NaN
5 2.3 NaN 2.0 4.0 NaN NaN
There isnt actually a userId column in the dataframe, it's just being used for the index
From this DataFrame, I'm trying to make a another DataFrame that only contain movies that have been rated by a specific user. For example if I wanted to make a new DataFrame of movies only rated by user with userId == 0 the output would a dataframe with:
userId movie1 movie3 movie5
0 4.1 1.0 2.1
1 3.1 3.4 NaN
2 2.8 1.7 3.0
3 NaN NaN NaN
4 NaN NaN NaN
5 2.3 2.0 NaN
I know how to iterate over the columns but I dont know how to select the columns I want by checking a row value.
You can use .loc accessor to select the particular userId then use notna to create a boolean mask which specifies the columns which does not contain NaN values, finally use this boolean mask to filter the columns:
userId = 0 # specify the userid here
df_user = df.loc[:, df.loc[userId].notna()]
Details:
>>> df.loc[userId].notna()
movie1 True
movie2 False
movie3 True
movie4 False
movie5 True
movie6 False
Name: 0, dtype: bool
>>> df.loc[:, df.loc[userId].notna()]
movie1 movie3 movie5
userId
0 4.1 1.0 2.1
1 3.1 3.4 NaN
2 2.8 1.7 3.0
3 NaN NaN NaN
4 NaN NaN NaN
5 2.3 2.0 NaN
Another approach:
import pandas as pd
user0 = df.iloc[0,:] #select the first row
flags = user0.notna() #flag the non NaN values
flags = flags.tolist() #convert to list instead of series
newdf = df.iloc[:,flags] #return all rows, and the columns where flags are true
Declare and loc the userId of interest into a new df keeping only the relevant columns.
Then pd.concat the new df with the other userId's and keep columns (movies) of your userId that you selected:
user = 0 # set your userId
a = df.loc[[user]].dropna(axis=1)
b = pd.concat([a, (df.drop(a.index))[[i for i in a.columns]]])
Which prints:
b
movie1 movie3 movie5
userId
0 4.10 1.00 2.10
1 3.10 3.40 NaN
2 2.80 1.70 3.00
3 NaN NaN NaN
4 NaN NaN NaN
5 2.30 2.00 NaN
Note that I have set the index to be userId as you specified.
Related
Replace Column with single value but LEAVE NaNs
I'm trying to replace the entire column with a single value, however, I want to leave the NaNs in place. How do I go about doing that? Lets say for column 'Q1' I would like to replace every value with '1' but leave every row that has NaN in place. In the end, for column 'Q1' every row that has a integer value would now have the integer value '1' and every row that has NaN would still remain as NaN. Q1 Q2 Q3 Q4 0 NaN NaN 1.33 NaN 1 NaN NaN NaN 1.35 2 0.93 NaN NaN NaN 3 NaN 1.08 NaN NaN 4 NaN NaN 1.28 NaN ...
In [13]: df Out[13]: Q1 Q2 0 NaN 1.0 1 NaN 2.0 2 0.93 NaN 3 NaN 3.0 4 NaN 4.0 In [14]: df.loc[~df.Q1.isna(), 'Q1'] = 1 In [15]: df Out[15]: Q1 Q2 0 NaN 1.0 1 NaN 2.0 2 1.0 NaN 3 NaN 3.0 4 NaN 4.0
Python pivot DataFrame without index columns
It is intrducing nulls in resultant dataframe df.pivot( columns='colname',values='value') Initial DF: colname value 0 bathrooms 1.0 1 bathrooms 2.0 2 bathrooms 1.0 3 bathrooms 2.0 4 property_id 82671.0 enter image description here Result: colname addr_street bathrooms bedrooms lat lng parking_space property_id 0 NaN 1.0 NaN NaN NaN NaN NaN 1 NaN 2.0 NaN NaN NaN NaN NaN 2 NaN 1.0 NaN NaN NaN NaN NaN I just want a dataframe where unique values of 'colname' in initial df are the columns and the its corresponding value is the value(like it happens in bathroom)
If I understand, you want a groupby and concatenation, not pivot: df = pd.concat( {k: g.reset_index(drop=True) for k, g in df.groupby('colname')['value']}, axis=1) df bathrooms property_id 0 1.0 82671.0 1 2.0 NaN 2 1.0 NaN 3 2.0 NaN
Counting rows with NaN
I have the following DataFrame: dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff 6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN What I have to do is count the rows that are exactly the same, including the NaN values. The problem is the following, I use groupby, but it is a function that ignores the NaN values, that is, it does not have them in mind when doing the counting, that is the reason why I am not returning a correct output counting the number of exact repetitions between those rows. My code is the following one: def detect_duplicates(data): x = DataFrame(columns=data.columns.tolist() + ["num_reps"]) aux = data[data.duplicated(keep=False)] x = data[data.duplicated(keep=False)].drop_duplicates() #This line should count my repeated rows s = aux.groupby(data.columns.tolist(),as_index=False).transform('size') return x If I print "x" var, I get this result, it shows all the repeated rows: dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff 6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 13 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 17 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 31 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 44 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 47 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 51 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 53 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN Now I have to count those rows from my x result that are exactly the same. This should be my correct output: dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps 6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4 8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2 9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3 43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2 Here is my problem and it's that groupby ignores NaN values, and that's why other similar posts about this problem can't help me. Thanks
If your dataframe's name is df, you can count the number of duplicates using just one line of code: sum(df.duplicated(keep = False)) If you want to drop duplicate rows, use the drop_duplicates method. documentation Example: #data.csv col1,col2,col3 a,3,NaN #duplicate b,9,4 #duplicate c,12,5 a,3,NaN #duplicate b,9,4 #duplicate d,19,20 a,3,NaN #duplicate - 5 duplicate rows Importing data.csv and dropping duplicate rows (by default the first instance of a duplicated row is kept) import pandas as pd df = pd.read_csv("data.csv") print(df.drop_duplicates()) #Output c1 c2 c3 0 a 3 NaN 1 b 9 4.0 2 c 12 5.0 5 d 19 20.0 To count the number of duplicates rows, use the dataframe's duplicated method. Set "keep" to False (documentation). As mentioned above, you can simply do this using sum(df.duplicated(keep = False)). Here's a messier way to do it that demonstrates what the "duplicated" method does: duplicate_rows = df.duplicated(keep = False) print(duplicate_rows) #count the number of duplicates (i.e. count the number of 'True' values in #the duplicate_rows boolean series. number_of_duplicates = sum(duplicate_rows) print("Number of duplicate rows:") print(number_of_duplicates) #Output #The duplicate_rows pandas series from df.duplicated(keep = False) 0 True 1 True 2 False 3 True 4 True 5 False 6 True dtype: bool #The number of rows from sum(df.duplicated(keep = False)) Number of duplicate rows: 5
I just solved it. The problem as I said was groupby that doesn't not accept Nan Values. So what I have done to solve it, is to change all Nan Values with fillna(0) function so it changes all NaN to 0 and now I can do the comparation properly. Here is my new function working properly: def detect_duplicates(data): x = DataFrame(columns=data.columns.tolist() + ["num_reps"]) aux = data[data.duplicated(keep=False)] x = data[data.duplicated(keep=False)].drop_duplicates() s = aux.fillna(0).groupby(data.columns.tolist()).size().reset_index().rename(columns={0:'count'}) x['num_reps'] = s['count'].tolist()[::-1] return x
Duplicated rows from .csv and count- python
I have to get the number of times that a complete line appears repeated in my data frame, to then only show those lines that appear repetitions and show in the last column how many times those lines appear repetitions. Input values for creating output correct table: dur,wage1,wage2,wage3,cola,hours,pension,stby_pay,shift_diff,educ_allw,holidays,vacation,ldisab,dntl,ber,hplan,agr 2,4.5,4.0,?,?,40,?,?,2,no,10,below average,no,half,?,half,bad 2,2.0,2.0,?,none,40,none,?,?,no,11,average,yes,none,yes,full,bad 3,4.0,5.0,5.0,tc,?,empl_contr,?,?,?,12,generous,yes,none,yes,half,good 1,2.0,?,?,tc,40,ret_allw,4,0,no,11,generous,no,none,no,none,bad 1,6.0,?,?,?,38,?,8,3,?,9,generous,?,?,?,?,good 2,2.5,3.0,?,tcf,40,none,?,?,?,11,below average,?,?,yes,?,bad 3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good 1,2.1,?,?,tc,40,ret_allw,2,3,no,9,below average,yes,half,?,none,bad 1,2.8,?,?,none,38,empl_contr,2,3,no,9,below average,yes,half,?,none,bad 1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good 2,4.3,4.4,?,?,38,?,?,4,?,12,generous,?,full,?,full,good 1,2.8,?,?,?,35,?,?,2,?,12,below average,?,?,?,?,good 2,2.0,2.5,?,?,35,?,?,6,yes,12,average,?,?,?,?,good 1,5.7,?,?,none,40,empl_contr,?,4,?,11,generous,yes,full,?,?,good 2,4.5,4.0,?,none,40,?,?,4,?,12,average,yes,full,yes,half,good 3,3.5,4.0,4.6,none,36,?,?,3,?,13,generous,?,?,yes,full,good 3,3.7,4.0,5.0,tc,?,?,?,?,yes,?,?,?,?,yes,?,good 3,2.0,3.0,?,tcf,?,empl_contr,?,?,yes,?,?,yes,half,yes,?,good I just have to keep those rows that are totally equal. This is the table result: dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff num_reps 6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 4 8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 2 9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 3 43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN 2 As you can see on this table, we keep for example line with index 6 because on line 6 and 17 from the input table to read, both lines are the same. With my current code: def detect_duplicates(data): x = DataFrame(columns=data.columns.tolist() + ["num_reps"]) x = data[data.duplicated(keep=False)].drop_duplicates() return x I get the result correctly, however I do not know how to count the repeated rows and then add it in the column 'nums_rep' at the end of the table. This is my result, without the last column that counts the number of repeated rows: dur wage1 wage2 wage3 cola hours pension stby_pay shift_diff 6 3.0 2.0 3.0 NaN tcf NaN empl_contr NaN NaN 8 1.0 2.8 NaN NaN none 38.0 empl_contr 2.0 3.0 9 1.0 5.7 NaN NaN none 40.0 empl_contr NaN 4.0 43 2.0 2.5 3.0 NaN NaN 40.0 none NaN NaN How can I perform a correct count, based on the equality of all the data in the column, then add it and show it in the column 'num_reps'?
Delete rows in dataframe based on column values
I need to rid myself of all rows with a null value in column C. Here is the code: infile="C:\****" df=pd.read_csv(infile) A B C D 1 1 NaN 3 2 3 7 NaN 4 5 NaN 8 5 NaN 4 9 NaN 1 2 NaN There are two basic methods I have attempted. method 1: source: How to drop rows of Pandas DataFrame whose value in certain columns is NaN df.dropna() The result is an empty dataframe, which makes sense because there is an NaN value in every row. df.dropna(subset=[3]) For this method I tried to play around with the subset value using both column index number and column name. The dataframe is still empty. method 2: source: Deleting DataFrame row in Pandas based on column value df = df[df.C.notnull()] Still results in an empty dataframe! What am I doing wrong?
df = pd.DataFrame([[1,1,np.nan,3],[2,3,7,np.nan],[4,5,np.nan,8],[5,np.nan,4,9],[np.nan,1,2,np.nan]], columns = ['A','B','C','D']) df = df[df['C'].notnull()] df
It's just a prove that your method 2 works properly (at least with pandas 0.18.0): In [100]: df Out[100]: A B C D 0 1.0 1.0 NaN 3.0 1 2.0 3.0 7.0 NaN 2 4.0 5.0 NaN 8.0 3 5.0 NaN 4.0 9.0 4 NaN 1.0 2.0 NaN In [101]: df.dropna(subset=['C']) Out[101]: A B C D 1 2.0 3.0 7.0 NaN 3 5.0 NaN 4.0 9.0 4 NaN 1.0 2.0 NaN In [102]: df[df.C.notnull()] Out[102]: A B C D 1 2.0 3.0 7.0 NaN 3 5.0 NaN 4.0 9.0 4 NaN 1.0 2.0 NaN In [103]: df = df[df.C.notnull()] In [104]: df Out[104]: A B C D 1 2.0 3.0 7.0 NaN 3 5.0 NaN 4.0 9.0 4 NaN 1.0 2.0 NaN