Compare entire rows for equality if some condition is satisfied - python
Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.
Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).
You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])
Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])
you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])
Related
Search for string in a dataframe first 3 word
In this data frame, I have the start word "PRE" in the columns containing the note, so I should update yes to the new columns, otherwise no. For whom I got this code but it is not working. import pandas as pd df1 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO", "SCREEN_MAMO", "> Unable to connect internet>4G Compatible>Set", "No>Not Barred>Active>No>Available>Others>", "Internet Not Working>>>Unable To Connect To"]}) df1['NOTES'].astype(str) for i in df1['NOTES']: if i [:3]=='PRE': df1['new']='yes' else: df1['new']='No' df1
Set df1['new'] to a list using a list comprehension and ternary operator: df1['new'] = ['yes' if i[:3] == 'PRE' else 'no' for i in df1['NOTES'] When setting dataframe columns, you need to set them to lists, not individual elements. For case-insensitive: df1['new'] = ['yes' if i[:3].upper() == 'PRE' else 'no' for i in df1['NOTES']
You can use list to apppend the values and then add value to dataframe. Code - import pandas as pd df1 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO", "SCREEN_MAMO", "> Unable to connect internet>4G Compatible>Set", "No>Not Barred>Active>No>Available>Others>", "Internet Not Working>>>Unable To Connect To"]}) df1['NOTES'].astype(str) data = [] for i in df1['NOTES']: if i[:3]=='PRE': data.append('yes') else: data.append('no') df1['new'] = data
The code that you posted will update all the 'new' column values with 'yes' or 'no' based on the condition. This happens because you do not already have a column 'new'. Try the following : import pandas as pd df1 = pd.DataFrame({'NOTES': ...) df1['NOTES'].astype(str) new=['*' for i in range(len(df1['NOTES']))] for i in range(len(df1['NOTES'])): if df1['NOTES'][i][0:3]=="PRE": new[i]='Yes' else: new[i]='No' df1['new']=new
How to drop duplicates ignoring one column
I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame. This is what the DataFrame looks like: id name features timestamp 1 34233 Bob athletics 04-06-2022 2 23423 John mathematics 03-06-2022 3 34233 Bob english_literature 06-06-2022 4 23423 John mathematics 10-06-2022 ... ... ... ... ... And this is are the data types when doing df.dtypes: id int64 name object features object timestamp object Lastly, this is the piece of code I used: df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True) The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense? EDIT: This is the full code: """ db_data contains 10 records new_data contains 12 records but I know only 5 are needed based on the logic I want to implement """ db_data = pd.read_sql("SELECT * FROM subscribed", engine) new_data = pd.read_csv("new_data.csv") # Checking columns match # This prints "matching" if db_data.columns == new_data.columns: print("matching") df = pd.concat([db_data, new_data], axis=1) consider = [x for x in df.columns if x != "timestamp"] df = df.drop_duplicates(subset=consider).reset_index(drop=True) # This outputs 22 but should have printed 15 print(len(df)) TEST: I've done a test but has puzzled me even more. I've created a separate table in the db and loaded the csv file new_data.csv and then used read_sql to get it back into a DataFrame. Surprisingly, this works. However, I do not want to take this unnecessary extra step. I am puzzled on why this works. I've checked the data types they match. db_data = pd.read_sql("SELECT * FROM subscribed, engine") new_data = pd.read_sql("SELECT * FROM test, engine") # Checking columns match # This still prints "matching" if db_data.columns == new_data.columns: print("matching") df = pd.concat([db_data, new_data], axis=1) consider = [x for x in df.columns if x != "timestamp"] df = df.drop_duplicates(subset=consider).reset_index(drop=True) # This the right output... in other words, it worked. print(len(df))
The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows: Create the list of columns for the subset: col_subset = df.columns.tolist() Remove timestamp: col_subset.remove('timestamp') Use the col_subset list in the drop_duplicates() function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)
Try this: consider = [x for x in df.columns if x != "timestamp"] df.drop_duplicates(subset=consider).reset_index(drop=True) (You don't need tolist() and keep="first" here)
If I understood you correctly, this code would do: df.drop_duplicates(subset='features', keep ='first').reset_index()
Python remove everything after specific string and loop through all rows in multiple columns in a dataframe
I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean: Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\ RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\ =0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"] I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData": string1 = "&FolderCTID" import pandas as pd df_MasterData = pd.read_excel(FN_MasterData) cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D'] for i in cols: # Objective: Replace "&FolderCTID", delete all string after string1 = "&FolderCTID" # Method 1 df_MasterData[i] = df_MasterData[i].str.split(string1).str[0] # Method 2 df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip() # Method 3 df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1] I did search and google and found similar solutions which were used but none of them work. Can any guru shed some light on this? Any assistance is appreciated. Added below is a few example rows in column A and B for these URLs: Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\ RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\ 0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\ 'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\ =%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\ 893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D'] Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\ RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\ 206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\ 'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\ 2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\ 648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']
This is how i would do it, first declare a variable with your target columns. Then use stack() and str.split to get your target output. finally, unstack and reapply the output to your original df. cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD'] string1 = "&FolderCTID" df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1) if you want to replace these columns in your target df then simply do - df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
You should first get the index of string using indexes = len(string1) + df_MasterData[i].str.find(string1) # This selected the final location of this string # if you don't want to add string in result just use below one indexes = len(string1) + df_MasterData[i].str.find(string1) Now do df_MasterData[i] = df_MasterData[i].str[:indexes]
Counting the repeated values in one column base on other column
Using Panda, I am dealing with the following CSV data type: f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win For this part of the raw data, I was trying to return something like: Column1_name -- t -- counts of nowin = 0 Column1_name -- t -- count of wins = 3 Column1_name -- f -- count of nowin = 2 Column1_name -- f -- count of win = 1 Based on this idea get dataframe row count based on conditions I was thinking in doing something like this: print(df[df.target == 'won'].count()) However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column. Should I keep pursing this idea of should I simply start using for loops? If you need, the rest of my code: import pandas as pd url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data" df = pd.read_csv(url,names=[ 'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd', 'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr', 'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target' ]) features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd', 'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr', 'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'] # number of lines #tot_of_records = np.size(my_data,0) #tot_of_records = np.unique(my_data[:,1]) #for item in my_data: # item[:,0] num_of_won=0 num_of_nowin=0 for item in df.target: if item == 'won': num_of_won = num_of_won + 1 else: num_of_nowin = num_of_nowin + 1 print(num_of_won) print(num_of_nowin) print(df[df.target == 'won'].count()) #print(df[:1]) #print(df.bkblk.to_string(index=False)) #print(df.target.unique()) #ini_entropy = (() + ())
This could work - outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict()) Basically we are going in on each feature column and making a crosstab with target column Hope this helps! :)
pandas fillna is not working on subset of the dataset
I want to impute the missing values for df['box_office_revenue'] with the median specified by df['release_date'] == x and df['genre'] == y . Here is my median finder function below. def find_median(df, year, genre, col_year, col_rev): median = df[(df[col_year] == year) & (df[col_rev].notnull()) & (df[genre] > 0)][col_rev].median() return median The median function works. I checked. I did the code below since I was getting some CopyValue error. pd.options.mode.chained_assignment = None # default='warn' I then go through the years and genres, col_name = ['is_drama', 'is_horror', etc] . i = df['release_year'].min() while (i < df['release_year'].max()): for genre in col_name: median = find_median(df, i, genre, 'release_year', 'box_office_revenue') df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True) print(i) i += 1 However, nothing changed! len(df['box_office_revenue'].isnull()) The output was 35527. Meaning none of the null values in df['box_office_revenue'] had been filled. Where did I go wrong? Here is a quick look at the data: The other columns are just binary variables
You mentioned I did the code below since I was getting some CopyValue error... The warning is important. You did not give your data, so I cannot actually check, but the problem is likely due to: df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(..) Let's break this down: First you select some rows with: df[(df['release_year'] == i) & (df[genre] > 0)] Then from that, you select a columns with: ...['box_office_revenue'] And now you have a problem... Why? The problem is that when you selected some rows (ie: not all), pandas was forced to create a copy of your dataframe. You then select a column of the copy!. Then you fillna() on the copy. Not super useful. How do I fix it? Select the column first: df['box_office_revenue'][(df['release_year'] == i) & (df[genre] > 0)].fillna(..) By selecting the entire column first, pandas is not forced to make a copy, and thus subsequent operations should work as desired.
This is not elegant, but I think it works. Basically, I calculate the means conditioned on genre and year, and then join the data to a dataframe containing the imputing values. Then, wherever the revenue data is null, replace the null with the imputed value import pandas as pd import numpy as np #Fake Data rev = np.random.normal(size = 10_000,loc = 20) rev_ix = np.random.choice(range(rev.size), size = 100 ) rev[rev_ix] = np.NaN year = np.random.choice(range(1950,2018), replace = True, size = 10_000) genre = np.random.choice(list('abc'), size = 10_000, replace = True) df = pd.DataFrame({'rev':rev,'year':year,'genre':genre}) imputing_vals = df.groupby(['year','genre']).mean() s = df.set_index(['year','genre']) s.rev.isnull().any() #True #Creates dataframe with new column containing the means s = s.join(imputing_vals, rsuffix = '_R') s.loc[s.rev.isnull(),'rev'] = s.loc[s.rev.isnull(),'rev_R'] new_df = s['rev'].reset_index() new_df.rev.isnull().any() #False
This URL describing chained assignments seems useful for such a case: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters As seen in above URL: Hence, instead of doing (in your 'for' loop): for genre in col_name: median = find_median(df, i, genre, 'release_year', 'box_office_revenue') df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True) You can try: for genre in col_name: median = find_median(df, i, genre, 'release_year', 'box_office_revenue') df.loc[(df['release_year'] == i) & (df[genre] > 0) & (df['box_office_revenue'].isnull()), 'box_office_revenue'] = median