Compare entire rows for equality if some condition is satisfied

Compare entire rows for equality if some condition is satisfied - python

Let's say I have the following data of a match in a CSV file:
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4
I'm writing a python program. Somewhere in my program I have scores collected for a match stored in a list, say x = [1,0,4]. I have found where in the data these scores exist using pandas and I can print "found" or "not found". However I want my code to print out to which name these scores correspond to. In this case the program should output "charlie" since charlie has all these values [1,0,4]. how can I do that?
I will have a large set of data so I must be able to tell which name corresponds to the numbers I pass to the program.

Yes, here's how to compare entire rows in a dataframe:
df[(df == x).all(axis=1)].index # where x is the pd.Series we're comparing to
Also, it makes life easiest if you directly set name as the index column when you read in the CSV.
import pandas as pd
from io import StringIO
df = """\
name,match1,match2,match3
Alice,2,4,3
Bob,2,3,4
Charlie,1,0,4"""
df = pd.read_csv(StringIO(df), index_col='name')
x = pd.Series({'match1':1, 'match2':0, 'match3':4})
Now you can see that doing df == x, or equivalently df.eq(x), is not quite what you want because it does element-wise compare and returns a row of True/False. So you need to aggregate those rows with .all(axis=1) which finds rows where all comparison results were True...
df.eq(x).all(axis=1)
df[ (df == x).all(axis=1) ]
# match1 match2 match3
# name
# Charlie 1 0 4
...and then finally since you only want the name of such rows:
df[ (df == x).all(axis=1) ].index
# Index(['Charlie'], dtype='object', name='name')
df[ (df == x).all(axis=1) ].index.tolist()
# ['Charlie']
which is what you wanted. (I only added the spaces inside the expression for clarity).

You need to use DataFrame.loc which would work like this:
print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])

Maybe try something like this:
import pandas as pd
import numpy as np
# Makes sample data
match1 = np.array([2,2,1])
match2 = np.array([4,4,0])
match3 = np.array([3,3,4])
name = np.array(['Alice','Bob','Charlie'])
df = pd.DataFrame({'name': id, 'match1': match1, 'match2':match2, 'match3' :match3})
df
# example of the list you want to get the data from
x=[1,0,4]
#x=[2,4,3]
# should return the name Charlie as well as the index (based on the values in the list x)
df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])]
# Makes a new dataframe out of the above
mydf = pd.DataFrame(df['name'].loc[(df['match1'] == x[0]) & (df['match2'] == x[1]) & (df['match3'] ==x[2])])
# Loop that prints out the name based on the index of mydf
# Assuming there are more than one name, it will print all. if there is only one name, it will print only that)
for i in range(0,len(mydf)):
print(mydf['name'].iloc[i])

you can use this
here data is your Data frame ,you can change accordingly your data frame name,
and
considering [1,0,4] is int type
data = data[(data['match1']== 1)&(data['match2']==0)&(data['match3']== 4 ).index
print(data[0])
if data is object type then use this
data = data[(data['match1']== "1")&(data['match2']=="0")&(data['match3']== "4" ).index
print(data[0])

Related

Search for string in a dataframe first 3 word

In this data frame, I have the start word "PRE" in the columns containing the note, so I should update yes to the new columns, otherwise no.
For whom I got this code but it is not working.
import pandas as pd
df1 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO", "SCREEN_MAMO",
"> Unable to connect internet>4G Compatible>Set",
"No>Not Barred>Active>No>Available>Others>",
"Internet Not Working>>>Unable To Connect To"]})
df1['NOTES'].astype(str)
for i in df1['NOTES']:
if i [:3]=='PRE':
df1['new']='yes'
else:
df1['new']='No'
df1

Set df1['new'] to a list using a list comprehension and ternary operator:
df1['new'] = ['yes' if i[:3] == 'PRE' else 'no' for i in df1['NOTES']
When setting dataframe columns, you need to set them to lists, not individual elements.
For case-insensitive:
df1['new'] = ['yes' if i[:3].upper() == 'PRE' else 'no' for i in df1['NOTES']

You can use list to apppend the values and then add value to dataframe.
Code -
import pandas as pd
df1 = pd.DataFrame({'NOTES': ["PREPAID_HOME_SCREEN_MAMO", "SCREEN_MAMO",
"> Unable to connect internet>4G Compatible>Set",
"No>Not Barred>Active>No>Available>Others>",
"Internet Not Working>>>Unable To Connect To"]})
df1['NOTES'].astype(str)
data = []
for i in df1['NOTES']:
if i[:3]=='PRE':
data.append('yes')
else:
data.append('no')
df1['new'] = data

The code that you posted will update all the 'new' column values with 'yes' or 'no' based on the condition. This happens because you do not already have a column 'new'.
Try the following :
import pandas as pd
df1 = pd.DataFrame({'NOTES': ...)
df1['NOTES'].astype(str)
new=['*' for i in range(len(df1['NOTES']))]
for i in range(len(df1['NOTES'])):
if df1['NOTES'][i][0:3]=="PRE":
new[i]='Yes'
else:
new[i]='No'
df1['new']=new

How to drop duplicates ignoring one column

I have a DataFrame with multiple columns and the last column is timestamp which I want Python to ignore. I've used drop_columns(subset=...) but does not work as it returns literally the same DataFrame.
This is what the DataFrame looks like:
id
name
features
timestamp
1
34233
Bob
athletics
04-06-2022
2
23423
John
mathematics
03-06-2022
3
34233
Bob
english_literature
06-06-2022
4
23423
John
mathematics
10-06-2022
...
...
...
...
...
And this is are the data types when doing df.dtypes:
id
int64
name
object
features
object
timestamp
object
Lastly, this is the piece of code I used:
df.drop_duplicates(subset=df.columns.tolist().remove("timestamp"), keep="first").reset_index(drop=True)
The idea is to keep track of changes based on a timestamp IF there are changes to the other columns. For instance, I don't want to keep row 4 because nothing has changed with John, however, I want to keep Bob as it has changed from athletics to english_literature. Does that make sense?
EDIT:
This is the full code:
"""
db_data contains 10 records
new_data contains 12 records but I know only 5 are needed based on the logic I want to implement
"""
db_data = pd.read_sql("SELECT * FROM subscribed", engine)
new_data = pd.read_csv("new_data.csv")
# Checking columns match
# This prints "matching"
if db_data.columns == new_data.columns: print("matching")
df = pd.concat([db_data, new_data], axis=1)
consider = [x for x in df.columns if x != "timestamp"]
df = df.drop_duplicates(subset=consider).reset_index(drop=True)
# This outputs 22 but should have printed 15
print(len(df))
TEST:
I've done a test but has puzzled me even more. I've created a separate table in the db and loaded the csv file new_data.csv and then used read_sql to get it back into a DataFrame. Surprisingly, this works. However, I do not want to take this unnecessary extra step. I am puzzled on why this works. I've checked the data types they match.
db_data = pd.read_sql("SELECT * FROM subscribed, engine")
new_data = pd.read_sql("SELECT * FROM test, engine")
# Checking columns match
# This still prints "matching"
if db_data.columns == new_data.columns: print("matching")
df = pd.concat([db_data, new_data], axis=1)
consider = [x for x in df.columns if x != "timestamp"]
df = df.drop_duplicates(subset=consider).reset_index(drop=True)
# This the right output... in other words, it worked.
print(len(df))

The remove method of a list returns None. That's why the returned dataframe is similar. You can do as follows:
Create the list of columns for the subset: col_subset = df.columns.tolist()
Remove timestamp: col_subset.remove('timestamp')
Use the col_subset list in the drop_duplicates() function: df.drop_duplicates(subset=col_subset, keep="first").reset_index(drop=True)

Try this:
consider = [x for x in df.columns if x != "timestamp"]
df.drop_duplicates(subset=consider).reset_index(drop=True)
(You don't need tolist() and keep="first" here)

If I understood you correctly, this code would do:
df.drop_duplicates(subset='features', keep ='first').reset_index()

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

I have a file full of URL paths like below spanning across 4 columns in a dataframe that I am trying to clean:
Path1 = ["https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID\
=0x012000EDE8B08D50FC3741A5206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D"]
I want to remove everything after a specific string which I defined it as "string1" and I would like to loop through all 4 columns in the dataframe defined as "df_MasterData":
string1 = "&FolderCTID"
import pandas as pd
df_MasterData = pd.read_excel(FN_MasterData)
cols = ['Column_A', 'Column_B', 'Column_C', 'Column_D']
for i in cols:
# Objective: Replace "&FolderCTID", delete all string after
string1 = "&FolderCTID"
# Method 1
df_MasterData[i] = df_MasterData[i].str.split(string1).str[0]
# Method 2
df_MasterData[i] = df_MasterData[i].str.split(string1).str[1].str.strip()
# Method 3
df_MasterData[i] = df_MasterData[i].str.split(string1)[:-1]
I did search and google and found similar solutions which were used but none of them work.
Can any guru shed some light on this? Any assistance is appreciated.
Added below is a few example rows in column A and B for these URLs:
Column_A = ['https://contentspace.global.xxx.com/teams/Australia/NSW/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FNSW%2FDocuments%2FIn%20Scope%2FA%20I%20TOPPER%20GROUP&FolderCTID=\
0x01200016BC4CE0C21A6645950C100F37A60ABD&View=%7B64F44840%2D04FE%2D4341%2D9FAC%2D902BB54E7F10%7D',\
'https://contentspace.global.xxx.com/teams/Australia/Victoria/Documents/Forms/AllItems.aspx?RootFolder\
=%2Fteams%2FAustralia%2FVictoria%2FDocuments%2FIn%20Scope&FolderCTID=0x0120006984C27BA03D394D9E2E95FB\
893593F9&View=%7B3276A351%2D18C1%2D4D32%2DADFF%2D54158B504FCC%7D']
Column_B = ['https://contentspace.global.xxx.com/teams/Australia/WA/Documents/Forms/AllItems.aspx?\
RootFolder=%2Fteams%2FAustralia%2FWA%2FDocuments%2FIn%20Scope&FolderCTID=0x012000EDE8B08D50FC3741A5\
206CD23377AB75&View=%7B287FFF9E%2DD60C%2D4401%2D9ECD%2DC402524F1D4A%7D',\
'https://contentspace.global.xxx.com/teams/Australia/QLD/Documents/Forms/AllItems.aspx?RootFolder=%\
2Fteams%2FAustralia%2FQLD%2FDocuments%2FIn%20Scope%2FAACO%20GROUP&FolderCTID=0x012000E689A6C1960E8\
648A90E6EC3BD899B1A&View=%7B6176AC45%2DC34C%2D4F7C%2D9027%2DDAEAD1391BFC%7D']

This is how i would do it,
first declare a variable with your target columns.
Then use stack() and str.split to get your target output.
finally, unstack and reapply the output to your original df.
cols_to_slice = ['ColumnA','ColumnB','ColumnC','ColumnD']
string1 = "&FolderCTID"
df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)
if you want to replace these columns in your target df then simply do -
df[cols_to_slice] = df[cols_to_slice].stack().str.split(string1,expand=True)[1].unstack(1)

You should first get the index of string using
indexes = len(string1) + df_MasterData[i].str.find(string1)
# This selected the final location of this string
# if you don't want to add string in result just use below one
indexes = len(string1) + df_MasterData[i].str.find(string1)
Now do
df_MasterData[i] = df_MasterData[i].str[:indexes]

Counting the repeated values in one column base on other column

Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())

This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)

pandas fillna is not working on subset of the dataset

I want to impute the missing values for df['box_office_revenue'] with the median specified by df['release_date'] == x and df['genre'] == y .
Here is my median finder function below.
def find_median(df, year, genre, col_year, col_rev):
median = df[(df[col_year] == year) & (df[col_rev].notnull()) & (df[genre] > 0)][col_rev].median()
return median
The median function works. I checked. I did the code below since I was getting some CopyValue error.
pd.options.mode.chained_assignment = None # default='warn'
I then go through the years and genres, col_name = ['is_drama', 'is_horror', etc] .
i = df['release_year'].min()
while (i < df['release_year'].max()):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
print(i)
i += 1
However, nothing changed!
len(df['box_office_revenue'].isnull())
The output was 35527. Meaning none of the null values in df['box_office_revenue'] had been filled.
Where did I go wrong?
Here is a quick look at the data: The other columns are just binary variables

You mentioned
I did the code below since I was getting some CopyValue error...
The warning is important. You did not give your data, so I cannot actually check, but the problem is likely due to:
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(..)
Let's break this down:
First you select some rows with:
df[(df['release_year'] == i) & (df[genre] > 0)]
Then from that, you select a columns with:
...['box_office_revenue']
And now you have a problem...
Why?
The problem is that when you selected some rows (ie: not all), pandas was forced to create a copy of your dataframe. You then select a column of the copy!. Then you fillna() on the copy. Not super useful.
How do I fix it?
Select the column first:
df['box_office_revenue'][(df['release_year'] == i) & (df[genre] > 0)].fillna(..)
By selecting the entire column first, pandas is not forced to make a copy, and thus subsequent operations should work as desired.

This is not elegant, but I think it works. Basically, I calculate the means conditioned on genre and year, and then join the data to a dataframe containing the imputing values. Then, wherever the revenue data is null, replace the null with the imputed value
import pandas as pd
import numpy as np
#Fake Data
rev = np.random.normal(size = 10_000,loc = 20)
rev_ix = np.random.choice(range(rev.size), size = 100 )
rev[rev_ix] = np.NaN
year = np.random.choice(range(1950,2018), replace = True, size = 10_000)
genre = np.random.choice(list('abc'), size = 10_000, replace = True)
df = pd.DataFrame({'rev':rev,'year':year,'genre':genre})
imputing_vals = df.groupby(['year','genre']).mean()
s = df.set_index(['year','genre'])
s.rev.isnull().any() #True
#Creates dataframe with new column containing the means
s = s.join(imputing_vals, rsuffix = '_R')
s.loc[s.rev.isnull(),'rev'] = s.loc[s.rev.isnull(),'rev_R']
new_df = s['rev'].reset_index()
new_df.rev.isnull().any() #False

This URL describing chained assignments seems useful for such a case: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#evaluation-order-matters
As seen in above URL:
Hence, instead of doing (in your 'for' loop):
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df[(df['release_year'] == i) & (df[genre] > 0)]['box_office_revenue'].fillna(median, inplace=True)
You can try:
for genre in col_name:
median = find_median(df, i, genre, 'release_year', 'box_office_revenue')
df.loc[(df['release_year'] == i) & (df[genre] > 0) & (df['box_office_revenue'].isnull()), 'box_office_revenue'] = median

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare entire rows for equality if some condition is satisfied - python

You need to use DataFrame.loc which would work like this: print(df.loc[(df.match1 == 1) & (df.match2 == 0) & (df.match3 == 4), 'name'])

Related

Search for string in a dataframe first 3 word

How to drop duplicates ignoring one column

Python remove everything after specific string and loop through all rows in multiple columns in a dataframe

Counting the repeated values in one column base on other column

pandas fillna is not working on subset of the dataset

Categories

Resources