display duplicate value using pandas.duplicated function

display duplicate value using pandas.duplicated function - python

I have a csv file with a column of integers i'm reading and I want to terminate the program if there is a duplicate value in the column, along with displaying the value that was found to be a duplicate. I am currently able to find if there are duplicates and terminate the program using:
for x in df.duplicated(['projectID']): # projectID is the column header
if x == True:
sys.exit("ERROR: there is a duplicate projectID in the csv file. Terminating Program.")
but I want a way to tell the user which value is duplicated. This is where I stuck. I have no idea how to do so. I know there can be multiple duplicates but I'm content with saying
sys.exit("ERROR: {0} is a duplicate projectID in the csv file. Terminating Program.". format(x))
to the first duplicate integer it finds. Any ideas for how the code would look?
CSV would look something like:
projectName, projectID
Alpha,1
Beta,2
Gamma,3
Delta,1
so the value '1' is a duplicate which I would like to display to the user.

Here's a way to do that:
if df.projectID.duplicated().any():
print("There are some duplicates:")
print(f"The first duplicate value of 'projectID' is {df[df.projectID.duplicated()].projectID.iloc[0]}")
The output is:
There are some duplicates:
The first duplicate value of 'projectID' is 1
To explain the last line:
This is the full line:
df[df.projectID.duplicated()].projectID.iloc[0]
It's comprised of the following pieces:
Step 1: df.projectID.duplicated() - produced a Boolean series of which values are duplicates.
Step 2: df[<step-1>]: reduce the data frame to include only the values which are indeed duplicates.
Step 3: <step-2>.projectID: extract the ProjectID series from the reduced dataframe.
Step 4: <step-3>.iloc[0]: take the value in the first location of the duplicate ProjectID series. This is the value you'd like to print.

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

I have a dataframe that might look like this:
print(df_selection_names)
name
0 fatty red meat, like prime rib
0 grilled
I have another dataframe, df_everything, with columns called name, suggestion and a lot of other columns. I want to find all the rows in df_everything with a name value matching the name values from df_selection_names so that I can print the values for each name and suggestion pair, e.g., "suggestion1 is suggested for name1", "suggestion2 is suggested for name2", etc.
I've tried several ways to get cell values from a dataframe and searching for values within a row including
# number of items in df_selection_names = df_selection_names.shape[0]
# so, in other words, we are looping through all the items the user selected
for i in range(df_selection_names.shape[0]):
# get the cell value using at() function
# in 'name' column and i-1 row
sel = df_selection_names.at[i, 'name']
# this line finds the row 'sel' in df_everything
row = df_everything[df_everything['name'] == sel]
but everything I tried gives me ValueErrors. This post leads me to think I may be
way off, but I'm feeling pretty confused about everything at this point!

https://pandas.pydata.org/docs/reference/api/pandas.Series.isin.html?highlight=isin#pandas.Series.isin
df_everything[df_everything['name'].isin(df_selection_names["name"])]

Iterating through a list to then return a value if meeting conditions

I just started working with Pandas for a personal side project of mine. I've imported data from a CSV, cleaned it up and now want to use data from the CSV in the rest of my code. I want to ask a user for input and if the user input matches an entry inside the list (Which is a column inside the data) I want to get data for that instance (from the same row but other columns in the df)
I can get it to compare using the "in" statement but don't think that will work when I expand the functionality.
How would I go about looping through the list from the df to return a value from that list if it exists and then be able to return other values in the same row of the df?
import pandas as pd
import re
import math
housing_Data = pd.read_csv("/Users/saads/Downloads/DP_LIVE_26072020053911478.csv") #File to get data
housing_Data = housing_Data.drop(['INDICATOR', 'MEASURE', 'FREQUENCY', 'Flag Codes'], axis=1) #Removing unwanted Columns
print(housing_Data.columns) # Just to test if working
user_Country = str(input("What Country are you in")) # User Input
def getCountry(): # Function to compare user input to elements inside the list
if user_Country in housing_Data.LOCATION.unique():
print(user_Country)
else:
print("Sorry we don't have information about that country")
getCountry()

maybe what you want is all rows matching a location?
matching_rows = housing_Data[housing_Data.LOCATION == user_country]
print(matching_rows)
maybe? Im not entirely sure.

In case you are urgent , you can append the iteration list from the source to a newfile.txt and do a comfortable iteration. hopefully it helps.

Selection in dataframe with array as column value

I have a dataframe filled with twitter data. The columns are:
row_id : Int
content : String
mentions : [String]
value : Int
So for every tweet I have it's row id in the dataframe, the content of the tweet, the mentions used in it (for example: '#foo') as an array of strings and a value that I calculated based on the content of the tweet.
An example of a row would be:
row_id : 12
content : 'Game of Thrones was awful'
mentions : ['#hbo', '#tv', '#dissapointment', '#whatever']
value: -0.71
So what I need is a way to do the following 3 things:
find all rows that contain the mention '#foo' in the mentions-field
find all rows that ONLY contain the mention '#foo' in the mentions-field
above two but checking for an array of strings instead of checking for only one handle
If anyone could help met with this, or even just point me in the right direction that'd be great.

Let's call your DataFrame df.
For the first task you use:
result = df[(Dataframe(df['mentions'].tolist()) == '#foo').any(1)]
Here, the Dataframe(df['mentions']) creates a new DataFrame where each column is a mention and each row a tweet.
Then == '#foo' generates a boolean dataframe containing True where the mentions are '#foo'.
Finally .any(1) returns a boolean index which elements are True if any element in the row is True.
I think with this help you can manage to solve the rest for yourself.

For loop assigning one value when it should assign two

I am trying to use a for loop to assign a column with one of two values based on the value of another column. I created a list of the items I want to assign to one element, using else to assign the others. However, my code is only assigning the else value to the column. I also tried elif and it did not work. Here is my code:
#create list of aggressive reasons
aggressive = ['AGGRESSIVE - ANIMAL', 'AGGRESSIVE - PEOPLE', 'BITES']
#create new column assigning 'Aggressive' or 'Not Aggressive'
for reason in top_dogs_reason['Reason']:
if reason in aggressive:
top_dogs_reason['Aggression'] = 'Aggressive'
else:
top_dogs_reason['Aggression'] = 'Not Aggressive'
My new column top_dogs_reason['Aggression'] only has the value of Not Aggressive. Can someone please tell me why?

You should be using loc to assign things like this which isolate a part of a dataframe you want to update. The first line grabs the values in the "Aggression" column where the "Reason" column has a value contained in the list `aggressive1. The second line finds places where its not in the "Reason" column.
top_dogs_reason[top_dogs_reason['Reason'].isin(aggressive), 'Aggression'] = 'Aggressive'
top_dogs_reason[~top_dogs_reason['Reason'].isin(aggressive), 'Aggression'] = 'Not Aggressive'
or in one line as Roganjosh explained which uses np.where which is much like an excel if/else statement. so here we're saying if reason is in aggressive, give us "Aggressive", otherwise "Not Aggressive", and assign that to the "Aggression" column:
top_dogs_reason['Aggression'] = np.where(top_dogs_reason['Reason'].isin(aggressive), "Aggressive", "Not Aggressive")
or anky_91's answer which uses .map to map values. this is an effective way to feed a dictionary to a pandas series, and for each value in the series it looks at the key in the dictionary and returns the corresponding value:
top_dogs_reason['reason'].isin(aggressive).map({True:'Aggressive',False:'Not Aggressive'})

in openpyxl how to find the max_column number in a particular row

We know that sheet.max_column() gives the maximum column number of the whole Excel file. But my question is how to find the maximum column number in a particular row?
Suppose there is a saved Excel file which is not empty and I want to know 3rd row's max column number of that Excel file. So how can I find that out?

I had the same issue and solved with a simple for loop and if statement. I checked whether it is empty or not, then if it is empty I assigned a value. 'A' is the column, you can change it as you wish. Here is what I do:
workbook = openpyxl.load_workbook('RankingInfo.xlsx')
sheet = workbook.get_sheet_by_name('Sheet1')
for i in range(1,20000):
if sheet['A'+str(i)].value == None:
sheet['A'+str(i)] = isbn
workbook.save('RankingInfo.xlsx')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

display duplicate value using pandas.duplicated function - python

Related

How do I search a pandas dataframe to get the row with a cell matching a specified value?

Iterating through a list to then return a value if meeting conditions

Selection in dataframe with array as column value

For loop assigning one value when it should assign two

in openpyxl how to find the max_column number in a particular row

Categories

Resources