Pandas Dataframe showing duplicates by multiple columns - python

I have a large DF with 3m rows and 16 columns. I have been trying to find duplicates based on certain columns only. Hence I want to subset the data where these rows have the exact same value in the 6 columns. I want to keep all rows based on the duplicates.
pp19952017[pp19952017.duplicated(subset=['Postcode', 'Property Type','Street','Town/City', 'District', 'County'],keep=False)]
Edit:
Here is an example of mostly like a single property but won't show up with show duplicates because not every column is the same and a few cells are different. I want to have a list of duplicates so I can see how the same properties have increase in price.
15, ARMINGER ROAD, LONDON, HAMMERSMITH AND FULHAM, W12 7BA, GREATER LONDON
and
15, ARMINGER ROAD, LONDON, LONDON, HAMMERSMITH AND FULHAM, W12 7BA, GREATER LONDON
Unfortunately, this gives me nearly every line. I've checked manually and there aren't this many duplicates so I'm a bit stuck as to how to find the duplicates.
As this is data is from 1995 to the present day the way it was recorded changed, so I can only attempt to use this subset to find the duplicates.
Thanks in advance for any help.
Solution:
I think I found a way to do it. Which is that I concatenated the various columns that were relevant and had repeating data and used that new concatenated column's data to use as a check for duplication. It is a workaround but does what I want.

The duplicated is working as intended. The problem is with the usage. You are passing keep=False. It means that all the duplicate records will be marked as duplicate and it will return all the duplicate records.
e.g. (Adam, 30, NY), (Adam, 35, NY), (Adam, 30, MA) and you are doing duplicates based on name & state (keep = False) will return 2 rows as it will mark both the rows as duplicate.
if you pass keep = first or last it will mark the first or last record as duplicate accordingly and will return only 1 row.

Related

How to parse batches of flagged rows and keep the row sastisfying some conditions in a Pandas dataframes?

I have a dataframe containing duplicates, that are flagged by a specific variable. The df looks like this:
enter image description here
The idea is that the rows to keep and its duplicates are stacked in batches (a pair or more if many duplicates)and identified by the "duplicate" column. I would like, for each batch, to keep the row depending on one conditions: it has to be the row with the smallest number of empty cells. For Alice for instance, it should be the second row (and not the one flagged "keep").
The difficulty lies also in the fact that I cannot group by on the "name", "lastname" or "phone" column, because they are not always filled (the duplicates are computed on these 3 concatenated columns by a ML algo).
Unlike already posted questions I've seen (how do I remove rows with duplicate values of columns in pandas data frame?), here the conditions to select the row to keep is not fixed (like keeping the first row or the last withing the batch of duplicates) but depends on the rows completion in each batch.
How can I parse the dataframe according to this column "duplicate", and among each batch extract the row I want ?
I tried to assign an unique label for each batch, in order to iterate over these label, but it fails.

How to check dataframe column that contains value lower than 2500

I want to check how many values are lower than 2500
1)Using .count(
df[df.price<2500]["price"].count()
Using .values_counts()
df[df.price<2500]["price"].value_counts()
this ise code view
First one results 27540 and second 2050. Which one is correct count?
Definitely not 2050, analyze your histogram.
The method value_counts will assign only one row for a number that has duplicates but it will associate the number of duplicates. So it seems to be 2050 differents prices, but if you count duplicates there are much more.

Pandas deleting partly duplicate rows with wrong values in specific columns

I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.
0 Albert alive
1 Newton alive
2 Galileo alive
3 Copernicus dead
4 Galileo dead
...
Any help would be greatly appreciated!
You could use this to determine if a name is mentioned more than 1 time
df['RN'] = df.groupby(['Name']).cumcount() + 1
You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates
df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1
The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].

How to hot encode a dataframe column with multiple strings?

I am currently working on building a regressor model to predict the food delivery time.
This is the dataframe with a few observation
If you observe the Cuisines column has many strings. Used the code
pd.get_dummies(data.Cuisines.str.split(',',expand=True),prefix='c')
This helped me split the strings and hot encode, however, there is a new issue to be dealt with.
Merged the dataframe and dummies. fastfood appears in 1st and 3rd rows. Expected output was a single fastfood column with value 1 on first and third rows, however, there are two fastfood columns are created. fastfood(4th column) is created for first row and fastfood(15th column) for thrid row.
Can someone help me solve this help me get a single fastfood column with value 1 on first and third rows and similarly for the other cuisines too.
The two Fast Food are different by a trailing space. You probably want to try:
data.Cuisines.str.get_dummies(',\s*')

How to calculate based on multiple conditions using Python data frames?

I have excel data file with thousands of rows and columns.
I am using python and have started using pandas dataframes to analyze data.
What I want to do in column D is to calculate annual change for values in column C for each year for each ID.
I can use excel to do this – if the org ID is same are that in the prior row, calculate annual change (leaving the cells highlighted in blue because that’s the first period for that particular ID). I don’t know how to do this using python. Can anyone help?
Assuming the dataframe is already sorted
df.groupby(‘ID’).Cash.pct_change()
However, you can speed things up with the assumption things are sorted. Because it’s not necessary to group in order to calculate percentage change from one row to next
df.Cash.pct_change().mask(
df.ID != df.ID.shift()
)
These should produce the column values you are looking for. In order to add the column, you’ll need to assign to a column or create a new dataframe with the new column
df[‘AnnChange’] = df.groupby(‘ID’).Cash.pct_change()

Categories

Resources