Pandas deleting partly duplicate rows with wrong values in specific columns

Pandas deleting partly duplicate rows with wrong values in specific columns - python

I have a large dataframe from a csv file which has a few dozen columns. I have another csv file which I concatenated to the original. Now, the second file has exactly the same structure but a particular column may have incorrect values. I want to delete rows which are duplicates that have this one wrong column. For example in the below the last row should be removed. (The names of the specimens (Albert, etc.) are unique). I have been struggling to find a way of deleting only the data which has the wrong value, without risking deleting the correct row.
0 Albert alive
1 Newton alive
2 Galileo alive
3 Copernicus dead
4 Galileo dead
...
Any help would be greatly appreciated!

You could use this to determine if a name is mentioned more than 1 time
df['RN'] = df.groupby(['Name']).cumcount() + 1
You can also expand it out to have more columns in the "groupby" to see if there are any more limitations you want to put on the duplicates
df['RN'] = df.groupby(['Name', 'Another Column']).cumcount() + 1
The advantage I like with this is it gives you more control over the RN selection if you needed to df.loc[df['RN'] > 2].

Related

I need help concatenating 1 csv file and 1 pandas dataframe together without duplicates

My code currently looks like this:
df1 = pd.DataFrame(statsTableList)
df2 = pd.read_csv('StatTracker.csv')
result = pd.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
I get an error and I'm not sure why.
The goal of my program is to pull data from an API, and then write it all to a file for analyzing. df1 is the lets say the first 100 games written to the csv file as the first version. df2 is me reading back those first 100 games the second time around and comparing it to that of df1 (new data, next 100 games) to check for duplicates and delete them.
The part that is not working is the drop duplicates part. It gives me an error of unhashable list, I would assume that's because its two dataframes that are lists of dictionaries. The goal is to pull 100 games of data, and then pull the next 50, but if I pull number 100 again, to drop that one, and just add 101-150 and then add it all to my csv file. Then if I run it again, to pull 150-200, but drop 150 if its a duplicate, etc etc..

Based from your explanation, you can use this one liner to find unique values in df1:
df_diff = df1[~df1.apply(tuple,1)\
.isin(df2.apply(tuple,1))]
This code checks if the rows is exists in another dataframe. To do the comparision it converts each row to tuple (apply tuple conversion along 1 (row) axis).
This solution is indeed slow because its compares each row inside df1 to all rows in df2. So it has time complexity n^2.
If you want more optimised version, try to use pandas built in compare method
df1.compare(df2)

Find which column has unique values that can help distinguish the rows with Pandas

I have the following dataframe, which contains 2 rows:
index name food color number year hobby music
0 Lorenzo pasta blue 5 1995 art jazz
1 Lorenzo pasta blue 3 1995 art jazz
I want to write a code that will be able to tell me which column is the one that can distinguish between the these two rows.
For example , in this dataframe, the column "number" is the one that distinguish between the two rows.
Unti now I have done this very simply by just go over column after column using iloc and see the values.
duplicates.iloc[:,3]
>>>
0 blue
1 blue
It's important to take into account that:
This should be for loop, each time I check it on new generated dataframe.
There may be nore than 2 rows which I need to check
There may be more than 1 column that can distinguish between the rows.
I thought that the way to check such a thing will be something like take each time one column, get the unique values and check if they are equal to each other ,similarly to this:
for n in np.arange(0,len(df.columns)):
tmp=df.iloc[:,n]
and then I thought to compare if all the values are similar to each other on the temporal dataframe, but here I got stuck because sometimes I have many rows and also I need.
My end goal: to be able to check inside for loop to identify the column that has different values in each row of the temporaldtaframe, hence can help to distinguish between the rows.

You can apply the duplicated method on all columns:
s = df.apply(pd.Series.duplicated).any()
s[~s].index
Output: ['number']

Splitting column of a really big dataframe in two (or more) new cols

Problem
Hey there! I'm having some trouble trying to split one column of my dataframe in two (or even more) new columns. I think this depends on the fact that the dataframe I'm working with comes from a really big csv file, almost 10gb worth of space. Once it is loaded into a Pandas dataframe, this is represented by ~60mil of rows and 5 cols.
Example
Initially, the dataframes looks something like this:
In [1]: df
Out[1]:
category other_col
0 animal.cat 5
1 animal.dog 3
2 clothes.shirt.sports 6
3 shoes.laces 1
4 None 0
I want to first remove the rows of the df for which the category is not defined (i.e., the last one), and then split the category column in three new columns based on where the dot appears: one for the main category, one for the first subcategory and another one for the last subcategory (if that actually exists). Finally, I want to merge the whole dataframe back together.
In other words, this is what I want to obtain:
In [2]: df_after
Out[2]:
other_col main_cat sub_category_1 sub_category_2
0 5 animal cat None
1 3 animal dog None
2 6 clothes shirt sports
3 1 shoes laces None
My approach
My approach for this was the following:
df = df[df['category'].notnull()]
df_wt_cat = df.drop(columns=['category'])
df_cat_subcat = df['category'].str.split('.', expand=True).rename(columns={0: 'main_cat', 1: 'sub_category_1', 2: 'sub_category_2', 3: 'sub_category_3'})
df_after = pd.concat([df_wt_cat, df_cat_subcat], axis=1)
which seems to work just fine with small datasets, but it sucks up too much memory when this is applied on a dataframe that big and the Jupyter kernel just dies.
I've tried to read the dataframe in chunks, but I'm not really sure how should I proceed after that; I've obviously tried searching this kind of problem here on stack overflow, but I didn't manage to find anything useful.
Any help is appreciated!

split and join methods do the job:
results = df['category'].str.split(".", expand = True))
df_after = df.join(results)
after doing that you can freely filter resulting dataframe

sqlite, filter rows with dynamic number of keys, but only if they have the same value in a specific column?

I am brand new to sqlite (and databases in general). I have done a ton of reading both here and elsewhere and am unable to find this specific problem. People tend to want counts, or duplicates. I need to filter.
I have a database with 3 columns (and a few hundred thousand entries)
column1 column2 column3
abc 123 ##$
egf 456 $%#
abc 321 !##
kop 123 &$%
pok 321 ^$#
and so on.
What I am trying to do is this. I need to retrieve all possible combinations of a list. For example
[123, 321]
all possible combos would be
[123],[321],[123,321]
I do not know what input can possibly be, it can be more than 2 strings, and so the combinations list can grow pretty fast. For single entries above, like 123, 321, it works out of the gate, the thing I am trying to get to work is with more than 1 value in a list.
So I am dynamically generating the select statement
sqlquery = "SELECT fileloc, frequency FROM words WHERE word=?"
while numOfVariables < len(list):
sqlquery += " or word=?"
numOfVariables += 1
This generates the query, then I execute it with
cursor.execute(sqlquery,tuple(list))
Which works. It finds me all rows with any of those combinations.
Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
So in the above example it would select rows 1 and 3 because their column2 has the values I am interested in, and their column1 is the same. But column 4 would not be selected even though it has value we want. Because it's column1 does not match 321's column1. Same thing for row 5, again even though its one of the values we need, it's column1 doesnt match 123's.
From things Ive been able to find, people compare against specific value by using GROUP BY. But in my case I do not know what that value may be. All I care about is if its the same between the rows or not.
I am sorry if my explanation is not clear. I have never used mysql before this week so I dont know all the technical terms.
But basically I need the functionality of (pseudo code):
if (column2 is 123 or 321) and 123.column1 == 321.column1:
count
else:
dont count
I have a feeling this can be done by first moving whatever matches 123 or 321 into a new table. Then going through that table and only keeping records that have both 123 and 321 with the same column1 value. But I am not sure how to do this or if its the proper approach? Because this thing is going to scale pretty quick, if there are 5 inputs, then the rows that are kept is if there is one row to account for each input and all of their column1 is the same. (So rows would be saved in sets of 5).
Thank you.
(I am using Python 2.7.15)

You wrote:
"I need to retrieve all possible combinations of a list"
"Now I need one more thing, I need it to ONLY select them if their column1 is the same (I do not know what this value may be).
Use self-join for this purpose:
SELECT W1.column2, W2.column2
FROM words W1
JOIN words W2 ON W1.column1 = W2.column1
Correct me if I miss something in your question but this 3 lines must be sufficient.
Python looks as irrelevant for your question. It could be solved in pure SQL

How to combine multiple rows of data into a single sting per group

To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.

'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')

I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.