Keep unique values with only 1 instance - python

I have the following dataset:
Col_A Amounts
0 A 100
1 B 200
2 C 500
3 D 100
4 E 500
5 F 300
The output I am trying to achieve is to basically remove all values based on the "Amounts" column which have a duplicate value and to keep only the rows where there is one unique instance of a value.
Desired Output:
Col_A Amounts
1 B 200
5 F 300
I have tried to use the following with no luck:
df_1.drop_duplicates(subset=['Amounts'])
This removes the duplicates, however, it still keeps the values which have occurred more than once.
Using the pandas .unique function also provides a similiar undesired output.

You are close, need keep=False for remove all duplicates per Amounts column:
print (df.drop_duplicates(subset=['Amounts'], keep=False))
Col_A Amounts
1 B 200
5 F 300

Less straight forward than the previous answer, but if you want to be able keep the rows that appear n times, you could use value_counts() as a mask, and keep only the rows that appear exactly / at least / less than n times:
import pandas as pd
data = { 'Col_1': ['A','B','C','D','E','F'],
'Amounts': [100, 200, 500, 100, 500,300]
}
df = pd.DataFrame(data)
n=1
mask = df.Amounts.value_counts()
df[df.Amounts.isin(mask.index[mask.lt(n+1)])]
outputs:
Col_1 Amounts
1 B 200
5 F 300

Related

Pandas grouping with filtering on other columns

I have the following dataframe in Pandas:
name
value
in
out
A
50
1
0
A
-20
0
1
B
150
1
0
C
10
1
0
D
500
1
0
D
-250
0
1
E
800
1
0
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
name
value
A
30
D
250
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
name
value
A
30
B
150
C
10
D
250
E
800
How to add the beforementioned filtering based on columns?
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250

Why are 2 different lengths are showing when using value_counts() and shape[0]?

I am trying to find out how many records are there and I thought that there were 2 ways to show the total number of records. However, they show different lengths, why is this happening?
I listed both ways below, to elaborate further one line has the .shape[0] attribute while the other has the .value_counts() attribute
df.loc[(df['rental_store_city'] == 'Woodridge') & (df['film_rental_duration'] > 5)].shape[0]
output: 3186
df.loc[(df['rental_store_city'] == 'Woodridge') & (df['film_rental_duration'] > 5)].value_counts()
output image that shows length of 3153
It's because value_counts groups by the duplicates and counts the number of them, and it removes the extra duplicates, so that would make the dataframe shorter.
As you can see in the documentation:
Return a Series containing counts of unique rows in the DataFrame.
Example:
>>> df = pd.DataFrame({'a': [1, 2, 1, 3]})
>>> df
a
0 1
1 2
2 1
3 3
>>> df.value_counts()
a
1 2
3 1
2 1
dtype: int64
>>>
As you can see the duplicates made the code dataframe shorter.
If you want to get the length of the dataframe don't use value_counts, use len:
len(df.loc[(df['rental_store_city'] == 'Woodridge') & (df['film_rental_duration'] > 5)])

Using Numpy to filter two dataframes

I have two data frames. They are structured like this:
df a
Letter
ID
A
3
B
4
df b
Letter
ID
Value
A
3
100
B
4
300
B
4
100
B
4
150
A
3
200
A
3
400
I need to take for each combo of Letter and ID in df A values from df B and run an outlier function on then.
Currently I am using over 40,000 rows of A and a list of 4,500,000 of list b
a['Results'] = a.apply(lambda x: outliers(b[(b['Letter']==x['Letter']) & (b['ID']==x['ID'])]['value'].to_list()),axis=1)
As you can imagine this is taking forever. Is there some mistake im making or something that can improve this code?
I'd first aggregate every combination of [Letter, ID] in df_b into a list using .groupby, then merge with df_a and apply your outliers function afterwards. Should be faster:
df_a["results"] = df_a.merge(
df_b.groupby(["Letter", "ID"])["Value"].agg(list),
left_on=["Letter", "ID"],
right_index=True,
how="left",
)["Value"].apply(outliers)
print(df_a)
You can first try to merge the datasets a and b and then run a group by over letter and ID, aggregate the Value by Outlier function.
pd.merge(a,b,how="inner",on = ['letter','ID']).groupby(['letter','ID']).agg(outlier).reset_index()

How to divide a column by the number of rows with equal id in a dataframe?

I have a DataFrame that looks like this:
Id
Price
1
300
1
300
1
300
2
400
2
400
3
100
My goal is to divide the price for each observation by the number of rows with the same Id number. The expected output would be:
Id
Price
1
100
1
100
1
100
2
200
2
200
3
100
However I am having some issues finding the most optimized way to conduct this operation. I did manage to do this using the code below, but it takes more than 5 minutes to run (as I have roughly 200k observations):
# For each row in the dataset, get the number of rows with the same Id and store them in a list
sum_of_each_id=[]
for i in df['Id'].to_numpy():
sum_of_each_id.append(len(df[df['Id']==i]))
# Creating an auxiliar column in the dataframe, with the number of rows associated to each Id
df['auxiliar']=sum_of_each_id
# Dividing the price by the number of rows with the same Id
df['Price']=df['Price']/df['auxiliar']
Could you please let me know what would be the best way to do this?
Try groupby with transform.
Make groups on the basis of id using groupby('Id')
Get count of values in a group for each row using `transform('count')
Divide df["Price] by that series which contains count.
df = pd.DataFrame({"Id":[1,1,1,2,2,3],"Price":[300,300,300,400,400,100]})
df["new_Price"] = (df["Price"]/df.groupby("Id")["Price"].transform("count")).astype('int')
print(df)
Id Price new_Price
0 1 300 100
1 1 300 100
2 1 300 100
3 2 400 200
4 2 400 200
5 3 100 100
import pandas as pd
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "price": [300, 300, 300, 400, 400, 100]})
df.set_index("id") / df.groupby("id").count()
Explanation:
df.groupby("id").count() calculates the number of rows with the same Id number. the resulting DataFrame will have an Id as index.
df.set_index("id") will set the Id column as index
Then we simply divide the frames and pandas will match the numbers by the index.

How unique is each row based on 3-4 columns?

I am going to merge two datasets soon by 3 columns.
The hope is that there are no/few 3 column group repeats in the original dataset. I would like to produce something that says approximately how unique each row is. Like maybe some kind of frequency plot (might not work as I have a very large dataset), maybe a table that displays the average frequency for each .5million rows or something like that.
Is there a way to determine how unique each row is compared to the other rows?
1 2 3
A 100 B
A 200 B
A 200 B
Like for the above data frame, I would like to say that each row is unique
1 2 3
A 200 B
A 200 B
A 100 B
For this data set, rows 1 and 2 are not unique. I don't want to drop one, but I am hoping to quantify/weight the amount of non-unique rows.
The problem is my dataframe is 14,000,000 lines long, so I need to think of a way I can show how unique each row is on a set this big.
Assuming you are using pandas, here's one possible way:
import pandas as pd
# Setup, which you can probably skip since you already have the data.
cols = ["1", "2", "3"]
rows = [
["A", 200, "B",],
["A", 200, "B",],
["A", 100, "B",],
]
df1 = pd.DataFrame(rows, columns=cols)
# Get focus column values before adding a new column.
key_columns = df1.columns.values.tolist()
# Add a line column
df1["line"] = 1
# Set new column to cumulative sum of line values.
df1["match_count"] = df1.groupby(key_columns )['line'].apply(lambda x: x.cumsum())
# Drop line column.
df1.drop("line", axis=1, inplace=True)
Print results
print(df1)
Output -
1 2 3 match_count
0 A 200 B 1
1 A 200 B 2
2 A 100 B 1
Return only unique rows:
# We only want results where the count is less than 2,
# because we have our key columns saved, we can just return those
# and not worry about 'match_count'
df_unique = df1.loc[df1["match_count"] < 2, key_columns]
print(df_unique)
Output -
1 2 3
0 A 200 B
2 A 100 B

Categories

Resources