Assuming a df as follows:
Product Time
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
2 7
3 1
3 2
3 3
4 1
4 2
4 3
I would like to only keep those Products whose Time is greater than 3 and drop the others.
In the above example, after I do
df.groupby(['Product']).size()
I get the following output:
1 4
2 7
3 3
4 3
and based on this, from my main df, I would only like to retain Product 1 & 2
Expected output:
Product Time
1 1
1 2
1 3
1 4
2 1
2 2
2 3
2 4
2 5
2 6
2 7
Use GroupBy.transform for return Series with same size like original, so possible filtering by boolean indexing:
df = df[df.groupby(['Product'])['Product'].transform('size') > 3]
print (df)
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
Details:
b = df.groupby(['Product'])['Product'].transform('size') > 3
a = df.groupby(['Product'])['Product'].transform('size')
print (df.assign(size=a, filter=b))
Product Time size filter
0 1 1 4 True
1 1 2 4 True
2 1 3 4 True
3 1 4 4 True
4 2 1 7 True
5 2 2 7 True
6 2 3 7 True
7 2 4 7 True
8 2 5 7 True
9 2 6 7 True
10 2 7 7 True
11 3 1 3 False
12 3 2 3 False
13 3 3 3 False
14 4 1 3 False
15 4 2 3 False
16 4 3 3 False
If DataFrame is not large, here is alternative with DataFrameGroupBy.filter:
df = df.groupby(['Product']).filter(lambda x: len(x) > 3)
Instead use transform.size after grouping, check which are greater than (gt) 3 and use the result to perform boolean indexing on your dataframe:
df[df.groupby('Product').Time.transform('size').gt(3)]
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
You can do this if you don't plan to use assign operation and you like to use boolean indexing.
g = df.groupby('Product')
t = g.transform('count')
df['c']=t #new column holding the count
df2=df[df['c'] > 3]
print(df2)
Product Time
0 1 1
1 1 2
2 1 3
3 1 4
4 2 1
5 2 2
6 2 3
7 2 4
8 2 5
9 2 6
10 2 7
11 3 1
12 3 2
13 3 3
14 4 1
15 4 2
16 4 3
Product Time c
0 1 1 4
1 1 2 4
2 1 3 4
3 1 4 4
4 2 1 7
5 2 2 7
6 2 3 7
7 2 4 7
8 2 5 7
9 2 6 7
10 2 7 7
I have a pandas dataframe with three columns, and I want to drop all rows where the unique combination
of df['person'], df['id'], and df['day'] only occur twice or less. Is there a simple way to do this in pandas?
[In]:
person id day
1 2 1
1 2 1
1 2 1
1 2 1
1 1 1
1 1 1
1 1 1
1 0 1
1 2 2
2 2 2
2 2 2
2 2 2
1 3 1
1 3 1
1 3 1
1 0 1
2 2 2
[Out]:
person id day
1 2 1
1 2 1
1 2 1
1 1 1
1 1 1
1 1 1
2 2 2
2 2 2
2 2 2
1 3 1
1 3 1
1 3 1
2 2 2
We can using transform build a new para info
df['Info']=df.groupby(list(df)).id.transform('count')
df
Out[444]:
person id day Info
0 1 2 1 4
1 1 2 1 4
2 1 2 1 4
3 1 2 1 4
4 1 1 1 3
5 1 1 1 3
6 1 1 1 3
7 1 0 1 2
8 1 2 2 1
9 2 2 2 4
10 2 2 2 4
11 2 2 2 4
12 1 3 1 3
13 1 3 1 3
14 1 3 1 3
15 1 0 1 2
16 2 2 2 4
Then you can do
df[df.Info>2].drop('Info',1)
Out[447]:
person id day
0 1 2 1
1 1 2 1
2 1 2 1
3 1 2 1
4 1 1 1
5 1 1 1
6 1 1 1
9 2 2 2
10 2 2 2
11 2 2 2
12 1 3 1
13 1 3 1
14 1 3 1
16 2 2 2
df.groupby(['person','id','day']).filter(lambda x:x.shape[0]>2)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.filter.html
I have a data like this:
republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
from source.
I would like to change all different distinct values from all of the data (dataframe) into numeric values in most efficient way.
In the above mentioned example I would like to transform republican-> 1 and democrat -> 2, y ->3, n->4 and ? -> 5 (or NULL).
I tried to use the following:
# Convert string column to integer
def str_column_to_int(dataset, column):
class_values = [row[column] for row in dataset]
unique = set(class_values)
lookup = dict()
for i, value in enumerate(unique):
lookup[value] = i
for row in dataset:
row[column] = lookup[row[column]]
return lookup
However, I'm not sure if using Pandas can be more efficient or there are some other better solutions for it. (This should be generic to any source of data).
Here is the transform of data into dataframe using Pandas:
import pandas as pd
file_path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data'
dataset = pd.read_csv(file_path, header=None)
v = df.values
f = pd.factorize(v.ravel())[0].reshape(v.shape)
pd.DataFrame(f)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 1 2 1 2 2 2 1 1 1 2 3 2 2 2 1 2
1 0 1 2 1 2 2 2 1 1 1 1 1 2 2 2 1 3
2 4 3 2 2 3 2 2 1 1 1 1 2 1 2 2 1 1
3 4 1 2 2 1 3 2 1 1 1 1 2 1 2 1 1 2
4 4 2 2 2 1 2 2 1 1 1 1 2 3 2 2 2 2
5 4 1 2 2 1 2 2 1 1 1 1 1 1 2 2 2 2
6 4 1 2 1 2 2 2 1 1 1 1 1 1 3 2 2 2
7 0 1 2 1 2 2 2 1 1 1 1 1 1 2 2 3 2
Use replace on the whole dataframe to make the mappings. You could first pass a dictionary of known mappings for values you need to remain consistent, and then generate a set of values for the dataset and map these extra values to say values 100 upwards.
For example, the ? here is not mapped, so would get a value of 100:
mappings = {'republican':1, 'democrat':2, 'y':3, 'n':4}
unknown = set(pd.unique(df.values.ravel())) - set(mappings.keys())
mappings.update([v, c] for c, v in enumerate(unknown, start=100))
df.replace(mappings, inplace=True)
Giving you:
republican n n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9 ? n.10 n.11 n.12 n.13 n.14
0 1 4 3 4 3 3 3 4 4 4 3 100 3 3 3 4 3
1 1 4 3 4 3 3 3 4 4 4 4 4 3 3 3 4 100
2 2 100 3 3 100 3 3 4 4 4 4 3 4 3 3 4 4
3 2 4 3 3 4 100 3 4 4 4 4 3 4 3 4 4 3
4 2 3 3 3 4 3 3 4 4 4 4 3 100 3 3 3 3
5 2 4 3 3 4 3 3 4 4 4 4 4 4 3 3 3 3
6 2 4 3 4 3 3 3 4 4 4 4 4 4 100 3 3 3
7 1 4 3 4 3 3 3 4 4 4 4 4 4 3 3 100 3
A more generalized version would be:
mappings = {v:c for c, v in enumerate(sorted(set(pd.unique(df.values.ravel()))), start=1)}
df.replace(mappings, inplace=True)
You can use:
v = df.values
a, b = v.shape
f = pd.factorize(v.T.ravel())[0].reshape(b,a).T
df = pd.DataFrame(f)
print (df)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 0 2 4 2 4 4 4 2 2 2 4 3 4 4 4 2 4
1 0 2 4 2 4 4 4 2 2 2 2 2 4 4 4 2 3
2 1 3 4 4 3 4 4 2 2 2 2 4 2 4 4 2 2
3 1 2 4 4 2 3 4 2 2 2 2 4 2 4 2 2 4
4 1 4 4 4 2 4 4 2 2 2 2 4 3 4 4 4 4
5 1 2 4 4 2 4 4 2 2 2 2 2 2 4 4 4 4
6 1 2 4 2 4 4 4 2 2 2 2 2 2 3 4 4 4
7 0 2 4 2 4 4 4 2 2 2 2 2 2 4 4 3 4