Find rows that have same values in another column - Python

Find rows that have same values in another column - Python - python

I have a basic Python questions.
I have a pandas dataframe like this:
ID | Name | User_id
---+------+--------
1 John 10
2 Tom 11
3 Sam 12
4 Ben 13
5 Jen 10
6 Tim 11
7 Sean 14
8 Ana 15
9 Sam 12
10 Ben 13
I want to get the names and user ids that share the same value for User_id, without returning names that appear twice. So I would like the output to look something like this:
John Jen 10
Tom Tim 11

IIUC you could do it this way, groupby on 'User_id' and then filter the groupby:
In [54]:
group = df.groupby('User_id')['Name'].unique()
In [55]:
group[group.apply(lambda x: len(x)>1)]
Out[55]:
User_id
10 [John, Jen]
11 [Tom, Tim]
Name: Name, dtype: object

Related

Disproportionate stratified sampling in Pandas

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0

you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0

You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)

you can achive that using unique
df['Name'].unique()

Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])

df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first

How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

I'm trying to split a Pandas DataFrame into multiple separate DataFrames where one of the columns is evenly distributed among the resulting DataFrames. For example, if I wanted the following DataFrame split into 3 distinct DataFrames where each one contains one record of each sector (selected at random).
So a df that looks like this:
id Name Sector
1 John A
2 Steven A
3 Jane A
4 Kyle A
5 Ashley B
6 Ken B
7 Tom B
8 Peter B
9 Elaine C
10 Tom C
11 Adam C
12 Simon C
13 Stephanie D
14 Jan D
15 Marsha D
16 David D
17 Drew E
18 Kit E
19 Corey E
20 James E
Would yield two DataFrames, one of which could look like this, while the other consist of the remaining records.
id Name Sector
1 John A
2 Steven A
7 Tom B
8 Peter B
10 Tom C
11 Adam C
13 Stephanie D
16 David D
19 Corey E
20 James E
I know np.array_split(df, 2) will get me part way there, but it may not evenly distribute the sectors like I need.
(Edited for clarity)

Update per comments and updated question:
df_1=df.groupby('Sector', as_index=False, group_keys=False).apply(lambda x: x.sample(n=2))
df_2 = df[~df.index.isin(df_1.index)]
print(df_1)
id Name Sector
2 3 Jane A
3 4 Kyle A
7 8 Peter B
5 6 Ken B
11 12 Simon C
9 10 Tom C
12 13 Stephanie D
15 16 David D
19 20 James E
17 18 Kit E
print(df_2)
id Name Sector
0 1 John A
1 2 Steven A
4 5 Ashley B
6 7 Tom B
8 9 Elaine C
10 11 Adam C
13 14 Jan D
14 15 Marsha D
16 17 Drew E
18 19 Corey E
Here is a "funky" method, using sequential numbering and random sampling:
df['grp'] = df.groupby('Sector')['Sector']\
.transform(lambda x: x.notna().cumsum().sample(frac=1))
dd = dict(tuple(df.groupby('grp')))
Output:
dd[1]
id Name Sector grp
0 1 John A 1
4 5 Ken B 1
6 7 Elaine C 1
dd[2]
id Name Sector grp
2 3 Jane A 2
5 6 Tom B 2
7 8 Tom C 2
dd[3]
id Name Sector grp
1 2 Steven A 3
3 4 Ashley B 3
8 9 Adam C 3
Details:
Create a sequence of numbers in each sector group starting from 1,
then randomize than number in the group to create a grouping key,
grp.
Use grp to groupby then create a dictionary, with keys for each grp.

Here's my way, you can groupbyby sector and randomly select from each group with a loop using the sample function:
for x, i in df.groupby('Sector'):
print(i.sample())
If you need multiple random selection use the sample function to specify how many items you want. For example:
for x, i in df.groupby('Sector'):
print(i.sample(2))
will return 2 random values from each group.

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]

You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

How to replace certain rows by shared column values in pandas DataFrame?

Let's say I have the following pandas DataFrame:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13], ['Bob', '#'], ['Bob', '#'], ['Bob', '#']]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob #
4 Bob #
5 Bob #
So, there are odd rows in the DataFrame for Bob, namely rows 3, 4, and 5. These values are consistently #, not 12. Row 1 shows that Bob should be 12, not #.
In this example, it's straightforward to fix this with replace():
df = df.replace("#", 12)
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob 12
4 Bob 12
5 Bob 12
However, this wouldn't work for larger dataframes, e.g.
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob #
4 Bob #
5 Bob #
6 Clarke #
whereby row 6 should be 6 Clarke 13.
How does one replace any row in Age with # with the correct integer as given in other rows, based on Name? If # exists, check other rows with the same Name value and replace #.

try this,
d= df[df['Age']!='#'].set_index('Name')['Age']
df['Age']=df['Name'].replace(d)
O/P:
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob 12
4 Bob 12
5 Bob 12
6 Clarke 13

You want to use the valid values to fill the invalid ones? In that case, use map:
v = df.assign(Age=pd.to_numeric(df['Age'], errors='coerce')).dropna()
df['Age'] = df['Name'].map(v.set_index('Name').Age)
df
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
3 Bob 12.0
4 Bob 12.0
5 Bob 12.0
6 Clarke 13.0

Pandas Dataframe retrieve unique column

I have a Pandas dataframe. My question is how do I group all the sellers (indicated under sellerUserName) for each date. For example, for any date e.g. 29/03/2018 I want to retrieve a sum of all the unique sellers.
ScrapeDate sellerUserName
0 29/03/2018 BOB
1 29/03/2018 BOB
2 29/03/2018 BOB
3 29/03/2018 MARY
4 29/03/2018 IAN
5 29/03/2018 ANISA
6 30/03/2018 BOB
7 30/03/2018 BOB
8 30/03/2018 BOB
9 30/03/2018 KARL
10 30/03/2018 KARL
11 30/03/2018 IAN
12 01/04/2018 NGI
13 01/04/2018 NICEE
So the output dataframe should be
ScrapeDate No.of Sellers
0 29/03/2018 4
1 30/03/2018 3
2 01/04/2018 2

Just using nunique
df.groupby('ScrapeDate')['sellerUserName'].nunique()
Out[38]:
ScrapeDate
01/04/2018 2
29/03/2018 4
30/03/2018 3
Name: sellerUserName, dtype: int64

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Find rows that have same values in another column - Python - python

IIUC you could do it this way, groupby on 'User_id' and then filter the groupby: In [54]: group = df.groupby('User_id')['Name'].unique() In [55]: group[group.apply(lambda x: len(x)>1)] Out[55]: User_id 10 [John, Jen] 11 [Tom, Tim] Name: Name, dtype: object

Related

Disproportionate stratified sampling in Pandas

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

How can we fill the empty values in the column?

How to replace certain rows by shared column values in pandas DataFrame?

Pandas Dataframe retrieve unique column

Categories

Resources