I have a basic Python questions.
I have a pandas dataframe like this:
ID | Name | User_id
---+------+--------
1 John 10
2 Tom 11
3 Sam 12
4 Ben 13
5 Jen 10
6 Tim 11
7 Sean 14
8 Ana 15
9 Sam 12
10 Ben 13
I want to get the names and user ids that share the same value for User_id, without returning names that appear twice. So I would like the output to look something like this:
John Jen 10
Tom Tim 11
IIUC you could do it this way, groupby on 'User_id' and then filter the groupby:
In [54]:
group = df.groupby('User_id')['Name'].unique()
In [55]:
group[group.apply(lambda x: len(x)>1)]
Out[55]:
User_id
10 [John, Jen]
11 [Tom, Tim]
Name: Name, dtype: object
Related
How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0
you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0
You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)
you can achive that using unique
df['Name'].unique()
Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])
df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first
How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])
I'm trying to split a Pandas DataFrame into multiple separate DataFrames where one of the columns is evenly distributed among the resulting DataFrames. For example, if I wanted the following DataFrame split into 3 distinct DataFrames where each one contains one record of each sector (selected at random).
So a df that looks like this:
id Name Sector
1 John A
2 Steven A
3 Jane A
4 Kyle A
5 Ashley B
6 Ken B
7 Tom B
8 Peter B
9 Elaine C
10 Tom C
11 Adam C
12 Simon C
13 Stephanie D
14 Jan D
15 Marsha D
16 David D
17 Drew E
18 Kit E
19 Corey E
20 James E
Would yield two DataFrames, one of which could look like this, while the other consist of the remaining records.
id Name Sector
1 John A
2 Steven A
7 Tom B
8 Peter B
10 Tom C
11 Adam C
13 Stephanie D
16 David D
19 Corey E
20 James E
I know np.array_split(df, 2) will get me part way there, but it may not evenly distribute the sectors like I need.
(Edited for clarity)
Update per comments and updated question:
df_1=df.groupby('Sector', as_index=False, group_keys=False).apply(lambda x: x.sample(n=2))
df_2 = df[~df.index.isin(df_1.index)]
print(df_1)
id Name Sector
2 3 Jane A
3 4 Kyle A
7 8 Peter B
5 6 Ken B
11 12 Simon C
9 10 Tom C
12 13 Stephanie D
15 16 David D
19 20 James E
17 18 Kit E
print(df_2)
id Name Sector
0 1 John A
1 2 Steven A
4 5 Ashley B
6 7 Tom B
8 9 Elaine C
10 11 Adam C
13 14 Jan D
14 15 Marsha D
16 17 Drew E
18 19 Corey E
Here is a "funky" method, using sequential numbering and random sampling:
df['grp'] = df.groupby('Sector')['Sector']\
.transform(lambda x: x.notna().cumsum().sample(frac=1))
dd = dict(tuple(df.groupby('grp')))
Output:
dd[1]
id Name Sector grp
0 1 John A 1
4 5 Ken B 1
6 7 Elaine C 1
dd[2]
id Name Sector grp
2 3 Jane A 2
5 6 Tom B 2
7 8 Tom C 2
dd[3]
id Name Sector grp
1 2 Steven A 3
3 4 Ashley B 3
8 9 Adam C 3
Details:
Create a sequence of numbers in each sector group starting from 1,
then randomize than number in the group to create a grouping key,
grp.
Use grp to groupby then create a dictionary, with keys for each grp.
Here's my way, you can groupbyby sector and randomly select from each group with a loop using the sample function:
for x, i in df.groupby('Sector'):
print(i.sample())
If you need multiple random selection use the sample function to specify how many items you want. For example:
for x, i in df.groupby('Sector'):
print(i.sample(2))
will return 2 random values from each group.
I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]
You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.
Let's say I have the following pandas DataFrame:
import pandas as pd
data = [['Alex',10],['Bob',12],['Clarke',13], ['Bob', '#'], ['Bob', '#'], ['Bob', '#']]
df = pd.DataFrame(data,columns=['Name','Age'], dtype=float)
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob #
4 Bob #
5 Bob #
So, there are odd rows in the DataFrame for Bob, namely rows 3, 4, and 5. These values are consistently #, not 12. Row 1 shows that Bob should be 12, not #.
In this example, it's straightforward to fix this with replace():
df = df.replace("#", 12)
print(df)
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob 12
4 Bob 12
5 Bob 12
However, this wouldn't work for larger dataframes, e.g.
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob #
4 Bob #
5 Bob #
6 Clarke #
whereby row 6 should be 6 Clarke 13.
How does one replace any row in Age with # with the correct integer as given in other rows, based on Name? If # exists, check other rows with the same Name value and replace #.
try this,
d= df[df['Age']!='#'].set_index('Name')['Age']
df['Age']=df['Name'].replace(d)
O/P:
Name Age
0 Alex 10
1 Bob 12
2 Clarke 13
3 Bob 12
4 Bob 12
5 Bob 12
6 Clarke 13
You want to use the valid values to fill the invalid ones? In that case, use map:
v = df.assign(Age=pd.to_numeric(df['Age'], errors='coerce')).dropna()
df['Age'] = df['Name'].map(v.set_index('Name').Age)
df
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
3 Bob 12.0
4 Bob 12.0
5 Bob 12.0
6 Clarke 13.0
I have a Pandas dataframe. My question is how do I group all the sellers (indicated under sellerUserName) for each date. For example, for any date e.g. 29/03/2018 I want to retrieve a sum of all the unique sellers.
ScrapeDate sellerUserName
0 29/03/2018 BOB
1 29/03/2018 BOB
2 29/03/2018 BOB
3 29/03/2018 MARY
4 29/03/2018 IAN
5 29/03/2018 ANISA
6 30/03/2018 BOB
7 30/03/2018 BOB
8 30/03/2018 BOB
9 30/03/2018 KARL
10 30/03/2018 KARL
11 30/03/2018 IAN
12 01/04/2018 NGI
13 01/04/2018 NICEE
So the output dataframe should be
ScrapeDate No.of Sellers
0 29/03/2018 4
1 30/03/2018 3
2 01/04/2018 2
Just using nunique
df.groupby('ScrapeDate')['sellerUserName'].nunique()
Out[38]:
ScrapeDate
01/04/2018 2
29/03/2018 4
30/03/2018 3
Name: sellerUserName, dtype: int64