Pandas - Combine rows with similar values (name spelling variations)

Pandas - Combine rows with similar values (name spelling variations) - python

I have the following Python Pandas Dataframe:
Name Sales Qty
0 JOHN BARNES 10
1 John Barnes 5
2 John barnes 4
3 Peter K. 4
4 Peter K 6
5 Peter Krammer 5
6 Charles 3
7 CHARLES 2
8 Julie Moore 3
9 Julie moore 7
10
And many more, with same name spelling variations.
I would like to combine the rows with similar values, such that I have the following Dataframe:
Name Sales Qty
0 John Barness 19
1 Peter Krammer 15
2 Charles 5
3 Julie Moore 10
and many more
How should I do?

The requirements are vague, as you can see in the comments, but I've tabulated the totals as far as I can tell. I tallied the total by lowercasing the name and removing the period, and converted it to uppercase with str.title().
import pandas as pd
import io
data = '''
Name Sales
0 "JOHN BARNES" 10
1 "John Barnes" 5
2 "John barnes" 4
3 "Peter K." 4
4 "Peter K" 6
5 "Peter Krammer" 5
6 "Charles" 3
7 "CHARLES" 2
8 "Julie Moore" 3
9 "Julie moore" 7
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['lower'] = df['Name'].str.lower()
df['lower'] = df['lower'].str.replace('.','')
new = df.groupby('lower')['Sales'].sum().reset_index()
new['lower'] = new['lower'].str.title()
new
lower Sales
0 Charles 5
1 John Barnes 19
2 Julie Moore 10
3 Peter K 10
4 Peter Krammer 5

Related

Disproportionate stratified sampling in Pandas

How can I randomly select one row from each group (column Name) in the following dataframe:
Distance Name Time Order
1 16 John 5 0
4 31 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
2 32 Peter 2 0
5 26 Peter 4 1
Expected result:
Distance Name Time Order
4 31 John 9 1
0 23 Kate 3 0
2 32 Peter 2 0

you can use a groupby on Name col and apply sample
df.groupby('Name',as_index=False).apply(lambda x:x.sample()).reset_index(drop=True)
Distance Name Time Order
0 31 John 9 1
1 15 Kate 7 1
2 32 Peter 2 0

You can shuffle all samples using, for example, the numpy function random.permutation. Then groupby by Name and take N first rows from each group:
df.iloc[np.random.permutation(len(df))].groupby('Name').head(1)

you can achive that using unique
df['Name'].unique()

Shuffle the dataframe:
df.sample(frac=1)
And then drop duplicated rows:
df.drop_duplicates(subset=['Name'])

df.drop_duplicates(subset='Name')
Distance Name Time Order
1 16 John 5 0
0 23 Kate 3 0
2 32 Peter 2 0
This should help, but this not random choice, it keeps the first

How about using random
like this,
Import your provided data,
df=pd.read_csv('random_data.csv', header=0)
which looks like this,
Distance Name Time Order
1 16 John 5 0
4 3 John 9 1
0 23 Kate 3 0
3 15 Kate 7 1
then get a random column name,
colname = df.columns[random.randint(1, 3)]
and below it selected 'Name',
print(df[colname])
1 John
4 John
0 Kate
3 Kate
Name: Name, dtype: object
Of course I could have condensed this to,
print(df[df.columns[random.randint(1, 3)]])

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

I'm trying to split a Pandas DataFrame into multiple separate DataFrames where one of the columns is evenly distributed among the resulting DataFrames. For example, if I wanted the following DataFrame split into 3 distinct DataFrames where each one contains one record of each sector (selected at random).
So a df that looks like this:
id Name Sector
1 John A
2 Steven A
3 Jane A
4 Kyle A
5 Ashley B
6 Ken B
7 Tom B
8 Peter B
9 Elaine C
10 Tom C
11 Adam C
12 Simon C
13 Stephanie D
14 Jan D
15 Marsha D
16 David D
17 Drew E
18 Kit E
19 Corey E
20 James E
Would yield two DataFrames, one of which could look like this, while the other consist of the remaining records.
id Name Sector
1 John A
2 Steven A
7 Tom B
8 Peter B
10 Tom C
11 Adam C
13 Stephanie D
16 David D
19 Corey E
20 James E
I know np.array_split(df, 2) will get me part way there, but it may not evenly distribute the sectors like I need.
(Edited for clarity)

Update per comments and updated question:
df_1=df.groupby('Sector', as_index=False, group_keys=False).apply(lambda x: x.sample(n=2))
df_2 = df[~df.index.isin(df_1.index)]
print(df_1)
id Name Sector
2 3 Jane A
3 4 Kyle A
7 8 Peter B
5 6 Ken B
11 12 Simon C
9 10 Tom C
12 13 Stephanie D
15 16 David D
19 20 James E
17 18 Kit E
print(df_2)
id Name Sector
0 1 John A
1 2 Steven A
4 5 Ashley B
6 7 Tom B
8 9 Elaine C
10 11 Adam C
13 14 Jan D
14 15 Marsha D
16 17 Drew E
18 19 Corey E
Here is a "funky" method, using sequential numbering and random sampling:
df['grp'] = df.groupby('Sector')['Sector']\
.transform(lambda x: x.notna().cumsum().sample(frac=1))
dd = dict(tuple(df.groupby('grp')))
Output:
dd[1]
id Name Sector grp
0 1 John A 1
4 5 Ken B 1
6 7 Elaine C 1
dd[2]
id Name Sector grp
2 3 Jane A 2
5 6 Tom B 2
7 8 Tom C 2
dd[3]
id Name Sector grp
1 2 Steven A 3
3 4 Ashley B 3
8 9 Adam C 3
Details:
Create a sequence of numbers in each sector group starting from 1,
then randomize than number in the group to create a grouping key,
grp.
Use grp to groupby then create a dictionary, with keys for each grp.

Here's my way, you can groupbyby sector and randomly select from each group with a loop using the sample function:
for x, i in df.groupby('Sector'):
print(i.sample())
If you need multiple random selection use the sample function to specify how many items you want. For example:
for x, i in df.groupby('Sector'):
print(i.sample(2))
will return 2 random values from each group.

How can we fill the empty values in the column?

I have table A with 3 columns. The column (val) has some of the empty values. The question is: Is there any way to fill the empty values based on the previous value using python. For example, Alex and John take vale 20 and Sam takes value 100.
A = [ id name val
1 jack 10
2 mec 20
3 alex
4 john
5 sam 250
6 tom 100
7 sam
8 hellen 300]

You can try to take in data as a pandas Dataframe and use the built in function fillna() to solve your problem. For an example,
df = # your data
df.fillna(method='pad')
Would return a dataframe like,
id name val
0 1 jack 10
1 2 mec 20
2 3 alex 20
3 4 john 20
4 5 sam 250
5 6 tom 100
6 7 sam 100
7 8 hellen 300
You can refer to this page for more information.

Creating dataframe from another dataframe and list

I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done

You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10

Python: how to create network of collaboration from pandas data frame?

I have a DataFrame like the following
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'Name': ['Tom','John','Paul','Tom','Frank','Tom', 'John', 'Richard', 'James'],
'Total':[3,3,3,2,2,4,4,4,4]})
print df
Item Name Total
A Tom 3
A John 3
A Paul 3
B Tom 2
B Frank 2
C Tom 4
C John 4
C Richard 4
C James 4
I want to create a network of collaboration which is normalized over the Total collaborations between two pairs and the number of Name on the same Item. At end I would like something like
df1
Name Name1 Item Total
Tom John A 3
Tom John C 4
Tom Paul A 3
Tom Frank B 2
Tom Richard C 4
Tom James C 4
John Paul A 3
John Richard C 4
Richard James C 4

I think this gets what you want. I used groupby to group by the Item that connects two Names and itertools.combinations within the group.
cnxns = []
for k,g in df.groupby('Item'):
[cnxns.extend((n1,n2,k,len(g)) for n1,n2 in combinations(g['Name'], 2))]
pd.DataFrame(cnxns, columns=['Name', 'Name1', 'Item', 'Total'])
Name Name1 Item Total
0 Tom John A 3
1 Tom Paul A 3
2 John Paul A 3
3 Tom Frank B 2
4 Tom John C 4
5 Tom Richard C 4
6 Tom James C 4
7 John Richard C 4
8 John James C 4
9 Richard James C 4
Probably a better method out there, but this should do what you ask.
The only difference between my output and your desired output is that I included (John, James, C, 4), but maybe you wanted that (assuming I understood the question correctly)?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - Combine rows with similar values (name spelling variations) - python

Related

Disproportionate stratified sampling in Pandas

Split a Pandas DataFrame where one factor column is evenly distributed among the splits

How can we fill the empty values in the column?

Creating dataframe from another dataframe and list

Python: how to create network of collaboration from pandas data frame?

Categories

Resources