Merging pandas columns (many-to-one) - python

I have two dataframes:
df1
ID
1
2
3
4
5
6
7
8
9
10
df2:
Name Count
raj 2
dinesh 3
sachin 3
glen 2
Now I want to create a third dataframe with parent dataframe as df1 with second column inserted as "Owner" with 2 rows assigned to raj, 3 to dinesh, 3 to sachin and 2 to glen. The third datframe will look like this:
df3:
ID Owner
1 raj
2 raj
3 dinesh
4 dinesh
5 dinesh
6 sachin
7 sachin
8 sachin
9 glen
10 glen
I'll highly appreciate all your help.

It seems you need numpy.repeat but is necessary sum of all Count values is same as length of df1:
df1['Owner'] = np.repeat(df2['Name'].values, df2['Count'].values)
print (df1)
ID Owner
0 1 raj
1 2 raj
2 3 dinesh
3 4 dinesh
4 5 dinesh
5 6 sachin
6 7 sachin
7 8 sachin
8 9 glen
9 10 glen

Related

Pandas - Combine rows with similar values (name spelling variations)

I have the following Python Pandas Dataframe:
Name Sales Qty
0 JOHN BARNES 10
1 John Barnes 5
2 John barnes 4
3 Peter K. 4
4 Peter K 6
5 Peter Krammer 5
6 Charles 3
7 CHARLES 2
8 Julie Moore 3
9 Julie moore 7
10
And many more, with same name spelling variations.
I would like to combine the rows with similar values, such that I have the following Dataframe:
Name Sales Qty
0 John Barness 19
1 Peter Krammer 15
2 Charles 5
3 Julie Moore 10
and many more
How should I do?
The requirements are vague, as you can see in the comments, but I've tabulated the totals as far as I can tell. I tallied the total by lowercasing the name and removing the period, and converted it to uppercase with str.title().
import pandas as pd
import io
data = '''
Name Sales
0 "JOHN BARNES" 10
1 "John Barnes" 5
2 "John barnes" 4
3 "Peter K." 4
4 "Peter K" 6
5 "Peter Krammer" 5
6 "Charles" 3
7 "CHARLES" 2
8 "Julie Moore" 3
9 "Julie moore" 7
'''
df = pd.read_csv(io.StringIO(data), sep='\s+')
df['lower'] = df['Name'].str.lower()
df['lower'] = df['lower'].str.replace('.','')
new = df.groupby('lower')['Sales'].sum().reset_index()
new['lower'] = new['lower'].str.title()
new
lower Sales
0 Charles 5
1 John Barnes 19
2 Julie Moore 10
3 Peter K 10
4 Peter Krammer 5

Turn 4 columns into two

In Jupiter notebook, using pandas, I have a csv with 4 columns.
Names Number Names2 Number2
Jim 2 Greg 5
Meek 4 Drake 6
NaN 12 Tim 3
Neri 1 Nan 9
There are no duplicates between the two Name columns but there are NaN's.
I am looking to
Create 2 new columns that appends the 4 columns
Remove the NaN's in the process
Where there are NaN names remove the associated number aswell.
Desired Output
Names Number Names2 Number2 - NameList NumberList
Jim 2 Greg 5 Jim 2
Meek 4 Drake 6 Meek 4
NaN 12 Tim 3 Neri 1
Neri 1 Nan 9 Greg 5
Drake 6
Tim 3
I have tried using .append but whenever I append, my new NameList column ends up just being the same length as one of the original columns or the NaN's stay.
This looks like pd.wide_to_long with a little modification on the first set of Names and Number column:
d = dict(zip(['Names','Number'],['Names1','Number1']))
(pd.wide_to_long(df.rename(columns=d).reset_index()
,['Names','Number'],'index','v')
.dropna(subset=['Names']).reset_index(drop=True))
Names Number
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can try this:
df = df.replace('Nan', np.NaN)
df1 = pd.concat([pd.concat([df['Names'], df['Names2']]), pd.concat([df['Number'], df['Number2']])], axis=1).dropna().rename(columns={0: 'Nameslist', 1: 'Numberlist'}).reset_index().drop(columns=['index'])
print(df1)
Nameslist Numberlist
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
When you want to concatenate while ignoring the column names and index, numpy can be a handy tool:
tmp = pd.DataFrame(np.concatenate(
[df[['Names', 'Number']].dropna().values,
df[['Names2', 'Number2']].dropna().values]),
columns=['NameList', 'NumberList'])
It gives:
NameList NumberList
0 Jim 2
1 Meek 4
2 Neri 1
3 Greg 5
4 Drake 6
5 Tim 3
You can know concatenate on axis=1:
pd.concat([df, tmp], axis=1)
which gives as expected:
Names Number Names2 Number2 NameList NumberList
0 Jim 2.0 Greg 5.0 Jim 2
1 Meek 4.0 Drake 6.0 Meek 4
2 NaN 12.0 Tim 3.0 Neri 1
3 Neri 1.0 NaN 9.0 Greg 5
4 NaN NaN NaN NaN Drake 6
5 NaN NaN NaN NaN Tim 3
try this,
(pd.concat([df,
pd.DataFrame(
{x.replace("2", ""): df.pop(x)
for x in ['Names2', 'Number2']})])) \
.replace('Nan', np.NaN).dropna()
output,
Names Number
0 Jim 2
1 Meek 4
3 Neri 1
0 Greg 5
1 Drake 6
2 Tim 3

Pandas Dataframe retrieve unique column

I have a Pandas dataframe. My question is how do I group all the sellers (indicated under sellerUserName) for each date. For example, for any date e.g. 29/03/2018 I want to retrieve a sum of all the unique sellers.
ScrapeDate sellerUserName
0 29/03/2018 BOB
1 29/03/2018 BOB
2 29/03/2018 BOB
3 29/03/2018 MARY
4 29/03/2018 IAN
5 29/03/2018 ANISA
6 30/03/2018 BOB
7 30/03/2018 BOB
8 30/03/2018 BOB
9 30/03/2018 KARL
10 30/03/2018 KARL
11 30/03/2018 IAN
12 01/04/2018 NGI
13 01/04/2018 NICEE
So the output dataframe should be
ScrapeDate No.of Sellers
0 29/03/2018 4
1 30/03/2018 3
2 01/04/2018 2
Just using nunique
df.groupby('ScrapeDate')['sellerUserName'].nunique()
Out[38]:
ScrapeDate
01/04/2018 2
29/03/2018 4
30/03/2018 3
Name: sellerUserName, dtype: int64

Creating dataframe from another dataframe and list

I have a list of names like this
names= [Josh,Jon,Adam,Barsa,Fekse,Bravo,Talyo,Zidane]
and i have a dataframe like this
Number Names
0 1 Josh
1 2 Jon
2 3 Adam
3 4 Barsa
4 5 Fekse
5 6 Barsa
6 7 Barsa
7 8 Talyo
8 9 Jon
9 10 Zidane
i want to create a dataframe that will have all the names in names list and the corresponding numbers from this dataframe grouped, for the names that does not have corresponding numbers there should be an asterisk like below
Names Number
Josh 1
Jon 2,9
Adam 3
Barsa 4,6,7
Fekse 5
Bravo *
Talyo 8
Zidane 10
Do we have any built in functions to get this done
You can use GroupBy with str.join, then reindex with your names list:
res = df.groupby('Names')['Number'].apply(lambda x: ','.join(map(str, x))).to_frame()\
.reindex(names).fillna('*').reset_index()
print(res)
Names Number
0 Josh 1
1 Jon 2,9
2 Adam 3
3 Barsa 4,6,7
4 Fekse 5
5 Bravo *
6 Talyo 8
7 Zidane 10

How to correctly sort a multi-indexed pandas DataFrame

I have a multi-indexed pandas dataframe that looks like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
5 1 1.143599
2 1.151358
3 1.272172
10 1 1.765615
2 1.779330
3 1.752246
20 1 1.685807
2 1.688354
3 1.614013
..... ....
0 4 2.111466
5 1.933589
6 1.336527
5 4 2.006936
5 2.040884
6 1.430818
10 4 1.398334
5 1.594028
6 1.684037
20 4 1.529750
5 1.721385
6 1.608393
(Note that I've only posted one antibody, there are many analogous entries under the antibody index) but they all have the same format. Despite missing out the entries in the middle for the sake of space you can see that I have 6 experimental repeats but they are not organized properly. My question is: how would I get the DataFrame to aggregate all the repeats. So the output would look something like this:
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.60839
..... ....
Thanks in advance
I think you need sort_index:
df = df.sort_index(level=[0,1,2])
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64
Or you can omit parameter levels:
df = df.sort_index()
print (df)
Antibody Time Repeats
Akt 0 1 1.988053
2 1.855905
3 1.416557
4 2.111466
5 1.933589
6 1.336527
5 1 1.143599
2 1.151358
3 1.272172
4 2.006936
5 2.040884
6 1.430818
10 1 1.765615
2 1.779330
3 1.752246
4 1.398334
5 1.594028
6 1.684037
20 1 1.685807
2 1.688354
3 1.614013
4 1.529750
5 1.721385
6 1.608393
Name: col, dtype: float64

Categories

Resources