Pandas dataframe group by list - python

I have the following dataframe with names of people and their abbreviation. The aim is to perform name disambiguation:
Names Abb
0 Michaele Frendu [Mic, Fre]
1 Lucam Zamit [Luc, Zam]
2 magistro Johanne Luckys [Joh, Luc]
3 Albano Fava [Alb, Fav]
4 Augustino Bagliu [Aug, Bag]
5 Lucas Zamit [Luc, Zam]
6 Jngabellavit [Jng]
7 Micheli Frendu [Mic, Fre]
8 Luce [Luc]
9 Far [Far]
Can I group by list ie: row 1, 7 and row 1,5. Later on I was going to do something similar with just the first names.

If want groupby list, is necessary convert column to tuples first:
def func(x):
print (x)
#some code
return x
df1 = df.groupby(df['Abb'].apply(tuple)).apply(func)
Names Abb
3 Albano Fava [Alb, Fav]
Names Abb
3 Albano Fava [Alb, Fav]
Names Abb
4 Augustino Bagliu [Aug, Bag]
Names Abb
9 Far [Far]
Names Abb
6 Jngabellavit [Jng]
Names Abb
2 magistro Johanne Luckys [Joh, Luc]
Names Abb
8 Luce [Luc]
Names Abb
1 Lucam Zamit [Luc, Zam]
5 Lucas Zamit [Luc, Zam]
Names Abb
0 Michaele Frendu [Mic, Fre]
7 Micheli Frendu [Mic, Fre]

Or map:
df.groupby(df['Abb'].map(tuple)).do_something
I do this because list aren't hash-able objects

Related

filter rows from data where column salary has string datatype

id name salary
0 1 shyam 10000
1 2 ram 20000
2 3 ravi abc
3 4 abhay 30000
4 5 karan fgh
expected:
id name salary
2 3 ravi abc
4 5 karan fgh
We can use str.contains as follows:
df_out = df[(df["name"].str.contains(r'^[A-Za-z]+$', regex=True)) &
(df["salary"].str.contains(r'^[A-Za-z]+$', regex=True))]
The above logic will only match rows for which both the name and salary columns contain only alpha characters.

Retrieve the numbers from the file corresponding to the given regions specified in the file

The below is my dataframe :
Sno Name Region Num
0 1 Rubin Indore 79744001550
1 2 Rahul Delhi 89824304549
2 3 Rohit Noida 91611611478
3 4 Chirag Delhi 85879761557
4 5 Shan Bharat 95604535786
5 6 Jordi Russia 80777784005
6 7 El Russia 70008700104
7 8 Nino Spain 87707101233
8 9 Mark USA 98271377772
9 10 Pattinson Hawk Eye 87888888889
Retrieve the numbers and store it region wise from the given CSV file.
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
delhi_list = []
for i in range(len(data)):
if data.loc[i]['Region'] == 'Delhi':
delhi_list.append(data.loc[i]['Num'])
I am getting the results, but I want to achieve the data by the use of dictionary in python. Can I use it?
IIUC, you can use groupby, apply the list aggregation then use to_dict:
data.groupby('Region')['Num'].apply(list).to_dict()
[out]
{'Bharat': [95604535786],
'Delhi': [89824304549, 85879761557],
'Hawk Eye': [87888888889],
'Indore': [79744001550],
'Noida': [91611611478],
'Russia': [80777784005, 70008700104],
'Spain': [87707101233],
'USA': [98271377772]}

how to slice between two elements in a pandas series

I have a Series containing a column with names and their nationalities in parenthesis.
I want this column to contain just the individuals nationality and without parenthesis, with the same index.
0 LOMBARDI Domingo (URU)
1 MACIAS Jose (ARG)
2 TEJADA Anibal (URU)
3 WARNKEN Alberto (CHI)
4 REGO Gilberto (BRA)
5 CRISTOPHE Henry (BEL)
6 MATEUCCI Francisco (URU)
7 MACIAS Jose (ARG)
8 LANGENUS Jean (BEL)
9 TEJADA Anibal (URU)
10 SAUCEDO Ulises (BOL)
I have tried using .split(' ')[2] to the series.
But found out "'Series' object has no attribute 'split'."
You need to use str accessor on series.
df.name.str.split('(').str[1].str[:-1]
Output:
0 URU
1 ARG
2 URU
3 CHI
4 BRA
5 BEL
6 URU
7 ARG
8 BEL
9 URU
10 BOL
Name: name, dtype: object
Using extract
s.str.extract('.*\((.*)\).*',expand=True)[0]
Out[463]:
0 URU
1 ARG
2 URU
3 CHI
Name: 0, dtype: object
Using slice. May not be optimal as it assumes the right side of the string is constant but it's another possible solution.
df.name.str.slice(start = -4).str[:-1]

python pandas merge two or more lines of text into one line

I have data frame with text data like below,
name | address | number
1 Bob bob No.56
2 #gmail.com
3 Carly carly#world.com No.90
4 Gorge greg#yahoo
5 .com
6 No.100
and want to make it like this frame.
name | address | number
1 Bob bob#gmail.com No.56
2 Carly carly#world.com No.90
3 Gorge greg#yahoo.com No.100
I am using pandas to read file but not sure how to use merge or concat.
In case of name column consists of unique values,
print df
name address number
0 Bob bob No.56
1 NaN #gmail.com NaN
2 Carly carly#world.com No.90
3 Gorge greg#yahoo NaN
4 NaN .com NaN
5 NaN NaN No.100
df['name'] = df['name'].ffill()
print df.fillna('').groupby(['name'], as_index=False).sum()
name address number
0 Bob bob#gmail.com No.56
1 Carly carly#world.com No.90
2 Gorge greg#yahoo.com No.100
you may need ffill(), bfill(), [::-1], .groupby('name').apply(lambda x: ' '.join(x['address'])), strip(), lstrip(), rstrip(), replace() kind of thing to extend above code to more complicated data.
If you want to convert a data frame of sex rows (with possible NaN entry in each column), there might be no direct pandas methods for that.
You will need some codes to assign the value in name column, so that pandas can know the split rows of bob and #gmail.com belong to same user Bob.
You can fill each empty entry in column name with its preceding user using the fillna or ffill methods, see pandas dataframe missing data.
df ['name'] = df['name'].ffill()
# gives
name address number
0 Bob bob No.56
1 Bob #gmail.com
2 Carly carly#world.com No.90
3 Gorge greg#yahoo
4 Gorge .com
5 Gorge No.100
Then you can use the groupby and sum as the aggregation function.
df.groupby(['name']).sum().reset_index()
# gives
name address number
0 Bob bob#gmail.com No.56
1 Carly carly#world.com No.90
2 Gorge greg#yahoo.com No.100
You may find converting between NaN and white space useful, see Replacing blank values (white space) with NaN in pandas and pandas.DataFrame.fillna.

get the distinct column values and union the dataframes

I am trying to convert sql statement
SELECT distinct table1.[Name],table1.[Phno]
FROM table1
union
select distinct table2.[Name],table2.[Phno] from table2
UNION
select distinct table3.[Name],table3.[Phno] from table3;
Now I have 4 dataframes: table1, table2, table3.
table1
Name Phno
0 Andrew 6175083617
1 Andrew 6175083617
2 Frank 7825942358
3 Jerry 3549856785
4 Liu 9659875695
table2
Name Phno
0 Sandy 7859864125
1 Nikhil 9526412563
2 Sandy 7859864125
3 Tina 7459681245
4 Surat 9637458725
table3
Name Phno
0 Patel 9128257489
1 Mary 3679871478
2 Sandra 9871359654
3 Mary 3679871478
4 Hali 9835167465
now I need to get distinct values of these dataframes and union them and get the output to be:
sample output
Name Phno
0 Andrew 6175083617
1 Frank 7825942358
2 Jerry 3549856785
3 Liu 9659875695
4 Sandy 7859864125
5 Nikhil 9526412563
6 Tina 7459681245
7 Surat 9637458725
8 Patel 9128257489
9 Mary 3679871478
10 Sandra 9871359654
11 Hali 9835167465
I tried to get the unique values for one dataframe table1 as shown below:
table1_unique = pd.unique(table1.values.ravel()) #which gives me
table1_unique
array(['Andrew', 6175083617L, 'Frank', 7825942358L, 'Jerry', 3549856785L,
'Liu', 9659875695L], dtype=object)
But i get them as an array. I even tried converting them as dataframe using:
table1_unique1 = pd.DataFrame(table1_unique)
table1_unique1
0
0 Andrew
1 6175083617
2 Frank
3 7825942358
4 Jerry
5 3549856785
6 Liu
7 9659875695
How do I get unique values in a dataframe, so that I can concat them as per my sample output. Hope this is clear. Thanks!!
a = table1df[['Name','Phno']].drop_duplicates()
b = table2df[['Name','Phno']].drop_duplicates()
c = table3df[['Name','Phno']].drop_duplicates()
result = pd.concat([a,b,c])

Categories

Resources