I am trying to convert sql statement
SELECT distinct table1.[Name],table1.[Phno]
FROM table1
union
select distinct table2.[Name],table2.[Phno] from table2
UNION
select distinct table3.[Name],table3.[Phno] from table3;
Now I have 4 dataframes: table1, table2, table3.
table1
Name Phno
0 Andrew 6175083617
1 Andrew 6175083617
2 Frank 7825942358
3 Jerry 3549856785
4 Liu 9659875695
table2
Name Phno
0 Sandy 7859864125
1 Nikhil 9526412563
2 Sandy 7859864125
3 Tina 7459681245
4 Surat 9637458725
table3
Name Phno
0 Patel 9128257489
1 Mary 3679871478
2 Sandra 9871359654
3 Mary 3679871478
4 Hali 9835167465
now I need to get distinct values of these dataframes and union them and get the output to be:
sample output
Name Phno
0 Andrew 6175083617
1 Frank 7825942358
2 Jerry 3549856785
3 Liu 9659875695
4 Sandy 7859864125
5 Nikhil 9526412563
6 Tina 7459681245
7 Surat 9637458725
8 Patel 9128257489
9 Mary 3679871478
10 Sandra 9871359654
11 Hali 9835167465
I tried to get the unique values for one dataframe table1 as shown below:
table1_unique = pd.unique(table1.values.ravel()) #which gives me
table1_unique
array(['Andrew', 6175083617L, 'Frank', 7825942358L, 'Jerry', 3549856785L,
'Liu', 9659875695L], dtype=object)
But i get them as an array. I even tried converting them as dataframe using:
table1_unique1 = pd.DataFrame(table1_unique)
table1_unique1
0
0 Andrew
1 6175083617
2 Frank
3 7825942358
4 Jerry
5 3549856785
6 Liu
7 9659875695
How do I get unique values in a dataframe, so that I can concat them as per my sample output. Hope this is clear. Thanks!!
a = table1df[['Name','Phno']].drop_duplicates()
b = table2df[['Name','Phno']].drop_duplicates()
c = table3df[['Name','Phno']].drop_duplicates()
result = pd.concat([a,b,c])
Related
id name salary
0 1 shyam 10000
1 2 ram 20000
2 3 ravi abc
3 4 abhay 30000
4 5 karan fgh
expected:
id name salary
2 3 ravi abc
4 5 karan fgh
We can use str.contains as follows:
df_out = df[(df["name"].str.contains(r'^[A-Za-z]+$', regex=True)) &
(df["salary"].str.contains(r'^[A-Za-z]+$', regex=True))]
The above logic will only match rows for which both the name and salary columns contain only alpha characters.
DOB Name
0 1956-10-30 Anna
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry
6 1972-05-04 Kate
In the dataframe similar to the one above where I have duplicate names. So I am want to add a suffix '_0' to the name if DOB is before 1990 and a duplicate name.
I am expecting a result like this
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I am using the following
df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0')
But I am getting this result
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 NaN
2 2001-09-09 NaN
3 1993-01-15 NaN
4 1999-05-02 NaN
5 1962-12-17 Jerry_0
6 1972-05-04 NaN
How can I add a suffix to the Name which is a duplicate and have to be born before 1990.
Problem in your df['Name'] = df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))].Name.apply(lambda x: x+'_0') is that df[(df['DOB'] < '01-01-1990') & (df['Name'].isin(['Anna','Jerry']))] is a filtered dataframe whose rows are less than the original. When you assign it back, the not filtered rows doesn't have corresponding value in the filtered dataframe, so it becomes NaN.
You can try mask instead
m = (df['DOB'] < '1990-01-01') & df['Name'].duplicated(keep=False)
df['Name'] = df['Name'].mask(m, df['Name']+'_0')
You can use masks and boolean indexing:
# is the year before 1990?
m1 = pd.to_datetime(df['DOB']).dt.year.lt(1990)
# is the name duplicated?
m2 = df['Name'].duplicated(keep=False)
# if both conditions are True, add '_0' to the name
df.loc[m1&m2, 'Name'] += '_0'
output:
DOB Name
0 1956-10-30 Anna_0
1 1993-03-21 Jerry
2 2001-09-09 Peter
3 1993-01-15 Anna
4 1999-05-02 James
5 1962-12-17 Jerry_0
6 1972-05-04 Kate
I have a pandas dataframe that includes a "Name" column. Strings in the Name column may contain "Joe", "Bob", or "Joe Bob". I want to add a column for the type of person: just Joe, just Bob, or Both.
I was able to do this by creating boolean columns, turning them into strings, combining the strings, and then replacing the values. It just...didn't feel very elegant! I am new to Python...is there a better way to do this?
My original dataframe:
df = pd.DataFrame(data= [['Joe Biden'],['Bobby Kennedy'],['Joe Bob Briggs']], columns = ['Name'])
0
Name
1
Joe Biden
2
Bobby Kennedy
3
Joe Bob Briggs
I added two boolean columns to find names:
df['Joe'] = df.Name.str.contains('Joe')
df['Joe'] = df.Joe.astype('int')
df['Bob'] = df.Name.str.contains('Bob')
df['Bob'] = df.Bob.astype('int')
Now my dataframe looks like this:
df = pd.DataFrame(data= [['Joe Biden',1,0],['Bobby Kennedy',0,1],['Joe Bob Briggs',1,1]], columns = ['Name','Joe', 'Bob'])
0
Name
Joe
Bob
1
Joe Biden
1
0
2
Bobby Kennedy
0
1
3
Joe Bob Briggs
1
1
But what I really want is one "Type" column with categorical values: Joe, Bob, or Both.
To do that, I added a column to combine the booleans, then I replaced the values:
df["Type"] = df["Joe"].astype(str) + df["Bob"].astype(str)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
10
2
Bobby Kennedy
0
1
1
3
Joe Bob Briggs
1
1
11
df['Type'] = df.Type.astype('str') df['Type'].replace({'11': 'Both', '10': 'Joe','1': 'Bob'}, inplace=True)
0
Name
Joe
Bob
Type
1
Joe Biden
1
0
Joe
2
Bobby Kennedy
0
1
Bob
3
Joe Bob Briggs
1
1
Both
This feels clunky. Anyone have a better way?
Thanks!
You can use np.select to create the column Type.
You need to ordered correctly your condlist from the most precise to the widest.
df['Type'] = np.select([df['Name'].str.contains('Joe') & df['Name'].str.contains('Bob'),
df['Name'].str.contains('Joe'),
df['Name'].str.contains('Bob')],
choicelist=['Both', 'Joe', 'Bob'])
Output:
>>> df
Name Type
0 Joe Biden Joe
1 Bobby Kennedy Bob
2 Joe Bob Briggs Both
I am new to python and trying to move some of my work from excel to python, and wanted an excel SUMIFS equivalent in pandas, for example something like:
SUMIFS(F:F, D:D, "<="&C2, B:B, B2, F:F, ">"&0)
I my case, I have 6 columns, a unique Trade ID, an Issuer, a Trade date, a release date, a trader, and a quantity. I wanted to get a column which show the sum of available quantity for release at each row. Something like the below:
A B C D E F G
ID Issuer TradeDate ReleaseDate Trader Quantity SumOfAvailableRelease
1 Horse 1/1/2012 13/3/2012 Amy 7 0
2 Horse 2/2/2012 15/5/2012 Dave 2 0
3 Horse 14/3/2012 NaN Dave -3 7
4 Horse 16/5/2012 NaN John -4 9
5 Horse 20/5/2012 10/6/2012 John 2 9
6 Fish 6/6/2013 20/6/2013 John 11 0
7 Fish 25/6/2013 9/9/2013 Amy 4 11
8 Fish 8/8/2013 15/9/2013 Dave 5 11
9 Fish 25/9/2013 NaN Amy -3 20
Usually, in excel, I just pull the SUMIFS formulas down the whole column and it will work, I am not sure how I can do it in python.
Many thanks!
What you could do is a df.where
so for example you could say
Qdf = df.where(df["Quantity"]>=5)
and then do you sum, Idk what you want to do since I have 0 knowledge about excell but I hope this helps
I want to compare name column in two dataframes df1 and df2 , output the matching rows from dataframe df1 and store the result in new dataframe df3. How do i do this in Pandas ?
df1
place name qty unit
NY Tom 2 10
TK Ron 3 15
Lon Don 5 90
Hk Sam 4 49
df2
place name price
PH Tom 7
TK Ron 5
Result:
df3
place name qty unit
NY Tom 2 10
TK Ron 3 15
Option 1
Using df.isin:
In [1362]: df1[df1.name.isin(df2.name)]
Out[1362]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 2
Performing an inner-join with df.merge:
In [1365]: df1.merge(df2.name.to_frame())
Out[1365]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
Option 3
Using df.eq:
In [1374]: df1[df1.name.eq(df2.name)]
Out[1374]:
place name qty unit
0 NY Tom 2 10
1 TK Ron 3 15
You want something called an inner join.
df1.merge(df2,on = 'name')
place_x name qty unit place_y price
NY Tom 2 10 PH 7
TK Ron 3 15 TK 5
The _xand _y happens when you have a column in both data frames being merged.