Python - Sort a Pandas Dataframe twice - python

I would like to sort a Pandas dataframe twice the same way excel does. Given the following df:
Name Date
John 13/01
Mike 13/01
John 15/01
John 14/01
Mike 12/01
When adding the following code:
df=df.sort_values(['Date','Name'], ascending=[True, True])
I would expect the following result:
Name Date
John 13/01
John 14/01
John 15/01
Mike 12/01
Mike 13/01
I'm getting nothing close to this result with the code above. Any idea where's the mistake?
Many thanks!

You need swap columns, because first sort by Name and then by Date, ascending=[True, True] should be removed, because default parameter:
df = df.sort_values(['Name','Date'])
print (df)
Name Date
0 John 13/01
3 John 14/01
2 John 15/01
4 Mike 12/01
1 Mike 13/01

Related

Match two columns in dataframe

I have two columns in dataframe df
ID Name
AXD2 SAM S
AXD2 SAM
SCA4 JIM
SCA4 JIM JONES
ASCQ JOHN
I need the output to get a unique id and should match the first name only,
ID Name
AXD2 SAM S
SCA4 JIM
ASCQ JOHN
Any suggestions?
You can use groupby with agg and get first of Name
df.groupby(['ID']).agg(first_name=('Name', 'first')).reset_index()
Use drop_duplicates:
out = df.drop_duplicates('ID', ignore_index=True)
print(out)
# Output
ID Name
0 AXD2 SAM S
1 SCA4 JIM
2 ASCQ JOHN
You can use cumcount() to find the first iteration name of the ID
df['RN'] = df.groupby(['ID']).cumcount() + 1
df = df.loc[df['RN'] == 1]
df[['ID', 'Name']]

Python/Pandas - If Column A equals X or Y, then assign value from Col B. If not, then assign Col C. How to write in Python?

I'm having trouble formulating this statement in Pandas that would be very simple in excel. I have a dataframe sample as follows:
colA colB colC
10 0 27:15 John Doe
11 0 24:33 John Doe
12 1 29:43 John Doe
13 Inactive John Doe None
14 N/A John Doe None
Obviously the dataframe is much larger than this, with 10,000+ rows, so I'm trying to find an easier way to do this. I want to create a column that checks if colA is equal to 0 or 1. If so, then equals colC. If not, then equals colC. In excel, I would simply create a new column (new_col) and write
=IF(OR(A2<>0,A2<>1),B2,C2)
And then drag fill the entire sheet.
I'm sure this is fairly simple but I cannot for the life of me figure this out.
Result should look like this
colA colB colC new_col
10 0 27:15 John Doe John Doe
11 0 24:33 John Doe John Doe
12 1 29:43 John Doe John Doe
13 Inactive John Doe None John Doe
14 N/A John Doe None John Doe
np.where should do the trick.
df['new_col'] = np.where(df['colA'].isin([0, 1]), df['colB'], df['colC'])
Here is a solution that adds your results to a list given your conditions, then adds the list back in the dataframe as D column.
your_results=[]
for i,data in enumerate(df["colA"]):
if data==0 or data==1:
your_results.append(df["colC"][i])
else:
your_results.append(df["colB"][i])
df["colD"]=your_results

Check if pandas column contains text in another dataframe and replace values

I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary

Fill dataframe nan values from a join

I am trying to map owners to an IP address through the use of two tables, df1 & df2. df1 contains the IP list to be mapped and df2 contains an IP, an alias, and the owner. After running a join on the IP column, it gives me a half joined dataframe. Most of the remaining data can be joined by replacing the NaN values with a join on the Alias column, but I can’t figure out how to do it.
My initial thoughts were to try nesting pd.merge inside fillna(), but it won't accept a dataframe. Any help would be greatly appreciated.
df1 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.0.102', '192.18.0.103', '192.18.0.104']})
df2 = pd.DataFrame({'IP' : ['192.18.0.100', '192.18.0.101', '192.18.1.206', '192.18.1.218', '192.18.1.118'],
'Alias' : ['192.18.1.214', '192.18.1.243', '192.18.0.102', '192.18.0.103', '192.18.1.180'],
'Owner' : ['Smith, Jim', 'Bates, Andrew', 'Kline, Jenny', 'Hale, Fred', 'Harris, Robert']})
new_df = pd.DataFrame(pd.merge(df1, df2[['IP', 'Owner']], on='IP', how= 'left'))
Expected output is:
IP Owner
192.18.0.100 Smith, Jim
192.18.0.101 Bates, Andrew
192.18.0.102 Kline, Jenny
192.18.0.103 Hale, Fred
192.18.0.104 nan
No need to merge, Just pull data where condition satisfies. This is way faster than merge and less complicated.
condition = (df1['IP'] == df2['IP']) | (df1['IP'] == df2['Alias'])
df1['Owner'] = np.where(condition, df2['Owner'], np.nan)
print(df1)
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN
Try this one:
new_df = pd.DataFrame(pd.merge(df1, pd.concat([df2[['IP', 'Owner']], df2[['Alias', 'Owner']].rename(columns={"Alias": "IP"})]).drop_duplicates(), on='IP', how= 'left'))
The result:
>>> new_df
IP Owner
0 192.18.0.100 Smith, Jim
1 192.18.0.101 Bates, Andrew
2 192.18.0.102 Kline, Jenny
3 192.18.0.103 Hale, Fred
4 192.18.0.104 NaN
Let's melt then use map:
df1['IP'].map(df2.melt('Owner').set_index('value')['Owner'])
Output:
0 Smith, Jim
1 Bates, Andrew
2 Kline, Jenny
3 Hale, Fred
4 NaN
Name: IP, dtype: object

Pivot pandas dataframe with dates and showing counts per date

I have the following pandas DataFrame: (currently ~500 rows):
merged_verified =
Last Verified Verified by
0 2016-07-11 John Doe
1 2016-07-11 John Doe
2 2016-07-12 John Doe
3 2016-07-11 Mary Smith
4 2016-07-12 Mary Smith
I am attempting to pivot_table() it to receive the following:
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Currently I'm running
merged_verified = merged_verified.pivot_table(index=['Verified by'], values=['Last Verified'], aggfunc='count')
which gives me close to what I need, but not exactly:
Last Verified
Verified by
John Doe 3
Mary Smith 2
I've tried a variety of things with the parameters, but none of it worked. The result above is the closest I've come to what I need. I read somewhere I would need to add an additional column that uses dummy values (1's) that I can then add but that seems counter-intuitive for a what I believe to be simple DataFrame layout.
You can add parameter columns and aggragate by len:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
values=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last 2016-07-11 2016-07-12
Verified by
Doe 2 1
Smith 1 1
Or you also omit values:
merged_verified = merged_verified.pivot_table(index=['Verified by'],
columns=['Last Verified'],
aggfunc=len)
print (merged_verified)
Last Verified 2016-07-11 2016-07-12
Verified by
John Doe 2 1
Mary Smith 1 1
Use groupby, value_counts, and unstack:
merged_verified.groupby('Last Verified')['Verified by'].value_counts().unstack(0)
Timing
Example dataframe
Large dataframe 1 million rows
idx = pd.MultiIndex.from_product(
[
pd.date_range('2016-03-01', periods=100),
pd.DataFrame(np.random.choice(letters, (10000, 10))).sum(1)
], names=['Last Verified', 'Verified by'])
merged_verified = idx.to_series().reset_index()[idx.names]

Categories

Resources