Checking unique value for a variable in a different column - python

I currently have a dataframe which looks like this:
Owner Vehicle_Color
0 James Red
1 Peter Green
2 James Blue
3 Sally Blue
4 Steven Red
5 James Blue
6 James Red
7 Peter Blue
And I am trying to verify whether one Owner has one or multiple vehicle colors assigned to the person. Keeping in mind that my dataframe has more than a million number of different entries for owners (which can be duplicate), what would be the best solution?

One way may be to use groupby and nunique:
df.groupby('Owner')['Vehicle_Color'].nunique()
Results:
Owner
James 2
Peter 2
Sally 1
Steven 1
Name: Vehicle_Color, dtype: int64

Related

Pandas Number of Unique Values from 2 Fields

I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5

How to append a new row in a dataframe by searching for an existing column value without iterating?

I'm trying to find the best way to create new rows for every 1 row when a certain value is contained in a column.
Example Dataframe
Index
Person
Drink_Order
1
Sam
Jack and Coke
2
John
Coke
3
Steve
Dr. Pepper
I'd like to search the DataFrame for Jack and Coke, remove it and add 2 new records as Jack and Coke are 2 different drink sources.
Index
Person
Drink_Order
2
John
Coke
3
Steve
Dr. Pepper
4
Sam
Jack Daniels
5
Sam
Coke
Example Code that I want to replace as my understanding is you should never modify rows you are iterating
for index, row in df.loc[df['Drink_Order'].str.contains('Jack and Coke')].iterrows():
df.loc[len(df)]=[row['Person'],'Jack Daniels']
df.loc[len(df)]=[row['Person'],'Coke']
df = df[df['Drink_Order']!= 'Jack and Coke']
Split using and. That will result in a list. Explode list to get each element in a list appear as an individual row. Then conditionally rename Jack to Jack Daniels
df= df.assign(Drink_Order=df['Drink_Order'].str.split('and')).explode('Drink_Order')
df['Drink_Order']=np.where(df['Drink_Order'].str.contains('Jack'),'Jack Daniels',df['Drink_Order'])
Index Person Drink_Order
0 1 Sam Jack Daniels
0 1 Sam Coke
1 2 John Coke
2 3 Steve Dr. Pepper

Groupby for 3 columns

I would like use groupby function based on 3 columns. First column has surname info for families, second column has name of individuals in that families.Third column has which animal every individual has in those families. I want to get information of person with name and surname who has cat or dog and how many of cat or dog those indivual has.
My data looks like
Family SubFamily Animal
Smith Karen Cat
Smith Karen Cow
Smith Karen Dog
Jackson Jason Dog
I tried
merged_family.groupby(["Family","Animal","SubFamily"]).size().loc[:,'Cat'].loc[:,'Dog']
The result might be
Family SubFamily Cat Dog
Smith Karen 1 1
or something similar
It did not work. Could you help me?
I think is a better task for pivot_table
df_merged.query("Animal.isin(['Cat', 'Dog'])")
.pivot_table(columns='Animal', index=['Family', 'SubFamily'], aggfunc='size')
.fillna(0)
.reset_index()
.rename_axis(None, axis=1)
# Family SubFamily Cat Dog
# 0 Jackson Jason 0.0 1.0
# 1 Smith Karen 1.0 1.0

Find events recurring every week

I am trying to find the keys that are recurring at a weekly cadence in a set of events, similar to the following:
_index _time key
0 2018-12-01T23:59:56.000+0000 mike
1 2018-12-04T23:59:36.000+0000 mike
2 2018-12-13T23:59:05.000+0000 mike
3 2018-12-20T23:57:45.000+0000 mike
4 2018-12-31T23:57:21.000+0000 jerry
5 2018-12-31T23:57:15.000+0000 david
6 2018-12-31T23:55:13.000+0000 tom
7 2018-12-31T23:54:28.000+0000 mike
8 2018-12-31T23:54:21.000+0000 john
I have tried creating groups by date, using the following:
df = [g for n, g in df.groupby(pd.Grouper(key='_time',freq='W'), as_index=False)]
but have been unable to find the intersection of the various groups using: set.intersection(), reduce & pd.merge, and df.join
Maybe we groupby key then check whether this name show in all weeks
s=df['_time'].dt.strftime('%Y-%w').groupby(df['key']).nunique()
nweek=df['_time'].dt.strftime('%Y-%w').nunique()
s[s==nweek]
key
mike 4
Name: _time, dtype: int64

How to combine two dataframes and have unique key column using Pandas?

I have two dataframes with the same columns that I need to combine:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
and
first_name last_name
0 Billy Bonder
1 Brian Black
2 Bran Balwner
When I do this:
df_new = pd.concat([df1, df1])
I get this:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
0 Billy Bonder
1 Brian Black
2 Bran Balwner
Is there a way to have the left column have a unique number like this?
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
3 Billy Bonder
4 Brian Black
5 Bran Balwner
If not, how can I add a new key column with numbers from 1 to whatever the row count is?
As said earlier by #MaxU you can use ignore_index=True.
If you want to keep the index of your first table you can use the parameter ignore_index=True after the [dataframe1, dataframe2].
You can check if the indexes are being repeated with the paremeter verify_integrity=True it will return a boolean (you never know when you'll have to check.
But be careful because this procedure can be a little slow depending on the size of you Dataframe

Categories

Resources