I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?
Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison
This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.
I have a DataFrame with more than 11,000 observations. I want to loop through 2 columns from the several columns. below is the 2 columns that I want to loop through:
df1: EWZ sales_territory
0 90164.0 north
1 246794.0 north
2 216530.0 north
3 80196.0 north
4 12380.0 north
11002 224.0 east
11003 1746.0 east
11004 7256.0 east
11005 439.0 east
11006 13724.0 east
The data shown here is the first 5 and last 5 observations of the columns.
The sales_territory column has north,south,east and west observations. EWZ is the population size.
I want to select all east that have same value of population and also with north, south, west with same value of population and append in a variable. I.e, if there are 20 north that have 20,000 as population size, I want to select them. Same with others.
I tried using nested if but frankly speaking, I don't know how to specify the condition for EWZ column. I tried Iterrows(), but I could not find my way out.
How do I figure this out?
You don't have to use a for loop. You can use:
df1.groupby(['EWZ', 'sales_territory']).apply(your_function)
and achieve the desired result. If you want to get a list of unique values then you can just drop duplicates using df1[['EWZ', 'sales_territory']].drop_duplicates()
If you don't care about the EWZ values while selecting you can use 4 loc statements like
df1.loc[df1['sales_territory'] == 'north', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'east', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'south', 'new_column'] = value
df1.loc[df1['sales_territory'] == 'west', 'new_column'] = value
I have a df, like
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
2 C Coco Gym Beer
... ... ... ... ...
If I want to select and get a new df which rows contains 'Park'.
Desired result:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
1 B Tea Restaurant Park
...
another new df which rows contains 'Gym'.
Desired results:
Person 1st 2nd 3rd
0 A Park Gym Supermarket
2 C Coco Gym Beer
...
How could I do it?
There is no problem to select park in one column, df.[df['1st'] == 'park'] but have problems to select from multi columns 1st, 2nd, 3rd etc.
You can perform "or" operations in pandas by using the pipe |, so in this specific case, you could try:
df_filtered = df[(df['1st'] == 'park') | (df['2nd'] == 'park') | (df['3rd'] == 'park')]
Alternatively, you could use the .any() function with the argument axis=1 which will return a row where there is any match:
df_filtered = df[df[['1st', '2nd', '3rd']].isin(['park']).any(axis=1)]
Let's suppose I have a Series (or DataFrame) s1, for example list of all Universities and Colleges in the USA:
University
0 Searcy Harding University
1 Angwin Pacific Union College
2 Fairbanks University of Alaska Fairbanks
3 Ann Arbor University of Michigan
And another Series (od DataFrame) s2, for example list of all cities in the USA:
City
0 Searcy
1 Angwin
2 New York
3 Ann Arbor
And my desired output (bascially an intersection of s1 and s2):
Uni City
0 Searcy
1 Angwin
2 Fairbanks
3 Ann Arbor
The thing is: I'd like to create a Series that consists of cities but only these, that have a university/college. My very first thought was to remove "University" or "College" parts from the s1, but it turns out that it is not enough, as in case of Angwin Pacific Union College. Then I thought of leaving only the first word, but that excludes Ann Arbor.
Finally, I got a series of all the cities s2 and now I'm trying to use it as a filter (something similiar to .contains() or .isin()), so if a string s1 (Uni name) contains any of the elements of s2 (city name), then return only the city name.
My question is: how to do it in a neat way?
I would try to build a list comprehension of cities that are contained in at least one university name:
pd.Series([i for i in s2 if s1.str.contains(i).any()], name='Uni City')
With your example data it gives:
0 Searcy
1 Angwin
2 Ann Arbor
Name: Uni City, dtype: object
Data Used
s=pd.DataFrame({'University':['Searcy Harding University','Angwin Pacific Union College','Fairbanks University of Alaska Fairbanks','Ann Arbor University of Michigan']})
s2=pd.DataFrame({'City':['Searcy','Angwin','Fairbanks','Ann Arbor']})
Convert s2.City to set to create an iterator
st=set(s2.City.unique().tolist())
Calculate s['Uni City'] using the next() function to return the next item from the iterator.
s['Uni City']=s['University'].apply(lambda x: next((i for i in st if i in x)), np.nan)
Outcome
I have a dataframe like this:
Destinations
Paris,Oslo, Paris,Milan, Athens,Amsterdam
Boston,New York, Boston,London, Paris,New York
Nice,Paris, Milan,Paris, Nice,Milan
I want to get the following dataframe (without space between the cities):
Destinations_2 no_destinations
Paris,Oslo,Milan,Athens,Amsterdam 5
Boston,New York,London,Paris 4
Nice,Paris,Milan 3
How to remove duplicates within a cell?
You can use a list comprehension which is faster than using apply() (replace Col with the original column name) :
df['no_destinations']=[len(set([a.strip() for a in i.split(',')])) for i in df['Col']]
print(df)
Col no_destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
df['no_destinations'] = df.Destinations.str.split(',').apply(set).apply(len)
if there are spaces in between use
df.Destinations.str.split(',').apply(lambda x: list(map(str.strip,x))).apply(set).apply(len)
Output
Destinations nodestinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam 5
1 Boston,New York, Boston,London, Paris,New York 4
2 Nice,Paris, Milan,Paris, Nice,Milan 3
# your data:
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan']}
df = pd.DataFrame(data)
>>>
Destinations
0 Paris,Oslo, Paris,Milan, Athens,Amsterdam
1 Boston,New York, Boston,London, Paris,New York
2 Nice,Paris, Milan,Paris, Nice,Milan
First: make every row of your column a list.
df.Destinations = df.Destinations.apply(lambda x: x.replace(', ', ',').split(','))
>>>
Destinations
0 [Paris, Oslo, Paris, Milan, Athens, Amsterdam]
1 [Boston, New York, Boston, London, Paris, New York]
2 [Nice, Paris, Milan, Paris, Nice, Milan]
Second: removes dups from the lists
df.Destinations = df.Destinations.apply(lambda x: list(dict.fromkeys(x)))
# or: df.Destinations = df.Destinations.apply(lambda x: list(set(x)))
>>>
Destinations
0 [Paris, Oslo, Milan, Athens, Amsterdam]
1 [Boston, New York, London, Paris]
2 [Nice, Paris, Milan]
Finally, create desired columns
df['no_destinations'] = df.Destinations.apply(lambda x: len(x))
df['Destinations_2'] = df.Destinations.apply(lambda x: ','.join(x))
All steps use the apply and lambda functions, you can chain or nest them together if you want
All the previous answers have addressed only one part of your problem i.e. to show the unique count (no_destinations). Let me try to answer both of your queries.
The idea below is to apply a method on the Destinations column which returns 2 series named Destinations_2 and no_destinations which contain unique elements separated by comma with no space, and count of unique elements, respectively.
import pandas as pd
data = {'Destinations': ['Paris,Oslo, Paris,Milan, Athens,Amsterdam',
'Boston,New York, Boston,London, Paris,New York',
'Nice,Paris, Milan,Paris, Nice,Milan'
]}
def remove_dups(x):
data = set(x.replace(" ", "").split(','))
return pd.Series([','.join(data),len(data)], index=['Destinations_2', 'no_destinations'])
df = pd.DataFrame.from_dict(data)
df[['Destinations_2', 'no_destinations']] = df['Destinations'].apply(remove_dups)
print(df.head())
Output:
Note: As you are not concerned with the order, I have used set above. If you need to maintain the order, you will have to replace set with some other logic to remove duplicates.