groupby comma-separated values in single DataFrame column python/pandas - python

As an example, let's say I have a python pandas DataFrame that is the following:
# PERSON THINGS
0 Joe Candy Corn, Popsicles
1 Jane Popsicles
2 John Candy Corn, Ice Packs
3 Lefty Ice Packs, Hot Dogs
I would like to use the pandas groupby functionality to have the following output:
THINGS COUNT
Candy Corn 2
Popsicles 2
Ice Packs 2
Hot Dogs 1
I generally understand the following groupby command:
df.groupby(['THINGS']).count()
But the output is not by individual item, but by the entire string. I think I understand why this is, but it's not clear to me how to best approach the problem to get the desired output instead of the following:
THINGS PERSON
Candy Corn, Ice Packs 1
Candy Corn, Popsicles 1
Ice Packs, Hot Dogs 1
Popsicles 1
Does pandas have a function like the LIKE in SQL, or am I thinking about how to do this wrong in pandas?
Any assistance appreciated.

Create a series by splitting words, and use value_counts
In [292]: pd.Series(df.THINGS.str.cat(sep=', ').split(', ')).value_counts()
Out[292]:
Popsicles 2
Ice Packs 2
Candy Corn 2
Hot Dogs 1
dtype: int64

You need to split THINGS by , and flatten the series and count values.
pd.Series([item.strip() for sublist in df['THINGS'].str.split(',') for item in sublist]).value_counts()
Output:
Candy Corn 2
Popsicles 2
Ice Packs 2
Hot Dogs 1
dtype: int64

Related

In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?
Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison
This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.

How to divide a list to allocate it to another dataframe based on sum of values?

I have two dataframes for example:
First dataframe contains the name and kind of chocolate they want:
Name
Chocolate
Kirti
Nutella
Rahul
Lindt
Sam
Lindt
Joy
Lindt
Mrinal
Kit Kat
Sai
Lindt
The second dataframe contains shop and availability of each item in shop:
Shop
Chocolate
Count
Shop 1
Lindt
2
Shop 2
Lindt
3
Shop 1
Nutella
5
The end result that I'm looking for should return a dataframe which indicates which shop the people can go to.
Rahul, Sam, Joy and Sai are 4 people who want Lindt. 2 of them can go to Shop 1 and other 2 can go to shop 3 to ensure everyone can get lindt Chocolate.
Now we can randomly assign 2 of them to shop 1 and 2 of them to Shop 2.
Similarly with other chocolates and resulting dataframe will be
Name
Chocolate
Shop
Kirti
Nutella
Shop 1
Rahul
Lindt
Shop 1
Sam
Lindt
Shop 1
Joy
Lindt
Shop 2
Mrinal
Kit Kat
NA
Sai
Lindt
Shop 2
In above case, Mrinal doesn't get assigned any shop because no shop has KitKat available
I've been trying to do a vlookup in Python using map but all people who want Lindt get assigned Shop 2. I want to assign them in such a way that divides the qty available in each shop so that everyone possible can get chocolate.
Here's the code that I wrote as of now:
df_demand = pd.DataFrame({'Name': ['Kirti','Rahul','Sam','Joy','Mrinal','Sai'],
'Chocolate': ['Nutella','Lindt','Lindt','Lindt','Kit-Kat','Lindt']})
df_inventory = pd.DataFrame({'Shop':['Shop1','Shop2','Shop1'],
'Chocolate':['Lindt','Lindt','Nutella'],
'Count':[2,3,5]})
df_inventory = df_inventory.sort_values(by = ['Count'], ascending = False, kind = "mergesort")
df_inventory= df_inventory.drop_duplicates(subset ="Chocolate")
df_inv1= df_inventory.set_index('Chocolate').to_dict()['Shop']
df_demand['Shop'] = df_demand['Chocolate'].map(df_inv1)
Output of above code:
A way to do this is to count Chocolate need/sale opportunity up and then use that number to merge the request of the kids with the corresponding shops.
df = pd.DataFrame(
[['Shop1','Lindt',1],
['Shop1','Milka',1],
['Shop2','Lindt',3],
['Shop3','Lindt',3],
['Shop3','Milka',3]]
,columns=['Shop','Chocolate','Count'])
dk = pd.DataFrame(
[['Alfred','Milka'],
['Berta','Milka'],
['Charlie','Milka'],
['Darius','Milka'],
['Emil','Milka'],
['George','Lindt'],
['Francois','Milka']],
columns =['Name','Chocolate'])
df['max_satisfaction']=df.groupby('Chocolate').cumsum()
df['min_satisfaction'] = df['max_satisfaction']-df['Count']
df['satisfies']=df.apply(lambda x:list(range(x[-1],x[-2])),axis=1)
df = df.explode('satisfies')
dk['request_number'] = dk.groupby('Chocolate').cumcount()
dk = dk.merge(df,how='left',
left_on=['Chocolate','request_number'],
right_on=['Chocolate','satisfies'])
dk[['Name','Chocolate','Shop']]
Note that this solution will be quite expensive if the shops have way more supply than demand. A limit to prevent the explosion of df could be however easily implemented.

Pandas combine multiple columns (with NoneType)

My apologies if this has been asked/answered before but I couldn't find this an answer to my problem after some time searching.
Very simply put I would like to combine multiple columns to one seperated with a ,
The problem is that some cells are empty (NoneType)
And when combining them I get either:
TypeError: ('sequence item 3: expected str instance, NoneType found', 'occurred at index 0')
or
When added .map(str), it literally adds 'None' for every NoneType value (as kinda expected)
Let's say I have a production dataframe looking like
0 1 2
1 Rice
2 Beans Rice
3 Milk Beans Rice
4 Sugar Rice
What I would like is a single column with the values
Production
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
With some searching and tweaking I added this code:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x)), axis=1)
Which produces problem 1
or changed it like this:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x.map(str)), axis=1)
Which produces problem 2
Maybe it's good to add that I'm very new and kinda exploring Pandas/Python right now. So any help or push in the right direction is much appreciated!
pd.Series.str.cat should work here
df
Out[43]:
0 1 2
1 Rice NaN NaN
2 Beans Rice NaN
3 Milk Beans Rice
4 Sugar Rice NaN
df.apply(lambda x: x.str.cat(sep=', '), axis=1)
Out[44]:
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
dtype: object
You can use str.join after transforming NaN values to empty strings:
res = df.fillna('').apply(lambda x: ', '.join(filter(None, x)), axis=1)
print(res)
0 Rice
1 Beans, Rice
2 Milk, Beans, Rice
3 Sugar, Rice
dtype: object

Python Access Labels of Sklearn CountVectorizer

Here is my df after cleaning:
number summary cleanSummary
0 1-123 he loves ice cream love ice cream
1 1-234 she loves ice love ice
2 1-345 i hate avocado hate avocado
3 1-123 i like skim milk like skim milk
As you can see, there are two records that have the same number. Now I'll create and fit the vectorizer.
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1,1), analyzer='word')
cv.fit(df['cleanSummary'])
Now I'll transform.
freq = cv.transform(df['cleanSummary'])
Now if I take a look at freq...
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
freq
frequency
0 1
1 1
2 1
3 2
4 1
5 2
6 1
7 1
...there doesn't seem to be a logical way to access the original number. I have tried methods of looping through each row, but this runs into problems because of the potential for multiple summaries per number. A loop using a grouped df...
def extractFeatures(groupedDF, textCol):
features = pd.DataFrame()
for id, group in groupedDF:
freq = cv.transform(group[textCol])
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
dfinner['number'] = id
dfinner = dfinner.join(freq)
features = features.append(dfinner)
return features
...works, but the performance is terrible (i.e. 12 hours to run through 45,000 documents with one sentence lengths).
If I change
freq = sum(freq).toarray()[0]
to
freq = freq.toarray()
I get an array of frequencies for each ngram for each document. This is good, but then it doesn't allow me to push that array of lists into a dataframe. And I still wouldn't be able to access nunmber.
How do I access the original labels number for each ngram without looping over a grouped df? My desired result is:
number ngram frequency
1-123 love 1
1-123 ice 1
1-123 cream 1
1-234 love 1
1-234 ice 1
1-345 hate 1
1-345 avocado 1
1-123 like 1
1-123 skim 1
1-123 milk 1
Edit: this is somewhat of a revisit to this question:Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows. However, after implementing the method described in that answer, I face memory issues for a large corpus, so it doesn't seem scalable.
freq = cv.fit_transform(df.cleanSummary)
dtm = pd.DataFrame(freq.toarray(), columns=cv.get_feature_names(), index=df.number).stack()
dtm[dtm > 0]
number
1-123 cream 1
ice 1
love 1
1-234 ice 1
love 1
1-345 avocado 1
hate 1
1-123 like 1
milk 1
skim 1
dtype: int64

Pandas: Fill in missing indexes with specific ordered values that are already in column.

I have extracted a one-column dataframe with specific values. Now this is what the dataframe looks like:
Commodity
0 Cocoa
4 Coffee
6 Maize
7 Rice
10 Sugar
12 Wheat
Now I want to respectively fill each index that has no value with the value above it in the column so It should look something like this:
Commodity
0 Cocoa
1 Cocoa
2 Cocoa
3 Cocoa
4 Coffee
5 Coffee
6 Maize
7 Rice
8 Rice
9 Rice
10 Sugar
11 Sugar
12 Wheat
I don't seem to get anything from the pandas documentation Working with Text Data. Thanks for your help!
I create a new index with pd.RangeIndex. It works like range so I need to pass it a number one greater than the max number in the current index.
df.reindex(pd.RangeIndex(df.index.max() + 1)).ffill()
Commodity
0 Cocoa
1 Cocoa
2 Cocoa
3 Cocoa
4 Coffee
5 Coffee
6 Maize
7 Rice
8 Rice
9 Rice
10 Sugar
11 Sugar
12 Wheat
First expand the index to include all numbers
s = pd.Series(['Cocoa', 'Coffee', 'Maize', 'Rice', 'Sugar', 'Wheat',], index=[0,4,6,7,10, 12], name='Commodity')
s = s.reindex(range(s.index.max() + 1))
Then do a backfill
s.bfill()

Categories

Resources