My apologies if this has been asked/answered before but I couldn't find this an answer to my problem after some time searching.
Very simply put I would like to combine multiple columns to one seperated with a ,
The problem is that some cells are empty (NoneType)
And when combining them I get either:
TypeError: ('sequence item 3: expected str instance, NoneType found', 'occurred at index 0')
or
When added .map(str), it literally adds 'None' for every NoneType value (as kinda expected)
Let's say I have a production dataframe looking like
0 1 2
1 Rice
2 Beans Rice
3 Milk Beans Rice
4 Sugar Rice
What I would like is a single column with the values
Production
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
With some searching and tweaking I added this code:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x)), axis=1)
Which produces problem 1
or changed it like this:
testColumn = productionFrame.iloc[::].apply(lambda x: ', '.join(x.map(str)), axis=1)
Which produces problem 2
Maybe it's good to add that I'm very new and kinda exploring Pandas/Python right now. So any help or push in the right direction is much appreciated!
pd.Series.str.cat should work here
df
Out[43]:
0 1 2
1 Rice NaN NaN
2 Beans Rice NaN
3 Milk Beans Rice
4 Sugar Rice NaN
df.apply(lambda x: x.str.cat(sep=', '), axis=1)
Out[44]:
1 Rice
2 Beans, Rice
3 Milk, Beans, Rice
4 Sugar, Rice
dtype: object
You can use str.join after transforming NaN values to empty strings:
res = df.fillna('').apply(lambda x: ', '.join(filter(None, x)), axis=1)
print(res)
0 Rice
1 Beans, Rice
2 Milk, Beans, Rice
3 Sugar, Rice
dtype: object
Related
I am working with sequential and frequent pattern mining. I was given this type of dataset to do the task, and I am told to make a sequence from the dataset before processing.
This is the sample data taken from dataset, in table format. The table in .csv format is available at: https://drive.google.com/file/d/1j1rEy4Q600y_oym23cG3m3NNWuNvIcgG/view?usp=sharing
User
Item 1
Item 2
Item 3
Item 4
Item 5
Item 6
A
milk
cake
citrus
B
cheese
milk
bread
cabbage
carrot
A
tea
juice
citrus
salmon
B
apple
orange
B
cake
At first, I think I have to make the csv file into Pandas Dataframe. I have no problem with that, what I want to ask is, how is it possible with dataframe to produce result like this?
Expected result 1, a group of items bought from 1 user is grouped into one tuple
User
Transactions
A
(milk cake citrus)(tea juice citrus salmon)
B
(cheese milk bread cabbage carrot)(apple orange)(cake)
Expected result 2, each item purchased by user is not grouped by one.
User
Transactions
A
milk, cake, citrus, tea, juice, citrus, salmon,
B
cheese, milk, bread, cabbage, carrot, apple, orange, cake
My question is, how to make those dataframe? I've tried a solution from this article: How to group dataframe rows into list in pandas groupby, but it is still not successful.
In order to get the first result:
out = df.set_index('User').apply(lambda x : tuple(x[x.notna()].tolist()),axis=1).groupby(level=0).agg(list).reset_index(name='Transactions')
Out[95]:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
For the second result which is easier than the previous one:
df.set_index('User').replace('',np.nan).stack().groupby(level=0).agg(','.join)
Out[97]:
User
A milk,cake,citrus,tea,juice,citrus,salmon
B cheese,milk,bread,cabbage,carrot,apple,orange,...
dtype: object
Let's start with the second one:
(df.set_index('User')
.stack()
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [milk, cake, citrus, tea, juice, citrus, salmon]
1 B [cheese, milk, bread, cabbage, carrot, apple, ...
To get the first one, on just need to add a new column:
(df.assign(group=df.groupby('User').cumcount())
.set_index(['User', 'group'])
.stack()
.groupby(level=[0,1]).apply(tuple)
.groupby(level=0).apply(list)
.rename('Transactions')
.reset_index()
)
output:
User Transactions
0 A [(milk, cake, citrus), (tea, juice, citrus, sa...
1 B [(cheese, milk, bread, cabbage, carrot), (appl...
import pandas as pd
df = pd.read_csv('sampletable.csv')
df['Transactions'] = '(' + df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=' '), axis=1) + ')'
df = df.groupby(['User'])['Transactions'].apply(lambda x: ''.join(x)).reset_index()
print(df)
output:
User Transactions
0 A (milk cake citrus)(tea juice citrus salmon)
1 B (cheese milk bread cabbage carrot)(apple orange)(cake)
for the second output, use this:
df = pd.read_csv('sampletable.csv')
df['a'] = df[['Item 1','Item 2','Item 3','Item 4','Item 5','Item 6']].apply(lambda x: x.str.cat(sep=', '), axis=1)
df = df.groupby(['User'])['a'].apply(lambda x: ', '.join(x)).reset_index()
print(df)
I have a list of strings looking like this:
strings = ['apple', 'pear', 'grapefruit']
and I have a data frame containing id and text values like this:
id
value
1
The grapefruit is delicious! But the pear tastes awful.
2
I am a big fan og apple products
3
The quick brown fox jumps over the lazy dog
4
An apple a day keeps the doctor away
Using pandas I would like to create a filter which will give me only the id and values for those rows, which contain one or more of the values together with a column, showing which values are contained in the string, like this:
id
value
value contains substrings:
1
The grapefruit is delicious! But the pear tastes awful.
grapefruit, pear
2
I am a big fan og apple products
apple
4
An apple a day keeps the doctor away
apple
How would I write this using pandas?
Use .str.findall:
df['fruits'] = df['value'].str.findall('|'.join(strings)).str.join(', ')
df[df.fruits != '']
id value fruits
0 1 The grapefruit is delicious! But the pear tast... grapefruit, pear
1 2 I am a big fan og apple products apple
3 4 An apple a day keeps the doctor away apple
I have a dataframe that looks like this:
Beverage Ingredients Ingredients_Tokens
Orange Juice Orange Juice Concentrate, Orange Pulp [orange, juice, concentrate, orange, pulp]
Root Beer Sugar, Water, Caramel Color [sugar, water, caramel, color]
... ... ...
Apple Juice INGREDIENTS: CONTAINS PURE FILTERED WATER, CONCENTRATED APPLE JUICE [pure, filtered, water, concentrated, apple, juice]
I want to take the ingredients_tokens field and create flag fields for each token that appears more than 20 times in the whole dataframe so that my final dataframe has all of the Beverages and whether they contain the tokens listed, like
Beverage Token_Orange Token_Sugar Token_Water ... Token_Apple
Orange_Juice 1 0 0 0
Root Beer 0 1 1 0
...
Apple Juice 0 0 1 1
I tried a loop that tried to create the Token variable and then store it, something like (47 is total number of tokens):
df=pd.DataFrame()
for i in range (0,47):
T['Token'] = T['Ingredients_Tokens'][i]
df = df.append([Q])
df = pd.DataFrame(df)
But am not sure where to go
One option if you're on one of the more recent versions of pandas is to use .explode:
In [167]: df
Out[167]:
thing ingredients
0 oj [orange, juice, pulp]
1 root beer [roots, beer]
In [168]: df.explode("ingredients").set_index("ingredients", append=True).unstack().notnull()
Out[168]:
thing
ingredients beer juice orange pulp roots
0 False True True True False
1 True False False False True
As an example, let's say I have a python pandas DataFrame that is the following:
# PERSON THINGS
0 Joe Candy Corn, Popsicles
1 Jane Popsicles
2 John Candy Corn, Ice Packs
3 Lefty Ice Packs, Hot Dogs
I would like to use the pandas groupby functionality to have the following output:
THINGS COUNT
Candy Corn 2
Popsicles 2
Ice Packs 2
Hot Dogs 1
I generally understand the following groupby command:
df.groupby(['THINGS']).count()
But the output is not by individual item, but by the entire string. I think I understand why this is, but it's not clear to me how to best approach the problem to get the desired output instead of the following:
THINGS PERSON
Candy Corn, Ice Packs 1
Candy Corn, Popsicles 1
Ice Packs, Hot Dogs 1
Popsicles 1
Does pandas have a function like the LIKE in SQL, or am I thinking about how to do this wrong in pandas?
Any assistance appreciated.
Create a series by splitting words, and use value_counts
In [292]: pd.Series(df.THINGS.str.cat(sep=', ').split(', ')).value_counts()
Out[292]:
Popsicles 2
Ice Packs 2
Candy Corn 2
Hot Dogs 1
dtype: int64
You need to split THINGS by , and flatten the series and count values.
pd.Series([item.strip() for sublist in df['THINGS'].str.split(',') for item in sublist]).value_counts()
Output:
Candy Corn 2
Popsicles 2
Ice Packs 2
Hot Dogs 1
dtype: int64
This question is related to another question I had posted.
Pandas - check if a string column in one dataframe contains a pair of strings from another dataframe
My goal is to check if two different columns of a dataframe contain a pair of string values and if the condition is met, then extract one of the values.
I have two dataframes like this:
df1 = pd.DataFrame({'consumption':['squirrelate apple', 'monkey likesapple',
'monkey banana gets', 'badger/getsbanana', 'giraffe eats grass', 'badger apple.loves', 'elephant is huge', 'elephant/eats/', 'squirrel.digsingrass'],
'name': ['apple', 'appleisred', 'banana is tropical', 'banana is soft', 'lemon is sour', 'washington apples', 'kiwi', 'bananas', 'apples']})
df2 = pd.DataFrame({'food':['apple', 'apple', 'banana', 'banana'], 'creature':['squirrel', 'badger', 'monkey', 'elephant']})
In [187]:df1
Out[187]:
consumption name
0 squirrelate apple apple
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples
6 elephant is huge kiwi
7 elephant/eats/ bananas
8 squirrel.digsingrass apples
In[188]: df2
Out[188]:
creature food
0 squirrel apple
1 badger apple
2 monkey banana
3 elephant banana
What I want to do is test if 'apple' occurs in df1['name'] and 'squirrel' occurs in df1['consumption'] and if both conditions are met then extract 'squirrel' from df1['consumption'] into a new column df['creature']. The result should look like:
Out[189]:
consumption creature name
0 squirrelate apple squirrel apple
1 monkey likesapple NaN appleisred
2 monkey banana gets monkey banana is tropical
3 badger/getsbanana NaN banana is soft
4 giraffe eats grass NaN lemon is sour
5 badger apple.loves badger washington apples
6 elephant is huge NaN kiwi
7 elephant/eats/ elephant bananas
8 squirrel.digsingrass NaN apples
If there was no paired value constraint, I could have done something simple like :
np.where((df1['consumption'].str.contains(<creature_string>, case = False)) & (df1['name'].str.contains(<food_string>, case = False)), df['consumption'].str.extract(<creature_string>), np.nan)
but I must check for pairs so I tried to make a dictionary of food as keys and creatures as values , then make a string var of all the creatures for a given food key and look for those using str.contains :
unique_food = df2.food.unique()
food_dict = {elem : pd.DataFrame for elem in unique_food}
for key in food_dict.keys():
food_dict[key] = df2[:][df2.food == key]
# create key:value pairs of food key and creature strings
food_strings = {}
for key, values in food_dict.items():
food_strings.update({key: '|'.join(map(str, list(food_dict[key]['creature'].unique())))})
In[199]: food_strings
Out[199]: {'apple': 'squirrel|badger', 'banana': 'monkey|elephant'}
The problem is when I try to now apply str.contains:
for key, value in food_strings.items():
np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
(df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumptions'].str.extract('('+food_strings[value]+')'), np.nan)
I get a KeyError: .
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-62-7ab718066040> in <module>()
1 for key, value in food_strings.items():
2 np.where((df1['name'].str.contains('('+food_strings[key]+')', case = False)) &
----> 3 (df1['consumption'].str.contains('('+food_strings[value]+')', case = False)), df1['consumption'].str.extract('('+food_strings[value]+')'), np.nan)
KeyError: 'squirrel|badger'
When I just try for only the value and not the key, it works for the first key:value pair but not the second:
for key in food_strings.keys():
df1['test'] = np.where(df1['consumption'].str.contains('('+food_strings[key]+')', case =False),
df1['consumption'].str.extract('('+food_strings[key]+')', expand=False),
np.nan)
df1
Out[196]:
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred NaN
2 monkey banana gets banana is tropical NaN
3 badger/getsbanana banana is soft badger
4 giraffe eats grass lemon is sour NaN
5 badger apple.loves washington apples badger
6 elephant is huge kiwi NaN
7 elephant/eats/ bananas NaN
8 squirrel.digsingrass apples squirrel
I got the ones matching apple and squirrel|badger but missed banana:monkey|elephant.
can someone please help?
d1 = df1.dropna()
d2 = df2.dropna()
sump = d1.consumption.values.tolist()
name = d1.name.values.tolist()
cret = d2.creature.values.tolist()
food = d2.food.values.tolist()
check = np.array(
[
[c in s and f in n for c, f in zip(cret, food)]
for s, n in zip(sump, name)
]
)
# create a new series with the index of `d1` where we dropped na
# then reindex with `df1.index` prior to `assign`
test = pd.Series(check.dot(d2[['creature']].values).ravel(), d1.index)
test = test.reindex(df1.index, fill_value='')
df1.assign(test=test)
consumption name test
0 squirrelate apple apple squirrel
1 monkey likesapple appleisred
2 monkey banana gets banana is tropical monkey
3 badger/getsbanana banana is soft
4 giraffe eats grass lemon is sour
5 badger apple.loves washington apples badger
6 elephant is huge kiwi
7 elephant/eats/ bananas elephant
8 squirrel.digsingrass apples squirrel