I have the dataframe below:
details = {
'container_id' : [1, 2, 3, 4, 5, 6 ],
'container' : ['black box', 'orange box', 'blue box', 'black box','blue box', 'white box'],
'fruits' : ['apples, black currant', 'oranges','peaches, oranges', 'apples','apples, peaches, oranges', 'black berries, peaches, oranges, apples'],
}
# creating a Dataframe object
df = pd.DataFrame(details)
I want to find the frequency of each fruit separately on a list.
I tried this code
df['fruits'].str.split(expand=True).stack().value_counts()
but I get the black count 2 times instead of 1 for black currant and 1 for black berries.
You can do it like you did, but with specifying the delimiter. Be aware that when splitting the data, you get some leading whitespace unless your delimiter is a comma with a space. To be sure just use another step with str.strip.
df['fruits'].str.split(',', expand=False).explode().str.strip().value_counts()
your way (you can also use str.strip after the stack command if you want to)
df['fruits'].str.split(', ', expand=True).stack().value_counts()
Output:
apples 4
oranges 4
peaches 3
black currant 1
black berries 1
Name: fruits, dtype: int64
Specify the comma separator followed by an optional space:
df['fruits'].str.split(',\s?', expand=True).stack().value_counts()
OUTPUT:
apples 4
oranges 4
peaches 3
black currant 1
black berries 1
dtype: int64
Related
I have the following pandas dataframe:
import pandas as pd
foo_dt = pd.DataFrame({'var_1': ['filter coffee', 'american cheesecake', 'espresso coffee', 'latte tea'],
'var_2': ['coffee', 'coffee black', 'tea', 'strawberry cheesecake']})
and the following dictionary:
foo_colors = {'coffee': 'brown', 'cheesecake': 'white', 'tea': 'green'}
I want to add two columns in foo_dt (color_var_1 and color_var_2), the values of which will be the respective value of the foo_colors dictionary which corresponds to the key depending if the key is in the value of the column var_1 or var_2 respectively.
EDIT
In other words, for every key in foo_colors , check where "it is contained" in both columns var_1 & var_2, and then give as value of the respective column (color_var_1 & color_var_2) the respective value of the dictionary
My resulting dataframe looks like this:
var_1 var_2 color_var_1 color_var_2
0 filter coffee coffee brown brown
1 american cheesecake coffee black white brown
2 espresso coffee tea brown green
3 latte tea strawberry cheesecake green white
Any idea how can I do this ?
Use Series.str.extract for get first matched substring created by join by | for regex or of keys in dict with Series.map by dict:
pat = '|'.join(r"\b{}\b".format(x) for x in foo_colors)
for c in ['var_1','var_2']:
foo_dt[f'color_{c}'] = foo_dt[c].str.extract(f'({pat})', expand=False).map(foo_colors)
print(foo_dt)
var_1 var_2 color_var_1 color_var_2
0 filter coffee coffee brown brown
1 american cheesecake coffee black white brown
2 espresso coffee tea brown green
3 latte tea strawberry cheesecake green white
I am having a nan value when trying to get left part of a string a pandas dataframe, where the left condition is depending on the lengh of the cell in another column of the dataframe :
Example of df :
Phrase
Color
Paul like red
red
Mike like green
green
John like blue
blue
My objectives is to obtain a series of the first part of the phrase => before "like {Color}".
Here it would be :
|First Name|
| Paul |
| Mike |
| John |
i try to call the function below :
df["First Name"] = df["Phrase"].str[:- df["Color"].str.len() - 6]
But i keep having Nan value results. It seems my length calculation of the colors can't transmit to my str[:-x] function.
Can someone help me understand what is happening here and find a solution ?
Thanks a lot. Have a nice day.
Consider below df:
In [128]: df = pd.DataFrame({'Phrase':['Paul like red', 'Mike like green', 'John like blue', 'Mark like black'], 'Color':['red', 'green', 'blue', 'brown']})
In [129]: df
Out[129]:
Phrase Color
0 Paul like red red
1 Mike like green green
2 John like blue blue
3 Mark like black brown
Use numpy.where:
In [134]: import numpy as np
In [132]: df['First Name'] = np.where(df.apply(lambda x: x['Color'] in x['Phrase'], 1), df.Phrase.str.split().str[0], np.nan)
In [133]: df
Out[133]:
Phrase Color First Name
0 Paul like red red Paul
1 Mike like green green Mike
2 John like blue blue John
3 Mark like black brown NaN
Lets break this down and try to understand whats going on.. .str returns a pandas.Series.str (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.html) object and you want to slice it using a vector.
So basically you are trying to do pandas.Series.str[: <some_vector>] where <some_vector> is - df["Color"].str.len() - 6
Unfortunately, pandas offers no way to slice using a vector, check all methods here: https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html
So we are restricted to using pandas.Series.str[: some_value]
Now since this some_value changes for every row, you can use the .apply method over each row as follows:
df = pd.DataFrame({
'Phrase': ['Paul like red', 'Mike like green', 'John like blue'],
'Color': ['red', 'green', 'blue']
})
>>>
Phrase Color
0 Paul like red red
1 Mike like green green
2 John like blue blue
def func(x):
return x['Phrase'][:-len(x['Color'])-6]
df['First Name'] = df.apply(func, axis=1)
>>>
print (df)
Phrase Color First Name
0 Paul like red red Paul
1 Mike like green green Mike
2 John like blue blue John
Here I have used the same logic but passed the value as a scalar using .apply
I have two excel sheets. One contains summaries and the other contains categories with potential filter words. I need to assign categories to the first dataframe if any element matches in the second dataframe.
I have attempted to expand the list in the second dataframe and map by matching the terms to any words in the first dataframe.
Data for the test.
import pandas as pd
data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}
data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
print(df1)
Bucket Summary
0 basket This is a basket of red apples. They are sour.
1 bushel We found a bushel of fruit. They are red and s...
2 peck There is a peck of pears that taste sweet. The...
3 box We have a box of plums. They are sour and have...
print(df2)
Category Filters
0 Fruit apple, pear, plum, grape
1 Color red, purple, green
This line of script converts the Category column from the table to a list to use later.
category_list = df2['Category'].values
category_list = list(set(category_list))
Attempt to match the text.
for item in category_list:
item = df2.loc[df2['Category'] == item]
filter_list = item['Filters'].values
filter_list = list(set(filter_list))
df1 = df1 [df1 ['Summary'].isin(filter_list)]
I want the first dataframe to have categories assigned to it separated by a comma.
Result:
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit We have a box of plums. They are sour and have...
I hope this is clear. I have been banging my head against it for a week now.
Thank you in advance
Use pandas.Series.str.contains to check Filters with a loop:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Fruit=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Fruit']]).any()
Color=pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters']['Color']]).any()
print(Fruit)
print(Color)
0 True
1 False
2 True
3 True
dtype: bool
0 True
1 True
2 True
3 False
dtype: bool
Then use np.where with Series.str.cat to get your dataframe output:
df1['Fruit']=np.where(Fruit,'Fruit','')
df1['Color']=np.where(Color,'Color','')
df1['Category']=df1['Fruit'].str.cat(df1['Color'],sep=', ')
df1=df1[['Bucket','Category','Summary']]
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
To n Category filters:
df2['Filters']=[key.replace(' ','') for key in df2['Filters']]
df2['Filters']=df2['Filters'].apply(lambda x : x.split(','))
Categories=[pd.Series(np.where(( pd.DataFrame([df1['Summary'].str.contains(key) for key in df2.set_index('Category')['Filters'][category_filter]]).any() ),category_filter,'')) for category_filter in df2['Category']]
df1['Category']=Categories[0].str.cat(Categories[1:],sep=', ')
df1=df1.reindex(columns=['Bucket','Category','Summary'])
print(df1)
Bucket Category Summary
0 basket Fruit, Color This is a basket of red apples. They are sour.
1 bushel , Color We found a bushel of fruit. They are red and s...
2 peck Fruit, Color There is a peck of pears that taste sweet. The...
3 box Fruit, We have a box of plums. They are sour and have...
This is my try using regex pattern and pandas string replaceall function. First filters are joined with "|" to get regex pattern which is matched using findall which puts match in tuple for corresponding group which is then used to find matched category
import pandas as pd
data1 = {'Bucket':['basket', 'bushel', 'peck', 'box'], 'Summary':['This is a basket of red apples. They are sour.', 'We found a bushel of fruit. They are red and sweet.', 'There is a peck of pears that taste sweet. They are very green.', 'We have a box of plums. They are sour and have a great color.']}
data2 = {'Category':['Fruit', 'Color'], 'Filters':['apple, pear, plum, grape', 'red, purple, green']}
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
pat = df2.Filters.str.replace(", ", "|").str.replace("(.*)", "(\\1)").str.cat(sep="|")
found = df1.Summary.str.findall(pat) \
.apply(lambda x: [i for m in x for i, k in enumerate(m) if k!=""])
## for pandas 0.25 and above
# found= found.explode()
# for pandas below 0.25
found = found.apply(lambda x: pd.Series(x)).unstack().reset_index(level=0, drop=True).dropna()
found.name = "Cat_ID"
result = df1.merge(found, left_index=True, right_index=True) \
.merge(df2["Category"], left_on="Cat_ID", right_index=True).drop("Cat_ID", axis=1)
result = result.groupby(result.index).agg({"Bucket":"min", "Summary": "min", "Category": lambda x: ", ".join(x)})
result
assuming I have 2 dataframes:
sub = pd.DataFrame(['Little Red', 'Grow Your', 'James Bond', 'Tom Brady'])
text = pd.DataFrame(['Little Red Corvette must Grow Your ego', 'Grow Your Beans', 'James Dean and his Little Red coat', 'I love pasta'])
One containing various subjects and the other text from where I should be able to extract the subjects
I want the output of text dataframe to be:
Text | Subjects
Little Red Corvette must Grow Your ego | Little Red, Grow Your
Grow Your Beans | Grow Your
James Dean and his Little Red coat | Little Red
I love pasta | NaN
Any idea how I can achieve this?
I was looking at this question: Check if words in one dataframe appear in another (python 3, pandas)
but it is not exactly as my desired output. Thankyou
Use str.findall with joined all values of sub by | with regex word boundary:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
text['new'] = text[0].str.findall(pat).str.join(', ')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red, Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta
If want NaN for not matched values use loc:
pat = '|'.join(r"\b{}\b".format(x) for x in sub[0])
lists = text[0].str.findall(pat)
m = lists.astype(bool)
text.loc[m, 'new'] = lists.loc[m].str.join(',')
print (text)
0 new
0 Little Red Corvette must Grow Your ego Little Red,Grow Your
1 Grow Your Beans Grow Your
2 James Dean and his Little Red coat Little Red
3 I love pasta NaN
I have a dataframe df containing the information of car brands. For instance,
df['Car_Brand'][1]
'HYUNDAI '
where the length of each entries is the same len(df['Car_Brand'][1])=30. I can also have entries with only white spaces.
df['Car_Brand']
0 TOYOTA
1 HYUNDAI
2
3
4
5 OPEL
6
7 JAGUAR
where
df['Car_Brand'][2]
' '
I would like to drop from the dataframe all the entries with white spaces and reduce the size of the others. Finally:
df['Car_Brand'][1]
'HYUNDAI '
becomes
df['Car_Brand'][1]
'HYUNDAI'
I started to remove the withe spaces, in this way:
tmp = df['Car_Brand'].str.replace(" ","")
using str.strip and convert it to bool to filter the empty ones
df['Car_Brand'] = df['Car_Brand'].strip()
df[df['Car_Brand'].astype(bool)]
It seems need:
s = df['Car_Brand']
s1 = s[s != ''].reset_index(drop=True)
#if multiple whitespaces
#s1 = s[s.str.strip() != ''].reset_index(drop=True)
print (s1)
0 TOYOTA
1 HYUNDAI
2 OPEL
3 JAGUAR
Name: Car_Brand, dtype: object
If multiple whitespaces:
s = df[~df['Car_Brand'].str.contains('^\s+$')]