Construct index/membership vectors from lists of items - python

Say you have the following baskets:
basket1 = ['apple', 'orange', 'banana']
basket2 = ['orange', 'grape']
basket3 = ['banana', 'grape', 'kiwi', 'orange']
baskets = [basket1, basket2, basket3]
And your goal is to create the following datastructure:
pd.DataFrame({'apple': {'basket1': 1,'basket2': 0,'basket3': 0 }, 'orange': {'basket1': 1,'basket2': 1,'basket3': 1 }, 'banana': {'basket1': 1,'basket2': 0,'basket3': 1 }, 'grape': {'basket1': 0,'basket2': 1,'basket3': 1 }, 'kiwi': {'basket1': 0,'basket2': 0,'basket3': 1 } })
Which looks like:
I know there's Counter from Collections and bincount from numpy which you could leverage if you just wanted a binary list like the one above, but say you wanted to put some other value at each of these points:
For example, say that instead of a 1, at each point, you wanted to put the weight of the fruit, which you happen to have in another table:
pd.DataFrame({'weight': {'apple': 3, 'orange':3, 'banana':2, 'grape':1, 'kiwi':2}})
And the result you want is:
pd.DataFrame({'apple': { 'basket1': 3, 'basket2': 0, 'basket3': 0 }, 'orange': { 'basket1': 3, 'basket2': 3, 'basket3': 3 }, 'banana': { 'basket1': 2, 'basket2': 0, 'basket3': 2 }, 'grape': { 'basket1': 0, 'basket2': 1, 'basket3': 1 }, 'kiwi': { 'basket1': 0, 'basket2': 0, 'basket3': 2 } })
How would you go about writing such an operation cleanly? I'm not quite sure how to go about performing this operation efficiently or well.

Assuming you start out with a pd.Dataframe and a dict:
In [37]: df1
Out[37]:
apple banana grape kiwi orange
basket1 1 1 0 0 1
basket2 0 0 1 0 1
basket3 0 1 1 1 1
In [38]: mapper = {'apple': 3, 'orange':3, 'banana':2, 'grape':1, 'kiwi':2}
Then simply:
In [39]: for colname in df1:
...: df1[colname] = df1[colname]*mapper[colname]
...:
In [40]: df1
Out[40]:
apple banana grape kiwi orange
basket1 3 2 0 0 3
basket2 0 0 1 0 3
basket3 0 2 1 2 3
Or even more simply, you can intelligently mutiply a pd.DataFrame by a pd.Series (i.e. a "column" of a dataframe):
In [5]: df2 = pd.DataFrame({'weight': {'apple': 3, 'orange':3, 'banana':2, 'grap
...: e':1, 'kiwi':2}})
In [6]: mapper = df2.squeeze() # convert to series
In [7]: df1*mapper
Out[7]:
apple banana grape kiwi orange
basket1 3 2 0 0 3
basket2 0 0 1 0 3
basket3 0 2 1 2 3
Or starting from scratch:
In [8]: basket1 = ['apple', 'orange', 'banana']
...: basket2 = ['orange', 'grape']
...: basket3 = ['banana', 'grape', 'kiwi', 'orange']
...:
...: baskets = [basket1, basket2, basket3]
...:
In [9]: fruitvolume = {'apple': 3, 'orange':3, 'banana':2, 'grape':1, 'kiwi':2}
Then simply:
In [12]: data = [{item:fruitvolume[item] for item in basket} for basket in baskets]
In [13]: data
Out[13]:
[{'apple': 3, 'banana': 2, 'orange': 3},
{'grape': 1, 'orange': 3},
{'banana': 2, 'grape': 1, 'kiwi': 2, 'orange': 3}]
In [14]: pd.DataFrame(data)
Out[14]:
apple banana grape kiwi orange
0 3.0 2.0 NaN NaN 3
1 NaN NaN 1.0 NaN 3
2 NaN 2.0 1.0 2.0 3
But now you'll have to do some munging...
In [16]: df = df.fillna(0).astype(int)
In [17]: df
Out[17]:
apple banana grape kiwi orange
0 3 2 0 0 3
1 0 0 1 0 3
2 0 2 1 2 3

Related

How to convert column values into new columns showing frequency

I created a new dataframe by splitting a column and expanding it.
I now want to convert the dataframe to create new columns for every value and only display the frequency of the value.
I wrote an example below.
Example dataframe:
import pandas as pd
import numpy as np
df= pd.DataFrame({0:['cake','fries', 'ketchup', 'potato', 'snack'],
1:['fries', 'cake', 'potato', np.nan, 'snack'],
2:['ketchup', 'cake', 'potatos', 'snack', np.nan],
3:['potato', np.nan,'cake', 'ketchup',np.nan],
'index':['james','samantha','ashley','tim', 'mo']})
df.set_index('index')
Expected output:
output = pd.DataFrame({'cake': [1, 2, 1, 0, 0],
'fries': [1, 1, 0, 0, 0],
'ketchup': [1, 0, 1, 1, 0],
'potatoes': [1, 0, 2, 1, 0],
'snack': [0, 0, 0, 1, 2],
'index': ['james', 'samantha', 'asheley', 'tim', 'mo']})
output.set_index('index')
Based on the description of what you want, you would need a crosstab on the reshaped data:
df2 = df.reset_index().melt('index')
out = pd.crosstab(df2['index'], df2['value'].str.lower())
This, however, doesn't match the provided output.
Output:
value apple berries cake chocolate drink fries fruits ketchup potato potatoes snack
index
Ashley 0 0 0 0 0 0 0 1 1 0 1
James 0 1 1 0 0 1 1 0 0 0 0
Mo 0 0 0 1 0 0 1 1 0 1 0
samantha 1 0 0 1 0 1 0 0 0 0 0
tim 0 0 0 0 1 0 0 0 0 0 1

Creating a new column based on other columns from another dataframe

I have 2 dataframes:
df1
Name Apples Pears Grapes Peachs
James 3 5 5 2
Harry 1 0 2 9
Will 20 2 7 3
df2
Class User Factor
A Harry 3
A Will 2
A James 5
B NaN 4
I want to create a new column in df2 called Total which is a list of all the columns for each user in df1, multiplied by the Factor for that user - this should only be done if they are in Class A.
This is how the final df should look
df2
Class User Factor Total
A Harry 3 [3,0,6,27]
A Will 2 [40,4,14,6]
A James 5 [15,25,25,10]
B NaN 4
This is what I tried:
df2['Total'] = list(df1.Name.isin((df2.User) and (df2.Class==A)) * df2.Factor)
This will do what your question asks:
df2 = df2[df2.Class=='A'].join(df.set_index('Name'), on='User').set_index(['Class','User'])
df2['Total'] = df2.apply(lambda x: list(x * x.Factor)[1:], axis=1)
df2 = df2.reset_index()[['Class','User','Factor','Total']]
Full test code:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=[
x.strip() for x in 'Name Apples Pears Grapes Peachs'.split()], data =[
['James', 3, 5, 5, 2],
['Harry', 1, 0, 2, 9],
['Will', 20, 2, 7, 3]])
print(df)
df2 = pd.DataFrame(columns=[
x.strip() for x in 'Class User Factor'.split()], data =[
['A', 'Harry', 3],
['A', 'Will', 2],
['A', 'James', 5],
['B', np.nan, 4]])
print(df2)
df2 = df2[df2.Class=='A'].join(df.set_index('Name'), on='User').set_index(['Class','User'])
df2['Total'] = df2.apply(lambda x: list(x * x.Factor)[1:], axis=1)
df2 = df2.reset_index()[['Class','User','Factor','Total']]
print(df2)
Input:
Name Apples Pears Grapes Peachs
0 James 3 5 5 2
1 Harry 1 0 2 9
2 Will 20 2 7 3
Class User Factor
0 A Harry 3
1 A Will 2
2 A James 5
3 B NaN 4
Output
Class User Factor Total
0 A Harry 3 [3, 0, 6, 27]
1 A Will 2 [40, 4, 14, 6]
2 A James 5 [15, 25, 25, 10]
You can use:
# First lookup
factor = df2[df2['Class'] == 'A'].set_index('User')['Factor']
df1['Total'] = df1[cols].mul(df1['Name'].map(factor), axis=0).agg(list, axis=1)
# Second lookup
df2['Total'] = df2['User'].map(df1.set_index('Name')['Total'])
Output:
>>> df2
Class User Factor Total
0 A Harry 3 [3, 0, 6, 27]
1 A Will 2 [40, 4, 14, 6]
2 A James 5 [15, 25, 25, 10]
3 B NaN 4 NaN
>>> df1
Name Apples Pears Grapes Peachs Total
0 James 3 5 5 2 [15, 25, 25, 10]
1 Harry 1 0 2 9 [3, 0, 6, 27]
2 Will 20 2 7 3 [40, 4, 14, 6]
On-liner masochists, greetings ;)
df2['Total'] = pd.Series(df1.sort_values(by='Name').reset_index(drop=True).iloc[:,1:5]\
.mul(df2[df2.Class == 'A'].sort_values(by='User')['Factor'].reset_index(drop=True), axis=0)\
.values.tolist())
df2
Output:
index
Class
User
Factor
Total
0
A
Harry
3
3,0,6,27
1
A
Will
2
15,25,25,10
2
A
James
5
40,4,14,6
3
B
NaN
4
NaN

Count how many times a value occurs in a pandas data frame based on a condition

I'm trying to calculate how many times a value occurs in a specific location inside a data frame.
As an example I use this data frame:
import pandas as pd
d = {'Fruit': ['Apple', 'Apple', 'Apple', 'Onion', 'Onion', 'Onion', 'Onion', 'Pear', 'Pear', 'Pear',
'Pear', 'Pear'],
'Country': ['USA', 'SUI', 'USA', 'SUI', 'USA', 'SUI', 'SUI', 'USA', 'USA', 'USA', 'SUI', 'SUI']}
df = pd.DataFrame(data=d)
I do not understand how I can calculate for example how many Apples there are in USA and SUI and add this to a 'Count' column.
The output should look something like this:
import pandas as pd
d = {'Fruit': ['Apple', 'Apple', 'Apple', 'Onion', 'Onion', 'Onion', 'Onion', 'Pear', 'Pear', 'Pear', 'Pear', 'Pear'],
'Country': ['USA', 'SUI', 'USA', 'SUI', 'USA', 'SUI', 'SUI', 'USA', 'USA', 'USA', 'SUI', 'SUI'],
'Count': [2, 1, 2, 3, 1, 3, 3, 3, 3, 3, 2, 2]}
df = pd.DataFrame(data=d)
I would know how to count the values themselves (how many apples occur in the Fruit column) but not how to add this condition to the calculation.
Thanks for your help in advance.
Try Groupby transform:
df['counts'] = df.groupby(['Fruit', 'Country'])['Country'].transform('size')
df:
Fruit Country counts
0 Apple USA 2
1 Apple SUI 1
2 Apple USA 2
3 Onion SUI 3
4 Onion USA 1
5 Onion SUI 3
6 Onion SUI 3
7 Pear USA 3
8 Pear USA 3
9 Pear USA 3
10 Pear SUI 2
11 Pear SUI 2
You can use groupby followed by a join, like this:
fruit_counts = df.groupby(["Fruit", "Country"]).size().rename("Count")
df.join(fruit_counts, on=["Fruit", "Country"])
Output:
Fruit Country Count
0 Apple USA 2
1 Apple SUI 1
2 Apple USA 2
3 Onion SUI 3
4 Onion USA 1
5 Onion SUI 3
6 Onion SUI 3
7 Pear USA 3
8 Pear USA 3
9 Pear USA 3
10 Pear SUI 2
11 Pear SUI 2

Dataframe: How to group values in a column to create index for pivot

Hi everyone,
I'm super new at this so I'm looking for some help.
Please consider the following dataframe:
fruit sales price
0 lemon .. ..
1 orange .. ..
2 carrot .. ..
3 potato .. ..
4 pineapple .. ..
5 mango .. ..
Lets say that fruit column can be categorized in the following way:
lemon + orange = citrus;
carrot + potato = tuber;
pineapple + mango = tropical.
After I would like to use this new
grouping as an index for a pivot table. , in order to see average
price or sales in a "citrus/tuber/tropical" segmentation.
In the dataframe I'm trying to apply this logic on there are too many values to make a ditionary.
Any help would be greatly appreciated :)
You can create dict for map and use groupby witt aggregate mean:
#sample data
df = pd.DataFrame({
'price': [4, 7, 3, 4, 1, 4],
'sales': [1, 5, 1, 2, 6, 3],
'model': ['lemon', 'orange', 'carrot', 'potato', 'pineapple', 'mango']})
print (df)
model price sales
0 lemon 4 1
1 orange 7 5
2 carrot 3 1
3 potato 4 2
4 pineapple 1 6
5 mango 4 3
#dict of mapping
d1 = {'citrus': ['lemon', 'orange'],
'tuber':['carrot', 'potato'],
'tropical':['pineapple', 'mango']}
#is necessary swap values with keys and expand them to new dict
d = {k: oldk for oldk, oldv in d1.items() for k in oldv}
print (d)
{'pineapple': 'tropical', 'potato': 'tuber', 'mango': 'tropical',
'lemon': 'citrus', 'orange': 'citrus', 'carrot': 'tuber'}
s = df['model'].map(d)
df1 = df.groupby(s)['sales'].mean().reset_index()
print (df1)
model sales
0 citrus 3.0
1 tropical 4.5
2 tuber 1.5
Similar solution with set_index, but then is necessary change column names:
df1 = df.set_index('model').groupby(d)['sales'].mean().reset_index()
df1.columns= ['model','sales']
print (df1)
model sales
0 citrus 3.0
1 tropical 4.5
2 tuber 1.5

How to transform list of dictionaries with jagged arrays to a DataFrame

Fairly new to Pandas. I'm trying to get a json data set which looks like this
data = [
{'id':1, 'colors':['red', 'blue']},
{'id':2, 'colors':['red', 'blue', 'green']},
{'id':3, 'colors':['orange', 'blue', 'orange']},
]
into a Pandas DataFrame which looks like this
import pandas as pd
df = pd.DataFrame({'id':[1,2,3],
'blue':[1,1,1],
'green':[0,1,0],
'orange':[0,0,2],
'red':[1,1,0]})
df
blue green id orange red
0 1 0 1 0 1
1 1 1 2 0 1
2 1 0 3 2 0
Where the columns are 'id' and the unique colors and the rows are the ids and the count of each color in each original dictionary. How do I do this?
Figured it out.
df = [DataFrame(e) for e in data]
df = pd.concat(df)
df = df.pivot_table(index=['id'], columns=['colors'], aggfunc=len).fillna(0)
df
blue green id orange red
0 1 0 1 0 1
1 1 1 2 0 1
2 1 0 3 2 0
>>> ids = [item['id'] for item in data]
>>> col = [item['colors'] for item in data]
>>> ids = np.repeat(ids, list(map(len, col)))
>>> col = [a for item in col for a in item]
>>> df = DataFrame({'ids':ids, 'colors':col})
>>> df
colors id
0 red 1
1 blue 1
2 red 2
3 blue 2
4 green 2
5 orange 3
6 blue 3
7 orange 3
>>> df.groupby(['id', 'colors']).size().unstack().fillna(0)
colors blue green orange red
id
1 1 0 0 1
2 1 1 0 1
3 1 0 2 0
you may call .reset_index at the end, to have id as a column instead of index.
I would first get the list of all unique colors using set().union() on the colors lists and then create the new dictionary using collections.defaultdict. Example -
In [10]: data = [
....: {'id':1, 'colors':['red', 'blue']},
....: {'id':2, 'colors':['red', 'blue', 'green']},
....: {'id':3, 'colors':['orange', 'blue', 'orange']},
....: ]
In [11]:
In [11]: from collections import defaultdict
In [24]: colorlist = list(set().union(*colorlist))
In [25]: colorlist
Out[25]: ['orange', 'blue', 'red', 'green']
In [26]: newd = defaultdict(list)
In [30]: for x in data:
....: xid = x['id']
....: newd['id'].append(xid)
....: for elem in colorlist:
....: if elem in x['colors']:
....: newd[elem].append(xid)
....: else:
....: newd[elem].append(0)
....:
In [31]: newd
Out[31]: defaultdict(<class 'list'>, {'id': [1, 2, 3], 'orange': [0, 0, 3], 'blue': [1, 2, 3], 'red': [1, 2, 0], 'green': [0, 2, 0]})
In [32]: df = pd.DataFrame(newd)
In [33]: df
Out[33]:
blue green id orange red
0 1 0 1 0 1
1 2 2 2 0 2
2 3 0 3 3 0
You can make use of list comprehension and dicts, for a faster alternative
In [2494]: pd.DataFrame(dict(zip(r['colors']+['id'], [1]*len(r['colors'])+[r['id']]))
for r in data).fillna(0)
Out[2494]:
blue green id orange red
0 1 0.0 1 0.0 1.0
1 1 1.0 2 0.0 1.0
2 1 0.0 3 1.0 0.0
Timings
In [2492]: %timeit pd.DataFrame(dict(zip(r['colors']+['id'], [1]*len(r['colors'])+[r['id']])) for r in data).fillna(0)
1000 loops, best of 3: 863 µs per loop
In [2493]: %timeit pd.concat([pd.DataFrame(e) for e in data]).pivot_table(index=['id'], columns=['colors'], aggfunc=len).fillna(0)
100 loops, best of 3: 7.79 ms per loop
Details
In [2495]: [dict(zip(r['colors']+['id'], [1]*len(r['colors'])+[r['id']])) for r in data]
Out[2495]:
[{'blue': 1, 'id': 1, 'red': 1},
{'blue': 1, 'green': 1, 'id': 2, 'red': 1},
{'blue': 1, 'id': 3, 'orange': 1}]
If using pandas 0.25 and up, explode is a convenience method
df = (pd.DataFrame(data)
.assign(v=1)
.explode('colors')
.groupby(['id', 'colors']).sum()
.unstack(1, fill_value=0)
.reset_index(col_level=1)
)
df.columns = df.columns.droplevel(0)

Categories

Resources