Extract name of sum in Python Pandas - python

I have the following DataFrame called df:
KEY_ID READY STEADY GO
001 Yes Maybe 123
002 No Maybe 123
003 Yes Sometimes 234
004 Yes Later 234
005 No Sometimes 345
I use df.count() to see how many times a value is filled in which is 5 every time:
KEY_ID 5
READY 5
STEADY 5
GO 5
But I would also like to see how many times the values in column STEADY are used. I do this with abc = df['STEADY'].value_counts() which gives me:
Sometimes 2
Maybe 2
Later 1
With a for loop I can extract the information of the values in abc which I just created with value_counts() as follows:
for i in abc:
print(i)
However, I tried several methods, including
for i,j in enumerate(abc):
print(i); print(j)
to get the names of Sometimes, Maybe, Later as well as I don't want to type them manually. How do I extract these names of the value_counts() values?

Are you looking for groupby() ?
import pandas as pd
lst = [['Apple', 1], ['Orange', 1], ['Apple', 2], ['Orange', 1], ['Apple', 3], ['Orange', 1]]
df = pd.DataFrame(lst)
df.columns = ['fruit', 'amount']
df.groupby('fruit').sum()

import pandas as pd
rowdata = [['Apple', 1], ['Orange', 1], ['Apple', 2], ['Orange', 1], ['Apple', 3],['Orange', 1]]
df = pd.DataFrame(rowdata)
df.groupby(0).sum()
This will give like a data frame which is given below,
1
0
Apple 6
Orange 3
But just df.sum() will give like this,
0 AppleOrangeAppleOrangeAppleOrange
1 9
I hope you are expecting like the first one..

IIUC:
In [339]: df
Out[339]:
name val
0 Apple 1
1 Orange 1
2 Apple 2
3 Orange 1
4 Apple 3
5 Orange 1
In [340]: df.groupby('name', as_index=False)['val'].sum()
Out[340]:
name val
0 Apple 6
1 Orange 3
In [341]: df.groupby('name', as_index=False)['val'].sum()['name']
Out[341]:
0 Apple
1 Orange
Name: name, dtype: object
In [342]: df.groupby('name', as_index=False)['val'].sum()['name'].tolist()
Out[342]: ['Apple', 'Orange']

It seems you want filter first by boolean indexing with isin:
print (df)
A B
0 Peach 3
1 Pear 6
2 Apple 1
3 Orange 1
4 Apple 2
5 Orange 1
6 Apple 3
7 Orange 1
df1 = df[df['A'].isin(['Apple','Orange'])]
print (df1)
A B
2 Apple 1
3 Orange 1
4 Apple 2
5 Orange 1
6 Apple 3
7 Orange 1
Then groupby and aggregate sum:
df2 = df1.groupby('A', as_index=False)['B'].sum()
print (df2)
A B
0 Apple 6
1 Orange 3
Another solution is groupby and aggregate first and then select only values by list:
df1 = df.groupby('A')['B'].sum()
df2 = df1.loc[['Apple','Orange']].reset_index()
print (df2)
A B
0 Apple 6
1 Orange 3

Related

How to expand dictionaries in rows of pandas dataframe with unique column names?

I have a dataframe with rows as dictionaries as below:
Col1 A B
{'A': 1, 'B': 23} apple carrot
{'A': 3, 'B': 35} banana spinach
I want to expand Col1 such that the dataframe looks like this:
Col1.A Col2.B A B
1 23 apple carrot
3 35 banana spinach
How can I do this using pandas in python? Please let me know if there is any other way as well.
I tried using pd.explode but the new column names are being duplicated. How to avoid this?
df["Col1.A"] = df["Col1"].map(lambda x: x["A"])
df["Col1.B"] = df["Col1"].map(lambda x: x["B"])
df.drop("Col1", axis=1, inplace=True)
As a generic method that doesn't require knowledge of the dictionary keys:
df = (pd.json_normalize(df.pop('Col1'))
.add_prefix('Col1.').join(df)
)
Or, if you don't want to alter df:
out = (pd.json_normalize(df['Col1'])
.add_prefix('Col1.')
.join(df.drop(columns='Col1'))
)
Output:
Col1.A Col1.B A B
0 1 23 apple carrot
1 3 35 banana spinach None
To convert them to columns, you can use:
Col1 = df['Col1'].apply(pd.Series)
Result:
A B
0 1 23
1 3 35
Then, if you want, you can add this to your dataframe like this:
Col1.join(df.drop(columns='Col1'), lsuffix='_Col1')
Output:
A_Col1 B_Col1 A B
0 1 23 apple carrot
1 3 35 banana spinach

Pandas, find the number of times a combination of rows appear under a different column ID

I have a dataset that looks as follows:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
df
purchase item
0 1 apple
1 1 banana
2 2 apple
3 2 banana
4 2 pear
5 3 apple
And I need an output such as
item_1
item_2
purchase
apple
banana
2
banana
pear
1
apple
pear
1
A table counting how many times a combination of two fruits was purchased in the same purchase.
In this example's first row, the values are apple, banana, 2 because there are two purchases (see column purchase), purchase ID 1 and purchase ID 2, where the person bought both apple and banana. The second row is apple, pear, and 1 because there's only one purchase (purchase ID 2) where the person bought both apple and pear.
My code so far:
df = pd.DataFrame({'purchase': [1, 1, 2, 2, 2, 3],
'item': ['apple', 'banana', 'apple', 'banana', 'pear', 'apple']})
dummies = pd.get_dummies(df['item'])
df2 = pd.concat([df['purchase'], dummies], axis=1)
Creates a table like this:
purchase apple banana pear
0 1 1 0 0
1 1 0 1 0
2 2 1 0 0
3 2 0 1 0
4 2 0 0 1
5 3 1 0 0
Now, I don't know how to proceed to get the wanted result (and I'm aware my output is far from the wanted one). I tried some group by's but it didn't work.
This is probably not the most efficient, but it seems to get the job done:
In [3]: from itertools import combinations
In [4]: combos = df.groupby("purchase")["item"].apply(lambda row: list(combinations(row, 2))).explode().value_counts()
In [5]: combos.reset_index()
Out[5]:
index item
0 (apple, banana) 2
1 (apple, pear) 1
2 (banana, pear) 1
From there,
In [6]: pd.DataFrame([[*x, y] for x, y in zip(combos.index, combos)], columns=["item_1", "item_2", "combo_qty"])
Out[6]:
item_1 item_2 combo_qty
0 apple banana 2
1 apple pear 1
2 banana pear 1
Here is another take that uses the behavior of join with duplicated index:
df2 = df.set_index("purchase")
df2 = df2.join(df2, rsuffix="_other")\
.groupby(["item", "item_other"])\
.size().rename("count").reset_index()
result = df2[df2.item < df2.item_other].reset_index(drop=True)
# item item_other count
# 0 apple banana 2
# 1 apple pear 1
# 2 banana pear 1
I get around 10x speedup over using builtin combinations in the following benchmark:
import numpy as np
num_orders = 200
max_order_size = 10
num_items = 50
purchases = np.repeat(np.arange(num_orders),
np.random.randint(1, max_order_size, num_orders))
items = np.random.randint(1, num_items, size=purchases.size)
test_df = pd.DataFrame({
"purchase": purchases,
"item": items,
})

How to assign a new descriptive column while concatenating dataframes

I have two data frames that i want to concatenate in python. However, I want to add another column type in order to distinguish among the columns.
Here is my sample data:
import pandas as pd
df = pd.DataFrame({'numbers': [1, 2, 3], 'colors': ['red', 'white', 'blue']},
columns=['numbers', 'colors'])
df1 = pd.DataFrame({'numbers': [7, 9, 9], 'colors': ['yellow', 'brown', 'blue']},
columns=['numbers', 'colors'])
pd.concat([df,df1])
This code will give me the following result:
numbers colors
0 1 red
1 2 white
2 3 blue
0 7 yellow
1 9 brown
2 9 blue
but what I would like to get is as follows:
numbers colors type
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
type column is going to help me to differentiate between the values of the two data frames.
Can anyone help me with this please?
Use DataFrame.assign for new columns:
df = pd.concat([df.assign(typ='first'),df1.assign(typ='second')])
print (df)
numbers colors typ
0 1 red first
1 2 white first
2 3 blue first
0 7 yellow second
1 9 brown second
2 9 blue second
Using a list-comprehension
df = pd.concat([d.assign(typ=f'id{i}') for i, d in enumerate([df, df1])], ignore_index=True)
numbers colors typ
0 1 red id0
1 2 white id0
2 3 blue id0
3 7 yellow id1
4 9 brown id1
5 9 blue id1

Find the frequency of words in a dataframe from a list

import pandas as pd
list = ['apple','banana','cherries','dragonfruit','elderberry']
data = {'name': ['Alpha', 'Bravo','Charlie','Delta','Echo'],
'favorite_fruit': ['apple banana cherries', 'banana cherries dragonfruit',
'cherries dragonfruit','dragonfruit','apple elderberry']}
df = pd.DataFrame (data, columns = ['name','favorite_fruit'])
I want to count the frequency of every fruit in the list within the df.
Expected Output:
df2
Fruit | Frequency
Apple | 2
Banana | 2
Cherries | 3
Dragonfruit | 3
Elderberry | 1
The code df.favorite_fruit.str.split(expand=True).stack().value_counts() works for a small DataFrame.
If df.favorite_fruit contains thousands of rows of different fruit combinations,
how do I find only the frequency of words in the list?
Maybe this is a loop-hole answer, but you can just filter out the values from the answer you already described. So if you start with this:
>>> df2 = df.favorite_fruit.str.split(expand=True).stack()
>>> df2
0 0 apple
1 banana
2 cherries
1 0 banana
1 cherries
2 dragonfruit
2 0 cherries
1 dragonfruit
3 0 dragonfruit
4 0 apple
1 elderberry
dtype: object
You could use isin to limit the data to ones in the target list:
>>> target = ['apple', 'banana']
>>> df2[df2.isin(target)].value_counts()
banana 2
apple 2
dtype: int64
Or even after your original answer:
>>> df.favorite_fruit.str.split(expand=True).stack().value_counts().loc[target]
apple 2
banana 2
dtype: int64
If the issue is that the expand and stack operations are costly with that much data, then maybe this won't be satisfactory. But I think it's possible this can be better than loop based answers?
Perhaps a bit of a roundabout way of doing it, but if your favorite_fruit column is always space delimited something like this should work:
import pandas as pd
list = ['apple','banana','cherries','dragonfruit','elderberry']
data = {'name': ['Alpha', 'Bravo','Charlie','Delta','Echo'],
'favorite_fruit': ['apple banana cherries', 'banana cherries dragonfruit',
'cherries dragonfruit','dragonfruit','apple elderberry']}
df = pd.DataFrame (data, columns = ['name','favorite_fruit'])
new_df = pd.DataFrame()
data = {}
for i, row in df.iterrows():
s = row['favorite_fruit']
items = s.split(' ')
for item in items:
if item in data.keys():
data[item].append(1)
else:
data[item] = [1]
for key, value in data.items():
data[key] = sum(value)
fruit = []
frequency = []
for key, value in data.items():
fruit.append(key)
frequency.append(value)
new_df = pd.DataFrame({'fruit': fruit, 'frequency':frequency})
print(new_df)
This prints out the following:
fruit frequency
0 apple 2
1 banana 2
2 cherries 3
3 dragonfruit 3
4 elderberry 1
Try using explode function after splitting.
df.favorite_fruit.str.split().explode().value_counts()
cherries 3
dragonfruit 3
banana 2
apple 2
elderberry 1
Name: favorite_fruit, dtype: int64

Pandas DataFrame from list/dict/list

I have some data in this form:
a = [{'table': 'a', 'field':['apple', 'pear']},
{'table': 'b', 'field':['grape', 'berry']}]
I want to create a dataframe that looks like this:
field table
0 apple a
1 pear a
2 grape b
3 berry b
When I try this:
pd.DataFrame.from_records(a)
I get this:
field table
0 [apple, pear] a
1 [grape, berry] b
I'm using a loop to restructure my original data, but I think there must be a more straightforward and simpler methid.
You can use a list comprehension to concatenate a series of dataframes, one for each dictionary in a.
>>> pd.concat([pd.DataFrame({'table': d['table'], # Per #piRSquared for simplification.
'field': d['field']})
for d in a]).reset_index(drop=True)
field table
0 apple a
1 pear a
2 grape b
3 berry b
Option 1
comprehension
pd.DataFrame([{'table': d['table'], 'field': f} for d in a for f in d['field']])
field table
0 apple a
1 pear a
2 grape b
3 berry b
Option 2
reconstruct
d1 = pd.DataFrame(a)
pd.DataFrame(dict(
table=d1.table.repeat(d1.field.str.len()),
field=np.concatenate(d1.field)
)).reset_index(drop=True)
field table
0 apple a
1 pear a
2 grape b
3 berry b
Option 3
Rubik's Cube
pd.DataFrame(a).set_index('table').field.apply(pd.Series) \
.stack().reset_index('table', name='field').reset_index(drop=True)
table field
0 a apple
1 a pear
2 b grape
3 b berry
Or you can try using pd.wide_to_long ,I do want to use lreshape, but it is undocumented and personally not recommended ...T _ T
a = [{'table': 'a', 'field':['apple', 'pear']},
{'table': 'b', 'field':['grape', 'berry']}]
df=pd.DataFrame.from_records(a)
df[['Feild1','Feild2']]=df.field.apply(pd.Series)
pd.wide_to_long(df,['Feild'],'table','lol').reset_index().drop('lol',axis=1).sort_values('table')
Out[74]:
table Feild
0 a apple
2 a pear
1 b grape
3 b berry

Categories

Resources