Pandas Dataframe explode List, add new columns and count values - python

I'm a little bit stuck. I habe a Dataframe with a list in a column.
id
list
1
[a, b]
2
[a,a,a,b]
3
c,b,b
4
[c,a]
5
[f,f,b]
I have the values, a, b, c, d, e, f in general.
I want to count if two values are in a list togehter and also if a value appears more than once in that list.
I want to get that to create a heatmap, with all values in x and y axis. and the counts where e.g. a is x times in a list with itself or e.g. a and b are x times togehter.
I tried this so far, but it is not exactly the solution i want.
Make ne columns and count values
df['a'] = df['list'].explode().str.contains('a').groupby(level=0).any().astype('int')
df['b'] = df['list'].explode().str.contains('b').groupby(level=0).any().astype('int')
df['c'] = df['list'].explode().str.contains('c').groupby(level=0).any().astype('int')
df['d'] = df['list'].explode().str.contains('d').groupby(level=0).any().astype('int')
df['e'] = df['list'].explode().str.contains('e').groupby(level=0).any().astype('int')
df['f'] = df['list'].explode().str.contains('f').groupby(level=0).any().astype('int')
here i get the first problem, i create a new df with rows names the list and counting the values in the list, but I also get the count if i only have the value once in the list
make x axis
df_explo = df.explode(['list'],ignore_index=True)
get sum of all
df2=df_explo.groupby(['list']).agg({'a':'sum','b':'sum','c':'sum','d':'sum','e':'sum','f':'sum').reset_index()
set index to list
df3 = df2.set_index('list')
create heatmap
sns.heatmap(df3,cmap='RdYlGn_r', linewidths=0.5,annot=True,fmt="d")

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from itertools import combinations
data = [
['a', 'b'],
['a', 'a', 'a', 'b'],
['b', 'b', 'b'],
['c', 'a'],
['f', 'f', 'b']
]
letters = ['a', 'b', 'c', 'd', 'e', 'f']
duplicate_occurrences = pd.DataFrame(0, index=[0], columns=letters)
co_occurrences = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
duplicates = [k for k, v in Counter(l).items() if v > 1]
for d in duplicates:
duplicate_occurrences[d] += 1
co = list(combinations(set(l), 2))
for a, b in co:
co_occurrences.loc[a, b] += 1
co_occurrences.loc[b, a] += 1
plt.figure(figsize=(7, 1))
sns.heatmap(duplicate_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Duplicate Occurrence Counts')
plt.show()
sns.heatmap(co_occurrences, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")
plt.title('Co-Occurrence Counts')
plt.show()
The first plot shows how often each letter occurs at least twice in a list, the second shows how often each pair of letters occurs together in a list.
In case you want to plot the duplicate occurrences on the diagonal, you could do it e.g. as follows:
df = pd.DataFrame(0, index=letters, columns=letters)
for l in data:
for k, v in Counter(l).items():
if v > 1:
df.loc[k, k] += 1
for a, b in combinations(set(l), 2):
df.loc[a, b] += 1
df.loc[b, a] += 1
sns.heatmap(df, cmap='RdYlGn_r', linewidths=0.5, annot=True, fmt="d")

Related

insert python list in all rows new pd.Dataframe column

I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations
You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.
You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))

Pandas Multi-index set value based on three different condition

The objective is to create a new multiindex column based on 3 conditions of the column (B)
Condition for B
if B<0
CONDITION_B='l`
elif B<-1
CONDITION_B='L`
else
CONDITION_B='g`
Naively, I thought, we can simply create two different mask and replace the value as suggested
# Handle CONDITION_B='l` and CONDITION_B='g`
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
and then
# CONDITION_B='L`
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
As expected, this will throw an error
TypeError: sequence item 1: expected str instance, bool found
May I know how to handle the 3 different condition
Expected output
ONE TWO
B B
g L
l l
l g
g l
L L
The code to produce the error is
import pandas as pd
import numpy as np
np.random.seed(3)
arrays = [np.hstack([['One']*2, ['Two']*2]) , ['A', 'B', 'A', 'B']]
columns = pd.MultiIndex.from_arrays(arrays)
df= pd.DataFrame(np.random.randn(5, 4), columns=list('ABAB'))
df.columns = columns
idx = pd.IndexSlice
mask_2 = df.loc[:,idx[:,'B']]<0
appenddf_2=mask_2.replace({True:'g',False:'l'}).rename(columns={'A':'iv'},level=1)
mask_33 = df.loc[:,idx[:,'B']]<-0.1
appenddf_2=mask_33.replace({True:'G'}).rename(columns={'A':'iv'},level=1)
IIUC:
np.select() is ideal in this case:
conditions=[
df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),
df.loc[:,idx[:,'B']].lt(-1),
df.loc[:,idx[:,'B']].ge(0)
]
labels=['l','L','g']
out=pd.DataFrame(np.select(conditions,labels),columns=df.loc[:,idx[:,'B']].columns)
OR
via np.where():
s=np.where(df.loc[:,idx[:,'B']].lt(0) & df.loc[:,idx[:,'B']].gt(-1),'l',np.where(df.loc[:,idx[:,'B']].lt(-1),'L','g'))
out=pd.DataFrame(s,columns=df.loc[:,idx[:,'B']].columns)
output of out:
One Two
B B
0 g L
1 l l
2 l g
3 g l
4 L L
I don't fully understand what you want to do but try something like this:
df = pd.DataFrame({'B': [ 0, -1, -2, -2, -1, 0, 0, -1, -1, -2]})
df['ONE'] = np.where(df['B'] < 0, 'l', 'g')
df['TWO'] = np.where(df['B'] < -1, 'L', df['ONE'])
df = df.set_index(['ONE', 'TWO'])
Output result:
>>> df
B
ONE TWO
g g 0
l l -1
L -2
L -2
l -1
g g 0
g 0
l l -1
l -1
L -2

A column in my dataframe does not seem to correspond to the input List (python)

I want to assign one of the columns of my dataframe to a list. I used the code below.
listone = [['a', 'b', 'c'], ['m', 'g'], ['h'], ['y', 't', 'r']]
df['Letter combinations'] = listone
The 'Letter Combinations' column in the dataframe doesn't correspond to the list, instead seems to assign random elements to each row in the column. I was wondering if this method indexes the elements differently causing a change in the order or if there is something wrong with my code. Any help would be appreciated!
Edit: Here is my complete code
listone = [[a, b, c], [m, g], [h], [y, t, r]]
numbers = [1, 2, 3, 4]
my_matrix = {'Numbers': numbers}
sample = pd.DataFrame(my_matrix)
sample['Letter combinations'] = listone
sample
My output looks like:
```
Numbers Letter combination
0 1 [b]
1 2 [m, g]
2 3 []
3 4 [r]
```
You need to make the listone to be a series. Ie:
sample['Letter combinations'] = pd.Series(listone)
sample
Numbers Letter combinations
0 1 [a, b, c]
1 2 [m, g]
2 3 [h]
3 4 [y, t, r]

How to filter dataframe for column with lists contains value [duplicate]

This question already has answers here:
Python & Pandas: How to query if a list-type column contains something?
(7 answers)
Closed 1 year ago.
We have dataframe with lists in one column. Couldn't find easy way to filter dataframe for rows contains value in their lists.
df = pd.DataFrame({'lists':[['a', 'c'], ['a', 'b', 'd'], ['c', 'd']]})
For example I need only rows contains 'a' in their lists.
I managed to get it only through 'apply'.
df[df.lists.apply(lambda x: True if 'a' in x else False)]
>>> lists
>>>0 [a, c]
>>>1 [a, b, d]
Is there is anything like .isin(), but vice-versa?
What is the best way to get needed rows?
Thanks.
Simpliest is use apply with in:
df1 = df[df.lists.apply(lambda x: 'a' in x)]
But if want check a create DataFrame, but it is a bit complicated:
df1 = df[pd.DataFrame(df.lists.values.tolist()).eq('a').any(axis=1)]
Another solution is use str.join with str.contains:
df1 = df[df.lists.str.join(',').str.contains('a')]
print (df1)
lists
0 [a, c]
1 [a, b, d]
Boolean indexing via a list comprehension is one way:
df = pd.DataFrame({'lists':[['a', 'c'], ['a', 'b', 'd'], ['c', 'd']]})
df[['a' in x for x in df['lists'].values]]
# lists
# 0 [a, c]
# 1 [a, b, d]
Some performance benchmarking:
df = pd.DataFrame({'lists':[['a', 'c'], ['a', 'b', 'd'], ['c', 'd']]})
df = pd.concat([df]*100000)
def jez1(df):
return df[df.lists.apply(lambda x: 'a' in x)]
def jez2(df):
return df[pd.DataFrame(df.lists.values.tolist()).eq('a').any(axis=1)]
def jez3(df):
return df[df.lists.str.join(',').str.contains('a')]
def jp(df):
return df[['a' in x for x in df['lists'].values]]
%timeit jez1(df) # 87ms
%timeit jez2(df) # 122ms
%timeit jez3(df) # 416ms
%timeit jp(df) # 53ms

Iterating over rows to add items to dictionary

I have a dataframe with a column that contains lists. I want to
A) Find all unique values of the lists
B) Make a dictionary with a format {uniquevalue : [indexA, indexB,...]}, where the indices correspond to the index of a dataframe row that contains uniquevalue.
I have done A, but my code for B creates a dictionary that simply has all the indexes, regardless if they are contained in the row or not. Could you please help?
import pandas as pd
df = pd.read_excel(io = 'links.xlsx')
unique_list = []
for row in df['relevant_links']:
row_list = row.split(sep = ', ')
unique_list.extend(row_list)
unique_set = set(unique_list)
unique_dict = dict.fromkeys(unique_set, [])
print(unique_dict.keys())
row_idx = 0
for row in df['relevant_links']:
[unique_dict[i].append(row_idx) for i in str(row).split(', ') if i in unique_dict]
row_idx += 1
I think you can use:
df = pd.DataFrame({'relevant_links':['a, c, v','a, r, e','e, t','e, r']})
print (df)
relevant_links
0 a, c, v
1 a, r, e
2 e, t
3 e, r
#create Series
s = df['relevant_links'].str.split(', ', expand=True).stack()
#groupby by unique links, create list and then dict
unique_dict = s.reset_index(name='val').groupby('val')['level_0'].apply(list).to_dict()
print (unique_dict)
{'v': [0], 't': [2], 'r': [1, 3], 'e': [1, 2, 3], 'a': [0, 1], 'c': [0]}
unique_set = s.unique().tolist()
print (unique_set)
['a', 'c', 'v', 'r', 'e', 't']

Categories

Resources