assign hash to row of categorical data in pandas - python

So I have many pandas data frames with 3 columns of categorical variables:
D F False
T F False
D F False
T F False
The first and second columns can take one of three values. The third one is binary. So there are a grand total of 18 possible rows (not all combination may be represented on each data frame).
I would like to assign a number 1-18 to each row, so that rows with the same combination of factors are assigned the same number and vise-versa (no hash collision).
What is the most efficient way to do this in pandas?
So, all_combination_df is a df with all possible combination of the factors. I am trying to turn df such as big_df to a Series with unique numbers in it
import pandas, itertools
def expand_grid(data_dict):
"""Create a dataframe from every combination of given values."""
rows = itertools.product(*data_dict.values())
return pandas.DataFrame.from_records(rows, columns=data_dict.keys())
all_combination_df = expand_grid(
{'variable_1': ['D', 'A', 'T'],
'variable_2': ['C', 'A', 'B'],
'variable_3' : [True, False]})
big_df = pandas.concat([all_combination_df, all_combination_df, all_combination_df])

UPDATE: as #user189035 mentioned in the comment it's much better to use categorical dtype as it'll save a lot of memory
I would try to use factorize method:
In [112]: df['category'] = \
...: pd.Categorical(
...: pd.factorize((df.a + '~' + df.b + '~' + (df.c*1).astype(str)))[0])
...:
In [113]: df
Out[113]:
a b c category
0 A X True 0
1 B Y False 1
2 A X True 0
3 C Z False 2
4 A Z True 3
5 C Z True 4
6 B Y False 1
7 C Z False 2
In [114]: df.dtypes
Out[114]:
a object
b object
c bool
category category
dtype: object
Explanation: this simple way we can glue all columns into a single series:
In [115]: df.a + '~' + df.b + '~' + (df.c*1).astype(str)
Out[115]:
0 A~X~1
1 B~Y~0
2 A~X~1
3 C~Z~0
4 A~Z~1
5 C~Z~1
6 B~Y~0
7 C~Z~0
dtype: object

Without taking into account issues of efficiency, this would find duplicate rows and give you a dictionary (similar to the question here).
import pandas as pd, numpy as np
# Define data
d = np.array([["D", "T", "D", "T", "U"],
["F", "F", "F", "J", "K"],
[False, False, False, False, True]])
df = pd.DataFrame(d.T)
# Find and remove duplicate rows
df_nodupe = df[~df.duplicated()]
# Make a list
df_nodupe.T.to_dict('list')
{0: ['D', 'F', 'False'],
1: ['T', 'F', 'False'],
3: ['T', 'J', 'False'],
4: ['U', 'K', 'True']}
Otherwise, you could use map, like so:
import pandas as pd, numpy as np
# Define data
d = np.array([["D", "T", "D", "T", "U"],
["F", "F", "F", "J", "K"],
[False, False, False, False, True]])
df = pd.DataFrame(d.T)
df.columns = ['x', 'y', 'z']
# Define your dictionary of interest
dd = {('D', 'F', 'False'): 0,
('T', 'F', 'False'): 1,
('T', 'J', 'False'): 2,
('U', 'K', 'True'): 3}
# Create a tuple of the rows of interest
df['tupe'] = zip(df.x, df.y, df.z)
# Create a new column based on the row values
df['new_category'] = df.tupe.map(dd)

Related

insert python list in all rows new pd.Dataframe column

I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations
You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.
You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))

Pandas groupby multiple columns, but treat separately (don't want unique combinations)

Say I have the following dataframe
c1
c2
c3
p
x
1
n
x
2
n
y
1
p
y
2
p
y
1
n
x
2
etc. I then want this in the following format:
p
n
x
y
4
5
5
4
i.e., i want to sum column 3 for each group in columns 1 & 2, but I don't want the unique combinations of columns 1 & 2, which would be achieved by grouping by those columns and summing on the third. Any way to do this using groupby?
As Karan said, just call groupby on each of your label columns separately, then concatenate (and transpose) the results:
import pandas as pd
df = pd.DataFrame([['p', 'x', 1],
['n', 'x', 2],
['n', 'y', 1],
['p', 'y', 2],
['p', 'y', 1],
['n', 'x', 2]])
df.columns = ['c1', 'c2', 'c3']
sums1 = df.groupby('c1').sum()
sums2 = df.groupby('c2').sum()
sums = pd.concat([sums1, sums2]).T
sums
n p x y
c3 5 4 5 4

split an array into a list of arrays

How can I split a 2D array by a grouping variable, and return a list of arrays please (also the order is important).
To show expected outcome, the equivalent in R can be done as
> (A = matrix(c("a", "b", "a", "c", "b", "d"), nr=3, byrow=TRUE)) # input
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
[3,] "b" "d"
> (split.data.frame(A, A[,1])) # output
$a
[,1] [,2]
[1,] "a" "b"
[2,] "a" "c"
$b
[,1] [,2]
[1,] "b" "d"
EDIT: To clarify: I'd like to split the array/matrix, A into a list of multiple arrays based on the unique values in the first column. That is, split A into one array where the first column has an a, and another array where the first column has a b.
I have tried Python equivalent of R "split"-function but this gives three arrays
import numpy as np
import itertools
A = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
b = a[:,0]
def split(x, f):
return list(itertools.compress(x, f)), list(itertools.compress(x, (not i for i in f)))
split(A, b)
([array(['a', 'b'], dtype='<U1'),
array(['a', 'c'], dtype='<U1'),
array(['b', 'd'], dtype='<U1')],
[])
And also numpy.split, using np.split(A, b), but which needs integers. I though I may be able to use How to convert strings into integers in Python? to convert the letters to integers, but even if I pass integers, it doesn't split as expected
c = np.transpose(np.array([1,1,2]))
np.split(A, c) # returns 4 arrays
Can this be done? thanks
EDIT: please note that this is a small example, and the number of groups may be greater than two and they may not be ordered.
You can use pandas:
import pandas as pd
import numpy as np
a = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
listofdfs = {}
for n,g in pd.DataFrame(a).groupby(0):
listofdfs[n] = g
listofdfs['a'].values
Output:
array([['a', 'b'],
['a', 'c']], dtype=object)
And,
listofdfs['b'].values
Output:
array([['b', 'd']], dtype=object)
Or, you could use itertools groupby:
import numpy as np
from itertools import groupby
l = [np.stack(list(g)) for k, g in groupby(a, lambda x: x[0])]
l[0]
Output:
array([['a', 'b'],
['a', 'c']], dtype='<U1')
And,
l[1]
Output:
array([['b', 'd']], dtype='<U1')
If I understand your question, you can do simple slicing, as in:
a = np.array([["a", "b"], ["a", "c"], ["b", "d"]])
x,y=a[:2,:],a[2,:]
x
array([['a', 'b'],
['a', 'c']], dtype='<U1')
y
array(['b', 'd'], dtype='<U1')

Creating a subset of array from another array : Python

I have a basic question regarding working with arrays:
a= ([ c b a a b b c a a b b c a a b a c b])
b= ([ 0 1 0 1 0 0 0 0 2 0 1 0 2 0 0 1 0 1])
I) Is there a short way, to count the number of time 'c' in a corresponds to 0, 1, and 2 in b and 'b' in a corresponds to 0, 1, 2 and so on
II) How do I create a new array c (subset of a) and d(subset of b) such that it only contains those elements if the corresponding element in a is 'c' ?
In [10]: p = ['a', 'b', 'c', 'a', 'c', 'a']
In [11]: q = [1, 2, 1, 3, 3, 1]
In [12]: z = zip(p, q)
In [13]: z
Out[13]: [('a', 1), ('b', 2), ('c', 1), ('a', 3), ('c', 3), ('a', 1)]
In [14]: counts = {}
In [15]: for pair in z:
...: if pair in counts.keys():
...: counts[pair] += 1
...: else:
...: counts[pair] = 1
...:
In [16]: counts
Out[16]: {('a', 1): 2, ('a', 3): 1, ('b', 2): 1, ('c', 1): 1, ('c', 3): 1}
In [17]: sub_p = []
In [18]: sub_q = []
In [19]: for i, element in enumerate(p):
...: if element == 'a':
...: sub_p.append(element)
...: sub_q.append(q[i])
In [20]: sub_p
Out[20]: ['a', 'a', 'a']
In [21]: sub_q
Out[21]: [1, 3, 1]
Explanation
zip takes two lists and runs a figurative zipper between them. Resulting in a list of tuples
I've used a simplistic approach, I'm just maintaining a map/dictionary that makes not of how many times it has seen a pair of char-int tuples
Then I make 2 sub lists that you can modify to use the character in question and figure out what it maps to
Alternative methods
As abarnert suggested you could use A Counter from collections instead.
Or you could just a count method on z . eg: z.count('a',1). Or you can use a defaultdict instead.
The questions are a bit vague but here's a quick method (some would call it dirty) using Pandas though I think something written without recourse to Pandas should be preferred.
import pandas as pd
#create OP's lists
a= [ 'c', 'b', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'a', 'c', 'b']
b= [ 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 1]
#dump lists to a Pandas DataFrame
df = pd.DataFrame({'a':a, 'b':b})
Question 1
provided I interpreted it correctly, you can cross-tabulate the two arrays:
pd.crosstab(df.a, df.b).stack(). Cross-tabulate basically counts the number of times each number corresponds to a particular letter. .stack is a command to turn output from .crosstab into a more legible format.
#question 1
pd.crosstab(df.a, df.b).stack()
## -- End pasted text --
Out[9]:
a b
a 0 3
1 2
2 2
b 0 4
1 3
2 0
c 0 4
1 0
2 0
dtype: int64
Question 2
Here, I use Pandas boolean indexing ability to only select the elements in array a that correspond to value 'c'. So df.a=='c' will return True for every value in a that is 'c' and False otherwise. df.loc[df.a=='c','a'] will return values from a for which the boolean statement was true.
c = df.loc[df.a == 'c', 'a']
d = df.loc[df.a == 'c', 'b']
In [15]: c
Out[15]:
0 c
6 c
11 c
16 c
Name: a, dtype: object
In [16]: d
Out[16]:
0 0
6 0
11 0
16 0
Name: b, dtype: int64
Python List : https://www.tutorialspoint.com/python/python_lists.htm has a count method.
I suggest you to first zip both lists, as said in comments, and then count occurances of tuple c, 1 and occurances of tuple c, 0 and sum them up, thats what you need for (I), basically.
For (II), if I understood you correctly, you have to take the zipped lists and apply filter on them with lambda x: x[0]==x[1]

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...

Categories

Resources