Create a matrix from a list of key-value pairs - python

I have a list of numpy arrays that contains a list of name-value pairs which are both strings. Every name and value can be found multiple times in the list, and I would like to convert it to a binary matrix.
The columns represent the values while the rows represent a key/name, and when a field is set to 1 it represents that particular name value pair.
E.g
I have
A : aa
A : bb
A : cc
B : bb
C : aa
and i want to convert it to
aa bb cc
A 1 1 1
B 0 1 0
C 1 0 0
I have some code that does this but I was wondering if there is an easier/out of the box way of doing this with numpy or some other library.
This is my code so far:
resources = Set(result[:,1])
resourcesDict = {}
i = 0
for r in resources:
resourcesDict[r] = i
i = i + 1
clients = Set(result[:,0])
clientsDict = {}
i = 0
for c in clients:
clientsDict[c] = i
i = i + 1
arr = np.zeros((len(clientsDict),len(resourcesDict)), dtype = 'bool')
for line in result[:,0:2]:
arr[clientsDict[line[0]],resourcesDict[line[1]]] = True
and in result theres the following
array([["a","aa"],["a","bb"],..]

I feel that using Pandas.DataFrame.pivot is the best way
>>> df = pd.DataFrame({'foo': ['one','one','one','two','two','two'],
'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
'baz': [1, 2, 3, 4, 5, 6]})
>>> df
foo bar baz
0 one A 1
1 one B 2
2 one C 3
3 two A 4
4 two B 5
5 two C 6
Or
you can load your pair list using
>>> df = pd.read_csv('ratings.csv')
Then
>>> df.pivot(index='foo', columns='bar', values='baz')
A B C
one 1 2 3
two 4 5 6

you probably have something like
m_dict = {'A': ['aa', 'bb', 'cc'], 'B': ['bb'], 'C': ['aa']}
i would go like this:
res = {}
for k, v in m_dict.items():
res[k] = defaultdict(int)
for col in v:
res[k][v] = 1
edit
given your format, it would probably be more in the line of :
m_array = [['A', 'aa'], ['A', 'bb'], ['A', 'cc'], ['B', 'bb'], ['C', 'aa']]
res = defaultdict(lambda: defaultdict(int))
for k, v in m_array:
res[k][v] = 1
which both give:
>>> res['A']['aa']
1
>>> res['B']['aa']
0

This is a job for np.unique. It is not clear what format your data is in, but you need to get two 1-D arrays, one with the keys, another with the values, e.g.:
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = np.zeros((len(rows), len(cols)), dtype=np.int)
out[row_idx, col_idx] += 1
>>> out
array([[1, 1, 1],
[0, 1, 0],
[1, 0, 0]])
>>> rows
array(['A', 'B', 'C'],
dtype='|S2')
>>> cols
array(['aa', 'bb', 'cc'],
dtype='|S2')
If you have no repeated key-value pairs, this code will work just fine. If there are repetitions, I would suggest abusing scipy's sparse module:
import scipy.sparse as sps
kvp = np.array([['A', 'aa'], ['A', 'bb'], ['A', 'cc'],
['B', 'bb'], ['C', 'aa'], ['A', 'bb']])
keys, values = kvp.T
rows, row_idx = np.unique(keys, return_inverse=True)
cols, col_idx = np.unique(values, return_inverse=True)
out = sps.coo_matrix((np.ones_like(row_idx), (row_idx, col_idx))).A
>>> out
array([[1, 2, 1],
[0, 1, 0],
[1, 0, 0]])

d = {'A': ['aa', 'bb', 'cc'], 'C': ['aa'], 'B': ['bb']}
rows = 'ABC'
cols = ('aa', 'bb', 'cc')
print ' ', ' '.join(cols)
for row in rows:
print row, ' ',
for col in cols:
print ' 1' if col in d.get(row) else ' 0',
print
>>> aa bb cc
>>> A 1 1 1
>>> B 0 1 0
>>> C 1 0 0

Related

insert python list in all rows new pd.Dataframe column

I have python list:
my_list = [1, 'V']
I have pd.Dataframe:
A B C
0 f v b
1 f i n
2 f i m
I need to create new column in my dataframe with value = my_list:
A B C D
0 f v b [1, 'V']
1 f i n [1, 'V']
2 f i m [1, 'V']
As far as I understand python lists can be values, bc df.groupby with apply "list":
df = df.groupby(['A', 'B'], group_keys=True)['C'].apply(list).reset_index(name='H')
A B H
0 f i [n, m]
1 f v [b]
Its posible without convert my_list type? What the the easiest way to do that?
I tried:
df['D'] = my_list
df['D'] = pd.Series(my_list)
but they did not meet my expectations
You can try using: np.repeat and set its repeat parameter to number of rows which can be found out from the shape of the dataframe.
my_list = [1, 'V']
df = pd.DataFrame({'col1': ['f', 'f', 'f'], 'col2': ['v', 'i', 'i'], 'col3': ['b', 'n', 'm']})
df['new_col'] = np.repeat(my_list, df.shape[0])
This will repeat the values of my_list as many times as there are rows in the DataFrame.
You can do it by creating a new array with my_list through hstack and then forming a new DataFrame. The code below has been tested and works fine.
import numpy as np
import pandas as ph
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.array([1, 'V']).repeat(3).reshape(2,3).transpose()
df = pd.DataFrame(np.hstack((a1,a2)))
Edit: Another code that has been tested is:
import pandas as pd
import numpy as np
a1 = np.array([['f','v','b'], ['f','i','n'], ['f','i','m']])
a2 = np.squeeze(np.dstack((np.array(1).repeat(3), np.array('V').repeat(3))))
df = pd.DataFrame(np.hstack((a1,a2)))

select element in a column A while column B does not have a value

I want to select products as long as it does not contain 0 in x.
Input:
test = pd.DataFrame(
[
['a', 0],
['a', 3],
['a', 4],
['b', 3],
['b', 2],
['c', 1],
['d', 0]
]
)
test.columns = ['product', 'x']
test.query("select distinct (product) where x not in (0) ")
expected out come:
b,c
How to do this in both pandas and SQL?
In SQL, you would use:
select product
from t
group by product
having min(x) > 0;
This works assuming x is never negative. A more general formulation is:
having sum(case when x = 0 then 1 else 0 end) = 0
In your case pandas can do with isin
test.loc[~test['product'].isin(test.loc[test.x.eq(0),'product']),'product'].unique()
Out[41]: array(['b', 'c'], dtype=object)
Or do with set
set(test['product'].tolist())-set(test.loc[test.x.eq(0),'product'].tolist())
Out[47]: {'b', 'c'}
If you want to filter your dataframe, you can use groupby with .any():
test[~test.groupby('product')['x'].transform(lambda x: x.eq(0).any())]
Output:
product x
b 3
b 2
c 1
If you only want to see unique values you can add ['product'].unique().tolist() at the end of the code which I pasted above.
Then we have the output:
['b', 'c']

Creating a subset of array from another array : Python

I have a basic question regarding working with arrays:
a= ([ c b a a b b c a a b b c a a b a c b])
b= ([ 0 1 0 1 0 0 0 0 2 0 1 0 2 0 0 1 0 1])
I) Is there a short way, to count the number of time 'c' in a corresponds to 0, 1, and 2 in b and 'b' in a corresponds to 0, 1, 2 and so on
II) How do I create a new array c (subset of a) and d(subset of b) such that it only contains those elements if the corresponding element in a is 'c' ?
In [10]: p = ['a', 'b', 'c', 'a', 'c', 'a']
In [11]: q = [1, 2, 1, 3, 3, 1]
In [12]: z = zip(p, q)
In [13]: z
Out[13]: [('a', 1), ('b', 2), ('c', 1), ('a', 3), ('c', 3), ('a', 1)]
In [14]: counts = {}
In [15]: for pair in z:
...: if pair in counts.keys():
...: counts[pair] += 1
...: else:
...: counts[pair] = 1
...:
In [16]: counts
Out[16]: {('a', 1): 2, ('a', 3): 1, ('b', 2): 1, ('c', 1): 1, ('c', 3): 1}
In [17]: sub_p = []
In [18]: sub_q = []
In [19]: for i, element in enumerate(p):
...: if element == 'a':
...: sub_p.append(element)
...: sub_q.append(q[i])
In [20]: sub_p
Out[20]: ['a', 'a', 'a']
In [21]: sub_q
Out[21]: [1, 3, 1]
Explanation
zip takes two lists and runs a figurative zipper between them. Resulting in a list of tuples
I've used a simplistic approach, I'm just maintaining a map/dictionary that makes not of how many times it has seen a pair of char-int tuples
Then I make 2 sub lists that you can modify to use the character in question and figure out what it maps to
Alternative methods
As abarnert suggested you could use A Counter from collections instead.
Or you could just a count method on z . eg: z.count('a',1). Or you can use a defaultdict instead.
The questions are a bit vague but here's a quick method (some would call it dirty) using Pandas though I think something written without recourse to Pandas should be preferred.
import pandas as pd
#create OP's lists
a= [ 'c', 'b', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'a', 'c', 'b']
b= [ 0, 1, 0, 1, 0, 0, 0, 0, 2, 0, 1, 0, 2, 0, 0, 1, 0, 1]
#dump lists to a Pandas DataFrame
df = pd.DataFrame({'a':a, 'b':b})
Question 1
provided I interpreted it correctly, you can cross-tabulate the two arrays:
pd.crosstab(df.a, df.b).stack(). Cross-tabulate basically counts the number of times each number corresponds to a particular letter. .stack is a command to turn output from .crosstab into a more legible format.
#question 1
pd.crosstab(df.a, df.b).stack()
## -- End pasted text --
Out[9]:
a b
a 0 3
1 2
2 2
b 0 4
1 3
2 0
c 0 4
1 0
2 0
dtype: int64
Question 2
Here, I use Pandas boolean indexing ability to only select the elements in array a that correspond to value 'c'. So df.a=='c' will return True for every value in a that is 'c' and False otherwise. df.loc[df.a=='c','a'] will return values from a for which the boolean statement was true.
c = df.loc[df.a == 'c', 'a']
d = df.loc[df.a == 'c', 'b']
In [15]: c
Out[15]:
0 c
6 c
11 c
16 c
Name: a, dtype: object
In [16]: d
Out[16]:
0 0
6 0
11 0
16 0
Name: b, dtype: int64
Python List : https://www.tutorialspoint.com/python/python_lists.htm has a count method.
I suggest you to first zip both lists, as said in comments, and then count occurances of tuple c, 1 and occurances of tuple c, 0 and sum them up, thats what you need for (I), basically.
For (II), if I understood you correctly, you have to take the zipped lists and apply filter on them with lambda x: x[0]==x[1]

How to replace values in multiple categoricals in a pandas DataFrame

I want to replace certain values in a dataframe containing multiple categoricals.
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
If I apply .replace on a single column, the result is as expected:
>>> df.s1.replace('a', 1)
0 1
1 b
2 c
Name: s1, dtype: object
If I apply the same operation to the whole dataframe, an error is shown (short version):
>>> df.replace('a', 1)
ValueError: Cannot setitem on a Categorical with a new category, set the categories first
During handling of the above exception, another exception occurred:
ValueError: Wrong number of dimensions
If the dataframe contains integers as categories, the following happens:
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
>>> df.replace(1, 3)
s1 s2
0 3 3
1 2 3
2 3 4
But,
>>> df.replace(1, 2)
ValueError: Wrong number of dimensions
What am I missing?
Without digging, that seems to be buggy to me.
My Work Around
pd.DataFrame.apply with pd.Series.replace
This has the advantage that you don't need to mess with changing any types.
df = pd.DataFrame({'s1': [1, 2, 3], 's2': [1, 3, 4]}, dtype='category')
df.apply(pd.Series.replace, to_replace=1, value=2)
s1 s2
0 2 2
1 2 3
2 3 4
Or
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.apply(pd.Series.replace, to_replace='a', value=1)
s1 s2
0 1 1
1 b c
2 c d
#cᴏʟᴅsᴘᴇᴇᴅ's Work Around
df = pd.DataFrame({'s1': ['a', 'b', 'c'], 's2': ['a', 'c', 'd']}, dtype='category')
df.applymap(str).replace('a', 1)
s1 s2
0 1 1
1 b c
2 c d
The reason for such behavior is different set of categorical values for each column:
In [224]: df.s1.cat.categories
Out[224]: Index(['a', 'b', 'c'], dtype='object')
In [225]: df.s2.cat.categories
Out[225]: Index(['a', 'c', 'd'], dtype='object')
so if you will replace to a value that is in both categories it'll work:
In [226]: df.replace('d','a')
Out[226]:
s1 s2
0 a a
1 b c
2 c a
As a solution you might want to make your columns categorical manually, using:
pd.Categorical(..., categories=[...])
where categories would have all possible values for all columns...

How to get back the index after groupby in pandas

I am trying to find the the record with maximum value from the first record in each group after groupby and delete the same from the original dataframe.
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
print df
t = df.groupby('item_id').first() #lost track of the index
desired_row = t[t.cost == t.cost.max()]
#delete this row from df
cost
item_id
d 5
I need to keep track of desired_row and delete this row from df and repeat the process.
What is the best way to find and delete the desired_row?
I am not sure of a general way, but this will work in your case since you are taking the first item of each group (it would also easily work on the last). In fact, because of the general nature of split-aggregate-combine, I don't think this is easily achievable without doing it yourself.
gb = df.groupby('item_id', as_index=False)
>>> gb.groups # Index locations of each group.
{'a': [0, 1], 'b': [2, 3, 4], 'c': [5], 'd': [6]}
# Get the first index location from each group using a dictionary comprehension.
subset = {k: v[0] for k, v in gb.groups.iteritems()}
df2 = df.iloc[subset.values()]
# These are the first items in each groupby.
>>> df2
cost item_id
0 1 a
5 1 c
2 1 b
6 5 d
# Exclude any items from above where the cost is equal to the max cost across the first item in each group.
>>> df[~df.index.isin(df2[df2.cost == df2.cost.max()].index)]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Try this ?
import pandas as pd
df = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
t=df.drop_duplicates(subset=['item_id'],keep='first')
desired_row = t[t.cost == t.cost.max()]
df[~df.index.isin([desired_row.index[0]])]
Out[186]:
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
Or using not in
Consider this df with few more rows
pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd', 'd','d'],
'cost': [1, 2, 1, 1, 3, 1, 5,1,7]})
df[~df.cost.isin(df.groupby('item_id').first().max().tolist())]
cost item_id
0 1 a
1 2 a
2 1 b
3 1 b
4 3 b
5 1 c
7 1 d
8 7 d
Overview: Create a dataframe using an dictionary. Group by item_id and find the max value. enumerate over the grouped dataframe and use the key which is an numeric value to return the alpha index value. Create an result_df dataframe if you desire.
df_temp = pd.DataFrame({'item_id': ['a', 'a', 'b', 'b', 'b', 'c', 'd'],
'cost': [1, 2, 1, 1, 3, 1, 5]})
grouped=df_temp.groupby(['item_id'])['cost'].max()
result_df=pd.DataFrame(columns=['item_id','cost'])
for key, value in enumerate(grouped):
index=grouped.index[key]
result_df=result_df.append({'item_id':index,'cost':value},ignore_index=True)
print(result_df.head(5))

Categories

Resources