This is my first question at Stack Overflow.
I have a DataFrame of Pandas like this.
a b c d
one 0 1 2 3
two 4 5 6 7
three 8 9 0 1
four 2 1 1 5
five 1 1 8 9
I want to extract the pairs of column name and data whose data is 1 and each index is separate at array.
[ [(b,1.0)], [(d,1.0)], [(b,1.0),(c,1.0)], [(a,1.0),(b,1.0)] ]
I want to use gensim of python library which requires corpus as this form.
Is there any smart way to do this or to apply gensim from pandas data?
Many gensim functions accept numpy arrays, so there may be a better way...
In [11]: is_one = np.where(df == 1)
In [12]: is_one
Out[12]: (array([0, 2, 3, 3, 4, 4]), array([1, 3, 1, 2, 0, 1]))
In [13]: df.index[is_one[0]], df.columns[is_one[1]]
Out[13]:
(Index([u'one', u'three', u'four', u'four', u'five', u'five'], dtype='object'),
Index([u'b', u'd', u'b', u'c', u'a', u'b'], dtype='object'))
To groupby each row, you could use iterrows:
from itertools import repeat
In [21]: [list(zip(df.columns[np.where(row == 1)], repeat(1.0)))
for label, row in df.iterrows()
if 1 in row.values] # if you don't want empty [] for rows without 1
Out[21]:
[[('b', 1.0)],
[('d', 1.0)],
[('b', 1.0), ('c', 1.0)],
[('a', 1.0), ('b', 1.0)]]
In python 2 the list is not required since zip returns a list.
Another way would be
In [1652]: [[(c, 1) for c in x[x].index] for _, x in df.eq(1).iterrows() if x.any()]
Out[1652]: [[('b', 1)], [('d', 1)], [('b', 1), ('c', 1)], [('a', 1), ('b', 1)]]
Related
I need to do something very similar to this question: Pandas convert dataframe to array of tuples
The difference is I need to get not only a single list of tuples for the entire DataFrame, but a list of lists of tuples, sliced based on some column value.
Supposing this is my data set:
t_id A B
----- ---- -----
0 AAAA 1 2.0
1 AAAA 3 4.0
2 AAAA 5 6.0
3 BBBB 7 8.0
4 BBBB 9 10.0
...
I want to produce as output:
[[(1,2.0), (3,4.0), (5,6.0)],[(7,8.0), (9,10.0)]]
That is, one list for 'AAAA', another for 'BBBB' and so on.
I've tried with two nested for loops. It seems to work, but it is taking too long (actual data set has ~1M rows):
result = []
for t in df['t_id'].unique():
tuple_list= []
for x in df[df['t_id' == t]].iterrows():
row = x[1][['A', 'B']]
tuple_list.append(tuple(x))
result.append(tuple_list)
Is there a faster way to do it?
You can groupby column t_id, iterate through groups and convert each sub dataframe into a list of tuples:
[g[['A', 'B']].to_records(index=False).tolist() for _, g in df.groupby('t_id')]
# [[(1, 2.0), (3, 4.0), (5, 6.0)], [(7, 8.0), (9, 10.0)]]
I think this should work too:
import pandas as pd
import itertools
df = pd.DataFrame({"A": [1, 2, 3, 1], "B": [2, 2, 2, 2], "C": ["A", "B", "C", "B"]})
tuples_in_df = sorted(tuple(df.to_records(index=False)), key=lambda x: x[0])
output = [[tuple(x)[1:] for x in group] for _, group in itertools.groupby(tuples_in_df, lambda x: x[0])]
print(output)
Out:
[[(2, 'A'), (2, 'B')], [(2, 'B')], [(2, 'C')]]
I have a numpy array that looks something like this:
h = array([string1 1
string2 1
string3 1
string4 3
string5 4
string6 2
string7 2
string8 4
string9 3
string0 2 ])
In the second column, I would like to change all occurrences of 1 to 3, all occurrences of 3 to 2, all occurrences of 4 to 1
Obviously if I systematically try to do it in place I will get an error, because:
h[,:1 == 1] = 3
h[,:1 == 3] = 2
will change all the 1's into 2's
The matrix can be up to 50,000 elements long, and the values to change might vary
I was looking at a similar question here , but it was turning all digits to 0, and the answers were specific to that.
Is there a way to simultaneously change all these occurrences or am I going to have to find another way?
You can use a look up table and advanced indexing:
A = np.rec.fromarrays([np.array("The quick brown fox jumps over the lazy dog .".split()), np.array([1,1,1,3,4,2,2,4,3,2])])
A
# rec.array([('The', 1), ('quick', 1), ('brown', 1), ('fox', 3),
# ('jumps', 4), ('over', 2), ('the', 2), ('lazy', 4), ('dog', 3),
# ('.', 2)],
# dtype=[('f0', '<U5'), ('f1', '<i8')])
LU = np.arange(A['f1'].max()+1)
LU[[1,3,4]] = 3,2,1
A['f1'] = LU[A['f1']]
A
# rec.array([('The', 3), ('quick', 3), ('brown', 3), ('fox', 2),
# ('jumps', 1), ('over', 2), ('the', 2), ('lazy', 1), ('dog', 2),
# ('.', 2)],
# dtype=[('f0', '<U5'), ('f1', '<i8')])
You can either use map directly, or use the more efficient numpy.vectorize to turn a mapping function into a function that can be applied to the array directly:
import numpy as np
mapping = {
1: 3,
3: 4,
4: 1
}
a = np.array([1, 2, 3, 4, 5, 1, 2, 3, 4, 5])
mapping_func = np.vectorize(lambda x: mapping[x] if x in mapping else x)
b = mapping_func(a)
print(a)
print(b)
Result:
[1 2 3 4 5 1 2 3 4 5]
[3 2 4 1 5 3 2 4 1 5]
Note that you don't have to use a dict or a lambda function. You function could be any normal function that takes the data type of your source array as an input (int in this case) and returns the data type of the target array.
The best way to do it is to use a dict to map a value. Doing so requires you to use a vectorized function:
import numpy as np
a = [[1,1],[1,2],[1,3]]
a = np.array([[1,1],[1,2],[1,3]])
>>> a
array([[1, 1],
[1, 2],
[1, 3]])
dic = {3:2,2:3}
vfunc = np.vectorize(lambda x:dic[x] if x in dic else x)
a[:,1] = vfunc(a[:,1])
>>> a
array([[1, 1],
[1, 3],
[1, 2]])
Im working with pandas module in python. In my script im trying to get each value of the DataFrame and transforming them in another DataFrame that every value are replaced by the number 1 (n times as the range of the number), like:
A B
a 1 2
b 2 3
to:
A B
a 1 1
a 1
b 1 1
b 1 1
b 1
The problem is: These aren't fixes values. I want the script get a "model" created by me and output something like the above results. Otherwise, the DataFrame can have 1,2,3...30 columns, same with rows.
Here is a solution based on standard 2D lists. Adaptation to pandas' DataFrame is straightforward:
lst = [['a',1,2,2], ['b',2,3,2], ['c',4,0,2]]
table = []
for cols in lst:
name, size = cols[0], max(cols[1:])
row = [[1]*col + [0]*(size-col) for col in cols[1:]]
table.extend(list(zip([name]*size, *row)))
Here is the final content of table
>>> from pprint import pprint
>>> pprint(table)
[('a', 1, 1, 1),
('a', 0, 1, 1),
('b', 1, 1, 1),
('b', 1, 1, 1),
('b', 0, 1, 0),
('c', 1, 0, 1),
('c', 1, 0, 1),
('c', 1, 0, 0),
('c', 1, 0, 0)]
Which is the best way to get all permutations of a bunch of indexes. We are looking to do this with the intention of running chi squared tests, I might be looking at re-inventing the wheel here. So for the following dataframe
index value
a 1.0
b 2.0
c 4.0
I would want to get the following out
group value
a,b 3.0
b,c 6.0
c,a 5.0
You need first to import itertools
import itertools
In [32]:
indices = [indices[0] + ',' + indices[1] for indices in list(itertools.combinations(df.index , 2))]
indices
Out[32]:
['a,b', 'a,c', 'b,c']
In [31]:
values = [values[0] + values[1] for values in list(itertools.combinations(df.value , 2))]
values
Out[31]:
[3.0, 5.0, 6.0]
In [36]:
pd.DataFrame(data = values , index=indices , columns=['values'])
Out[36]:
values
a,b 3
a,c 5
b,c 6
in my opinion you should use combinations from itertools.
>>> from itertools import combinations
>>> datas = {'a': 1, 'b': 2, 'c': 3}
>>> list(combinations(datas.keys(), 2))
[('a', 'c'), ('a', 'b'), ('c', 'b')]
>>> index_combination = combinations(datas.keys(), 2)
>>> for indexes in index_combination:
... print indexes , sum([datas[index] for index in indexes])
...
('a', 'c') 4
('a', 'b') 3
('c', 'b') 5
Looking for a fast way to get a row in a pandas dataframe into a ordered dict with out using list. List are fine but with large data sets will take to long. I am using fiona GIS reader and the rows are ordereddicts with the schema giving the data type. I use pandas to join data. I many cases the rows will have different types so I was thinking turning into a numpy array with type string might do the trick.
This is implemented in pandas 0.21.0+ in function to_dict with parameter into:
df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
print (df)
a b
0 1 2
1 3 4
d = df.to_dict(into=OrderedDict, orient='index')
print (d)
OrderedDict([(0, OrderedDict([('a', 1), ('b', 2)])), (1, OrderedDict([('a', 3), ('b', 4)]))])
Unfortunately you can't just do an apply (since it fits it back to a DataFrame):
In [1]: df = pd.DataFrame([[1, 2], [3, 4]], columns=['a', 'b'])
In [2]: df
Out[2]:
a b
0 1 2
1 3 4
In [3]: from collections import OrderedDict
In [4]: df.apply(OrderedDict)
Out[4]:
a b
0 1 2
1 3 4
But you can use a list comprehension with iterrows:
In [5]: [OrderedDict(row) for i, row in df.iterrows()]
Out[5]: [OrderedDict([('a', 1), ('b', 2)]), OrderedDict([('a', 3), ('b', 4)])]
If it was possible to use a generator, rather than a list, to whatever you were working with this will usually be more efficient:
In [6]: (OrderedDict(row) for i, row in df.iterrows())
Out[6]: <generator object <genexpr> at 0x10466da50>