Checking which rows contain a value efficiently

Checking which rows contain a value efficiently - python

I am trying to write a function that checks for the presence of a value in a row across columns. I have a script that does this by iterating through columns, but I am worried that this will be inefficient when used on large datasets.
Here is my current code:
import pandas as pd
a = [1, 2, 3, 4]
b = [2, 3, 3, 2]
c = [5, 6, 1, 3]
d = [1, 0, 0, 99]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
cols = ['a', 'b', 'c', 'd']
df['e'] = 0
for col in cols:
df['e'] = df['e'] + df[col] == 1
print(df)
result:
a b c d e
0 1 2 5 1 True
1 2 3 6 0 False
2 3 3 1 0 True
3 4 2 3 99 False
As you can see, column e keeps record of whether the value "1" exists in that row. I was wondering if there was a better/more efficient way of achieving these results.

You can check if values in the data frame is one and see if any is true in a row (with axis=1):
df['e'] = df.eq(1).any(1)
df
# a b c d e
#0 1 2 5 1 True
#1 2 3 6 0 False
#2 3 3 1 0 True
#3 4 2 3 99 False

Python supports 'in', and 'not in'.
EXAMPLE:
>>> a = [1, 2, 5, 1]
>>> b = [2, 3, 6, 0]
>>> c = [5, 6, 1, 3]
>>> d = [1, 0, 0, 99]
>>> 1 in a
True
>>> 1 not in a
False
>>> 99 in d
True
>>> 99 not in d
False
By using this, you don't have to iterate over the array by yourself for this case.

Related

Pandas DataFrame filter by multiple column criterias and multiple intervals

I have checked several answers but found no luck so far.
My dataset is like this:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
Location Place Value1 Value2
A 1 1 1
A 2 1 1
A 3 2 2
B 4 3 3
C 2 4 4
C 3 5 5
and I have a list of intervals:
A: [0, 1]
A: [3, 5]
B: [1, 3]
C: [1, 4]
C: [6, 10]
Now I want that every row that have Location equal to that of the filter list, should have the Place in range of the filter. So the desired output will be:
Location Place Value1 Value2
A 1 1 1
A 3 2 2
C 2 4 4
C 3 5 5
I know that I can chain multiple between conditions by | , but I have a really long list of intervals so manually enter the condition is not feasible. I also consider forloop to slice the data by location first, but I think there could be more efficient way.
Thank you for your help.
Edit: Currently the list of intervals is just strings like this
A 0 1
A 3 5
B 1 3
C 1 4
C 6 10
but I would like to slice them into list of dicts. Better structure for it is also welcome!

First define dataframe df and filters dff:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
dff = pd.DataFrame({'Location':['A','A','B','C','C'],
'fPlace':[[0,1], [3, 5], [1, 3], [1, 4], [6, 10]]})
dff[['p1', 'p2']] = pd.DataFrame(dff["fPlace"].to_list())
now dff is:
Location fPlace p1 p2
0 A [0, 1] 0 1
1 A [3, 5] 3 5
2 B [1, 3] 1 3
3 C [1, 4] 1 4
4 C [6, 10] 6 10
where fPlace transformed to lower and upper bounds p1 and p2 indicates filters that should be applied to Place. Next:
df.merge(dff).query('Place >= p1 and Place <= p2').drop(columns = ['fPlace','p1','p2'])
result:
Location Place Value1 Value2
0 A 1 1 1
5 A 3 2 2
7 C 2 4 4
9 C 3 5 5

Prerequisites:
# presumed setup for your intervals:
intervals = {
"A": [
[0, 1],
[3, 5],
],
"B": [
[1, 3],
],
"C": [
[1, 4],
[6, 10],
],
}
Actual solution:
x = df["Location"].map(intervals).explode().str
l, r = x[0], x[1]
res = df["Place"].loc[l.index].between(l, r)
res = res.loc[res].index.unique()
res = df.loc[res]
Outputs:
>>> res
Location Place Value1 Value2
0 A 1 1 1
2 A 3 2 2
4 C 2 4 4
5 C 3 5 5

how does pandas.Series.str.get method work?

I have a pandas.Series named matches like this:
When I called pandas.Series.str.get method on it, it returns a new Series with its values all NaN:
I have read the document pandas.Series.str.get, but still can't understand it.

It return second element from iterable, it is same as str[1]:
df = pd.DataFrame({"A": [[1,2,3], [0,1,3]], "B":['aswed','yuio']})
print (df)
A B
0 [1, 2, 3] aswed
1 [0, 1, 3] yuio
df['C'] = df['A'].str.get(1)
df['C1'] = df['A'].str[1]
df['D'] = df['B'].str.get(1)
df['D1'] = df['B'].str[1]
print (df)
A B C C1 D D1
0 [1, 2, 3] aswed 2 2 s s
1 [0, 1, 3] yuio 1 1 u u

Pandas: Compare every two rows and output result to a new dataframe

import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.

Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.
how to handle such situation where individual feature is a combination of multiple features.
Note :
the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas

You can pd.factorize tuples
Assuming marks is a list
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple and factorize
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)

UPDATE: I think we can use CountVectorizer in this case:
assuming we have the following DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer:
you can first convert your list to string and then categorize it:
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks is a column of strings:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0

Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5

Been trying to build new data frame from existing data frame and series

I'm trying to write a loop that does the following:
df_f.ix[0] = df_n.loc[0]
df_f.ix[1] = h[0]
df_f.ix[2] = df_n.loc[1]
df_f.ix[3] = h[1]
df_f.ix[4] = df_n.loc[2]
df_f.ix[5] = h[2]
...
df_f.ix[94778] = df_n.loc[47389]
df_f.ix[94779] = h[47389]
Basically, row 1 (and all the rows incremented by 2) of data frame df_f is equal to row 1 of data frame df_n (and its rows incremented by 1) and row 2 (and the rows incremented by 2) of df_f is equal to row 1 (and its rows incremented by 1) of series h. And so on...Can anyone help?

You don't necessarily need loops... You can just create a new list of data from your existing data frame/series and then make that int a new DataFrame
import pandas as pd
#example data
df_n = pd.DataFrame([1,2, 3, 4,5])
h = pd.Series([99, 98, 97, 96, 95])
new_data = [None] * (len(df_n) * 2)
new_data[::2] = df_n.loc[:, 0].values
new_data[1::2] = h.values
new_df = pd.DataFrame(new_data)
In [135]: new_df
Out[135]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
If you really want a loop that will do it you could create an empty data frame like so:
other_df = pd.DataFrame([None] * (len(df_n) * 2))
y = 0
for x in xrange(len(df_n)):
other_df.loc[y] = df_n.loc[x]
y+=1
other_df.loc[y] = h[x]
y+=1
In [136]: other_df
Out[136]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95

This is easy to do in Numpy. You can retrieve the data from a Pandas Dataframe using df.values.
>>> import numpy as np
>>> import pandas as pd
>>> df_a, df_b = pd.DataFrame([1, 2, 3, 4]), pd.DataFrame([5, 6, 7, 8])
>>> df_a
0
0 1
1 2
2 3
3 4
>>> df_b
0
0 5
1 6
2 7
3 8
>>> np_a, np_b = df_a.values, df_b.values
>>> np_a
array([[1],
[2],
[3],
[4]])
>>> np_b
array([[5],
[6],
[7],
[8]])
>>> np_c = np.hstack((np_a, np_b))
>>> np_c
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
>>> np_c = np_c.flatten()
>>> np_c
array([1, 5, 2, 6, 3, 7, 4, 8])
>>> df_c = pd.DataFrame(np_c)
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
All of this in one line, given df_a and df_b:
>>> df_c = pd.DataFrame(np.hstack((df_a.values, df_b.values)).flatten())
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
Edit:
If you have more than one column, which is the general case,
>>> df_a = pd.DataFrame([[1, 2], [3, 4]])
>>> df_b = pd.DataFrame([[5, 6], [7, 8]])
>>> df_a
0 1
0 1 2
1 3 4
>>> df_b
0 1
0 5 6
1 7 8
>>> np_a = df_a.values
>>> np_a = np_a.reshape(np_a.shape[0], 1, np_a.shape[1])
>>> np_a
array([[[1, 2]],
[[3, 4]]])
>>> np_b = df_b.values
>>> np_b = np_b.reshape(np_b.shape[0], 1, np_b.shape[1])
>>> np_b
array([[[5, 6]],
[[7, 8]]])
>>> np_c = np.concatenate((np_a, np_b), axis=1)
>>> np_c
array([[[1, 2],
[5, 6]],
[[3, 4],
[7, 8]]])
>>> np_c = np_c.reshape(np_c.shape[0] * np_c.shape[2], np_c.shape[1])
>>> np_c
array([[1, 2],
[5, 6],
[3, 4],
[7, 8]])
>>> df_c = pd.DataFrame(np_c)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking which rows contain a value efficiently - python

You can check if values in the data frame is one and see if any is true in a row (with axis=1): df['e'] = df.eq(1).any(1) df # a b c d e #0 1 2 5 1 True #1 2 3 6 0 False #2 3 3 1 0 True #3 4 2 3 99 False

Python supports 'in', and 'not in'. EXAMPLE: >>> a = [1, 2, 5, 1] >>> b = [2, 3, 6, 0] >>> c = [5, 6, 1, 3] >>> d = [1, 0, 0, 99] >>> 1 in a True >>> 1 not in a False >>> 99 in d True >>> 99 not in d False By using this, you don't have to iterate over the array by yourself for this case.

Related

Pandas DataFrame filter by multiple column criterias and multiple intervals

how does pandas.Series.str.get method work?

Pandas: Compare every two rows and output result to a new dataframe

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

Been trying to build new data frame from existing data frame and series

Categories

Resources