Checking which rows contain a value efficiently - python

I am trying to write a function that checks for the presence of a value in a row across columns. I have a script that does this by iterating through columns, but I am worried that this will be inefficient when used on large datasets.
Here is my current code:
import pandas as pd
a = [1, 2, 3, 4]
b = [2, 3, 3, 2]
c = [5, 6, 1, 3]
d = [1, 0, 0, 99]
df = pd.DataFrame({'a': a,
'b': b,
'c': c,
'd': d})
cols = ['a', 'b', 'c', 'd']
df['e'] = 0
for col in cols:
df['e'] = df['e'] + df[col] == 1
print(df)
result:
a b c d e
0 1 2 5 1 True
1 2 3 6 0 False
2 3 3 1 0 True
3 4 2 3 99 False
As you can see, column e keeps record of whether the value "1" exists in that row. I was wondering if there was a better/more efficient way of achieving these results.

You can check if values in the data frame is one and see if any is true in a row (with axis=1):
df['e'] = df.eq(1).any(1)
df
# a b c d e
#0 1 2 5 1 True
#1 2 3 6 0 False
#2 3 3 1 0 True
#3 4 2 3 99 False

Python supports 'in', and 'not in'.
EXAMPLE:
>>> a = [1, 2, 5, 1]
>>> b = [2, 3, 6, 0]
>>> c = [5, 6, 1, 3]
>>> d = [1, 0, 0, 99]
>>> 1 in a
True
>>> 1 not in a
False
>>> 99 in d
True
>>> 99 not in d
False
By using this, you don't have to iterate over the array by yourself for this case.

Related

Pandas DataFrame filter by multiple column criterias and multiple intervals

I have checked several answers but found no luck so far.
My dataset is like this:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
Location Place Value1 Value2
A 1 1 1
A 2 1 1
A 3 2 2
B 4 3 3
C 2 4 4
C 3 5 5
and I have a list of intervals:
A: [0, 1]
A: [3, 5]
B: [1, 3]
C: [1, 4]
C: [6, 10]
Now I want that every row that have Location equal to that of the filter list, should have the Place in range of the filter. So the desired output will be:
Location Place Value1 Value2
A 1 1 1
A 3 2 2
C 2 4 4
C 3 5 5
I know that I can chain multiple between conditions by | , but I have a really long list of intervals so manually enter the condition is not feasible. I also consider forloop to slice the data by location first, but I think there could be more efficient way.
Thank you for your help.
Edit: Currently the list of intervals is just strings like this
A 0 1
A 3 5
B 1 3
C 1 4
C 6 10
but I would like to slice them into list of dicts. Better structure for it is also welcome!
First define dataframe df and filters dff:
df = pd.DataFrame({
'Location':['A', 'A', 'A', 'B', 'C', 'C'],
'Place':[1, 2, 3, 4, 2, 3],
'Value1':[1, 1, 2, 3, 4, 5],
'Value2':[1, 1, 2, 3, 4, 5]
}, columns = ['Location','Place','Value1','Value2'])
dff = pd.DataFrame({'Location':['A','A','B','C','C'],
'fPlace':[[0,1], [3, 5], [1, 3], [1, 4], [6, 10]]})
dff[['p1', 'p2']] = pd.DataFrame(dff["fPlace"].to_list())
now dff is:
Location fPlace p1 p2
0 A [0, 1] 0 1
1 A [3, 5] 3 5
2 B [1, 3] 1 3
3 C [1, 4] 1 4
4 C [6, 10] 6 10
where fPlace transformed to lower and upper bounds p1 and p2 indicates filters that should be applied to Place. Next:
df.merge(dff).query('Place >= p1 and Place <= p2').drop(columns = ['fPlace','p1','p2'])
result:
Location Place Value1 Value2
0 A 1 1 1
5 A 3 2 2
7 C 2 4 4
9 C 3 5 5
Prerequisites:
# presumed setup for your intervals:
intervals = {
"A": [
[0, 1],
[3, 5],
],
"B": [
[1, 3],
],
"C": [
[1, 4],
[6, 10],
],
}
Actual solution:
x = df["Location"].map(intervals).explode().str
l, r = x[0], x[1]
res = df["Place"].loc[l.index].between(l, r)
res = res.loc[res].index.unique()
res = df.loc[res]
Outputs:
>>> res
Location Place Value1 Value2
0 A 1 1 1
2 A 3 2 2
4 C 2 4 4
5 C 3 5 5

how does pandas.Series.str.get method work?

I have a pandas.Series named matches like this:
When I called pandas.Series.str.get method on it, it returns a new Series with its values all NaN:
I have read the document pandas.Series.str.get, but still can't understand it.
It return second element from iterable, it is same as str[1]:
df = pd.DataFrame({"A": [[1,2,3], [0,1,3]], "B":['aswed','yuio']})
print (df)
A B
0 [1, 2, 3] aswed
1 [0, 1, 3] yuio
df['C'] = df['A'].str.get(1)
df['C1'] = df['A'].str[1]
df['D'] = df['B'].str.get(1)
df['D1'] = df['B'].str[1]
print (df)
A B C C1 D D1
0 [1, 2, 3] aswed 2 2 s s
1 [0, 1, 3] yuio 1 1 u u

Pandas: Compare every two rows and output result to a new dataframe

import pandas as pd
df1 = pd.DataFrame({'ID':['i1', 'i2', 'i3'],
'A': [2, 3, 1],
'B': [1, 1, 2],
'C': [2, 1, 0],
'D': [3, 1, 2]})
df1.set_index('ID')
df1.head()
A B C D
ID
i1 2 1 2 3
i2 3 1 1 1
i3 1 2 0 2
df2 = pd.DataFrame({'ID':['i1-i2', 'i1-i3', 'i2-i3'],
'A': [2, 1, 1],
'B': [1, 1, 1],
'C': [1, 0, 0],
'D': [1, 1, 1]})
df2.set_index('ID')
df2
A B C D
ID
i1-i2 2 1 1 1
i1-i3 1 1 0 1
i2-i3 1 1 0 1
Given a data frame as df1, I want to compare every two different rows, and get the smaller value at each column, and output the result to a new data frame like df2.
For example, to compare i1 row and i2 row, get new row i1-i2 as 2, 1, 1, 1
Please advise what is the best way of pandas to do that.
Try this:
from itertools import combinations
v = df1.values
r = pd.DataFrame([np.minimum(v[t[0]], v[t[1]])
for t in combinations(np.arange(len(df1)), 2)],
columns=df1.columns,
index=list(combinations(df1.index, 2)))
Result:
In [72]: r
Out[72]:
A B C D
(i1, i2) 2 1 1 1
(i1, i3) 1 1 0 2
(i2, i3) 1 1 0 1

How to make a multi-dimensional column into a single valued vector for training data in sklearn pandas

I have a data set in which certain column is a combination of couple of independent values, as in the example below:
id age marks
1 5 3,6,7
2 7 1,2
3 4 34,78,2
Thus the column by itself is composed of multiple values, and I need to pass the vector into a machine learning algorithm , I cannot really combine the values to assign a single value like :
3,6,7 => 1
1,2 => 2
34,78,2 => 3
making my new vector as
id age marks
1 5 1
2 7 2
3 4 3
and then subsequently pass it to the algorithm , as the number of such combination would be infinite and also that might not really capture the real meaning of the data.
how to handle such situation where individual feature is a combination of multiple features.
Note :
the values in column marks are just examples, it could be anything a list of values. it could be list of integer or list of string , string composed of multiple stings separated by commas
You can pd.factorize tuples
Assuming marks is a list
df
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 5 [3, 6, 7]
Apply tuple and factorize
df.assign(new=pd.factorize(df.marks.apply(tuple))[0] + 1)
id age marks new
0 1 5 [3, 6, 7] 1
1 2 7 [1, 2] 2
2 3 4 [34, 78, 2] 3
3 4 5 [3, 6, 7] 1
setup df
df = pd.DataFrame([
[1, 5, ['3', '6', '7']],
[2, 7, ['1', '2']],
[3, 4, ['34', '78', '2']],
[4, 5, ['3', '6', '7']]
], [0, 1, 2, 3], ['id', 'age', 'marks']
)
UPDATE: I think we can use CountVectorizer in this case:
assuming we have the following DF:
In [33]: df
Out[33]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [34]: %paste
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import TreebankWordTokenizer
vect = CountVectorizer(ngram_range=(1,1), stop_words=None, tokenizer=TreebankWordTokenizer().tokenize)
X = vect.fit_transform(df.marks.apply(' '.join))
r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names())
## -- End pasted text --
Result:
In [35]: r
Out[35]:
1 2 3 34 6 7 78
0 0 0 1 0 1 1 0
1 1 1 0 0 0 0 0
2 0 1 0 1 0 0 1
3 0 0 1 0 1 1 0
OLD answer:
you can first convert your list to string and then categorize it:
In [119]: df
Out[119]:
id age marks
0 1 5 [3, 6, 7]
1 2 7 [1, 2]
2 3 4 [34, 78, 2]
3 4 11 [3, 6, 7]
In [120]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [121]: df
Out[121]:
id age marks new
0 1 5 [3, 6, 7] 0
1 2 7 [1, 2] 1
2 3 4 [34, 78, 2] 2
3 4 11 [3, 6, 7] 0
In [122]: df.dtypes
Out[122]:
id int64
age int64
marks object
new category
dtype: object
this will also work if marks is a column of strings:
In [124]: df
Out[124]:
id age marks
0 1 5 3,6,7
1 2 7 1,2
2 3 4 34,78,2
3 4 11 3,6,7
In [125]: df['new'] = pd.Categorical(pd.factorize(df.marks.str.join('|'))[0])
In [126]: df
Out[126]:
id age marks new
0 1 5 3,6,7 0
1 2 7 1,2 1
2 3 4 34,78,2 2
3 4 11 3,6,7 0
Tp access them as either [[x, y, z], [x, y, z]] or [[x, x], [y, y], [z, z]] (whatever is most appropriate for the function you need to call) then use:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2, 3, 4], b=[3, 4, 3, 4], c=[[1,2,3], [1,2], [], [2]]))
df.values
zip(*df.values)
where
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
2 3 3 []
3 4 4 [2]
>>> df.values
array([[1, 3, [1, 2, 3]],
[2, 4, [1, 2]],
[3, 3, []],
[4, 4, [2]]], dtype=object)
>>> zip(*df.values)
[(1, 2, 3, 4), (3, 4, 3, 4), ([1, 2, 3], [1, 2], [], [2])]
To convert a column try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(dict(a=[1, 2], b=[3, 4], c=[[1,2,3], [1,2]]))
df['c'].apply(lambda x: np.mean(x))
before:
>>> df
a b c
0 1 3 [1, 2, 3]
1 2 4 [1, 2]
after:
>>> df
a b c
0 1 3 2.0
1 2 4 1.5

Been trying to build new data frame from existing data frame and series

I'm trying to write a loop that does the following:
df_f.ix[0] = df_n.loc[0]
df_f.ix[1] = h[0]
df_f.ix[2] = df_n.loc[1]
df_f.ix[3] = h[1]
df_f.ix[4] = df_n.loc[2]
df_f.ix[5] = h[2]
...
df_f.ix[94778] = df_n.loc[47389]
df_f.ix[94779] = h[47389]
Basically, row 1 (and all the rows incremented by 2) of data frame df_f is equal to row 1 of data frame df_n (and its rows incremented by 1) and row 2 (and the rows incremented by 2) of df_f is equal to row 1 (and its rows incremented by 1) of series h. And so on...Can anyone help?
You don't necessarily need loops... You can just create a new list of data from your existing data frame/series and then make that int a new DataFrame
import pandas as pd
#example data
df_n = pd.DataFrame([1,2, 3, 4,5])
h = pd.Series([99, 98, 97, 96, 95])
new_data = [None] * (len(df_n) * 2)
new_data[::2] = df_n.loc[:, 0].values
new_data[1::2] = h.values
new_df = pd.DataFrame(new_data)
In [135]: new_df
Out[135]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
If you really want a loop that will do it you could create an empty data frame like so:
other_df = pd.DataFrame([None] * (len(df_n) * 2))
y = 0
for x in xrange(len(df_n)):
other_df.loc[y] = df_n.loc[x]
y+=1
other_df.loc[y] = h[x]
y+=1
In [136]: other_df
Out[136]:
0
0 1
1 99
2 2
3 98
4 3
5 97
6 4
7 96
8 5
9 95
This is easy to do in Numpy. You can retrieve the data from a Pandas Dataframe using df.values.
>>> import numpy as np
>>> import pandas as pd
>>> df_a, df_b = pd.DataFrame([1, 2, 3, 4]), pd.DataFrame([5, 6, 7, 8])
>>> df_a
0
0 1
1 2
2 3
3 4
>>> df_b
0
0 5
1 6
2 7
3 8
>>> np_a, np_b = df_a.values, df_b.values
>>> np_a
array([[1],
[2],
[3],
[4]])
>>> np_b
array([[5],
[6],
[7],
[8]])
>>> np_c = np.hstack((np_a, np_b))
>>> np_c
array([[1, 5],
[2, 6],
[3, 7],
[4, 8]])
>>> np_c = np_c.flatten()
>>> np_c
array([1, 5, 2, 6, 3, 7, 4, 8])
>>> df_c = pd.DataFrame(np_c)
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
All of this in one line, given df_a and df_b:
>>> df_c = pd.DataFrame(np.hstack((df_a.values, df_b.values)).flatten())
>>> df_c
0
0 1
1 5
2 2
3 6
4 3
5 7
6 4
7 8
Edit:
If you have more than one column, which is the general case,
>>> df_a = pd.DataFrame([[1, 2], [3, 4]])
>>> df_b = pd.DataFrame([[5, 6], [7, 8]])
>>> df_a
0 1
0 1 2
1 3 4
>>> df_b
0 1
0 5 6
1 7 8
>>> np_a = df_a.values
>>> np_a = np_a.reshape(np_a.shape[0], 1, np_a.shape[1])
>>> np_a
array([[[1, 2]],
[[3, 4]]])
>>> np_b = df_b.values
>>> np_b = np_b.reshape(np_b.shape[0], 1, np_b.shape[1])
>>> np_b
array([[[5, 6]],
[[7, 8]]])
>>> np_c = np.concatenate((np_a, np_b), axis=1)
>>> np_c
array([[[1, 2],
[5, 6]],
[[3, 4],
[7, 8]]])
>>> np_c = np_c.reshape(np_c.shape[0] * np_c.shape[2], np_c.shape[1])
>>> np_c
array([[1, 2],
[5, 6],
[3, 4],
[7, 8]])
>>> df_c = pd.DataFrame(np_c)

Categories

Resources