Subset/Filter column in Dataframe [duplicate]

Subset/Filter column in Dataframe [duplicate] - python

This question already has answers here:
Accessing every 1st element of Pandas DataFrame column containing lists
(5 answers)
Closed 1 year ago.
I have dataframe like this:
text emotion
0 working add oil [1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0]
1 you're welcome [0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0]
7 off to face my exam now [0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, ...
12 no, i'm so not la! i want to sleeeeeeeeeeep. [0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, ...
151 i try to register on ebay. when i enter my hom... [1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, ...
18 Swam 6050 yards on just a yogurt for breakfast... [0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, ...
19 Alright! [0, 0, 1, 1, 0, 0, 0, 0]
120 Visiting gma. It's getting cold [0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, ...
22 You are very missed [0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, ...
345 ...LOL! You mean Rhode Island...close enough [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, ...
How can I leave only the first numbers in emotion column, to get data like this?:
text emotion
0 working add oil 1
1 you're welcome 0
7 off to face my exam now 0
12 no, i'm so not la! i want to sleeeeeeeeeeep. 0
151 i try to register on ebay. when i enter my hom... 1
18 Swam 6050 yards on just a yogurt for breakfast... 0
19 Alright! **0**
120 Visiting gma. It's getting cold 0
22 You are very missed **0**
345 ...LOL! You mean Rhode Island...close enough 0

If "emotion" column is a list and not string:
df["emotion"] = df["emotion"].apply(lambda x: x[0])
print(df)
Prints:
text emotion
0 working add oil 1
1 you're welcome 0
2 off to face my exam now 0
3 no, i'm so not la! i want to sleeeeeeeeeeep. 0
4 i try to register on ebay. when i enter my hom... 1
5 Swam 6050 yards on just a yogurt for breakfast... 0
6 Alright! 0
7 Visiting gma. It's getting cold 0
8 You are very missed 0
9 ...LOL! You mean Rhode Island...close enough 0
If it's string, you can convert it to list using ast.literal_eval:
from ast import literal_eval
df["emotion"] = df["emotion"].apply(literal_eval)
# and then:
df["emotion"] = df["emotion"].apply(lambda x: x[0])

Related

Code for updating the each element(binary) of array on the basis of majority voting of past 5 elements(including that element also)

I have an issue in writing a code for a problem related to the element updation of an array based on the majority voting of its past 5 elements including the element itself.
Explanation: Suppose we have an array having elements in binary form. arr= [0 , 1, 0, 0, 0, 1, 0, 1 , 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1]
Now, according to our problem, we have to find the majority value of the past 5 elements including the number itself. So for the first 4 elements, the new updated value would be Zero, and then the value at the 4th index will be based on the majority voting of past elements that is [0, 1, 0, 0, 0]. So the majority value for index 4 is '0' as the count of '0' is greater than the count of '1' in a given window. Similarly, we have to do the same for all the elements.
input arr= [0 , 1, 0, 0, 0, 1, 0, 1 , 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1]
Output arr=[0 , 0, 0, 0, 0, 0, 0, 0 , 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]

Use:
arr = [0 , 1, 0, 0, 0, 1, 0, 1 , 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1]
out = [int(sum(arr[max(i-5, 0):i]) >= 3) for i in range(1, len(arr)+1)]
print(out)
# Output
[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]
Variation:
out = [int(sum(arr[max(i-4, 0):i+1]) >= 3) for i in range(len(arr))]

Pandas Sample Based on Two Criteria

I have a dataframe that looks like this:
feature target
0 2 0
1 0 0
2 0 0
3 0 0
4 1 0
... ... ...
33208 1 0
33209 0 0
33210 2 0
33211 2 0
33212 1 0
In the feature column there are 3 classes (0, 1, 2) and in the target column there are two classes (0, 1). If I group the dataframe by this two columns, I get:
df.groupby(['feature', 'target']).size()
feature target
0 0 4282
1 81
1 0 8537
1 37
2 0 20161
1 115
dtype: int64
Each feature class have 0s and 1s as target values, I need to find a way of sampling this values, my intention is to have something like this at the end:
new_df.groupby(['feature', 'target']).size()
feature target
0 0 4282
1 81
1 0 4282
1 37
2 0 4282
1 115
dtype: int64
I need to sample the amount of target values for each feature class, any suggestions?

You have different distributions, depending on the value of feature.
You need to sample n value from a distribution, provided the value of feature: given that there are 2 possible outcomes, that is a binomial distribution problem.
The approach shown below should facilitate situation when target is not necessarily (0, 1) - could be anything (win vs lose, team A vs team B, as so forth) as far as I can see:
import numpy as np
import pandas as pd
# this is just reproducting your grouped end stated
df = pd.DataFrame({"feature":[0, 0, 1, 1, 2, 2], "target":[0, 1, 0, 1, 0, 1], "number":[4282, 81, 4282, 37, 4282, 115]})
df = df.set_index(["feature", "target"])
def sample_values(feature, sample_size):
# select one of the distribution by feature
df_sub = df.loc[feature]
(event1, number1), (event2, number2) = zip(df_sub.index,df_sub["number"].tolist())
return [event2 if np.random.binomial(1, number2/(number1+number2))==1 else event1 for _ in range(sample_size)]
print(sample_values(2, 100))
OUTPUT
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Smoothing a 2-D Numpy Array with a Kernel

Suppose I have an (m x n) 2-d numpy array that are just 0's and 1's. I want to "smooth" the array by running, for example, a 3x3 kernel over the array and taking the majority value within that kernel. For values at the edges, I would just ignore the "missing" values.
For example, let's say the array looked like
import numpy as np
x = np.array([[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[0, 0, 1, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 0, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])
Starting at the top left "1", a 3 x 3 kernel centered at the first top left element, would be missing the first row and first column. The way I want to treat that is just ignore that and consider the remaining 2 x 2 matrix:
1 0
0 0
In this case, the majority value is 0, so set that element to 0. Repeating this for all elements, the resulting 2-d array I would want is:
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0
0 0 1 1 1 1 1 0
0 0 1 1 1 1 1 0
0 0 1 1 1 1 1 0
0 0 1 1 1 1 0 0
0 0 0 0 0 0 0 0
How do I accomplish this?

You can use skimage.filters.rank.majority to assign to each value the most occuring one within its neighborhood. The 3x3 kernel can be defined using skimage.morphology.square:
from skimage.filters.rank import majority
from skimage.morphology import square
majority(x.astype('uint8'), square(3))
array([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)
Note: You'll need the latest stable version of scikit-image for majority. More here

I ended up doing something like this (which is based off of How do I use scipy.ndimage.filters.gereric_filter?):
import scipy.ndimage.filters
import scipy.stats as scs
def filter_most_common_element(a, w_k=np.ones(shape=(3, 3))):
"""
Creating a function for scipy.ndimage.generic_filter.
See https://docs.scipy.org/doc/scipy/reference/generated/scipy.ndimage.generic_filter.html for more information
on generic filters.
This filter takes a kernel of np.ones() to find the most common element in the array.
Based off of https://stackoverflow.com/questions/61197364/smoothing-a-2-d-numpy-array-with-a-kernel
"""
a = a.reshape(w_k.shape)
a = np.multiply(a, w_k)
# See https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.stats.mode.html
most_common_element = scs.mode(a, axis=None)[0][0]
return most_common_element
x = np.array([[1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 0, 1, 1, 0],
[0, 0, 1, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 1, 0, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])
out = scipy.ndimage.filters.generic_filter(x, filter_most_common_element, footprint=np.ones((3,3)),mode='constant',cval=0.0)
out
array([[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 1, 0, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 1, 0],
[0, 0, 1, 1, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0]])

Apply a function to series of list without apply in pandas

I have a dataframe
df = pd.DataFrame({'Binary_List': [[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]})
df
Binary_List
0 [0, 0, 1, 0, 0, 0, 0]
1 [0, 1, 0, 0, 0, 0, 0]
2 [0, 0, 1, 1, 0, 0, 0]
3 [0, 0, 0, 0, 1, 1, 1]
I want to apply a function to each list, without use of apply because apply is very slow when running on large dataset
def count_one(lst):
index = [i for i, e in enumerate(lst) if e != 0]
# some more steps
return len(index)
df['Value'] = df['Binary_List'].apply(lambda x: count_one(x))
df
Binary_List Value
0 [0, 0, 1, 0, 0, 0, 0] 1
1 [0, 1, 0, 0, 0, 0, 0] 1
2 [0, 0, 1, 1, 0, 0, 0] 2
3 [0, 0, 0, 0, 1, 1, 1] 3
I tried using this, but no improvement
vfunc = np.vectorize(count_one)
df['Value'] = vfunc(df['Binary_List'])
This gives me error
df['Value'] = count_one(df['Binary_List'])

you can try DataFrame.explode:
df.explode('Binary_List').reset_index().groupby('index').sum()
Binary_List
index
0 1
1 1
2 2
3 3
Also you can do:
pd.Series([np.array(key).sum() for key in df['Binary_List']])
0 1
1 1
2 2
3 3
dtype: int64

for getting length of list items you can use str function like below
df = pd.DataFrame({'Binary_List': [[0, 0, 1, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 0, 0],
[0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1]]})
df["Binary_List"].astype(np.str).str.count("1")

How does pandas qcut() method select what bins to put extra items in?

I would like to understand how pd.qcut() selects where to put extra items when numItems % binSize != 0. For example, I wrote this code to check how 0-9 items are binned in a decile setting
for i in range(10):
a = pd.qcut(pd.Series(range(i+10)),10,False).value_counts().ix[range(10)].tolist()
a = [x-1 for x in a]
print(str(i),'extra:',a)
0 extra: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
1 extra: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
2 extra: [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
3 extra: [1, 0, 0, 0, 1, 0, 0, 0, 0, 1]
4 extra: [1, 0, 0, 1, 0, 0, 1, 0, 0, 1]
5 extra: [1, 0, 1, 0, 1, 0, 0, 1, 0, 1]
6 extra: [1, 1, 0, 1, 0, 1, 0, 1, 0, 1]
7 extra: [1, 1, 0, 1, 1, 0, 1, 0, 1, 1]
8 extra: [1, 1, 1, 0, 1, 1, 0, 1, 1, 1]
9 extra: [1, 1, 1, 1, 1, 0, 1, 1, 1, 1]
Of course, this will change as numItems and binSize changes. Do you have any insight on how the algorithm works to try to select where to put the extra items? It appears that it tries to balance them in some way

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Subset/Filter column in Dataframe [duplicate] - python

Related

Code for updating the each element(binary) of array on the basis of majority voting of past 5 elements(including that element also)

Pandas Sample Based on Two Criteria

Smoothing a 2-D Numpy Array with a Kernel

Apply a function to series of list without apply in pandas

How does pandas qcut() method select what bins to put extra items in?

Categories

Resources