Stratified Sampling in Python without scikit-learn

Stratified Sampling in Python without scikit-learn - python

I have a vector which contains 10 values of sample 1 and 25 values of sample 2.
Fact = np.array((2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2))
I want to create a stratified output vector where :
sample 1 is divided in 80% : 8 values of 1 and 20% : 2 values of 0.
sample 2 is divided in 80% : 20 values of 1 and 20% : 5 values of 0.
The expected output will be :
Output = np.array((0,1,1,1,0,1,1,1,1,0,1,1,1,0,1,1,1,0,1,0,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1))
How can I automate this ? I can’t use the sampling function from scikit-learn because it is not for a machine learning experience.

Here is one way to get your desired result, with reproducibility of output added. We draw random index values for each of the two groups from the input (fact) array, without replacement. Then, we create a new output array where we assign 1's in locations corresponding to the drawn index values and assign 0's everywhere else.
import numpy as np
from numpy.random import RandomState
rng = RandomState(123)
fact = np.array(
(2,2,2,2,1,2,1,1,2,2,2,1,2,2,2,1,2,2,2,1,2,2,1,1,2,1,2,2,2,2,2,2,1,2,2),
dtype='int8'
)
idx_arr = np.hstack(
(
rng.choice(np.argwhere(fact == 1).flatten(), 8, replace=False),
rng.choice(np.argwhere(fact == 2).flatten(), 20, replace=False),
)
)
out = np.zeros_like(fact, dtype='int8')
np.put(out, idx_arr, 1)
print(out)
# [0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 1 1]

Related

Count values in previous rows that are greater than current row value

I want to find the count for the number of previous rows that have the a greater value than the current row in a column and store it in a new column. It would be like a rolling countif that goes back to the beginning of the column. The desired example output below shows the value column given and the count column I want to create.
Desired Output:
Value Count
5 0
7 0
4 2
12 0
3 4
4 3
1 6
I plan on using this code with a large dataframe so the fastest way possible is appreciated.

We can do subtract.outer from numpy , then get lower tri and find the value is less than 0, and sum the value per row
a = np.sum(np.tril(np.subtract.outer(df.Value.values,df.Value.values), k=0)<0, axis=1)
# results in array([0, 0, 2, 0, 4, 3, 6])
df['Count'] = a

IMPORTANT: this only works with pandas < 1.0.0 and the error seems to be a pandas bug. An issue is already created at https://github.com/pandas-dev/pandas/issues/35203
We can do this with expanding and applying a function which checks for values that are higher than the last element in the expanding array.
import pandas as pd
import numpy as np
# setup
df = pd.DataFrame([5,7,4,12,3,4,1], columns=['Value'])
# calculate countif
df['Count'] = df.Value.expanding(1).apply(lambda x: np.sum(np.where(x > x[-1], 1, 0))).astype('int')
Input
Value
0 5
1 7
2 4
3 12
4 3
5 4
6 1
Output
Value Count
0 5 0
1 7 0
2 4 2
3 12 0
4 3 4
5 4 3
6 1 6

count = []
for i in range(len(values)):
count = 0
for j in values[:i]:
if values[i] < j:
count += 1
count.append(count)

The below generator will do what you need. You may be able to further optimize this if needed.
def generator (data) :
i=0
count_dict ={}
while i<len(data) :
m=max(data)
v=data[i]
count_dict[v] =count_dict[v] +1 if v in count_dict else 1
t=sum([(count_dict[j] if j in count_dict else 0) for j in range(v+1,m)])
i +=1
yield t
d=[1, 5,7,3,5,8]
foo=generator (d)
result =[b for b in foo]
print(result)

Making a bar chart to represent the number of occurrences in a Pandas Series

I was wondering if anyone could help me with how to make a bar chart to show the frequencies of values in a Pandas Series.
I start with a Pandas DataFrame of shape (2000, 7), and from there I extract the last column. The column is shape (2000,).
The entries in the Series that I mentioned vary from 0 to 17, each with different frequencies, and I tried to plot them using a bar chart but faced some difficulties. Here is my code:
# First, I counted the number of occurrences.
count = np.zeros(max(data_val))
for i in range(count.shape[0]):
for j in range(data_val.shape[0]):
if (i == data_val[j]):
count[i] = count[i] + 1
'''
This gives us
count = array([192., 105., ... 19.])
'''
temp = np.arange(0, 18, 1) # Array for the x-axis.
plt.bar(temp, count)
I am getting an error on the last line of code, saying that the objects cannot be broadcast to a single shape.
What I ultimately want is a bar chart where each bar corresponds to an integer value from 0 to 17, and the height of each bar (i.e. the y-axis) represents the frequencies.
Thank you.
UPDATE
I decided to post the fixed code using the suggestions that people were kind enough to give below, just in case anybody facing similar issues will be able to see my revised code in the future.
data = pd.read_csv("./data/train.csv") # Original data is a (2000, 7) DataFrame
# data contains 6 feature columns and 1 target column.
# Separate the design matrix from the target labels.
X = data.iloc[:, :-1]
y = data['target']
'''
The next line of code uses pandas.Series.value_counts() on y in order to count
the number of occurrences for each label, and then proceeds to sort these according to
index (i.e. label).
You can also use pandas.DataFrame.sort_values() instead if you're interested in sorting
according to the number of frequencies rather than labels.
'''
y.value_counts().sort_index().plot.bar(x='Target Value', y='Number of Occurrences')
There was no need to use for loops if we use the methods that are built into the Pandas library.
The specific methods that were mentioned in the answers are pandas.Series.values_count(), pandas.DataFrame.sort_index(), and pandas.DataFrame.plot.bar().

I believe you need value_counts with Series.plot.bar:
df = pd.DataFrame({
'a':[4,5,4,5,5,4],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[1,1,6,1,6,5],
})
print (df)
a b c d
0 4 7 1 1
1 5 8 3 1
2 4 9 5 6
3 5 4 7 1
4 5 2 1 6
5 4 3 0 5
df['d'].value_counts(sort=False).plot.bar()
If possible some value missing and need set it to 0 add reindex:
df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0).plot.bar()
Detail:
print (df['d'].value_counts(sort=False))
1 3
5 1
6 2
Name: d, dtype: int64
print (df['d'].value_counts(sort=False).reindex(np.arange(18), fill_value=0))
0 0
1 3
2 0
3 0
4 0
5 1
6 2
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 0
Name: d, dtype: int64

Here's an approach using Seaborn
import numpy as np
import pandas as pd
import seaborn as sns
s = pd.Series(np.random.choice(17, 10))
s
# 0 10
# 1 13
# 2 12
# 3 0
# 4 0
# 5 5
# 6 13
# 7 9
# 8 11
# 9 0
# dtype: int64
val, cnt = np.unique(s, return_counts=True)
val, cnt
# (array([ 0, 5, 9, 10, 11, 12, 13]), array([3, 1, 1, 1, 1, 1, 2]))
sns.barplot(val, cnt)

Pick random coordinates in Numpy array based on condition

I have used convolution2d to generate some statistics on conditions of local patterns. To be complete, I'm working with images and the value 0.5 is my 'gray-screen', I cannot use masks before this unfortunately (dependence on some other packages). I want to add new objects to my image, but it should overlap at least 75% of non-gray-screen. Let's assume the new object is square, I mask the image on gray-screen versus the rest, do a 2-d convolution with a n by n matrix filled with 1s so I can get the sum of the number of gray-scale pixels in that patch. This all works, so I have a matrix with suitable places to place my new object. How do I efficiently pick a random one from this matrix?
Here is a small example with a 5x5 image and a 2x2 convolution matrix, where I want a random coordinate in my last matrix with a 1 (because there is at most 1 0.5 in that patch)
Image:
1 0.5 0.5 0 1
0.5 0.5 0 1 1
0.5 0.5 1 1 0.5
0.5 1 0 0 1
1 1 0 0 1
Convolution matrix:
1 1
1 1
Convoluted image:
3 3 1 0
4 2 0 1
3 1 0 1
1 0 0 0
Conditioned on <= 1:
0 0 1 1
0 0 1 1
0 1 1 1
1 1 1 1
How do I get a uniformly distributed coordinate of the 1s efficiently?

np.where and np.random.randint should do the trick :
#we grab the indexes of the ones
x,y = np.where(convoluted_image <=1)
#we chose one index randomly
i = np.random.randint(len(x))
random_pos = [x[i],y[i]]

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"

Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1

If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

N-Grams to array

For my thesis i am working on a machine learning project using Python which includes feature extraction from text. As a start I am trying to implement bi-grams using sci-kit learn.
Right now, when i process my data trough Countvectorizer, I get an array of just 1's and sometimes a bit more. E.g.:
`[[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]]`
I want to use these bi-grams to predict my target variable, which is categorical.
When i now execute my code, Python returns that the shape of my two arrays are not identical.
`[[1 3 2 ..., 1 1 1]] [ 0. 0. 1. 0. 0.]`
Can someone tell me what i am doing wrong? I am using this command for the bi-grams. The first part is a loop for every text (film plot) in the dataset.
plottext = [ row[8] ]
wordvec = CountVectorizer(ngram_range=(2,2), analyzer='word')
plotvec = wordvec.fit_transform(plottext).toarray()
matrix_terms = np.array(wordvec.get_feature_names())
matrix_freq = np.asarray(plotvec.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])
target = { 'Age': row[4] }
data.append((final_matrix, target))
# Convert categorial target variable to Y
(X, Ycat) = zip(*data)
vec = DictVectorizer(sparse=False)
Y = vec.fit_transform(Ycat)
#Extract textual features from plot
return (X, Y)
The error message i get
ValueError: could not broadcast input array from shape (2,830) into shape (2)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stratified Sampling in Python without scikit-learn - python

Related

Count values in previous rows that are greater than current row value

Making a bar chart to represent the number of occurrences in a Pandas Series

Pick random coordinates in Numpy array based on condition

Conditional length of a binary data series in Pandas

N-Grams to array

Categories

Resources