Randomly Sample in Python with certain distirbution

Randomly Sample in Python with certain distirbution - python

I want to create a dataframe with two columns an id column which repeats the ids 1-100 3 times and then 'age' where I randomly sample the ages 0-14 17% of the time, ages 15-64 65% of the time, ages 65-100 18% of the time.
Example DF:
id age
1 21
1 21
1 21
2 45
2 45
2 45
3 64
3 64
3 64
Code i have so far:
N = 100
R = 3
d = {'id': np.repeat(np.arange(1, N + 1), R)}
pd.DataFrame(d)
I'm stuck on how to simulate the age though.
How can I do this?

You can apply numpy.random.randint for your specific ranges and thresholds:
df['ages'] = np.repeat(np.concatenate([np.random.randint(0, 14, 17),
np.random.randint(15, 64, 65),
np.random.randint(65, 100, 18)]), R)
print(df)
If needed, the concatenated arrays can be additionally shuffled with np.random.shuffle (before the ages would be repeated np.repeat):
ages = np.concatenate([np.random.randint(0, 14, 17),
np.random.randint(15, 64, 65),
np.random.randint(65, 100, 18)])
np.random.shuffle(ages)
df['ages'] = np.repeat(ages, R)
id ages
0 1 11
1 1 11
2 1 11
3 2 3
4 2 3
5 2 3
6 3 12
7 3 12
8 3 12
9 4 8
10 4 8
11 4 8
12 5 10
13 5 10
14 5 10
.. ... ...
285 96 70
286 96 70
287 96 70
288 97 83
289 97 83
290 97 83
291 98 70
292 98 70
293 98 70
294 99 98
295 99 98
296 99 98
297 100 92
298 100 92
299 100 92

I suggst this method:
import pandas as pd
import numpy as np
ids = np.repeat(range(1, 101), 3)
age_choices = [(np.arange(0, 15), 0.17), (np.arange(15, 65), 0.65), (np.arange(65, 101), 0.18)]
ages = np.concatenate([np.random.choice(choice[0], size=int(len(ids)*choice[1]), replace=True) for choice in age_choices])
df = pd.DataFrame({'id': ids, 'age': ages})
print(df.head(30))
which gives
id age
0 1 2
1 1 13
2 1 8
3 2 14
4 2 0
5 2 14
6 3 7
7 3 6
8 3 9
9 4 13
10 4 9
11 4 6
12 5 7
13 5 6
14 5 12
15 6 12
16 6 4
17 6 2
18 7 0
19 7 10
20 7 4
21 8 10
22 8 8
23 8 1
24 9 10
25 9 5
26 9 13
27 10 8
28 10 13
29 10 4

Maybe something like:
import numpy as np
import pandas as pd
N = 100
R = 3
ids = np.arange(1, N + 1)
# Assuming max age of 99
possible_ages = np.arange(100)
sizes = np.array([16, 50, 34])
percentages = np.array([17, 65, 18])
ages = np.random.choice(possible_ages, size=N, p=np.repeat(percentages / sizes / 100, sizes))
df = pd.DataFrame({
"id": np.repeat(ids, R),
"age": np.repeat(ages, R)
})
Alternatively you could sample the age group using your specified percentages first, then uniformly sample from the obtained group after.

Related

Pandas fill dataframe with count of values within a range from another dataframe

I currently have two dataframes, df_ages and df_count:
In [1]: df_ages
Out [1]:
Enrolled Age
1 Y 44
2 Y 35
3 N 37
4 Y 55
5 N 26
6 Y 19
7 N 18
8 N 49
9 Y 26
10 Y 25
11 Y 25
12 Y 32
13 Y 25
14 N 50
15 N 58
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25
2 26 35
3 36 45
4 46 55
5 56 65
I am looking for code to populate df_count [count] column with the sum of people who fit within the min and max age range in the previous columns.
The [percentage] column should be the percentage of number of entries.
The desired resulting output is shown below:
In [2]: df_count
Out [2]:
Min Max counts percentage
1 18 25 5 33.3
2 26 35 4 26.7
3 36 45 2 13.3
4 46 55 3 20.0
5 56 65 1 6.7

You can try apply on rows with Series.between
df_count['counts'] = df_count.apply(lambda row: df_ages['Age'].between(row['Min'], row['Max']).sum(), axis=1)
df_count['percentage'] = df_count['counts'].div(len(df_ages)).mul(100).round(1)
print(df_count)
Min Max counts percentage
0 18 25 5 33.3
1 26 35 4 26.7
2 36 45 2 13.3
3 46 55 3 20.0
4 56 65 1 6.7

Labeling by period

my dataset
name day value
A 7 88
A 15 101
A 21 121
A 29 56
B 21 131
B 30 78
B 35 102
C 8 80
C 16 101
...
I am trying to plot with values for these days, but I want to label because there are too many unique numbers of days.
I try to label it consistently,
Is there a way to speed up labeling by cutting it every 7 days(week)?
For example, ~ 7day = 1week, 8 ~ 14day = 2week, and so on.
output what I want
name day value week
A 7 88 1
A 15 101 3
A 21 121 3
A 29 56 5
B 21 131 3
B 30 78 5
B 35 102 5
C 8 80 2
C 16 101 3
thank you for reading

Subtract 1, then use integer division by 7 and last add 1:
df['week'] = (df['day'] - 1) // 7 + 1
print (df)
name day value week
0 A 7 88 1
1 A 15 101 3
2 A 21 121 3
3 A 29 56 5
4 B 21 131 3
5 B 30 78 5
6 B 35 102 5
7 C 8 80 2
8 C 16 101 3

Pandas, substract columns Dataframe in loop

I am new with pandas. I have a Dataframe that consists in 6 columns and I would like to make a for loop that does this:
-create a new column (nc 1)
-nc1 = column 1 - column 2
and I want to iterate this for all columns, so the last one would be:
ncx = column 5- column 6
I can substract columns like this:
df['nc'] = df.Column1 - df.Column2
but this is not useful when I try to do a loop since I always have to insert the names of colums.
Can someone help me by telling me how can I refer to columns as numbers?
Thank you!

In [26]: import numpy as np
...: import random
...: import pandas as pd
...:
...: A = pd.DataFrame(np.random.randint(100, size=(5, 6)))
In [27]: A
Out[27]:
0 1 2 3 4 5
0 82 13 17 58 68 67
1 81 45 15 11 20 63
2 0 84 34 60 90 34
3 59 28 46 96 86 53
4 45 74 14 10 5 12
In [28]: for i in range(0, 5):
...: A[(i + 6)] = A[i] - A[(i + 1)]
...:
...:
...: A
...:
Out[28]:
0 1 2 3 4 5 6 7 8 9 10
0 82 13 17 58 68 67 69 -4 -41 -10 1
1 81 45 15 11 20 63 36 30 4 -9 -43
2 0 84 34 60 90 34 -84 50 -26 -30 56
3 59 28 46 96 86 53 31 -18 -50 10 33
4 45 74 14 10 5 12 -29 60 4 5 -7
In [29]: nc = 1 #The first new column
...: A[(nc + 5)] #outputs the first new column
Out[29]:
0 69
1 36
2 -84
3 31
4 -29
Here you don't need to call it by name, just by the column number, and you can just write a simple function that calls the column + 5
Something like this:
In [31]: def call_new_column(n):
...: return(A[(n + 5)])
...:
...:
...: call_new_column(2)
Out[31]:
0 -4
1 30
2 50
3 -18
4 60

Pandas: compute numerous columns of percentage values

I'm failing to loop through the values of select dataframe columns in order to create new columns representing percentage values. Reproducible example:
data = {'Respondents': [90, 43, 89, '89', '67', '88', '73', '78', '62', '101'],
'answer_1': [51, 15, 15, 61, 16, 14, 15, 1, 0, 16],
'answer_2': [11, 12, 14, 40, 36, 78, 12, 0, 26, 78],
'answer_3': [3, 8, 4, 0, 2, 7, 10, 11, 6, 7]}
df = pd.DataFrame(data)
df
Respondents answer_1 answer_2 answer_3
0 90 51 11 3
1 43 15 12 8
2 89 15 14 4
3 89 61 35 0
4 67 16 36 2
5 88 14 78 7
6 73 15 12 10
7 78 1 0 11
8 62 0 26 6
9 101 16 78 7
The aim is to compute the percentage for each of the answers columns against the total respondents. For example, for the new answer_1 column – let's name it answer_1_perc – the first value would be 46 (because 51 is 46% of 90), the next value would be 35 (15 is 35% of 43). Then there would be answer_2_perc and answer_3_perc columns.
I've written so many iterations of the following code my head's spinning.
for columns in df.iloc[:, 1:4]:
for i in columns:
i_name = 'percentage_' + str(columns)
i_group = ([i] / df['Respondents'] * 100)
df[i_name] = i_group
What is the best way to do this? I need to use an iterative method as my actual data has 25 answer columns rather than the 3 shown in this example.

You almost had it, note that you have string values in respondents col which I've corrected prior to calling the following:
In [172]:
for col in df.columns[1:4]:
i_name = 'percentage_' + col
i_group = (df[col] / df['Respondents']) * 100
df[i_name] = i_group
df
Out[172]:
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

I recommend using div and concat:
df['Respondents'] = df['Respondents'].astype(float)
df_pct = (df.drop('Respondents', axis=1)
.div(df['Respondents'], axis=0)
.mul(100)
.rename(columns=lambda col: 'percentage_' + col)
)
pd.concat([df, df_pct], axis=1)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90.0 51 11 3 56.666667
1 43.0 15 12 8 34.883721
2 89.0 15 14 4 16.853933
3 89.0 61 40 0 68.539326
4 67.0 16 36 2 23.880597
5 88.0 14 78 7 15.909091
6 73.0 15 12 10 20.547945
7 78.0 1 0 11 1.282051
8 62.0 0 26 6 0.000000
9 101.0 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Another solution with div desired columns by column Respondents and then add to new columns names:
print ('percentage_' + df.columns[1:4])
Index(['percentage_answer_1', 'percentage_answer_2', 'percentage_answer_3'], dtype='object')
df['percentage_' + df.columns[1:4]] = df.ix[:,1:4].div(df.Respondents, axis=0) * 100
print (df)
Respondents answer_1 answer_2 answer_3 percentage_answer_1 \
0 90 51 11 3 56.666667
1 43 15 12 8 34.883721
2 89 15 14 4 16.853933
3 89 61 40 0 68.539326
4 67 16 36 2 23.880597
5 88 14 78 7 15.909091
6 73 15 12 10 20.547945
7 78 1 0 11 1.282051
8 62 0 26 6 0.000000
9 101 16 78 7 15.841584
percentage_answer_2 percentage_answer_3
0 12.222222 3.333333
1 27.906977 18.604651
2 15.730337 4.494382
3 44.943820 0.000000
4 53.731343 2.985075
5 88.636364 7.954545
6 16.438356 13.698630
7 0.000000 14.102564
8 41.935484 9.677419
9 77.227723 6.930693

Split output of loop by columns used as input

Hi I'm relatively new to Python and am currently working on trying to measure the width of features in an image. The resolution of my image is 1m so measuring the width should be easier. I've managed to select certain columns or rows of the image and extract the necessary data using loops and such. My code is below:
subset = imarray[:,::500]#(imarray.shape[1]/2):(imarray.shape[1]/2)+1]
subset[(subset > 0) & (subset <= 17)] = 1
subset[(subset > 17)] = 0
width = []
count = 0
for i in np.arange(subset.shape[1]):
column = subset[:,i]
for value in column:
if (value == 1):
count += 1
width.append(count)
width_arr = np.array(width).astype('uint8')
else:
count = 0
final = np.split(width_arr, np.argwhere(width_arr == 1).flatten())
final2 = [x for x in final if x != []]
width2 = []
for array in final2:
width2.append(max(array))
width2 = np.array(width2).astype('uint8')
print width2
I can't figure out how to split the output up so it shows the results for each column or row individually. Instead all I've been able to do is to append the data to an empty list and here's the output for that:
[ 70 35 4 2 5 36 4 5 2 51 97 4 228 3 21 47 7 21
23 58 126 4 111 2 2 5 3 2 18 15 6 19 3 3 12 15
6 8 2 4 6 88 122 24 14 49 73 57 74 6 179 8 3 2
6 3 184 9 3 19 24 3 2 2 3 255 30 8 191 33 127 5
3 27 112 2 24 2 5 2 10 30 10 6 37 2 38 6 12 17
44 67 23 5 101 10 9 4 6 4 255 136 5 255 255 255 255 26
255 235 148 4 255 199 3 2 114 87 255 109 69 12 41 20 30 57
72 89 32]
So these are the widths of the features in all the columns appended together. How do I use my loop or another method to split these up into individual numpy arrays representing each column I've sliced out of the original?
It seems like I am almost there but I can't seem to figure that last step out and it's driving me nuts.
Thanks in advance for your help!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Randomly Sample in Python with certain distirbution - python

Related

Pandas fill dataframe with count of values within a range from another dataframe

Labeling by period

Pandas, substract columns Dataframe in loop

Pandas: compute numerous columns of percentage values

Split output of loop by columns used as input

Categories

Resources