Creating multiple subsets from single data frame, without replacement - python

I am trying to create 10 different subsets of 5 Members without replacement from this data (in Python):
Member CIN Needs Assessment Network Enrolled
117 CS38976K 1 1
118 GN31829N 1 1
119 GD98216H 1 1
120 VJ71307A 1 1
121 OX22563R 1 1
122 YW35494W 1 1
123 QX20765B 1 1
124 NO50548K 1 1
125 VX90647K 1 1
126 RG21661H 1 1
127 IT17216C 1 1
128 LD81088I 1 1
129 UZ49716O 1 1
130 UA16736M 1 1
131 GN07797S 1 1
132 TN64827F 1 1
133 MZ23779M 1 1
134 UG76487P 1 1
135 CY90885V 1 1
136 NZ74233H 1 1
137 CB59280X 1 1
138 LI89002Q 1 1
139 LO64230I 1 1
140 NY27508Q 1 1
141 GU30027P 1 1
142 XJ75065T 1 1
143 OW40240P 1 1
144 JQ23187C 1 1
145 PQ45586F 1 1
146 IM59460P 1 1
147 OU17576V 1 1
148 KL75129O 1 1
149 XI38543M 1 1
150 PO09602E 1 1
151 PS27561N 1 1
152 PC63391R 1 1
153 WR70847S 1 1
154 XL19132L 1 1
155 ZX27683R 1 1
156 MZ63663M 1 1
157 FT35723P 1 1
158 NX90823W 1 1
159 SC16809F 1 1
160 TX83955R 1 1
161 JA79273O 1 1
162 SK66781D 1 1
163 UK69813N 1 1
164 CX01143B 1 1
165 MT45485A 1 1
166 LJ25921O 1 1
I tried using MANY variations of random.sample() for _ in range().
Nothing is working. Nothing so far on stack overflow seems to give me the result I need.

Here a solution using pandas.
Say that master is your master dataframe created with pandas, you can do:
shuffled = master.sample(frac=1)
This creates a copy of your master dataframe with rows randomly reordered. See this answer on stackoverflow or the docs for the sample method.
Then you can simply build 10 smaller dataframes of five rows going in order.
subsets = []
for i in range(10):
subdf = shuffled.iloc[(i*5):(i+1)*5]
subsets.append(subdf)
subsets is the list containing your small dataframes. Do:
for sub in subsets:
print(sub)
to print them all and verify by eye that there are not repetitions.

This seems like a combination problem. Here is a solution:
You should create your list, say L. Then you decide the size of the subset, say r. After that here is the code:
from itertools import combinations
combinations(L,r)
However if you don't want to decide the size of the set to be created, then you can use random module as follows:
import random
from itertools import combinations
combinations(L,r = random(a,b))
In this case, this will create a random set of r (which is random integer between a and b) elements from the list L. If you want to do that 10 times, you can make a for loop.
I hope that works for you.

Let's assume that we have lines variable with an iterator of your dataset. Then:
from random import sample
# Chunk length
chunk_len = 2
# Number of chunks
num_of_chunks = 5
# Get the sample with data for all chunks. It guarantees us that there will
# be no repetitions
random_sample = sample(lines, num_of_chunks*chunk_len)
# Construct the list with chunks
result = [random_sample[i::num_of_chunks] for i in range(num_of_chunks)]
result
Will return:
[['123 QX20765B 1 1',
'118 GN31829N 1 1'],
['127 IT17216C 1 1',
'122 YW35494W 1 1'],
['138 LI89002Q 1 1',
'126 RG21661H 1 1'],
['120 VJ71307A 1 1',
'121 OX22563R 1 1'],
['143 OW40240P 1 1',
'142 XJ75065T 1 1']]

Related

Counting frequencies of a list of words in each row in a data frame in python

I would like to ask a question about how to create new column names for an existing data frame from a list of column names. I was counting verb frequencies in each string in a data frame. The verb list looks as below:
<bound method DataFrame.to_dict of verb
0 agree
1 bear
2 care
3 choose
4 be>
The code below works but the output is the total frequencies of all the words, instead of creating column names for each word in a word list.
#ver.1 code
import pandas as pd
verb = pd.read_csv('cog_verb.csv')
df2 = pd.DataFrame(df.answer_id)
for x in verb:
df2[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
The code was updated reflecting the helpful comment by Drakax, as below:
#updated code
for x in verb:
df2.to_dict()[f'count_{x}'] = lemma.str.count('|'.join(r"\b{}\b".format(x)))
but both of the codes produced the same following output:
<bound method DataFrame.to_dict of answer_id count_verb
0 312 91
1 1110 123
2 2700 102
3 2764 217
4 2806 182
.. ... ...
321 33417 336
322 36558 517
323 37316 137
324 37526 119
325 45683 1194
[326 rows x 2 columns]>
----- updated info----
As advised by Drakax, I add the first data frame below.
df.to_dict
<bound method DataFrame.to_dict of answer_id text
0 312 ANON_NAME_0\n Here are a few instructions for ...
1 1110 October16,2006 \nDear Dad,\n\n I am going to g...
2 2700 My Writing Habits\n I do many things before I...
3 2764 My Ideas about Writing\n I have many ideas bef...
4 2806 I've main habits for writing and I sure each o...
.. ... ...
321 33417 ????????????????????????\n???????????????? ?? ...
322 36558 In this world, there are countless numbers of...
323 37316 My Friend's Room\nWhen I was kid I used to go ...
324 37526 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ...
325 45683 Primary and Secondary Education in South Korea...
[326 rows x 2 columns]>
While the above output is correct, I want each word's frequency data as applied to each column.
I appreciate any help you can provide. Many thanks in advance!
Well it seems to still be a mess but I think I've understood what you want and you can adapt/update your code with mine:
1. This step is only for me; creating new DF with randomly generated str:
from pandas._testing import rands_array
randstr = pd.util.testing.rands_array(10, 10)
df = pd.DataFrame(data=randstr, columns=["randstr"])
df
index
randstr
count
0
20uDmHdBL5
1
1
E62AeycGdy
1
2
tHz99eI8BC
1
3
iZLXfs7R4k
1
4
bURRiuxHvc
2
5
lBDzVuB3z9
1
6
GuIZHOYUr5
1
7
k4wVvqeRkD
1
8
oAIGt8pHbI
1
9
N3BUMfit7a
2
2. Then to count the occurrences of your desired regex simply do this:
reg = ['a','e','i','o','u'] #this is where you stock your verbs
def count_reg(df):
for i in reg:
df[i] = df['randstr'].str.count(i)
return df
count_reg(df)
index
randstr
a
e
i
o
u
0
h2wcd5yULo
0
0
0
1
0
1
uI400TZnJl
0
0
0
0
1
2
qMiI7morYG
0
0
1
1
0
3
f6Aw6AH3TL
0
0
0
0
0
4
nJ0h9IsDn6
0
0
0
0
0
5
tWyNxnzLwv
0
0
0
0
0
6
V4sTYcPsiB
0
0
1
0
0
7
tSgni67247
0
0
1
0
0
8
sUZn3L08JN
0
0
0
0
0
9
qDiG3Zynk0
0
0
1
0
0

Sample Pandas Dataframe with equal number based on binary column

I have a dataframe with a data column, and a value column, as in the example below. The value column is always binary, 0 or 1:
data,value
173,1
1378,0
926,0
643,0
1279,0
472,0
706,0
1345,0
1167,1
1401,1
1236,0
447,1
1204,1
398,0
714,0
734,0
1732,0
98,0
1696,0
160,0
1611,0
274,1
562,0
625,0
1028,0
1766,0
511,0
1691,0
898,1
I need to sample the dataset so that basically I have an equal number of both values. So, if I originally have less 1 class, I'll need to use that one as a reference. In turn, if I have less 0 classes, I need to use that.
Any clues on how to do this? I'm working on a jupyter notebook, Python 3.6 (I cannot go up versions).
Sample data
data = [173,926,634,706,398]
value = [1,0,0,1,0]
df = pd.DataFrame({"data": data, "value": value})
print(df)
# data value
# 0 173 1
# 1 926 0
# 2 634 0
# 3 706 1
# 4 398 0
Filter to two DFs
ones = df[df['value'] == 1]
zeros = df[df['value'] == 0]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
# data value
# 1 926 0
# 2 634 0
# 4 398 0
Truncate as required
Find the minimum and then truncate it (take n first rows)
if len(ones) <= len(zeros):
zeros = zeros.iloc[:len(ones), :]
else:
ones = ones.iloc[:len(zeros), :]
print(ones)
print()
print()
print(zeros)
# data value
# 0 173 1
# 3 706 1
#
#
# data value
# 1 926 0
# 2 634 0
Group your dataframe by values, and then take a sample of the smallest count from each group.
grouped = df.groupby(['value'])
smallest = grouped.count().min().values
try: # Pandas 1.1.0+
print(grouped.sample(smallest))
except AttributeError: # Pre-Pandas 1.1.0
print(grouped.apply(lambda df: df.sample(smallest)))
Output:
data value
25 1766 0
3 643 0
10 1236 0
1 1378 0
14 714 0
6 706 0
24 1028 0
8 1167 1
9 1401 1
0 173 1
12 1204 1
11 447 1
28 898 1
21 274 1
This should do it.
df.groupby('value').sample(df.groupby('value').size().min())

Modify Python list of numpy arrays inside method without passing entire list

I have a standard list of numpy arrays because each item doesn't have an equal amount of dimensions, I convert it to uint8 (huge amounts of hex values), then perform operations to normalise the dimensions of each numpy array, so I can then convert it from a list to multi dimension numpy array.
To achieve this I need to use a few methods to make the code readable, but when I pass one numpy array from the list into a method I also have to pass the entire list and the iterator:
def push_bytes(messages, message, start_byte, i):
messages[i] = np.insert(messages[1], start_byte, 0, 0)
I call this method many times so would like to not have to pass the entire list and iterator so I can do something like this:
def push_bytes(message, start_byte):
message = np.insert(message, start_byte, 0, 0)
I believe the reason this doesn't work is because message = is creating a new numpy array and not pointing to the original one, is there a way I can point to the original one without having to pass the entire list and an iterator?
Sample data:
messages = [
[ 5 1 0 0 0 47 69 222 10 221 242 132 0 0 0 79 0 0 ]
[ 5 1 0 0 0 27 68 222 10 86 7 133 95 126 220 38 0 ]
[ 5 1 0 0 45 48 0 0 7 10 86 7 133 95 126 220 30 0 0 0 79 0 0 ]
[ 5 1 0 0 0 47 69 222 10 129 10 133 95 126 220 5 0 0 0 75 0 0 ]
[ 5 1 0 0 17 39 0 0 112 66 222 10 129 10 133 ]
[ 5 1 0 0 7 69 222 10 138 0 0 55 0 0 0 79 0 0 ]
[ 5 222 10 138 10 133 95 126 0 0 24 0 0 0 79 0 0 ]
[ 17 39 0 0 232 66 222 10 138 10 133 0 0 0 0 0 93 0 0 ]
]
A function like:
def push_bytes(messages, i, start_byte):
messages[i] = np.insert(messages[i], start_byte, 0, 0)
will modify (replace actually) the i'th element of the messages list. What's being passed is a reference to the list. There's no copying or other expensive stuff.
The np.insert function does create a new a new array, and a reference to that is placed in the messages list.
In:
def push_bytes(message, start_byte):
message = np.insert(message, start_byte, 0, 0)
the message=... assigns the new array to the message variable, but it is a local, and does not modify the variable outside the function. You have to add a
return message
and do
messages[i] = push_bytes(messages[i], start_byte)
to modify the element of the messages list.
I think the two functions will have similar execution times, since they just pass references, and do not require different calculations or copies. The second is, for most purposes, cleaner, since it doesn't assume anything about messages; it just does the new array creation (I assume there's more to this than a simple call to np.insert).

how to get the top n values and corresponding column headers to append to a Pandas dataframe

I have The results of a tensorflow multi-class prediction and I have been able to get the top value for each row and its' corresponding column header (which is the most likely predicted class) to append to the original data for further analysis like so:
The original results df with the predictions odds looks kind of like the following but with 260 columns. The column headers are the 1st row of ints.. the likelihood are the row 0 ,1 and so on for millions..
0 1 2 3 4 5 6 7 8 9 10 11 ....... 259
0 8.840584e-08 0.000115 0.000210 0.001662 0.002789
1 0.000312 0.000549 0.002412 0.000630 0.000077
The code that worked to get the top value (contained in the row) is:
eval_datan['odds']=predsdf.max(axis=1) #gets the largest value in the row
And to get the corresponding column header and append it to the original DF:
eval_datan['pred']=predsdf.idxmax(axis=1) #gets the column header for the largest value
I can't figure out how to get the top "n" in this case the top 5 maybe and add them to the original DF
the result currently looks like:
agegrp gender race marital_status region ccs1 ccs2 ccs3 ccs4 ccs5 odds pred
0 272 284 298 288 307 101 164 53 98 200 0.066987 102
1 272 285 300 290 307 204 120 147 258 151 0.196983 47
2 272 284 298 289 307 197 2 39 253 259 0.109894 259
So what I want is the top 5 preds and the top 5 odds...on the end of the original data.
I've looked at nlargest in pandas but so far no luck?
You can pick your top N features by changing the variable n below.
import pandas as pd
df = pd.read_table('your_sample_data.txt', delimiter='\s+')
n=3 # Top N features
frames = []
df.T.apply(lambda x: frames.append(x.sort_values(ascending=False).head(n).index.tolist()), axis=0)
print(df)
print(df.join(pd.DataFrame(frames, columns=['ccs{}'.format(n+1) for n in range(n)])))
0 1 2 3 4
0 8.840584e-08 0.000115 0.000210 0.001662 0.002789
1 3.120000e-04 0.000549 0.002412 0.000630 0.000077
0 1 2 3 4 ccs1 ccs2 ccs3
0 8.840584e-08 0.000115 0.000210 0.001662 0.002789 4 3 2
1 3.120000e-04 0.000549 0.002412 0.000630 0.000077 2 3 1

Python Pandas Feature Generation as aggregate function

I have a pandas df which is mire or less like
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
.....
this DF contains couple of millions of points. I am trying to generate some descriptors now to incorporate the time nature of the data. The idea is for each line I should create a window of lenght x going back in the data and counting the occurrences of the particular key in the window. I did a implementation, but according to my estimation for 23 different windows the calculation will run 32 days. Here is the code
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, key)).count()[0]
return all_window
There are multiple different windows of different length. I however have that uneasy feeling that the iteration is probably not the smartest way to go for this data aggregation. Is there way to implement it to run faster?
On a toy example data frame, you can achieve about a 7x speedup by using apply() instead of iterrows().
Here's some sample data, expanded a bit from OP to include multiple key values:
ID key dist
0 1 57 1
1 2 22 1
2 3 12 1
3 4 45 1
4 5 94 1
5 6 36 1
6 7 38 1
7 8 94 1
8 9 94 1
9 10 38 1
import pandas as pd
df = pd.read_clipboard()
Based on these data, and the counting criteria defined by OP, we expect the output to be:
key dist window
ID
1 57 1 0
2 22 1 0
3 12 1 0
4 45 1 0
5 94 1 0
6 36 1 0
7 38 1 0
8 94 1 1
9 94 1 2
10 38 1 1
Using OP's approach:
def features_wind2(inp):
all_window = inp
all_window['window1'] = 0
for index, row in all_window.iterrows():
lid = index
lid1 = lid - 200
pid = row['key']
row['window1'] = all_window.query('index < %d & index > %d & key == %d' % (lid, lid1, pid)).count()[0]
return all_window
print('old solution: ')
%timeit features_wind2(df)
old solution:
10 loops, best of 3: 25.6 ms per loop
Using apply():
def compute_window(row):
# when using apply(), .name gives the row index
# pandas indexing is inclusive, so take index-1 as cut_idx
cut_idx = row.name - 1
key = row.key
# count the number of instances key appears in df, prior to this row
return sum(df.ix[:cut_idx,'key']==key)
print('new solution: ')
%timeit df['window1'] = df.apply(compute_window, axis='columns')
new solution:
100 loops, best of 3: 3.71 ms per loop
Note that with millions of records, this will still take awhile, and the relative performance gains will likely be diminished somewhat compared to this small test case.
UPDATE
Here's an even faster solution, using groupby() and cumsum(). I made some sample data that seems roughly in line with the provided example, but with 10 million rows. The computation finishes in well under a second, on average:
# sample data
import numpy as np
import pandas as pd
N = int(1e7)
idx = np.arange(N)
keys = np.random.randint(1,100,size=N)
dists = np.ones(N).astype(int)
df = pd.DataFrame({'ID':idx,'key':keys,'dist':dists})
df = df.set_index('ID')
Now performance testing:
%timeit df['window'] = df.groupby('key').cumsum().subtract(1)
1 loop, best of 3: 755 ms per loop
Here's enough output to show that the computation is working:
dist key window
ID
0 1 83 0
1 1 4 0
2 1 87 0
3 1 66 0
4 1 31 0
5 1 33 0
6 1 1 0
7 1 77 0
8 1 49 0
9 1 49 1
10 1 97 0
11 1 36 0
12 1 19 0
13 1 75 0
14 1 4 1
Note: To revert ID from index to column, use df.reset_index() at the end.

Categories

Resources