How to select records randomly from the DataFrame? - python

I have the following single column pandas DataFrame called y. The column is called 0(zero).
y =
1
0
0
1
0
1
1
2
0
1
1
2
2
2
2
1
0
0
I want to select N row indices of records per y value. In the above example, there are 6 records of 0, 7 records of 1 and 5 records of 2.
I need to select 4 records from each of these 3 groups.
Below I provide my code. However this code always selects the first N (e.g. 4) records per class. I need the selection to be done randomly on a whole dataset.
How can I do it?
idx0 = []
idx1 = []
idx2 = []
for i in range(0, len(y[0])):
if y[0].iloc[i]==0 and len(idx0)<=4:
idx0.append(i)
if y[0].iloc[i]==1 and len(idx1)<=4:
idx1.append(i)
if y[0].iloc[i]==2 and len(idx2)<=4:
idx2.append(i)
Update:
The expected outcome is a list of indices, not the filtered DataFrame y.
n = 4
a = y.groupby(0).apply(lambda x: x.sample(n)).reset_index(1).\
rename(columns={'level_1':'indices'}).reset_index(drop=True).groupby(0)['indices'].\
apply(list).reset_index()
class = 0
idx = a.ix[2].tolist()[class]
y.values[idx] # THIS RETURNS WRONG WRONG CLASSES IN SOME CASES
0
1. # <- WRONG
0
0

Use groupby() with df.sample():
n=4
df.groupby('Y').apply(lambda x: x.sample(n)).reset_index(drop=True)
Y
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 2
9 2
10 2
11 2
EDIT, for index:
df.groupby('Y').apply(lambda x: x.sample(n)).reset_index(1).\
rename(columns={'level_1':'indices'}).reset_index(drop=True).groupby('Y')['indices'].\
apply(list).reset_index()
Y indices
0 0 [4, 1, 17, 16]
1 1 [0, 6, 10, 5]
2 2 [13, 14, 7, 11]

Using
idx0,idx1,idx2=[ np.random.choice(y.index.values,4,replace=False).tolist()for _, y in df.groupby('0')]
idx0
Out[48]: [1, 2, 16, 8]
To be more detail
s=pd.Series([1,0,1,0,2],index=[1,3,4,5,9])
idx=[1,4] # both anky and mine answer return the index
s.loc[idx] # using .loc with index is correct
Out[59]:
1 1
4 1
dtype: int64
s.values[idx]# using value with slice with index, is wrong
Out[60]: array([0, 2], dtype=int64)

Supposing column "y" belongs to a dataframe "df" and you want to select N=4 random rows:
for i in np.unique(df.y).astype(int):
print(df.y[np.random.choice(np.where(df.y==np.unique(df.y)[i])[0],4)])
You will get:
10116 0
329 0
4709 0
5630 0
Name: y, dtype: int32
382 1
392 1
9124 1
383 1
Name: y, dtype: int32
221 2
443 2
4235 2
5322 2
Name: y, dtype: int32
Edited, to get index:
pd.concat([df.y[np.random.choice(np.where(df.y==np.unique(df.y)[i])[0],4)] for i in np.unique(df.y).astype(int)],axis=0)
You will get:
10116 0
329 0
4709 0
5630 0
382 1
392 1
9124 1
383 1
221 2
443 2
4235 2
5322 2
Name: y, dtype: int32
To get a nested list of indices:
[df.holiday[np.random.choice(np.where(df.holiday==np.unique(df.holiday)[i])[0],4)].index.tolist() for i in np.unique(df.holiday).astype(int)]
You will get:
[[10116,329,4709,5630],[382,392,9124,383],[221,443,4235,5322]]

N = 4
y.loc[y[0]==0].sample(N)
y.loc[y[0]==1].sample(N)
y.loc[y[0]==2].sample(N)

Related

Count until condition is reached in Pandas

I need some input from you. The idea is that I would like to see how long (in rows) it takes before you can see
a new value in column SUB_B1, and
a new value in SUB_B2
i.e, how many steps is there between
SUB_A1 and SUB B1, and
between SUB A2 and SUB B2
I have structured the data something like this: (I sort the index in descending order by the results column. After that I separate index B and A and place them in new columns)
df.sort_values(['A','result'], ascending=[True,False]).set_index(['A','B'])
result
SUB_A1
SUB_A2
SUB_B1
SUB_B2
A
B
10_125
10_173
0.903257
10
125
10
173
10_332
0.847333
10
125
10
332
10_243
0.842802
10
125
10
243
10_522
0.836335
10
125
10
522
58_941
0.810760
10
125
58
941
...
...
...
...
...
...
10_173
10_125
0.903257
10
173
10
125
58_941
0.847333
10
173
58
941
1_941
0.842802
10
173
1
941
96_512
0.836335
10
173
96
512
10_513
0.810760
10
173
10
513
This is what I have done so far: (edit: I think I need to iterate over values[] However, I havent manage to loop beyond the first rows yet...)
def func(group):
if group.SUB_A1.values[0] == group.SUB_B1.values[0]:
group.R1.values[0] = 1
else:
group.R1.values[0] = 0
if group.SUB_A1.values[0] == group.SUB_B1.values[1] and group.R1.values[0] == 1:
group.R1.values[1] = 2
else:
group.R1.values[1] = 0
df['R1'] = 0
df.groupby('A').apply(func)
Expected outcome:
result
SUB_B1
SUB_B2
R1
R2
A
B
10_125
10_173
0.903257
10
173
1
0
10_332
0.847333
10
332
2
0
10_243
0.842802
10
243
3
0 
10_522
0.836335
10
522
4
0
58_941
0.810760
58
941
0
0
...
...
...
...
...
...
Are you looking for something like this:
Sample dataframe:
df = pd.DataFrame(
{"SUB_A": [1, -1, -2, 3, 3, 4, 3, 6, 6, 6],
"SUB_B": [1, 2, 3, 3, 3, 3, 4, 6, 6, 6]},
index=pd.MultiIndex.from_product([range(1, 3), range(1, 6)], names=("A", "B"))
)
SUB_A SUB_B
A B
1 1 1 1
2 -1 2
3 -2 3
4 3 3
5 3 3
2 1 4 3
2 3 4
3 6 6
4 6 6
5 6 6
Now this
equal = df.SUB_A == df.SUB_B
df["R"] = equal.groupby(equal.groupby("A").diff().fillna(True).cumsum()).cumsum()
leads to
SUB_A SUB_B R
A B
1 1 1 1 1
2 -1 2 0
3 -2 3 0
4 3 3 1
5 3 3 2
2 1 4 3 0
2 3 4 0
3 6 6 1
4 6 6 2
5 6 6 3
Try using pandas.DataFrame.iterrows and pandas.DataFrame.shift.
You can iterate through the dataframe and compare current row with the previous one, then add some condition:
df['SUB_A2_last'] = df['SUB_A2'].shift()
count = 0
#Fill column with zeros
df['count_series'] = 0
for index, row in df.iterrows():
subA = row['sub_A2']
subA_last = row['sub_A2_last']
if subA == subA_last:
count += 1
else:
count = 0
df.loc[index, 'count_series'] = count
Then repeat for B column. It is possible to get a better aproach using pandas.DataFrame.apply and a custom function.
Puh! Super! Thanks for the input you guys
def func(group):
for each in range(len(group)):
if group.SUB_A1.values[0] == group.SUB_B1.values[each]:
group.R1.values[each] = each + 1
continue
elif group.SUB_A1.values[0] == group.SUB_B1.values[each] and group.R1.values[each] == each + 1:
group.R1.values[each] = each + 1
else:
group.R1.values[each] = 0
return group
df['R1'] = 0
df.groupby('A').apply(func)

Determine if Values are within range based on pandas DataFrame column

I am trying to determine whether or a given value in a row of a DataFrame is within two other columns from a separate DataFrame, or if that estimate is zero.
import pandas as pd
df = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
lo1 up1 lo2 up2
0 -1 2 1 3
1 4 6 7 8
2 -2 10 11 13
3 5 6 8 9
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
pe1 pe2
0 1 3
1 4 6
2 5 8
3 10 2
To be more clear, is it possible to develop a for-loop or use a function that can look at pe1 and its corresponding values and determine if they are within lo1 and up1, if lo1 and up1 cross zero, and if pe1=0? I am having a hard time coding this in Python.
EDIT: I'd like the output to be something like:
m1 m2
0 0 3
1 4 0
2 0 0
3 0 0
Since the only pe that falls within its corresponding lo and up column are in the first row, second column, and second row, first column.
You can eventually concatenate the two dataframes along the horizontal axis and then use np.where. This has a similar behaviour as where used by RJ Adriaansen.
import pandas as pd
import numpy as np
# Data
df1 = pd.DataFrame([[-1, 2, 1, 3], [4, 6, 7,8], [-2, 10, 11, 13], [5, 6, 8, 9]],
columns=['lo1', 'up1','lo2', 'up2'])
df2 = pd.DataFrame([[1, 3], [4, 6] , [5, 8], [10, 2,]],
columns=['pe1', 'pe2'])
# concatenate dfs
df = pd.concat([df1, df2], axis=1)
where now df looks like
lo1 up1 lo2 up2 pe1 pe2
0 -1 2 1 3 1 3
1 4 6 7 8 4 6
2 -2 10 11 13 5 8
3 5 6 8 9 10 2
Finally we use np.where and between
for k in [1, 2]:
df[f"m{k}"] = np.where(
(df[f"pe{k}"].between(df[f"lo{k}"], df[f"up{k}"]) &
df[f"lo{k}"].gt(0)),
df[f"pe{k}"],
0)
and the result is
lo1 up1 lo2 up2 pe1 pe2 m1 m2
0 -1 2 1 3 1 3 0 3
1 4 6 7 8 4 6 4 0
2 -2 10 11 13 5 8 0 0
3 5 6 8 9 10 2 0 0
You can create a boolean mask for the required condition. For pe1 that would be:
value in lo1 is smaller or equal to pe1
value in up1 is larger or equal to pe1
value in lo1 is larger than 0
This would make this mask:
(df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0)
which returns:
0 False
1 True
2 False
3 False
dtype: bool
Now you can use where to keep the values that match True and replace those who don't with 0:
df2['pe1'] = df2['pe1'].where((df['lo1'] <= df2['pe1']) & (df['up1'] >= df2['pe1']) & (df['lo1'] > 0), other=0)
df2['pe2'] = df2['pe2'].where((df['lo2'] <= df2['pe2']) & (df['up2'] >= df2['pe2']) & (df['lo2'] > 0), other=0)
Result:
pe1
pe2
0
0
3
1
4
0
2
0
0
3
0
0
To loop all columns:
for i in df2.columns:
nr = i[2:] #remove the first two characters to get the number, then use that number to match the columns in the other df
df2[i] = df2[i].where((df[f'lo{nr}'] <= df2[i]) & (df[f'up{nr}'] >= df2[i]) & (df[f'lo{nr}'] > 0), other=0)

convert values in a series to either one of two values

I have a series y which has values between -3 and 3.
I want to convert numbers that are above 0 to 1 and numbers that are less than or equal to zero to 0.
What is the best way to do this?
I wrote the code below. However it doesn't give me the expected output. The first line works. However after running the second line the values that were 1 change to something random, which I don't understand
import numpy as np
y_final = np.where(y > 0, 1, y).tolist()
y_final = np.where(y <= 0, 0, y).tolist()
I think you need Series.clip if values are integers:
y = pd.Series(range(-3, 4))
print (y)
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
print (y.clip(lower=0, upper=1))
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int64
In your solution is possible simplify it by set 1 and 0:
y_final = np.where(y > 0, 1, 0)
print (y_final)
[0 0 0 0 1 1 1]
Or convert mask greater like 0 to integers:
y_final = y.gt(0).astype(int)
#alternative
#y_final = (y > 0).astype(int)
print (y_final)
0 0
1 0
2 0
3 0
4 1
5 1
6 1
dtype: int32
You can also use simple map:
numbers = range(-3,4)
print(list(map(lambda n: 1 if n > 0 else 0, numbers)))

Conditional length of a binary data series in Pandas

Having a DataFrame with the following column:
df['A'] = [1,1,1,0,1,1,1,1,0,1]
What would be the best vectorized way to control the length of "1"-series by some limiting value? Let's say the limit is 2, then the resulting column 'B' must look like:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
One fully-vectorized solution is to use the shift-groupby-cumsum-cumcount combination1 to indicate where consecutive runs are shorter than 2 (or whatever limiting value you like). Then, & this new boolean Series with the original column:
df['B'] = ((df.groupby((df.A != df.A.shift()).cumsum()).cumcount() <= 1) & df.A)\
.astype(int) # cast the boolean Series back to integers
This produces the new column in the DataFrame:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
1 See the pandas cookbook; the section on grouping, "Grouping like Python’s itertools.groupby"
Another way (checking if previous two are 1):
In [443]: df = pd.DataFrame({'A': [1,1,1,0,1,1,1,1,0,1]})
In [444]: limit = 2
In [445]: df['B'] = map(lambda x: df['A'][x] if x < limit else int(not all(y == 1 for y in df['A'][x - limit:x])), range(len(df)))
In [446]: df
Out[446]:
A B
0 1 1
1 1 1
2 1 0
3 0 0
4 1 1
5 1 1
6 1 0
7 1 0
8 0 0
9 1 1
If you know that the values in the series will all be either 0 or 1, I think you can use a little trick involving convolution. Make a copy of your column (which need not be a Pandas object, it can just be a normal Numpy array)
a = df['A'].as_matrix()
and convolve it with a sequence of 1's that is one longer than the cutoff you want, then chop off the last cutoff elements. E.g. for a cutoff of 2, you would do
long_run_count = numpy.convolve(a, [1, 1, 1])[:-2]
The resulting array, in this case, gives the number of 1's that occur in the 3 elements prior to and including that element. If that number is 3, then you are in a run that has exceeded length 2. So just set those elements to zero.
a[long_run_count > 2] = 0
You can now assign the resulting array to a new column in your DataFrame.
df['B'] = a
To turn this into a more general method:
def trim_runs(array, cutoff):
a = numpy.asarray(array)
a[numpy.convolve(a, numpy.ones(cutoff + 1))[:-cutoff] > cutoff] = 0
return a

Change dataframe pandas based one series

I have data and have convert using dataframe pandas :
import pandas as pd
d = [
(1,70399,0.988375133622),
(1,33919,0.981573492596),
(1,62461,0.981426807114),
(579,1,0.983018778374),
(745,1,0.995580488899),
(834,1,0.980942505189)
]
df_new = pd.DataFrame(e, columns=['source_target']).sort_values(['source_target'], ascending=[True])
and i need build series for mapping column source and target into another
e = []
for x in d:
e.append(x[0])
e.append(x[1])
e = list(set(e))
df_new = pd.DataFrame(e, columns=['source_target'])
df_new.source_target = (df_new.source_target.diff() != 0).cumsum() - 1
new_ser = pd.Series(df_new.source_target.values, index=new_source_old).drop_duplicates()
so i get series :
source_target
1 0
579 1
745 2
834 3
33919 4
62461 5
70399 6
dtype: int64
i have tried change dataframe df_beda based on new_ser series using :
df_beda.target = df_beda.target.mask(df_beda.target.isin(new_ser), df_beda.target.map(new_ser)).astype(int)
df_beda.source = df_beda.source.mask(df_beda.source.isin(new_ser), df_beda.source.map(new_ser)).astype(int)
but result is :
source target weight
0 0 70399 0.988375
1 0 33919 0.981573
2 0 62461 0.981427
3 579 0 0.983019
4 745 0 0.995580
5 834 0 0.980943
it's wrong, ideal result is :
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
maybe anyone can help me for show where my mistake
Thanks
If the order doesn't matter, you can do the following. Avoid for loop unless it's absolutely necessary.
uniq_vals = np.unique(df_beda[['source','target']])
map_dict = dict(zip(uniq_vals, xrange(len(uniq_vals))))
df_beda[['source','target']] = df_beda[['source','target']].replace(map_dict)
print df_beda
source target weight
0 0 6 0.988375
1 0 4 0.981573
2 0 5 0.981427
3 1 0 0.983019
4 2 0 0.995580
5 3 0 0.980943
If you want to roll back, you can create an inverse map from the original one, because it is guaranteed to be 1-to-1 mapping.
inverse_map = {v:k for k,v in map_dict.iteritems()}
df_beda[['source','target']] = df_beda[['source','target']].replace(inverse_map)
print df_beda
source target weight
0 1 70399 0.988375
1 1 33919 0.981573
2 1 62461 0.981427
3 579 1 0.983019
4 745 1 0.995580
5 834 1 0.980943

Categories

Resources