I'm trying to even up a dataset for machine learning. There are great answers for how to sample a dataframe with two values in a column (a binary choice).
In my case I have many values in column x. I want an equal number of records in the dataframe where
x is 0 or not 0
or in a more complicated example the value in x is 0, 5 or other value
Examples
x
0 5
1 5
2 5
3 0
4 0
5 9
6 18
7 3
8 5
** For the first **
I have 2 rows where x = 0 and 7 where x != 0. The result should balance this up and be 4 rows: the two with x = 0 and 2 where x != 0 (randomly selected). Preserving the same index for the sake of illustration
1 5
3 0
4 0
6 18
** For the second **
I have 2 rows where x = 0, 4 rows where x = 5 and 3 rows where x != 0 && x != 5. The result should balance this up and be 6 rows in total: two for each condition. Preserving the same index for the sake of illustration
1 5
3 0
4 0
5 9
6 18
8 5
I've done examples with 2 conditions & 3 conditions. A solution that generalises to more would be good. It is better if it detects the minimum number of rows (for 0 in this example) so I don't need to work this out first before writing the condition.
How do I do this with pandas? Can I pass a custom function to .groupby() to do this?
IIUC, you could groupby on the condition whether "x" is 0 or not and sample the smallest-group-size number of entries from each group:
g = df.groupby(df['x']==0)['x']
out = g.sample(n=g.count().min()).sort_index()
(An example) output:
1 5
3 0
4 0
5 9
Name: x, dtype: int64
For the second case, we could use numpy.select and numpy.unique to get the groups (the rest are essentially the same as above):
import numpy as np
groups = np.select([df['x']==0, df['x']==5], [1,2], 3)
g = df.groupby(groups)['x']
out = g.sample(n=np.unique(groups, return_counts=True)[1].min()).sort_index()
An example output:
2 5
3 0
4 0
5 9
7 3
8 5
Name: x, dtype: int64
IIUC, and you want any two non-zero records:
mask = df['x'].eq(0)
pd.concat([df[mask], df[~mask].sample(mask.sum())]).sort_index()
Output:
x
1 5
2 5
3 0
4 0
Part II:
mask0 = df['x'].eq(0)
mask5 = df['x'].eq(5)
pd.concat([df[mask0],
df[mask5].sample(mask0.sum()),
df[~(mask0 | mask5)].sample(mask0.sum())]).sort_index()
Output:
x
2 5
3 0
4 0
6 18
7 3
8 5
Related
I have a dataframe:
t = pd.Series([2,4,6,8,10,12],index= index)
df1 = pd.DataFrame(s,columns = ["MUL1"])
df1["MUL2"] =t
MUL1 MUL2
0 1 2
1 2 4
2 2 6
3 3 8
4 3 10
5 6 12
and another dataframe:
u = pd.Series([1,2,3,6],index= index)
v = pd.Series([2,8,10,12],index= index)
df2 = pd.DataFrame(u,columns = ["MUL3"])
df2["MUL4"] =v
Now I want a new dataframe which looks like the following:
MUL6 MUL7
0 1 2
1 2 8
2 2 8
3 3 10
4 3 10
5 6 12
By combining the first 2 dataframes.
I have tried the following:
X1 = df1.to_numpy()
X2 = df2.to_numpy()
list = []
for i in range(X1.shape[0]):
for j in range(X2.shape[0]):
if X1[i, -1] == X2[j, -1]:
list.append(X2[X1[i, -1]==X2[j, -1], -1])
I was trying to convert the dataframes to numpy arrays so I can iterate through them to get a new array that I can convert back to a dataframe. But the size of the new dataframe is not equal to size of the first dataframe. Please I would appreciate any help. Thanks.
Although the details of the logic are cryptic, I believe that you want a merge:
(df1[['MUL1']].rename(columns={'MUL1': 'MUL6'})
.merge(df2.rename(columns={'MUL3': 'MUL6', 'MUL4': 'MUL7'}),
on='MUL6', how='left')
)
output:
MUL6 MUL7
0 1 2
1 2 8
2 2 8
3 3 10
4 3 10
5 6 12
I have a four-column data frame, given as follows: Column zero consists of text labels chosen from a list ['A','B','C','D'] with possible repetitions. Columns one-two are labelled, start and stop, where the former is less than the latter, and column three, intensity, is a float. For each label, none of the corresponding intervals formed using [start,stop] overlap.
A simple example is given by:
import numpy as np
import pandas as pd
labels=['A','B','C','D']
d = {'label': ['A','B','A','C','D','B','A'],'start': [1, 2,6,4,1,8,12], 'stop':
[4,4,9,6,7,11,16],'intensity':[8,2,4,6,7,1,5]}
df = pd.DataFrame(data=d)
print(df)
label start stop intensity
0 A 1 4 8
1 B 2 4 2
2 A 6 9 4
3 C 4 6 6
4 D 1 7 7
5 B 8 11 1
6 A 12 16 5
I wish to create a matrix, M, having four (=len(labels)) rows and 16 columns. (The number of columns must be at least the maximum entry in df['stop']. Whether it's larger doesn't matter). For each integer k between 0 and 6, the index of df['label'][k] in labels specifies a row of my matrix M. The entries in columns d[start][k] to d[stop][k] of this row should all equal d['intensity'][k]. All other entries of M equal zero.
For example, label A corresponds to rows 0, 2, and 6. In row 0, entries in columns 1-4 equal 8, entries in columns 6-9 equal 4, and entries in columns 12-16 equal 5.
I'd like to do this in the most pythonic way using list operations and at most one loop.
Here's a solution:
MAX = df['stop'].max()
new_df = pd.DataFrame(df.groupby('label').apply(lambda g: sum(g.apply(lambda x: np.isin(np.arange(MAX), np.arange(x['start']-1, x['stop'])).astype(int)*x['intensity'], axis=1))).tolist(), index=labels)
Output:
>>> new_df
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
A 8 8 8 8 0 4 4 4 4 0 0 5 5 5 5 5
B 0 2 2 2 0 0 0 1 1 1 1 0 0 0 0 0
C 0 0 0 6 6 6 0 0 0 0 0 0 0 0 0 0
D 7 7 7 7 7 7 7 0 0 0 0 0 0 0 0 0
another way using explode
df['range'] = df.apply(lambda r: list(range(r['start'], r['stop']+1)), axis=1)
df.explode('range').set_index(['label', 'range'])[['intensity']].unstack()
I have list containing numbers x =(1,2,3,4,5,6,7,8)
I also have a DataFrame with 1000+ rows.
The thing I need is to assign the numbers in the list into a column/creating a new column, so that the rows 1-8 contain the numbers 1-8, but after that it starts again, so row 9 should contain number 1 and so on.
It seems really easy, but somehow I cannot manage to do this.
Here are two possible ways (example here with 3 items to repeat):
with numpy.tile
df = pd.DataFrame({'col': range(10)})
x = (1,2,3)
df['newcol'] = np.tile(x, len(df)//len(x)+1)[:len(df)]
with itertools
from itertools import cycle, islice
df = pd.DataFrame({'col': range(10)})
x = (1,2,3)
df['newcol'] = list(islice(cycle(x), len(df
input:
col
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
output:
col newcol
0 0 1
1 1 2
2 2 3
3 3 1
4 4 2
5 5 3
6 6 1
7 7 2
8 8 3
9 9 1
from math import ceil
df['new_column'] = (x*(ceil(len(df)/len(x))))[:len(df)]
How to divide an column into 5 groups by the column's value sorted.
and add a column by the groups
for example
import pandas as pd
df = pd.DataFrame({'x1':[1,2,3,4,5,6,7,8,9,10]})
and I want add columns like this:
You probably want to look at pd.cut, and set the argument bins to an integer of however many groups you want, and the labels argument to False (to return integer indicators of your groups instead of ranges):
df['add_col'] = pd.cut(df['x1'], bins=5, labels=False) + 1
>>> df
x1 add_col
0 1 1
1 2 1
2 3 2
3 4 2
4 5 3
5 6 3
6 7 4
7 8 4
8 9 5
9 10 5
Note that the + 1 is only there so that your groups are numbered 1 to 5, as in your desired output. If you don't say + 1 they will be numbered 0 to 4
I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).
This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866