Pandas oversampling ragged sequential data - python

Trying to use pandas to oversample my ragged data (data with different lengths).
Given the following data samples:
import pandas as pd
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
Data (groups are separated with --- for convince):
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
Targets:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3.
I'm looking for a way to oversample the data so the results would be:
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
-----------------
13 7 11
14 7 12 Replica of id 1
15 7 13
-----------------
16 8 33
17 8 34 Replica of id 3
18 8 35
19 8 36
And the targets would be balanced:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
6 7 1
8 8 1
With exactly 4 positive and 4 negative samples.

You can use:
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],
'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
#more general sample
y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
Solution above working only for non balanced data, if possible sometimes balanced:
#balanced sample
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
if len(new) > 0:
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
else:
print ('y is already balanced')

Related

Functional approach to group DataFrame columns into MultiIndex

Is there a simpler functional way to group columns into a MultiIndex?
# Setup
l = [...]
l2,l3,l4 = do_things(l, [2,3,4])
d = {2:l2, 3:l3, 4:l4}
# Or,
l = l2 = l3 = l4 = list(range(20))
Problems with my approaches:
# Cons:
# * Complicated
# * Requires multiple iterations over the dictionary to occur
# in the same order. This is guaranteed as the dictionary is
# unchanged but I'm not happy with the implicit dependency.
df = pd.DataFrame\
( zip(*d.values())
, index=l
, columns=pd.MultiIndex.from_product([["group"], d.keys()])
).rename_axis("x").reset_index().reset_index()
# Cons:
# * Complicated
# * Multiple assignments
df = pd.DataFrame(d, index=l).rename_axis("x")
df.columns = pd.MultiIndex.from_product([["group"],df.columns])
df = df.reset_index().reset_index()
I'm looking for something like:
df =\
( pd.DataFrame(d, index=l)
. rename_axis("x")
. group_columns("group")
. reset_index().reset_index()
)
Result:
index x group
2 3 4
0 0 2 0 0 0
1 1 2 0 0 0
2 2 2 0 0 0
3 3 2 0 0 0
4 4 1 0 0 0
5 5 2 0 0 0
6 6 1 0 0 0
7 7 2 0 0 0
8 8 4 0 1 1
9 9 4 0 1 1
10 10 4 0 1 1
11 11 0 0 1 1
12 12 1 0 1 1
13 13 1 0 1 1
14 14 3 1 2 2
15 15 1 1 2 2
16 16 1 1 2 3
17 17 1 1 2 3
18 18 4 1 2 3
19 19 3 1 2 3
20 20 4 1 2 3
21 21 4 1 2 3
22 22 4 1 2 3
23 23 4 1 2 3
It is probably easiest just to reformat the dictionary and pass it to the DataFrame constructor:
# Sample Data
size = 5
lst = np.arange(size) + 10
d = {2: lst, 3: lst + size, 4: lst + (size * 2)}
df = pd.DataFrame(
# Add group level by changing keys to tuples
{('group', k): v for k, v in d.items()},
index=lst
)
Output:
group
2 3 4
10 10 15 20
11 11 16 21
12 12 17 22
13 13 18 23
14 14 19 24
Notice that tuples get interpreted as a MultiIndex automatically
This can be followed with whatever chain of operations desired:
df = pd.DataFrame(
{('group', k): v for k, v in d.items()},
index=lst
).rename_axis('x').reset_index().reset_index()
df:
index x group
2 3 4
0 0 10 10 15 20
1 1 11 11 16 21
2 2 12 12 17 22
3 3 13 13 18 23
4 4 14 14 19 24
It is also possible to combine steps and generate the complete DataFrame directly:
df = pd.DataFrame({
('index', ''): pd.RangeIndex(len(lst)),
('x', ''): lst,
**{('group', k): v for k, v in d.items()}
})
df:
index x group
2 3 4
0 0 10 10 15 20
1 1 11 11 16 21
2 2 12 12 17 22
3 3 13 13 18 23
4 4 14 14 19 24
Naturally any combination of dictionary comprehension and pandas operations can be used.

Create multiple columns from one column (with the same data)

I have this column (similar but with a lot of more entries)
import pandas as pd
numbers = range(1,16)
sequence = []
for number in numbers:
sequence.append(number)
df = pd.DataFrame(sequence).rename(columns={0: 'sequence'})
and I want to distribute the same values into lots of more columns periodically (and automatically) to get something like this (but with a bunch of values)
Thanks
Use reshape with 5 for number of new rows, -1 is for count automatically number of columns:
numbers = range(1,16)
df = pd.DataFrame(np.array(numbers).reshape(-1, 5).T)
print (df)
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If length of values in range cannot be filled to N rows here is possible solution:
L = range(1,22)
N = 5
filled = 0
arr = np.full(((len(L) - 1)//N + 1)*N, filled)
arr[:len(L)] = L
df = pd.DataFrame(arr.reshape((-1, N)).T)
print(df)
0 1 2 3 4
0 1 6 11 16 21
1 2 7 12 17 0
2 3 8 13 18 0
3 4 9 14 19 0
4 5 10 15 20 0
Use pandas.Series.values.reshape and desired rows and columns
pd.DataFrame(df.sequence.values.reshape(5, -1))
If you like to reshape after reading dataframe then
df = pd.DataFrame(df.to_numpy().reshape(5,-1))
num_cols = 3
result = pd.DataFrame(df.sequence.to_numpy().reshape(-1, num_cols, order="F"))
for a given number of columns, e.g., 3 here, reshapes df.sequence to (total_number_of_values / num_cols, num_cols) where first shape is inferred with -1. The Fortran order matches the structure so that numbers are "going down first",
to get
>>> result
0 1 2
0 1 6 11
1 2 7 12
2 3 8 13
3 4 9 14
4 5 10 15
If num_cols = 5, then
>>> result
0 1 2 3 4
0 1 4 7 10 13
1 2 5 8 11 14
2 3 6 9 12 15

How can one duplicate columns N times in DataFrame?

I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19

Python Dataframe GroupBy Function

I am having hard time understanding what the code below does. I initially thought it was counting the unique appearances of the values in (weight age) and (weight height) however when I ran this example, I found out it was doing something else.
data = [[0,33,15,4],[1,44,12,3],[0,44,12,5],[1,33,15,4],[0,77,13,4],[1,33,15,4],[1,99,40,7],[0,58,45,4],[1,11,13,4]]
df = pd.DataFrame(data,columns=["Lbl","Weight","Age","Height"])
print (df)
def group_fea(df,key,target):
'''
Adds columns for feature combinations
'''
tmp = df.groupby(key, as_index=False)[target].agg({
key+target + '_nunique': 'nunique',
}).reset_index()
del tmp['index']
print("****{}****".format(target))
return tmp
#Add feature combinations
feature_key = ['Weight']
feature_target = ['Age','Height']
for key in feature_key:
for target in feature_target:
tmp = group_fea(df,key,target)
df = df.merge(tmp,on=key,how='left')
print (df)
Lbl Weight Age Height
0 0 33 15 4
1 1 44 12 3
2 0 44 12 5
3 1 33 15 4
4 0 77 13 4
5 1 33 15 4
6 1 99 40 7
7 0 58 45 4
8 1 11 13 4
****Age****
****Height****
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1
I want to understand what the values in WeightAge_nunique WeightHeight_nunique mean
The value of WeightAge_nunique on a given row is the number of unique Ages that have the same Weight. The corresponding thing is true of WeightHeight_nunique. E.g., for people of Weight=44, there is only 1 unique age (12), hence WeightAge_nunique=1 on those rows, but there are 2 unique Heights (3 and 5), hence WeightHeight_nunique=2 on those same rows.
You can see that this happens because the grouping function groups by the "key" column (Weight), then performs the "nunique" aggregation function on the "target" column (either Age or Height).
Let us try transform
g = df.groupby('Weight').transform('nunique')
df['WeightAge_nunique'] = g['Age']
df['WeightHeight_nunique'] = g['Height']
df
Out[196]:
Lbl Weight Age Height WeightAge_nunique WeightHeight_nunique
0 0 33 15 4 1 1
1 1 44 12 3 1 2
2 0 44 12 5 1 2
3 1 33 15 4 1 1
4 0 77 13 4 1 1
5 1 33 15 4 1 1
6 1 99 40 7 1 1
7 0 58 45 4 1 1
8 1 11 13 4 1 1

How to check if the value is between two consecutive rows in dataframe or numpy array?

I need to write a code that checks if a certain value is between 2 consecutive rows, for example:
row <50 < next row
meaning if the value is between row and its consecutive row.
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
The output is:
A
0 67
1 78
2 53
3 44
4 84
5 2
6 63
7 13
8 56
9 24
What I'd like to do is to check if (let's say I have a set value) "50" is between all consecutive rows.
Say, we check if 50 is between 67 and 78 and then between 78 and 53, obviously the answer is no, therefore in column B the result would be 0.
Now, if we check if 50 is between 53 and 44, then we'll get 1 in column B and we'll use cumsum() to count how many times the value of 50 is between consecutive rows in column A.
UPDATE: Let's say, if I have column C where I have 2 categories only: 1 and 2. How would I ensure that the check is performed within each of the categories separately? In other words, the check is reset once the category changes?
The desired output is:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1
Greatly appreciate your help.
Let's just subtract "50" from series and check sign change:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[67,78,53,44,84,2,63,13,56,24]}, columns=list('A'))
s = df['A'] - 50
df['count'] = np.sign(s).diff().fillna(0).ne(0).cumsum()
print(df)
Output:
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
This should work:
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 67 0
1 78 0
2 53 0
3 44 1
4 84 2
5 2 3
6 63 4
7 13 5
8 56 6
9 24 7
or
df = pd.DataFrame(np.random.randint(0,100,size=(10, 1)), columns=list('A'))
what = ((df.A < 50) | (50 > df.A.shift())) & ((df.A > 50) | (50 < df.A.shift()))
df['count'] = what.astype(int).cumsum()
A count
0 45 0
1 53 1
2 44 2
3 87 3
4 47 4
5 13 4
6 20 4
7 89 5
8 81 5
9 53 5
Would your second output look like this:
df
A B C
0 67 0 1
1 78 0 1
2 53 0 1
3 44 1 2
4 84 2 1
5 2 3 2
6 63 4 1
7 13 5 2
8 56 6 1
9 24 7 1
df_new = df
what = ((df_new.A < 50) | (50 > df_new.A.shift())) & ((df_new.A > 50) | (50 < df_new.A.shift())) & ((df_new.C == df_new.C.shift() ))
df['count'] = what.astype(int).cumsum()
df
Ouput:
A B C count
0 67 0 1 0
1 78 0 1 0
2 53 0 1 0
3 44 1 2 0
4 84 2 1 0
5 2 3 2 0
6 63 4 1 0
7 13 5 2 0
8 56 6 1 0
9 24 7 1 1

Categories

Resources