Splitting a dataframe in python - python

I have a dataframe df=
Type ID QTY_1 QTY_2 RES_1 RES_2
X 1 10 15 y N
X 2 12 25 N N
X 3 25 16 Y Y
X 4 14 62 N Y
X 5 21 75 Y Y
Y 1 10 15 y N
Y 2 12 25 N N
Y 3 25 16 Y Y
Y 4 14 62 N N
Y 5 21 75 Y Y
I want the result data set of two different data frames with QTY which has Y in their respective RES.
Below is my expected result
df1=
Type ID QTY_1
X 1 10
X 3 25
X 5 21
Y 1 10
Y 3 25
Y 5 21
df2 =
Type ID QTY_2
X 3 16
X 4 62
X 5 75
Y 3 16
Y 5 75

You can do this:
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.isin(['Y', 'y'])]
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.isin(['Y', 'y'])]
or
df1 = df[['Type', 'ID', 'QTY_1']].loc[df.RES_1.str.lower() == 'y']
df2 = df[['Type', 'ID', 'QTY_2']].loc[df.RES_2.str.lower() == 'y']
Output:
>>> df1
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
>>> df2
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75

Use a dictionary
It's good practice to use a dictionary for a variable number of variables. Although in this case there may be only a couple of categories, you benefit from organized data. For example, you can access RES_1 data via dfs[1].
dfs = {i: df.loc[df['RES_'+str(i)].str.lower() == 'y', ['Type', 'ID', 'QTY_'+str(i)]] \
for i in range(1, 3)}
print(dfs)
{1: Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21,
2: Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75}

You need:
df1 = df.loc[(df['RES_1']=='Y') | (df['RES_1']=='y')].drop(['QTY_2', 'RES_1', 'RES_2'], axis=1)
df2 = df.loc[(df['RES_2']=='Y') | (df['RES_2']=='y')].drop(['QTY_1', 'RES_1', 'RES_2'], axis=1)
print(df1)
print(df2)
Output:
Type ID QTY_1
0 X 1 10
2 X 3 25
4 X 5 21
5 Y 1 10
7 Y 3 25
9 Y 5 21
Type ID QTY_2
2 X 3 16
3 X 4 62
4 X 5 75
7 Y 3 16
9 Y 5 75

Related

Pandas oversampling ragged sequential data

Trying to use pandas to oversample my ragged data (data with different lengths).
Given the following data samples:
import pandas as pd
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,0,1,0,0,0]})
Data (groups are separated with --- for convince):
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
Targets:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
I would like to balance the minority class. In the sample above, target 1 is the minority class with 2 samples, for ids 1 & 3.
I'm looking for a way to oversample the data so the results would be:
id f1
0 1 11
1 1 12
2 1 13
-----------
3 2 22
4 2 22
-----------
5 3 33
6 3 34
7 3 35
8 3 36
-----------
9 4 44
-----------
10 5 55
-----------
11 6 66
12 6 66
-----------------
13 7 11
14 7 12 Replica of id 1
15 7 13
-----------------
16 8 33
17 8 34 Replica of id 3
18 8 35
19 8 36
And the targets would be balanced:
id target
0 1 1
1 2 0
2 3 1
3 4 0
4 5 0
5 6 0
6 7 1
8 8 1
With exactly 4 positive and 4 negative samples.
You can use:
x = pd.DataFrame({'id':[1,1,1,2,2,3,3,3,3,4,5,6,6],
'f1':[11,11,11,22,22,33,33,33,33,44,55,66,66]})
#more general sample
y = pd.DataFrame({'id':[1,2,3,4,5,6,7],'target':[1,0,1,0,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
Solution above working only for non balanced data, if possible sometimes balanced:
#balanced sample
y = pd.DataFrame({'id':[1,2,3,4,5,6],'target':[1,1,1,0,0,0]})
#repeat values 1 or 0 for balance target
s = y['target'].value_counts()
s1 = s.rsub(s.max())
new = s1.index.repeat(s1).tolist()
if len(new) > 0:
#create helper df and add to y
y1 = pd.DataFrame({'id':range(y['id'].max() + 1,y['id'].max() + len(new) + 1),
'target':new})
y2 = y.append(y1, ignore_index=True)
print (y2)
#filter by first value of new
add = y[y['target'].eq(new[0])]
#repeat values by np.tile or is possible change to np.repeat
#add helper column by y1.id and merge to x
add = (add.loc[np.tile(add.index, (len(new) // len(add)) + 1), ['id']]
.head(len(new))
.assign(new = y1['id'].tolist())
.merge(x, on='id', how='left')
.drop('id', axis=1)
.rename(columns={'new':'id'}))
#add to x
x2 = x.append(add, ignore_index=True)
print (x2)
else:
print ('y is already balanced')

Add label-column to DataFrame

I have two DataFrames for example
df1:
0 1 2 3
a 1 2 3 4
b 10 20 30 40
c 100 200 300 400
------------------
df2:
0
0 x
1 y
2 z
Now I want to combine both like:
df_new:
value label
0 1 x
1 2 x
2 3 x
3 4 x
0 10 y
1 20 y
2 30 y
3 40 y
0 100 z
1 200 z
2 300 z
3 400 z
I wrote a really awkward code like:
df_new=pd.DataFrame()
for i,j in zip(df1.index, df2.index):
x=df1.loc[i]
y=df2.loc[j]
label=np.full(x.shape[0],y)
df=pd.DataFrame({'value':x,'label':label})
df_new=pd.concat([df_new,df],axis=0)
print(df_new)
But I can imagine that there is a pandas-function like pd.melt or something which can do that better for bigger scale.
If there is same length of both DataFrames is possible create index in df1 by column 0 in df2 and reshape by DataFrame.stack, last encessary some data processing:
df = (df1.set_index(df2[0])
.stack()
.reset_index(level=1, drop=True)
.rename_axis('lab')
.reset_index(name='val')[['val','lab']])
print (df)
val lab
0 1 x
1 2 x
2 3 x
3 4 x
4 10 y
5 20 y
6 30 y
7 40 y
8 100 z
9 200 z
10 300 z
11 400 z
Solution with DataFrame.melt and append second df to first by DataFrame.join:
df = (df1.reset_index(drop=True)
.join(df2.add_prefix('label'))
.melt(['label0', 'label1'], ignore_index=False)
.sort_index(ignore_index=True)
.drop('variable', axis=1)[['value','label0','label1']]
)
print (df)
value label0 label1
0 1 x xx
1 2 x xx
2 3 x xx
3 4 x xx
4 10 y yy
5 20 y yy
6 30 y yy
7 40 y yy
8 100 z zz
9 200 z zz
10 300 z zz
11 400 z zz

pandas dataframe wide to long

I have a dataframe like so:
id type count_x count_y count_z sum_x sum_y sum_z
1 A 12 1 6 34 43 25
1 B 4 5 8 12 37 28
Now I want to transform it by grouping by id and type, then transforming from wide to long like so:
id type variable value calc
1 A x 12 count
1 A y 1 count
1 A z 6 count
1 B x 4 count
1 B y 5 count
1 B z 8 count
1 A x 34 sum
1 A y 43 sum
1 A z 25 sum
1 B x 12 sum
1 B y 37 sum
1 B z 28 sum
How can I achieve this?
try using melt:
res = pd.melt(df,id_vars=['id', 'type'])
res[['calc', 'variable']] = res.variable.str.split('_', expand=True)
id
type
variable
value
calc
0
1
A
x
12
count
1
1
B
x
4
count
2
1
A
y
1
count
3
1
B
y
5
count
4
1
A
z
6
count
5
1
B
z
8
count
6
1
A
x
34
sum
7
1
B
x
12
sum
8
1
A
y
43
sum
9
1
B
y
37
sum
10
1
A
z
25
sum
11
1
B
z
28
sum
Update:
Using stack:
df1 = (df.set_index(['id', 'type']).stack().rename('value').reset_index())
df1 = df1.drop('level_2',axis=1).join(df1['level_2'].str.split('_', 1, expand=True).rename(columns={0:'calc', 1:'variable'}))
id
type
value
calc
variable
0
1
A
12
count
x
1
1
A
1
count
y
2
1
A
6
count
z
3
1
A
34
sum
x
4
1
A
43
sum
y
5
1
A
25
sum
z
6
1
B
4
count
x
7
1
B
5
count
y
8
1
B
8
count
z
9
1
B
12
sum
x
10
1
B
37
sum
y
11
1
B
28
sum
z
You can use a combination of melt and split()
df = pd.DataFrame({'id': [1,1], 'type': ['A', 'B'], 'count_x':[12,4], 'count_y': [1,5], 'count_z': [6,8], 'sum_x': [34, 12], 'sum_y': [43, 37], 'sum_z': [25, 28]})
df_melt = df.melt(id_vars=['id', 'type'])
df_melt[['calc', 'variable']] = df_melt['variable'].str.split("_", expand=True)
df_melt
id type variable value calc
0 1 A x 12 count
1 1 B x 4 count
2 1 A y 1 count
3 1 B y 5 count
4 1 A z 6 count
5 1 B z 8 count
6 1 A x 34 sum
7 1 B x 12 sum
8 1 A y 43 sum
9 1 B y 37 sum
10 1 A z 25 sum
11 1 B z 28 sum
Assuming your pandas DataFrame is df_wide, you can get the desired result in df_long as,
df_long = df.melt(id_vars=['id', 'type'], value_vars=['count_x', 'count_y', 'count_z', 'sum_x', 'sum_y', 'sum_z'])
df_long['calc'] = df_long['variable'].apply(lambda x: x.split('_')[0])
df_long['variable'] = df_long['variable'].apply(lambda x: x.split('_')[1])
You could reshape the data using pivot_longer from pyjanitor:
df.pivot_longer(index = ['id', 'type'],
# names of the new columns
# note the order
names_to = ['calc', 'variable'],
values_to = 'value',
# delimiter for the columns
# first value is assigned to `calc`,
# the other goes to `variable`
names_sep='_')
id type calc variable value
0 1 A count x 12
1 1 B count x 4
2 1 A count y 1
3 1 B count y 5
4 1 A count z 6
5 1 B count z 8
6 1 A sum x 34
7 1 B sum x 12
8 1 A sum y 43
9 1 B sum y 37
10 1 A sum z 25
11 1 B sum z 28

How can one duplicate columns N times in DataFrame?

I have a dataframe with one column and I would like to get a Dataframe with N columns all of which will be identical to the first one. I can simply duplicate it by:
df[['new column name']] = df[['column name']]
but I have to make more than 1000 identical columns that's why it doesnt work
. One important thing is figures in columns should change for instance if first columns is 0 the nth column is n and the previous is n-1
If it's a single column, you can use tranpose and then simply replicate them with pd.concat and tranpose back to the original format, this avoids looping and should be faster, then you can change the column names in a second line, but without dealing with all the data in the dataframe which would be the most consuming performance wise:
import pandas as pd
df = pd.DataFrame({'Column':[1,2,3,4,5]})
Original dataframe:
Column
0 1
1 2
2 3
3 4
4 5
df = pd.concat([df.T]*1000).T
Output:
Column Column Column Column ... Column Column Column Column
0 1 1 1 1 ... 1 1 1 1
1 2 2 2 2 ... 2 2 2 2
2 3 3 3 3 ... 3 3 3 3
3 4 4 4 4 ... 4 4 4 4
4 5 5 5 5 ... 5 5 5 5
[5 rows x 1000 columns]
df.columns = ['Column'+'_'+str(i) for i in range(1000)]
Say that you have a df:, with column name 'company_name' that consists of 8 companies:
df = {"company_name":{"0":"Telia","1":"Proximus","2":"Tmobile","3":"Orange","4":"Telefonica","5":"Verizon","6":"AT&T","7":"Koninklijke"}}
company_name
0 Telia
1 Proximus
2 Tmobile
3 Orange
4 Telefonica
5 Verizon
6 AT&T
7 Koninklijke
You can use a loop and range to determine how many identical columns to create, and do:
for i in range(0,1000):
df['company_name'+str(i)] = df['company_name']
which results in the shape of the df:
df.shape
(8, 1001)
i.e. it replicated 1000 times the same columns. The names of the duplicated columns will be the same as the original one, plus an integer (=+1) at the end:
'company_name', 'company_name0', 'company_name1', 'company_name2','company_name..N'
df
A B C
0 x x x
1 y x z
Duplicate column "C" 5 times using df.assign:
n = 5
df2 = df.assign(**{f'C{i}': df['C'] for i in range(1, n+1)})
df2
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
Set n to 1000 to get your desired output.
You can also directly assign the result back:
df[[f'C{i}' for i in range(1, n+1)]] = df[['C']*n].to_numpy()
df
A B C C1 C2 C3 C4 C5
0 x x x x x x x x
1 y x z z z z z z
I think the most efficient is to index with DataFrame.loc instead of using an outer loop
n = 3
new_df = df.loc[:, ['column_duplicate']*n +
df.columns.difference(['column_duplicate']).tolist()]
print(new_df)
column_duplicate column_duplicate column_duplicate other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19
If you want add a suffix
suffix_tup = ('a', 'b', 'c')
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*len(suffix_tup) +
not_dup_cols]
.set_axis(list(map(lambda suffix: f'column_duplicate_{suffix}',
suffix_tup)) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_a column_duplicate_b column_duplicate_c other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
or add an index
n = 3
not_dup_cols = df.columns.difference(['column_duplicate']).tolist()
new_df = (df.loc[:, ['column_duplicate']*n +
not_dup_cols]
.set_axis(list(map(lambda x: f'column_duplicate_{x}', range(n))) +
not_dup_cols, axis=1)
)
print(new_df)
column_duplicate_0 column_duplicate_1 column_duplicate_2 other
0 0 0 0 10
1 1 1 1 11
2 2 2 2 12
3 3 3 3 13
4 4 4 4 14
5 5 5 5 15
6 6 6 6 16
7 7 7 7 17
8 8 8 8 18
9 9 9 9 19

pandas dataframe enumerate rows that passed a filter

I have a large data frame, and I'd like to add a column which is -1 if the row did not pass a filter, or an index if it passed the filter.
For example, in the data frame
b f j passed new_index
1 12 5 6 Y 0
2 4 99 2 Y 1
3 10 77 16 N -1
4 4 99 2 Y 2
5 10 77 16 N -1
6 4 99 2 Y 3
7 10 77 16 N -1
The column new_index is the one I added, based on column passed.
How do I do this without iterrows?
I created a series bool4 which is True where passed == Y and False otherwise, and tried:
df.loc[bool4, 'new_index'] = df.loc[bool4, 'new_index'].apply([lambda i: i for i in range(sum(bool4))])
But it does not update the new_index column (leaves it empty).
Let's use eq, cumsum, add, and mask:
df['new_index'] = df.passed.eq('Y').cumsum().add(-1).mask(df.passed == 'N', -1)
Output:
b f j passed new_index
1 12 5 6 Y 0
2 4 99 2 Y 1
3 10 77 16 N -1
4 4 99 2 Y 2
5 10 77 16 N -1
6 4 99 2 Y 3
7 10 77 16 N -1

Categories

Resources