Select certain columns based on multiple criteria in pandas - python

I have the following dataset:
my_df = pd.DataFrame({'id':[1,2,3,4,5],
'type':['corp','smb','smb','corp','mid'],
'sales':[34567,2190,1870,22000,10000],
'sales_roi':[.10,.21,.22,.15,.16],
'sales_pct':[.38,.05,.08,.30,.20],
'sales_ln':[4.2,2.1,2.0,4.1,4],
'cost_pct':[22000,1000,900,14000,5000],
'flag':[0,1,0,1,1],
'gibberish':['bla','ble','bla','ble','bla'],
'tech':['lnx','mst','mst','lnx','mc']})
my_df['type'] = pd.Categorical(my_df.type)
my_df
id type sales sales_roi sales_pct sales_ln cost_pct flag gibberish tech
0 1 corp 34567 0.10 0.38 4.2 22000 0 bla lnx
1 2 smb 2190 0.21 0.05 2.1 1000 1 ble mst
2 3 smb 1870 0.22 0.08 2.0 900 0 bla mst
3 4 corp 22000 0.15 0.30 4.1 14000 1 ble lnx
4 5 mid 10000 0.16 0.20 4.0 5000 1 bla mc
And I want to filter out all variables who end in "_pct" or "_ln" or are equal to "gibberish" or "tech". This is what I have tried:
df_selected = df.loc[:, ~my_df.columns.str.endswith('_pct') &
~my_df.columns.str.endswith('_ln') &
~my_df.columns.str.contains('gibberish','tech')]
But it returns me an unwanted column ("tech"):
id type sales sales_roi flag tech
0 1 corp 34567 0.10 0 lnx
1 2 smb 2190 0.21 1 mst
2 3 smb 1870 0.22 0 mst
3 4 corp 22000 0.15 1 lnx
4 5 mid 10000 0.16 1 mc
This is the expected result:
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1
Please consider that I have to deal with hundreds of variables and this is just an example of what I need.

Currently, what you are doing will return every column because of how the conditions are written. endswith will accept tuples so just put all the columns you are looking for in a single tuple and then filter
my_df[my_df.columns[~my_df.columns.str.endswith(('_pct','_ln','gibberish','tech'))]]
id type sales sales_roi flag
0 1 corp 34567 0.10 0
1 2 smb 2190 0.21 1
2 3 smb 1870 0.22 0
3 4 corp 22000 0.15 1
4 5 mid 10000 0.16 1

I would do it like this:
criterion = ["_pct", "_ln", "gibberish", "tech"]
for column in my_df:
for criteria in criterion:
if criteria in column:
my_df = my_df.drop(column, axis=1)
Ofcourse you can change the if statement in line 3 to endswith or something of your choice.

Related

Map numeric data into bins in Pandas dataframe for seperate groups using dictionaries

I have a pandas dataframe as follows:
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
polyid value
0 1 0.56
1 1 0.59
2 1 0.62
3 1 0.83
4 2 0.85
5 2 0.01
6 2 0.79
7 3 0.37
8 3 0.99
9 3 0.48
10 3 0.55
11 3 0.06
I need to reclassify the 'value' column separately for each 'polyid'. For the reclassification, I have two dictionaries. One with the bins that contain the information on how I want to cut the 'values' for each 'polyid' separately:
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
And one with the ids with which I want to label the resulting bins:
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
I tried to get this answer to work for my use case. I could only come up with applying pd.cut on each 'polyid' subset and then pd.concat all subsets again back to one dataframe:
import pandas as pd
def reclass_df_dic(df, bins_dic, names_dic, bin_key_col, val_col, name_col):
df_lst = []
for key in df[bin_key_col].unique():
bins = bins_dic[key]
names = names_dic[key]
sub_df = df[df[bin_key_col] == key]
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
df_lst.append(sub_df)
return(pd.concat(df_lst))
df = pd.DataFrame(data = [[1,0.56],[1,0.59],[1,0.62],[1,0.83],[2,0.85],[2,0.01],[2,0.79],[3,0.37],[3,0.99],[3,0.48],[3,0.55],[3,0.06]],columns=['polyid','value'])
bins_dic = {1:[0,0.6,0.8,1], 2:[0,0.2,0.9,1], 3:[0,0.5,0.6,1]}
ids_dic = {1:[1,2,3], 2:[1,2,3], 3:[1,2,3]}
df = reclass_df_dic(df, bins_dic, ids_dic, 'polyid', 'value', 'id')
This results in my desired output:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1
However, the line:
sub_df[name_col] = pd.cut(df[val_col], bins, labels=names)
raises the warning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
that I am unable to solve with using .loc. Also, I guess there generally is a more efficient way of doing this without having to loop over each category?
A simpler solution would be to use groupby and apply a custom function on each group. In this case, we can define a function reclass that obtains the correct bins and ids and then uses pd.cut:
def reclass(group, name):
bins = bins_dic[name]
ids = ids_dic[name]
return pd.cut(group, bins, labels=ids)
df['id'] = df.groupby('polyid')['value'].apply(lambda x: reclass(x, x.name))
Result:
polyid value id
0 1 0.56 1
1 1 0.59 1
2 1 0.62 2
3 1 0.83 3
4 2 0.85 2
5 2 0.01 1
6 2 0.79 2
7 3 0.37 1
8 3 0.99 3
9 3 0.48 1
10 3 0.55 2
11 3 0.06 1

Generate new column based on values in another column and their index

In the df underneath, I want to sort the values of column 'cdf_X' based on column 'A' and 'X'. Column 'X' and 'cdf_X' are connected, so if a value in 'X' appears in column 'A', the value of 'cdf_X' should be repositioned to that index number of column 'A' in a new column. (Values don't occur twice in a column 'cdf_A'.)
Example: 'X'=3 at index 0 -> cdf_X=0.05 at index 0 -> '3' appears in column 'A' at index 4 -> cdf_A at index 4 = cdf_X at index 0
Initial df:
A X cdf_X
0 7 3 0.05
1 4 4 0.15
2 11 7 0.27
3 9 9 0.45
4 3 11 0.69
5 13 13 1.00
Desired df:
A X cdf_X cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00
Tried code:
import pandas as pd
df = pd.DataFrame({"A": [7,4,11,9,3,13],
"cdf_X": [0.05,0.15,0.27,0.45,0.69,1.00],
"X": [3,4,7,9,11,13]})
df.loc[:, 'cdf_A'] = df['cdf_X'].where(df['A'] == df['X'])

print(df)
Check with map
df['cdf_A'] = df.A.map(df.set_index('X')['cdf'])
I think you need replace
df['cdf_A'] = df.A.replace(df.set_index('X').cdf)
Out[989]:
A X cdf cdf_A
0 7 3 0.05 0.27
1 4 4 0.15 0.15
2 11 7 0.27 0.69
3 9 9 0.45 0.45
4 3 11 0.69 0.05
5 13 13 1.00 1.00

Add a column of normalised values based on sections of a dataframe column

I have got a dataframe of several hundred thousand rows. Which is of the following format:
time_elapsed cycle
0 0.00 1
1 0.50 1
2 1.00 1
3 1.30 1
4 1.50 1
5 0.00 2
6 0.75 2
7 1.50 2
8 3.00 2
I want to create a third column that will give me the percentage of each time instance that the row is of the cycle (until the next time_elapsed = 0). To give something like:
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 75
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100
I'm not fussed about the number of decimal places, I've just excluded them for ease here.
I started going along this route, but I keep getting errors.
data['percentage'] = data['time_elapsed'].sub(data.groupby(['cycle'])['time_elapsed'].transform(lambda x: x*100/data['time_elapsed'].max()))
I think it's the lambda function causing errors, but I'm not sure what I should do to change it. Any help is much appreciated :)
Use Series.div for division instead sub for subtract, then solution is simplify - get only max per groups, multiple by Series.mul, if necessary Series.round and last convert to integers by Series.astype:
s = data.groupby(['cycle'])['time_elapsed'].transform('max')
data['percentage'] = data['time_elapsed'].div(s).mul(100).round().astype(int)
print (data)
time_elapsed cycle percentage
0 0.00 1 0
1 0.50 1 33
2 1.00 1 67
3 1.30 1 87
4 1.50 1 100
5 0.00 2 0
6 0.75 2 25
7 1.50 2 50
8 3.00 2 100

Ranking and subranking rows in pandas

I have the following dataset, that I would like to rank by region, and also by store type (within each region).
Is there a slick way of coding these 2 columns in python?
Data:
print (df)
Region ID Location store Type ID Brand share
0 1 Warehouse 1.97
1 1 Warehouse 0.24
2 1 Super Centre 0.21
3 1 Warehouse 0.13
4 1 Mini Warehouse 0.10
5 1 Super Centre 0.07
6 1 Mini Warehouse 0.04
7 1 Super Centre 0.02
8 1 Mini Warehouse 0.02
9 10 Warehouse 0.64
10 10 Mini Warehouse 0.18
11 10 Warehouse 0.13
12 10 Warehouse 0.09
13 10 Super Centre 0.07
14 10 Mini Warehouse 0.03
15 10 Mini Warehouse 0.02
16 10 Super Centre 0.02
Use GroupBy.cumcount:
df['RegionRank'] = df.groupby('Region ID')['Brand share'].cumcount() + 1
cols = ['Location store Type ID', 'Region ID']
df['StoreTypeRank'] = df.groupby(cols)['Brand share'].cumcount() + 1
print (df)
Region ID Location store Type ID Brand share RegionRank StoreTypeRank
0 1 Warehouse 1.97 1 1
1 1 Warehouse 0.24 2 2
2 1 Super Centre 0.21 3 1
3 1 Warehouse 0.13 4 3
4 1 Mini Warehouse 0.10 5 1
5 1 Super Centre 0.07 6 2
6 1 Mini Warehouse 0.04 7 2
7 1 Super Centre 0.02 8 3
8 1 Mini Warehouse 0.02 9 3
9 10 Warehouse 0.64 1 1
10 10 Mini Warehouse 0.18 2 1
11 10 Warehouse 0.13 3 2
12 10 Warehouse 0.09 4 3
13 10 Super Centre 0.07 5 1
14 10 Mini Warehouse 0.03 6 2
15 10 Mini Warehouse 0.02 7 3
16 10 Super Centre 0.02 8 2
Or GroupBy.rank, but it return same values for same categories:
df['RegionRank'] = (df.groupby('Region ID')['Brand share']
.rank(method='dense', ascending=False)
.astype(int))
cols = ['Location store Type ID', 'Region ID']
df['StoreTypeRank'] = (df.groupby(cols)['Brand share']
.rank(method='dense', ascending=False)
.astype(int))
print (df)
Region ID Location store Type ID Brand share RegionRank StoreTypeRank
0 1 Warehouse 1.97 1 1
1 1 Warehouse 0.24 2 2
2 1 Super Centre 0.21 3 1
3 1 Warehouse 0.13 4 3
4 1 Mini Warehouse 0.10 5 1
5 1 Super Centre 0.07 6 2
6 1 Mini Warehouse 0.04 7 2
7 1 Super Centre 0.02 8 3
8 1 Mini Warehouse 0.02 8 3
9 10 Warehouse 0.64 1 1
10 10 Mini Warehouse 0.18 2 1
11 10 Warehouse 0.13 3 2
12 10 Warehouse 0.09 4 3
13 10 Super Centre 0.07 5 1
14 10 Mini Warehouse 0.03 6 2
15 10 Mini Warehouse 0.02 7 3 <-same value .02
16 10 Super Centre 0.02 7 2 <-same value .02

Merge two DataFrames based on columns and values of a specific column with Pandas in Python 3.x

Hello i have a problem which i am not able to implement a solution on.
I have following two DataFrames:
>>> df1
A B date
1 1 01-2016
2 1 02-2017
1 2 03-2017
2 2 04-2020
>>> df2
A B 01-2016 02-2017 03-2017 04.2020
1 1 0.10 0.22 0.55 0.77
2 1 0.20 0.12 0.99 0.125
1 2 0.13 0.15 0.15 0.245
2 2 0.33 0.1 0.888 0.64
What i want is following DataFrame:
>>> df3
A B date value
1 1 01-2016 0.10
2 1 02-2017 0.12
1 2 03-2017 0.15
2 2 04-2020 0.64
I already tried following:
summarize_dates = self.summarize_specific_column(data=df1, column='date')
for date in summarize_dates:
left_on = np.append(left_on, date)
right_on = np.append(right_on, merge_columns.upper())
result = pd.merge(left=df2, right=df1,
left_on=left_on, right_on=right_on,
how='right')
print(result)
This does not work. Can you help me and suggest a more comfortable implementation? Manyy thanks in advance!
You can melt df2 and then merge using the default 'inner' merge
df3 = df1.merge(df2.melt(id_vars = ['A', 'B'], var_name='date'))
A B date value
0 1 1 01-2016 0.10
1 2 1 02-2017 0.12
2 1 2 03-2017 0.15
3 2 2 04-2020 0.64
Using lookup
df1['value']=df2.set_index(['A','B']).lookup(df1.set_index(['A','B']).index,df1.date)
df1
Out[228]:
A B date value
0 1 1 01-2016 0.10
1 2 1 02-2017 0.12
2 1 2 03-2017 0.15
3 2 2 04-2020 0.64

Categories

Resources