Drop pandas DF rows having majority of 0's - python

I have a dataset shown in the below :
And want to drop rows like 4,5 & 7 as there majority of columns are having 0 but not all. At the same time, I don't want to drop rows like 0 and 1 as they have very few entries as 0.

First create a column to calculate zeros in your rows
df['no_of_zeros']=(df == 0).astype(int).sum(axis=1)
Define how many zeros are acceptable in your row and filter the dataframe according to it.
df=df[df['no_of_zeros'] < 3].drop(['no_of_zeros'], axis=1)

Here is one way:
import pandas as pd
df = pd.DataFrame([[0, 1, 2, 3, 4],
[0, 0, 0, 1, 2]],
columns=['A', 'B', 'C', 'D', 'E'])
df = df[~((df == 0).astype(int).sum(axis=1) > len(df.columns) / 2)]
# A B C D E
# 0 0 1 2 3 4

Assuming "majority" means "more than half of the columns", this works:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'c2': {0: 76, 1: 45, 2: 47, 3: 92, 4: 0, 5: 0, 6: 26, 7: 0, 8: 71},
...: 'c3': {0: 0, 1: 3, 2: 6, 3: 9, 4: 0, 5: 0, 6: 12, 7: 0, 8: 15},
...: 'c4': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
...: 'c5': {0: 23, 1: 0, 2: 23, 3: 23, 4: 0, 5: 0, 6: 23, 7: 0, 8: 23},
...: 'c6': {0: 65, 1: 25, 2: 62, 3: 26, 4: 52, 5: 22, 6: 65, 7: 0, 8: 69},
...: 'c7': {0: 12, 1: 12, 2: 12, 3: 12, 4: 12, 5: 12, 6: 12, 7: 12, 8: 12},
...: 'c8': {0: 0, 1: 0, 2: 8, 3: 9, 4: 0, 5: 0, 6: 4, 7: 0, 8: 4},
...: 'cl': {0: 5, 1: 7, 2: 8, 3: 15, 4: 0, 5: 0, 6: 2, 7: 0, 8: 5},
...: 'sr': {0: 0, 1: 1, 2: 2, 3: 3, 4: 4, 5: 5, 6: 6, 7: 7, 8: 8}})
...:
In [3]: df
Out[3]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
4 0 0 1 0 52 12 0 0 4
5 0 0 1 0 22 12 0 0 5
6 26 12 1 23 65 12 4 2 6
7 0 0 1 0 0 12 0 0 7
8 71 15 1 23 69 12 4 5 8
In [4]: df[((df == 0).sum(axis=1) <= len(df.columns) / 2)]
Out[4]:
c2 c3 c4 c5 c6 c7 c8 cl sr
0 76 0 1 23 65 12 0 5 0
1 45 3 1 0 25 12 0 7 1
2 47 6 1 23 62 12 8 8 2
3 92 9 1 23 26 12 9 15 3
6 26 12 1 23 65 12 4 2 6
8 71 15 1 23 69 12 4 5 8
In [5]:

Related

Is there a way to return value (date) in dataframe from multiple column?

Currently I'm trying to return a date to another column based on the multiple column value as shown in the dataframe below. The expected result is in the column df['return_date'] where it returned the date before the other column start the value with '1'
DATE
column_a
column_b
column_c
column_d
return_date
1/1/2023
0
0
1
0
NaN
2/1/2023
0
0
1
0
NaN
3/1/2023
0
0
1
0
3/1/2023
4/1/2023
0
1
0
0
NaN
5/1/2023
0
1
0
0
NaN
6/1/2023
0
1
0
0
NaN
7/1/2023
0
1
0
0
7/1/2023
8/1/2023
1
0
0
0
NaN
9/1/2023
1
0
0
0
9/1/2023
10/1/2023
0
0
0
1
NaN
I want to learn using the groupby for multiple column but I'm not quite familiar with it yet.. Is there anyone can help me?
This should work for your case. You need to detect when a row change with respect to the next, and then in that row you should assign the date.
data = {'DATE': {0: '1/1/2023',
1: '2/1/2023',
2: '3/1/2023',
3: '4/1/2023',
4: '5/1/2023',
5: '6/1/2023',
6: '7/1/2023',
7: '8/1/2023',
8: '9/1/2023',
9: '10/1/2023'},
'column_a': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 1, 8: 1, 9: 0},
'column_b': {0: 0, 1: 0, 2: 0, 3: 1, 4: 1, 5: 1, 6: 1, 7: 0, 8: 0, 9: 0},
'column_c': {0: 1, 1: 1, 2: 1, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'column_d': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 1}}
df = pd.DataFrame.from_dict(data)
row_change = df.filter(like='column').diff(-1).eq(-1).any(axis=1)
df.loc[row_change, 'return_date'] = df.loc[row_change, 'DATE']
You can use groupby, in combination with .cumcount() to get the required index
And then drag the date from the DATE column using that index.
Code:
import pandas as pd
df = pd.DataFrame({
'DATE': ['1/1/2023', '2/1/2023', '3/1/2023', '4/1/2023', '5/1/2023', '6/1/2023', '7/1/2023', '8/1/2023', '9/1/2023', '10/1/2023'],
'column_a': [0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
'column_b': [0, 0, 0, 1, 1, 1, 1, 0, 0, 0],
'column_c': [1, 1, 1, 0, 0, 0, 0, 0, 0, 0],
'column_d': [0, 0, 0, 0, 0, 0, 0, 0, 0, 1]
})
idx = df[df.groupby(['column_a', 'column_b', 'column_c', 'column_d']).cumcount() == 0].index - 1
df.loc[idx[1:], 'return_date'] = df.loc[idx[1:], 'DATE']
print(df)
Output
DATE column_a column_b column_c column_d return_date
0 1/1/2023 0 0 1 0 NaN
1 2/1/2023 0 0 1 0 NaN
2 3/1/2023 0 0 1 0 3/1/2023
3 4/1/2023 0 1 0 0 NaN
4 5/1/2023 0 1 0 0 NaN
5 6/1/2023 0 1 0 0 NaN
6 7/1/2023 0 1 0 0 7/1/2023
7 8/1/2023 1 0 0 0 NaN
8 9/1/2023 1 0 0 0 9/1/2023
9 10/1/2023 0 0 0 1 NaN

Calcuating the proportion of total 'yes' values in a group

I have a dataframe that that looks like this:
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
1
1000
1005
6
7
13
Y
3
6
36346
1
1007
10012
3
1
4
N
3
6
36346
1
10014
10020
0
1
1
Y
3
6
36346
2
33532
33554
1
1
2
N
1
2
22123
cluster is an ID assigned to each row, in this case, we have 3 "sites"
In this cluster, two of these sites are in the control (in_control==Y)
I want to create an additional column, which tells me what proportion of the sites are in the control. i.e. (sum(in_control==Y) for a cluster)/sites_in_cluster
In this example, we have two rows with in_control==Y and 3 sites_in_cluster in cluster 36346. Therefore, cluster_sites_in_control would be 2/3 = 0.66 whereas cluster 22123 only has one site and isn't in the control, so would be 0/1=0
chr
start
end
plus
minus
total
in_control
sites_in_cluster
mean
cluster
cluster_sites_in_control
1
1000
1005
6
7
13
Y
3
6
36346
0.66
1
1007
10012
3
1
4
N
3
6
36346
0.66
1
10014
10020
0
1
1
Y
3
6
36346
0.66
2
33532
33554
1
1
2
N
1
2
22123
0.00
I have created code which seemingly accomplishes this, however, it seems to be extremely roundabout and I'm certain there's a better solution out there:
intersect_in_control
# %%
import pandas as pd
#get the number of sites in a control that are 'Y'
number_in_control = pd.DataFrame(intersect_in_control.groupby(['cluster']).in_control.value_counts().unstack(fill_value=0).loc[:,'Y'])
#get the number of breaksites for that cluster
number_of_breaksites = pd.DataFrame(intersect_in_control.groupby(['cluster'])['sites_in_cluster'].count())
#combine these two dataframes
combined_dataframe = pd.concat([number_in_control.reset_index(drop=False), number_of_breaksites.reset_index(drop=True)], axis=1)
#calculate the desired column
combined_dataframe["proportion_in_control"] = combined_dataframe["Y"]/combined_dataframe["sites_in_cluster"]
#left join this new dataframe to the original whilst dropping undesired columns.
cluster_in_control = intersect_in_control.merge((combined_dataframe.drop(["Y","sites_in_cluster"], axis = 1)), on='cluster', how='left')
10 rows of the df as example data:
{'chr': {0: 'chr14',
1: 'chr2',
2: 'chr1',
3: 'chr10',
4: 'chr17',
5: 'chr17',
6: 'chr2',
7: 'chr2',
8: 'chr2',
9: 'chr1',
10: 'chr1'},
'start': {0: 23016497,
1: 133031338,
2: 64081726,
3: 28671025,
4: 45219225,
5: 45219225,
6: 133026750,
7: 133026761,
8: 133026769,
9: 1510391,
10: 15853061},
'end': {0: 23016501,
1: 133031342,
2: 64081732,
3: 28671030,
4: 45219234,
5: 45219234,
6: 133026755,
7: 133026763,
8: 133026770,
9: 1510395,
10: 15853067},
'plus_count': {0: 2,
1: 0,
2: 5,
3: 1,
4: 6,
5: 6,
6: 14,
7: 2,
8: 0,
9: 2,
10: 4},
'minus_count': {0: 6,
1: 7,
2: 1,
3: 5,
4: 0,
5: 0,
6: 0,
7: 0,
8: 2,
9: 3,
10: 1},
'count': {0: 8, 1: 7, 2: 6, 3: 6, 4: 6, 5: 6, 6: 14, 7: 2, 8: 2, 9: 5, 10: 5},
'in_control': {0: 'N',
1: 'N',
2: 'Y',
3: 'N',
4: 'Y',
5: 'Y',
6: 'N',
7: 'Y',
8: 'N',
9: 'Y',
10: 'Y'},
'total_breaks': {0: 8,
1: 7,
2: 6,
3: 6,
4: 6,
5: 6,
6: 18,
7: 18,
8: 18,
9: 5,
10: 5},
'sites_in_cluster': {0: 1,
1: 1,
2: 1,
3: 1,
4: 1,
5: 1,
6: 3,
7: 3,
8: 3,
9: 1,
10: 1},
'mean_breaks_per_site': {0: 8.0,
1: 7.0,
2: 6.0,
3: 6.0,
4: 6.0,
5: 6.0,
6: 6.0,
7: 6.0,
8: 6.0,
9: 5.0,
10: 5.0},
'cluster': {0: 22665,
1: 24664,
2: 3484,
3: 13818,
4: 23640,
5: 23640,
6: 24652,
7: 24652,
8: 24652,
9: 48,
10: 769}}
Thanks in advance for any help :)
For percentage is possible symplify solution with mean per boolean column and for create new column use GroupBy.transform, it working well because Trues apre processing like 1:
df['cluster_sites_in_control'] = (df['in_control'].eq('Y')
.groupby(df['cluster']).transform('mean'))
print (df)
chr start end plus minus total in_control sites_in_cluster mean \
0 1 1000 1005 6 7 13 Y 3 6
1 1 1007 10012 3 1 4 N 3 6
2 1 10014 10020 0 1 1 Y 3 6
3 2 33532 33554 1 1 2 N 1 2
cluster cluster_sites_in_control
0 36346 0.666667
1 36346 0.666667
2 36346 0.666667
3 22123 0.000000

Pandas group by selected dates

I have a dataframe that is very similar to this dataframe:
index
date
month
0
2019-12-1
12
1
2020-03-1
3
2
2020-07-1
7
3
2021-02-1
2
4
2021-09-1
9
And i want to combine all dates that are closest to a set of months. The months need to be normalized like this:
Months
Normalized month
3, 4, 5
4
6, 7, 8, 9
8
1, 2, 10, 11, 12
12
So the output will be:
index
date
month
0
2019-12-1
12
1
2020-04-1
4
2
2020-08-1
8
3
2020-12-1
12
4
2021-08-1
8
You can iterate through the DataFrame and use replace to change the dates.
import pandas as pd
df = pd.DataFrame(data={'date': ["2019-12-1", "2020-03-1", "2020-07-1", "2021-02-1", "2021-09-1"],
'month': [12,3,7,2,9]})
for index, row in df.iterrows():
if (row['month'] in [3,4,5]):
df['month'][index] = 4
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"04")
elif (row['month'] in [6,7,8,9]):
df['month'][index] = 8
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"08")
else:
df['month'][index] = 12
df["date"][index] = df["date"][0].replace(df["date"][0][5:7],"12")
you can try creating a dictionary of months where:
norm_month_dict = {3: 4, 4: 4, 5: 4, 6: 8, 7: 8, 8: 8, 9: 8, 1: 12, 2: 12, 10: 12, 11: 12, 12: 12}
then use this dictionary to map month values to their respective normalized month values.
df['normalized_months'] = df.months.map(norm_month_dict)
You need to construct a dictionary from the second dataframe (assuming df1 and df2):
d = (
df2.assign(Months=df2['Months'].str.split(', '))
.explode('Months').astype(int)
.set_index('Months')['Normalized month'].to_dict()
)
# {3: 4, 4: 4, 5: 4, 6: 8, 7: 8, 8: 8, 9: 8, 1: 12, 2: 12, 10: 12, 11: 12, 12: 12}
Then map the values:
df1['month'] = df1['month'].map(d)
output:
index date month
0 0 2019-12-1 12
1 1 2020-03-1 4
2 2 2020-07-1 8
3 3 2021-02-1 12
4 4 2021-09-1 8`

In the data frame of probabilities over time return first column name where value is < .5 for each row

Given a pandas data frame like the following where the column names are the time, the rows are each of the subjects, and the values are probabilities return the column name (or time) the first time the probability is less than .50 for each subject in the data frame. The probabilities are always descending from 1-0 I. have tried looping though the data frame but it is not computationally efficient.
subject id
0
1
2
3
4
5
6
7
…
669
670
671
1
1
0.997913
0.993116
0.989017
0.976157
0.973078
0.968056
0.963685
…
0.156092
0.156092
0.156092
2
1
0.990335
0.988685
0.983145
0.964912
0.958
0.952
0.946995
…
0.148434
0.148434
0.148434
3
1
0.996231
0.990571
0.985775
0.976809
0.972736
0.969633
0.966116
…
0.17037
0.17037
0.17037
4
1
0.997129
0.994417
0.991054
0.978795
0.974216
0.96806
0.963039
…
0.15192
0.15192
0.15192
5
1
0.997728
0.993598
0.986641
0.98246
0.977371
0.972874
0.96816
…
0.154545
0.154545
0.154545
6
1
0.998134
0.995564
0.989901
0.986941
0.982313
0.972951
0.969645
…
0.17473
0.17473
0.17473
7
1
0.995681
0.994131
0.990401
0.974494
0.967941
0.961859
0.956636
…
0.144753
0.144753
0.144753
8
1
0.997541
0.994904
0.991941
0.983389
0.979375
0.973158
0.966358
…
0.158763
0.158763
0.158763
9
1
0.992253
0.989064
0.979258
0.955747
0.948842
0.942899
0.935784
…
0.150291
0.150291
0.150291
Goal Output
subject id
time prob < .05
1
100
2
99
3
34
4
19
5
600
6
500
7
222
8
111
9
332
Since the probabilities are always descending you can do this:
>>> df.set_index("subject id").gt(.98).sum(1)
subject id
1 4
2 4
3 4
4 4
5 5
6 6
7 4
8 5
9 3
dtype: int64
note: I'm using .98 instead of .5 because I'm using only a portion of the data.
Data used
{'subject id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9},
'0': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1},
'1': {0: 0.997913,
1: 0.990335,
2: 0.996231,
3: 0.997129,
4: 0.997728,
5: 0.998134,
6: 0.995681,
7: 0.997541,
8: 0.992253},
'2': {0: 0.993116,
1: 0.988685,
2: 0.990571,
3: 0.994417,
4: 0.993598,
5: 0.995564,
6: 0.994131,
7: 0.994904,
8: 0.989064},
'3': {0: 0.989017,
1: 0.983145,
2: 0.985775,
3: 0.991054,
4: 0.986641,
5: 0.989901,
6: 0.990401,
7: 0.991941,
8: 0.979258},
'4': {0: 0.976157,
1: 0.964912,
2: 0.976809,
3: 0.978795,
4: 0.98246,
5: 0.986941,
6: 0.974494,
7: 0.983389,
8: 0.955747},
'5': {0: 0.973078,
1: 0.958,
2: 0.972736,
3: 0.974216,
4: 0.977371,
5: 0.982313,
6: 0.967941,
7: 0.979375,
8: 0.948842},
'6': {0: 0.968056,
1: 0.952,
2: 0.969633,
3: 0.96806,
4: 0.972874,
5: 0.972951,
6: 0.961859,
7: 0.973158,
8: 0.942899},
'7': {0: 0.963685,
1: 0.946995,
2: 0.966116,
3: 0.963039,
4: 0.96816,
5: 0.969645,
6: 0.956636,
7: 0.966358,
8: 0.935784}}
If I understand correctly, I think this is what you are looking for:
df.where(df.lt(.5)).idxmax(axis=1)

ValueError: could not broadcast input array from shape (5) into shape (7)

In this code:
import pandas as pd
myj='{"columns":["tablename","alias_tablename","real_tablename","dbname","finalcost","columns","pri_col"],"index":[0,1],"data":[["b","b","vip_banners","openx","",["id","name","adlink","wap_link","ipad_link","iphone_link","android_link","pictitle","target","starttime","endtime","weight_limit","weight","introduct","isbutton","sex","tag","gomethod","showtype","version","warehouse","areaid","textpic","smallpicture","group","service_provider","channels","chstarttime","chendtime","tzstarttime","tzendtime","status","editetime","shownum","wap_version","ipad_version","iphone_version","android_version","showtime","template_id","app_name","acid","ab_test","ratio","ab_tset_type","acid_type","key_name","phone_models","androidpad_version","is_delete","ugep_group","author","content","rule_id","application_id","is_default","district","racing_id","public_field","editor","usp_expression","usp_group","usp_php_expression","is_pic_category","is_custom_finance","midwhitelist","is_freeshipping","resource_id","usp_property","always_display","pushtime","is_pmc","version_type","is_plan","loop_pic_frame_id","plan_personal_id","personal_id","is_img_auto","banner_type","ext_content"],"id"],["a","a","vip_adzoneassoc","openx","",["id","zone_id","ad_id","weight"],"id"]]}'
df=pd.read_json(myj, orient='split')
bl=['is_delete,status,author', 'endtime', 'banner_type', 'id', 'starttime', 'status,endtime','weight']
al= ['zone_id,ad_id', 'zone_id,ad_id,id', 'ad_id', 'id', 'zone_id']
#
#bl=['add_time', 'allot_time', 'create_time', 'end_pay_time', 'start_pay_time', 'order_status,update_time', 'order_type,order_status,add_time', 'order_type,order_status,end_pay_time', 'wms_flag,order_status,is_first,order_date', 'last_update_time', 'order_code', 'order_date', 'order_sn', 'parent_sn', 'id', 'user_id', 'wms_flag,order_date']
#al=['area_id', 'last_update_time', 'mobile', 'parent_sn', 'id', 'transport_number', 'parent_sn']
def get_index(row):
print(row)
if row.tablename=='b':
return bl
else:
return al
# return ['is_delete,status,author', 'endtime', 'banner_type', 'id', 'starttime', 'status,endtime', 'weight']
df['index_cols']=df.apply(get_index,axis=1)
I run into error:
ValueError: could not broadcast input array from shape (5) into shape
(7)
Instead if I use the commented out bl and al evevything run fine.
Also if I use
bl=['is_delete,status,author', 'endtime', 'banner_type', 'id', 'starttime', 'status,endtime']
it runs fine too, what 's the problem?
In pandas-0.22.0 a list coming from apply method could be used to construct a new dataframe, when its length equals the number of columns in the initial dataframe.
For example:
>>> df = pd.DataFrame([range(100),range(100)])
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
1 0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
You can return a list in apply and get a dataframe:
>>> df.apply(lambda x:(x+1).values.tolist(), axis=1)
0 1 2 3 4 5 6 7 8 9 ... 90 91 92 93 94 95 96 97 98 99
0 1 2 3 4 5 6 7 8 9 10 ... 91 92 93 94 95 96 97 98 99 100
1 1 2 3 4 5 6 7 8 9 10 ... 91 92 93 94 95 96 97 98 99 100
but if length does not match dimension:
>>> df.apply(lambda x:(x+1).values.tolist()[:99], axis=1)
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
we get a Series.
And if you return lists of different dimension and the first one matches the dimension while the next one does not (like in your case) you get an error:
>>> df.apply(lambda x:[1] * 99 if x.name==0 else [0] * 100 , axis=1)
0 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
Works ok.
And this one
>>> df.apply(lambda x:[1] * 100 if x.name==0 else [0] * 99 , axis=1)
raises an error.
In pandas-0.23 you get a Series either way:
>>> df.apply(lambda x:(x+1).values.tolist(), axis=1)
0 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
1 [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...
>>> df.apply(lambda x:(x+1).values.tolist()[:9], axis=1)
0 [1, 2, 3, 4, 5, 6, 7, 8, 9]
1 [1, 2, 3, 4, 5, 6, 7, 8, 9]
This problem does not apply to tuples in pandas-0.22.0:
>>> df.apply(lambda x:(1,) * 9 if x.name==0 else (0,) * 10 , axis=1)
0 (1, 1, 1, 1, 1, 1, 1, 1, 1)
1 (0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
You can use this fact in your case:
bl = ('is_delete,status,author', 'endtime', 'banner_type',
'id', 'starttime', 'status,endtime', 'weight')
al = ('zone_id,ad_id', 'zone_id,ad_id,id', 'ad_id', 'id', 'zone_id')
>>> df.apply(get_index, axis=1)
0 (is_delete,status,author, endtime, banner_type...
1 (zone_id,ad_id, zone_id,ad_id,id, ad_id, id, z...
dtype: object

Categories

Resources