Pandas filtering on max range - python

I'm working on text mining problem and using Pandas for text processing. From the following example I need to pick only those row which have the max span (start - end) within the same category (cat)
Given this dataframe:
name start end cat
0 coumadin 0 8 DRUG
1 albuterol 18 27 DRUG
2 albuterol sulfate 18 35 DRUG
3 sulfate 28 35 DRUG
4 2.5 36 39 STRENGTH
5 2.5 mg 36 42 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
7 0.083 50 55 STRENGTH
8 0.083 % 50 57 STRENGTH
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
10 solution 59 67 FORM
11 solution for nebulization 59 84 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
14 neb 98 101 ROUTE
15 neb inhalation 98 112 ROUTE
16 inhalation 102 112 ROUTE
17 q4h 113 116 FREQUENCY
18 every 118 123 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
I need to get the following:
name start end cat
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
11 solution for nebulization 59 84 FORM
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
23 dyspnea 147 154 REASON
What I tried is to groupby by the category and then compute the max difference (end-start). However I got stuck how to find the max span between for the same entity within the category. I guess it should not be very tricky
COMMENT
Thank you all for suggestions, but I need ALL possible entities within each category. For example, in DRUG, there are two relevant drugs: coumadin and albuterol sulfate, and some fractions of them (albuterol and sulfate). I need to remove only (albuterol and sulfate) while keeping coumadin and albuterol sulfate. The same logic for other categories.
For example, rows 4-8 are all bits of a complete row 9, thus I need to keep only row 9. Rows 1 and 3 are parts of the row 2, thus I need to keep row 2 (in addition to row 0). Etc.
Obviously, all constituents ('bits') are within the max range, but the problem is to find the max (or unifying range) of the same entity and its constituents)
COMMENT 2
A possible solution could be: to find all overlapping intervals within the same category cat and pick the largest. I'm trying to implement, but not luck so far.
Possible Solution
I sorted columns by ascending and descending order:
df.sort_values(by=[1,2], ascending=[True, False])
0 1 2 3
0 coumadin 0 8 DRUG
2 albuterol sulfate 18 35 DRUG
1 albuterol 18 27 DRUG
3 sulfate 28 35 DRUG
9 2.5 mg /3 ml (0.083 %) 36 58 STRENGTH
6 2.5 mg /3 ml 36 48 STRENGTH
5 2.5 mg 36 42 STRENGTH
4 2.5 36 39 STRENGTH
8 0.083 % 50 57 STRENGTH
7 0.083 50 55 STRENGTH
11 solution for nebulization 59 84 FORM
10 solution 59 67 FORM
12 nebulization 72 84 ROUTE
13 one (1) 90 97 FREQUENCY
15 neb inhalation 98 112 ROUTE
14 neb 98 101 ROUTE
16 inhalation 102 112 ROUTE
21 q4h (every 4 hours) as needed 113 142 FREQUENCY
20 q4h (every 4 hours) 113 132 FREQUENCY
17 q4h 113 116 FREQUENCY
19 every 4 hours 118 131 FREQUENCY
18 every 118 123 FREQUENCY
22 as needed 133 142 FREQUENCY
23 dyspnea 147 154 REASON
Which puts the relevant row the first, however, I still need to filter out irrelevant rows....

I have tried this on sample of your df:
Create a sample df:
import pandas as pd
Name = ['coumadin','albuterol','albuterol sulfate','sulfate']
Cat = ['D', 'D', 'D', 'D']
Start = [0, 18, 18, 28]
End = [8, 27, 33,35]
ID = [1,2,3,4]
df = pd.DataFrame(data = list(zip(ID,Name,Start,End,Cat)), \
columns=['ID','Name','Start','End','Cat'])
Make a function which will help in identifying the names which are similar
def matcher(x):
res = df.loc[df['Name'].str.contains(x, regex=False, case=False), 'ID']
return ','.join(res.astype(str))
Applying this function to value of the column
df['Matches'] = df['Name'].apply(matcher) ##Matches will contain the ID of rows which are similar and have only 1 value which are absolute.
ID Name Start End Cat Matches
0 1 coumadin 0 8 D 1
1 2 albuterol 18 27 D 2,3
2 3 albuterol sulfate 18 33 D 3
3 4 sulfate 28 35 D 3,4
Count the number of rows getting in matches
df['Count'] = df.Matches.apply(lambda x: len(x.split(',')))
Keep the df which has "Count" as 1 as these are the rows which contains the other rows:
df = df[df.Count == 1]
ID Name Start End Cat Matches Count
0 1 coumadin 0 8 D 1 1
2 3 albuterol sulfate 18 33 D 3 1
You can then remove unnecessary columns :)

Related

Binning continuos variables in columns based on time of first column

I am trying to bin values in columns as the average of 5 rows from 1-5, 6-10 and so on using python.
My df dataset looks like this:
Unnamed: 0 C00_zscore C01_zscore C02_zscore
1 3 5 6
2 4 36 65
3 56 98 62
4 89 52 35
5 32 74 30
6 55 22 35
7 68 23 31
8 97 65 15
9 2 68 1
10 13 54 300
11
Ideally, the result should look like this:
bin C00_binned C01_binned C02_binned
1 36.8 53 39.6
2 47 46.4 76.4
Take the index and divide it by the bin size. This will be the row's bin. In your case, the row number starts at 1 and you want bins of size 5.
bin_num = row_num / (bin_size + 1)
Now that each row has a bin_num, group by it then do the calculations.
df['bin_num'] = df['Unnamed'] / 6
df.groupby(['bin_num']).mean()

Sum row values of all columns where column names meet string match condition

I have the following dataset:
ID Length Width Range_CAP Capacity_CAP
0 1 33 25 16 50
1 2 34 22 11 66
2 3 22 12 15 42
3 4 46 45 66 54
4 5 16 6 23 75
5 6 21 42 433 50
I basically want to sum the row values of the columns only where the columns match a string (in this case, all columns with _CAP at the end of their name). And store the sum of the result in a new column.
So that I end up with a dataframe that looks something like this:
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
I first tried to use the solution recommended in this question here:
Summing columns in Dataframe that have matching column headers
However, the solution doesn't work for me since they are summing up columns that have the same exact name so a simple groupby can accomplish the result whereas I am trying to sum columns with specific string matches only.
Code to recreate above sample dataset:
data1 = [['1', 33,25,16,50], ['2', 34,22,11,66],
['3', 22,12,15,42],['4', 46,45,66,54],
['5',16,6,23,75], ['6', 21,42,433,50]]
df = pd.DataFrame(data1, columns = ['ID', 'Length','Width','Range_CAP','Capacity_CAP'])
Let us do filter
df['CAP_SUM'] = df.filter(like='CAP').sum(1)
Out[86]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
If have other CAP in front
df.filter(regex='_CAP$').sum(1)
Out[92]:
0 66
1 77
2 57
3 120
4 98
5 483
dtype: int64
One approach is:
df['CAP_SUM'] = df.loc[:, df.columns.str.endswith('_CAP')].sum(1)
print(df)
Output
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
The expression:
df.columns.str.endswith('_CAP')
creates a boolean mask where the values are True if and only if the column name ends with CAP. As an alternative use filter, with the following regex:
df['CAP_SUM'] = df.filter(regex='_CAP$').sum(1)
print(df)
Output (of filter)
ID Length Width Range_CAP Capacity_CAP CAP_SUM
0 1 33 25 16 50 66
1 2 34 22 11 66 77
2 3 22 12 15 42 57
3 4 46 45 66 54 120
4 5 16 6 23 75 98
5 6 21 42 433 50 483
You may try this:
columnstxt = df.columns
df['sum'] = 0
for i in columnstxt:
if i.find('_CAP') != -1:
df['sum'] = df['sum'] + df[i]
else:
pass

Dask DataFrame calculate mean within multi-column groupings

I have a data frame as shown in Image, what I want to do is to take the mean along the column 'trial'. It for every subject, condition and sample (when all these three columns has value one), take average of data along column trial (100 rows).
what I have done in pandas is as following
sub_erp_pd= pd.DataFrame()
for j in range(1,4):
sub_c=subp[subp['condition']==j]
for i in range(1,3073):
sub_erp_pd=sub_erp_pd.append(sub_c[sub_c['sample']==i].mean(),ignore_index=True)
But this take alot of time..
So i am thinking to use dask instead of Pandas.
But in dask i am having issue in creating an empty data frame. Like we create an empty data frame in pandas and append data to it.
image of data frame
as suggested by #edesz I made changes in my approach
EDIT
%%time
sub_erp=pd.DataFrame()
for subno in progressbar.progressbar(range(1,82)):
try:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
except:
sub=pd.read_csv('../input/data/{}.csv'.format(subno,subno),header=None)
sub_erp=sub_erp.append(sub.groupby(['condition','sample'], as_index=False).mean())
Reading a file using pandas take 13.6 seconds while reading a file using dask take 61.3 ms. But in dask, I am having trouble in appending.
NOTE - The original question was titled Create an empty dask dataframe and append values to it.
If I understand correctly, you need to
use groupby (read more here) in order to group the subject, condition and sample columns
this will gather all rows, which have the same value in each of these three columns, into a single group
take the average using .mean()
this will give you the mean within each group
Generate some dummy data
df = df = pd.DataFrame(np.random.randint(0,100,size=(100, 3)),
columns=['trial','condition','sample'])
df.insert(0,'subject',[1]*10 + [2]*30 + [5]*60)
print(df.head())
subject trial condition sample
0 1 71 96 34
1 1 2 89 66
2 1 90 90 81
3 1 93 43 18
4 1 29 82 32
Pandas approach
Aggregate and take mean
df_grouped = df.groupby(['subject','condition','sample'], as_index=False)['trial'].mean()
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
Dask approach
Step 1. Imports
import dask.dataframe as dd
from dask.diagnostics import ProgressBar
Step 2. Convert Pandas DataFrame to Dask DataFrame, using .from_pandas
ddf = dd.from_pandas(df, npartitions=2)
Step 3. Aggregate and take mean
ddf_grouped = (
ddf.groupby(['subject','condition','sample'])['trial']
.mean()
.reset_index(drop=False)
)
with ProgressBar():
df_grouped = ddf_grouped.compute()
[ ] | 0% Completed | 0.0s
[########################################] | 100% Completed | 0.1s
print(df_grouped.head(15))
subject condition sample trial
0 1 18 24 89
1 1 43 18 93
2 1 67 47 81
3 1 82 32 29
4 1 85 28 97
5 1 88 13 48
6 1 89 59 23
7 1 89 66 2
8 1 90 81 90
9 1 96 34 71
10 2 0 81 19
11 2 2 39 58
12 2 2 59 94
13 2 5 42 13
14 2 9 42 4
IMPORTANT NOTE: The approach in this answer does not use the approach of creating an empty Dask DataFrame and append values to it in order to calculate a mean within groupings of subject, condition and trial. Instead, this answer provides an alternate approach (using GROUP BY) to obtaining the desired end result (of calculating the mean within groupings of subject, condition and trial).

Find average of every column in a dataframe, grouped by column, exluding one value

I have a Dataframe like the one presented below:
CPU Memory Disk Label
0 21 28 29 0
1 46 53 55 1
2 48 45 49 2
3 48 52 50 3
4 51 54 55 4
5 45 50 56 5
6 50 83 44 -1
What I want is to grouby and find the average of each label. So far I have this
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean() which works just fine and get the results as follows:
Label CPU Memory Disk
-1 46.441176 53.882353 54.176471
0 48.500000 58.500000 60.750000
1 45.000000 51.000000 60.000000
2 54.000000 49.000000 56.000000
3 55.000000 71.500000 67.500000
4 53.000000 70.000000 71.000000
5 21.333333 30.000000 30.666667
The only thing I haven't yet found is how to exclude everything that is labeled as -1. Is there a way to do that?
You could filter the dataframe before grouping:
# Exclude rows with Label=-1
dataset = dataset.loc[dataset['Label'] != -1]
# Group by on filtered result
dataset.groupby('Label')['CPU', 'Memory', 'Disk'].mean()

Split output of loop by columns used as input

Hi I'm relatively new to Python and am currently working on trying to measure the width of features in an image. The resolution of my image is 1m so measuring the width should be easier. I've managed to select certain columns or rows of the image and extract the necessary data using loops and such. My code is below:
subset = imarray[:,::500]#(imarray.shape[1]/2):(imarray.shape[1]/2)+1]
subset[(subset > 0) & (subset <= 17)] = 1
subset[(subset > 17)] = 0
width = []
count = 0
for i in np.arange(subset.shape[1]):
column = subset[:,i]
for value in column:
if (value == 1):
count += 1
width.append(count)
width_arr = np.array(width).astype('uint8')
else:
count = 0
final = np.split(width_arr, np.argwhere(width_arr == 1).flatten())
final2 = [x for x in final if x != []]
width2 = []
for array in final2:
width2.append(max(array))
width2 = np.array(width2).astype('uint8')
print width2
I can't figure out how to split the output up so it shows the results for each column or row individually. Instead all I've been able to do is to append the data to an empty list and here's the output for that:
[ 70 35 4 2 5 36 4 5 2 51 97 4 228 3 21 47 7 21
23 58 126 4 111 2 2 5 3 2 18 15 6 19 3 3 12 15
6 8 2 4 6 88 122 24 14 49 73 57 74 6 179 8 3 2
6 3 184 9 3 19 24 3 2 2 3 255 30 8 191 33 127 5
3 27 112 2 24 2 5 2 10 30 10 6 37 2 38 6 12 17
44 67 23 5 101 10 9 4 6 4 255 136 5 255 255 255 255 26
255 235 148 4 255 199 3 2 114 87 255 109 69 12 41 20 30 57
72 89 32]
So these are the widths of the features in all the columns appended together. How do I use my loop or another method to split these up into individual numpy arrays representing each column I've sliced out of the original?
It seems like I am almost there but I can't seem to figure that last step out and it's driving me nuts.
Thanks in advance for your help!

Categories

Resources