Find overlapping or inserted rows in python - python

I have a simple DataFrame:
start end
0 30 40
1 45 55
2 50 60
3 53 64
4 65 70
5 75 80
6 77 85
7 80 83
8 90 120
9 95 100
10 105 110
You may notice some rows are part of another row, or they overlap with them. I want to straighten up this DataFrame to get this:
start end
0 30 40
1 45 64
2 65 70
3 75 85
4 90 120
I drew a picture for a better explanation (hope, it helps):

Use custom function with Dataframe constructor:
#https://stackoverflow.com/a/5679899/2901002
def merge(times):
saved = list(times[0])
for st, en in sorted([sorted(t) for t in times]):
if st <= saved[1]:
saved[1] = max(saved[1], en)
else:
yield tuple(saved)
saved[0] = st
saved[1] = en
yield tuple(saved)
df1 = pd.DataFrame(merge(df[['start','end']].to_numpy()), columns=['start','end'])
print (df1)
start end
0 30 40
1 45 64
2 65 70
3 75 85
4 90 120

Related

How do I make each group within a dataframe the same size?

I have the following dataframe:
Patient
HR
02
PaO2
Hgb
1
62
94
73
31
1
64
93
73
34
1
62
92
73
31
2
64
90
84
42
3
62
95
75
30
3
70
97
77
29
Each row for a patient indicates an hourly observation. So, patient 1 has three observations, patient 2 has one observation and patient 3 has two observations. I'm trying to find a way to pad each patient group so that they are the same size (the same number of observations) as I'm trying to use this data for an LSTM. I'm not sure what the best way to do this would be though. I was wondering if anyone had any ideas?
The output would hopefully look like this:
Patient
HR
02
PaO2
Hgb
1
62
94
73
31
1
64
93
73
34
1
62
92
73
31
2
64
90
84
42
2
0
0
0
0
2
0
0
0
0
3
62
95
75
30
3
70
97
77
29
3
0
0
0
0
Reindex your original data to a pandas.MultiIndex on the Patient and Cumulative Count:
df = df.set_index(["Patient", df.groupby("Patient").cumcount()])
index = pd.MultiIndex.from_product(df.index.levels, names=df.index.names)
output = df.reindex(index, fill_value=0).reset_index(level=1, drop=True).reset_index()
>>> output
Patient HR 02 PaO2 Hgb
0 1 62 94 73 31
1 1 64 93 73 34
2 1 62 92 73 31
3 2 64 90 84 42
4 2 0 0 0 0
5 2 0 0 0 0
6 3 62 95 75 30
7 3 70 97 77 29
8 3 0 0 0 0

Python: tidy data, how can I transform this table as I want? [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
How to melt 2 columns at the same time?
(2 answers)
Closed 1 year ago.
I need to transform my table from a wide format to a long table. The table has measurements over time, let's say mass over time: m0, m1, m2 etc. so it looks like this:
ID | Age | m0 | m1 | m2 | m3
1 67 72 69 66 67
2 70 80 81 79 77
3 72 69 69 70 70
How I want it is:
ID | Age | time | m
1 67 0 72
1 67 1 69
1 67 2 66
1 67 3 67
2 70 0 80
2 70 1 81
2 70 2 79
2 70 3 77
...
I appreciate any help! Thank you in advance.
Cheers.
You can make use of pandas melt method in this case
result = df.melt(id_vars=['ID', 'Age'], value_vars=['m0', 'm1', 'm2', 'm3'])
result.columns = ['ID', 'Age', 'time', 'm']
result['time'] = result['time'].str.replace('m', '')
result = result.sort_values('Age').reset_index(drop=True)
print(result)
ID Age time m
0 1 67 0 72
1 1 67 1 69
2 1 67 2 66
3 1 67 3 67
4 2 70 0 80
5 2 70 1 81
6 2 70 2 79
7 2 70 3 77
8 3 72 0 69
9 3 72 1 69
10 3 72 2 70
11 3 72 3 70
Alternative method using pd.wide_to_long
result = pd.wide_to_long(df, stubnames=["m"], i=["ID", "Age"], j="").reset_index()
result.columns = ['ID', 'Age', 'time', 'm']
result = result.sort_values('Age').reset_index(drop=True)
print(result)
ID Age time m
0 1 67 0 72
1 1 67 1 69
2 1 67 2 66
3 1 67 3 67
4 2 70 0 80
5 2 70 1 81
6 2 70 2 79
7 2 70 3 77
8 3 72 0 69
9 3 72 1 69
10 3 72 2 70
11 3 72 3 70
If there are more variables like m, one can mention it inside stubnames
pd.wide_to_long documentation : https://pandas.pydata.org/docs/reference/api/pandas.wide_to_long.html

Last cell in a column dataframe from excel using pandas

I just had a quick question. How would one go about getting the last cell value of an excel spreadsheet when working with it as a dataframe using pandas, for every single different column. I'm having quite some difficulty with this, I know the index can be found with len(), but I can't quite wrap my finger around it. Thank you any help would be greatly appreciated.
If you want the last cell of a dataframe meaning the most bottom right cell, then you can use .iloc:
df = pd.DataFrame(np.arange(1,101).reshape((10,-1)))
df
Output:
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9 10
1 11 12 13 14 15 16 17 18 19 20
2 21 22 23 24 25 26 27 28 29 30
3 31 32 33 34 35 36 37 38 39 40
4 41 42 43 44 45 46 47 48 49 50
5 51 52 53 54 55 56 57 58 59 60
6 61 62 63 64 65 66 67 68 69 70
7 71 72 73 74 75 76 77 78 79 80
8 81 82 83 84 85 86 87 88 89 90
9 91 92 93 94 95 96 97 98 99 100
Use .iloc with -1 index selection on both rows and columns.
df.iloc[-1,-1]
Output:
100
DataFrame.head(n) gets the top n results from the dataframe. DataFrame.tail(n) gets the bottom n results from the dataframe.
If your dataframe is named df, you could use df.tail(1) to get the last row of the dataframe. The returned value is also a dataframe.

How to select all rows which contain values greater than a threshold?

The request is simple: I want to select all rows which contain a value greater than a threshold.
If I do it like this:
df[(df > threshold)]
I get these rows, but values below that threshold are simply NaN. How do I avoid selecting these rows?
There is absolutely no need for the double transposition - you can simply call any along the column index (supplying 1 or 'columns') on your Boolean matrix.
df[(df > threshold).any(1)]
Example
>>> df = pd.DataFrame(np.random.randint(0, 100, 50).reshape(5, 10))
>>> df
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
2 37 2 55 68 16 14 93 14 71 84
3 67 45 79 75 27 94 46 43 7 40
4 61 65 73 60 67 83 32 77 33 96
>>> df[(df > 95).any(1)]
0 1 2 3 4 5 6 7 8 9
0 45 53 89 63 62 96 29 56 42 6
1 0 74 41 97 45 46 38 39 0 49
4 61 65 73 60 67 83 32 77 33 96
Transposing as your self-answer does is just an unnecessary performance hit.
df = pd.DataFrame(np.random.randint(0, 100, 10**8).reshape(10**4, 10**4))
# standard way
%timeit df[(df > 95).any(1)]
1 loop, best of 3: 8.48 s per loop
# transposing
%timeit df[df.T[(df.T > 95)].any()]
1 loop, best of 3: 13 s per loop
This is actually very simple:
df[df.T[(df.T > 0.33)].any()]

Find column with the highest value (pandas)

I have a Pandas dataframe with several columns that range from 0 to 100. I would like to add a column on to the dataframe that contains the name of the column from among these that has the greatest value for each row. So:
one two three four COLUMN_I_WANT_TO_CREATE
5 40 12 19 two
90 15 58 23 one
74 95 34 12 two
44 81 22 97 four
10 59 59 44 [either two or three, selected randomly]
etc.
Bonus points if the solution can resolve ties randomly.
You can use idxmax with parameter axis=1:
print df
one two three four
0 5 40 12 19
1 90 15 58 23
2 74 95 34 12
3 44 81 22 97
df['COLUMN_I_WANT_TO_CREATE'] = df.idxmax(axis=1)
print df
one two three four COLUMN_I_WANT_TO_CREATE
0 5 40 12 19 two
1 90 15 58 23 one
2 74 95 34 12 two
3 44 81 22 97 four
With random duplicity max values is it more complicated.
You can first find all max values by x[(x == x.max())]. Then you need index values, where apply sample. But it works only with Series, so index is converted to
Series by to_series. Last you can select only first value of Serie by iloc:
print df
one two three four
0 5 40 12 19
1 90 15 58 23
2 74 95 34 12
3 44 81 22 97
4 10 59 59 44
5 59 59 59 59
6 10 59 59 59
7 59 59 59 59
#first run
df['COL']=df.apply(lambda x:x[(x==x.max())].index.to_series().sample(frac=1).iloc[0], axis=1)
print df
one two three four COL
0 5 40 12 19 two
1 90 15 58 23 one
2 74 95 34 12 two
3 44 81 22 97 four
4 10 59 59 44 three
5 59 59 59 59 one
6 10 59 59 59 two
7 59 59 59 59 three
#one of next run
df['COL']=df.apply(lambda x:x[(x==x.max())].index.to_series().sample(frac=1).iloc[0], axis=1)
print df
one two three four COL
0 5 40 12 19 two
1 90 15 58 23 one
2 74 95 34 12 two
3 44 81 22 97 four
4 10 59 59 44 two
5 59 59 59 59 one
6 10 59 59 59 three
7 59 59 59 59 four

Categories

Resources