pandas DataFrame: match a dataframe and a dict by intervals - python

I have a question concerning DataFrames. I have a Dataframe, with intervals of 0.1 sec and features belong to that interval. I want to add a column that contains the prediction (is this interval a silence or a sounding) from a previous algorithm. I have a dictionary containing all predicted silence intervals per audio recording. My Dataframe will look like this. Here the df is filtered on audio_id==0 and ordered on interval_x.
audio_id interval_x interval_y predicted_value
0 0 0.579367 0.679367 0
1 0 0.679367 0.779367 0
2 0 0.779367 0.879367 0
3 0 0.879367 0.979367 0
4 0 0.979367 1.079367 0
... ... ... ... ...
518 0 50.805830 50.905830 0
519 0 50.905830 51.005830 0
520 0 51.005830 51.105830 0
521 0 51.105830 51.205830 0
522 0 51.205830 51.212938 0
My dictionary containing the silence intervals looks like this:
{'0': [[1.4501383219954658, 2.058138321995466],
[3.298138321995466, 4.762138321995465],
[7.682138321995467, 8.266138321995465],
[11.266138321995466, 11.938138321995465],
[13.242138321995466, 13.706138321995466],
[16.73013832199547, 17.82613832199547],
[24.53813832199547, 25.130138321995467],
[26.394138321995467, 27.042138321995466],
[28.21013832199547, 28.722138321995466]],
'1': [[0.0, 0.31253968253968023],
[4.296539682539681, 5.040539682539681],
[8.64053968253968, 9.296539682539679],
etc for each audiofile.
What is an efficient way to do this?

Here's a solution, using merge_asof to match intervals to their closest silent times. d is the dictionary from the question, and intervals is the data frame.
silent_times = pd.DataFrame.from_records([(file, from_time, to_time) for file, values in d.items()
for [from_time, to_time] in values],
columns = ["audio_id", "from_time", "to_time"])
silent_times.audio_id = silent_times.audio_id.astype(int)
res = pd.DataFrame()
for inx in intervals.audio_id.unique():
intervals_slice = intervals[intervals.audio_id == inx]
silent_times_slice = silent_times[silent_times.audio_id == inx]
t = pd.merge_asof(intervals_slice, silent_times_slice, left_on=["interval_x"], right_on=["from_time"])
t.loc[(t.interval_x>=t.from_time) & (t.interval_y <=t.to_time), "predicted_value"] = 1
res = res.append(t)
The result for the dataframe from the question, and for this silent intervals:
d = {'0': [
[1.4501383219954658, 2.058138321995466],
[3.298138321995466, 4.762138321995465],
[7.682138321995467, 8.266138321995465],
[50.01, 51.01]
],
'1': [
[0.0, 0.31253968253968023],
[4.296539682539681, 5.040539682539681],
[8.64053968253968, 9.296539682539679]]}
Is as follows:
print(res[["audio_id_x", "interval_x", "interval_y", "predicted_value"]])
audio_id_x interval_x interval_y predicted_value
0 0 0.579367 0.679367 0
1 0 0.679367 0.779367 0
2 0 0.779367 0.879367 0
3 0 0.879367 0.979367 0
4 0 0.979367 1.079367 0
5 0 50.805830 50.905830 1
6 0 50.905830 51.005830 1
7 0 51.005830 51.105830 0
8 0 51.105830 51.205830 0
9 0 51.205830 51.212938 0

Related

How to split a list using two nested conditions

Basically I have list of 0s and 1s. Each value in the list represents a data sample from an hour. Thus, if there are 24 0s and 1s in the list that means there are 24 hours, or a single day. I want to capture the first time the data cycles from 0s to 1s back to 0s in a span of 24 hours (or vice versa from 1s to 0s back to 1s).
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1]
expected output:
# D
signal = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0]
output = [0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0]
# ^ cycle.1:day.1 |dayline ^cycle.1:day.2
In the output list, when there is 1 that means 1 cycle is completed at that position of the signal list and at rest of the position there are 0. There should only 1 cycle in a days that's why only 1 is there.
I don't how to split this list according to that so can someone please help?
It seams to me like what you are trying to do is split your data first into blocks of 24, and then to find either the first rising edge, or the first falling edge depending on the first hour in that block.
Below I have tried to distill my understanding of what you are trying to accomplish into the following function. It takes in a numpy.array containing zeros and ones, as in your example. It checks to see what the first hour in the day is, and decides what type of edge to look for.
it detects an edge by using np.diff. This gives us an array containing -1's, 0's, and 1's. We then look for the first index of either a -1 falling edge, or 1 rising edge. The function returns that index, or if no edges were found it returns the index of the last element, or nothing.
For more info see the docs for descriptions on numpy features used here np.diff, np.array.nonzero, np.array_split
import numpy as np
def get_cycle_index(day):
'''
returns the first index of a cycle defined by nipun vats
if no cycle is found returns nothing
'''
first_hour = day[0]
if first_hour == 0:
edgetype = -1
else:
edgetype = 1
edges = np.diff(np.r_[day, day[-1]])
if (edges == edgetype).any():
return (edges == edgetype).nonzero()[0][0]
elif (day.sum() == day.size) or day.sum() == 0:
return
else:
return day.size - 1
Below is an example of how you might use this function in your case.
import numpy as np
_data = [1,1,1,1,1,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
#_data = np.random.randint(0,2,280, dtype='int')
data = np.array(_data, 'int')
#split the data into a set of 'day' blocks
blocks = np.array_split(data, np.arange(24,data.size, 24))
_output = []
for i, day in enumerate(blocks):
print(f'day {i}')
buffer = np.zeros(day.size, dtype='int')
print('\tsignal:', *day, sep = ' ')
cycle_index = get_cycle_index(day)
if cycle_index:
buffer[cycle_index] = 1
print('\toutput:', *buffer, sep=' ')
_output.append(buffer)
output = np.concatenate(_output)
print('\nfinal output:\n', *output, sep=' ')
this yeilds the following output:
day 0
signal: 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0
output: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 1
signal: 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
output: 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
day 2
signal: 0 0 0 0 0 0
output: 0 0 0 0 0 0
final output:
0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Looping over a pandas column and creating a new column if it meets conditions

I have a pandas dataframe and I want to loop over the last column "n" times based on a condition.
import random as random
import pandas as pd
p = 0.5
df = pd.DataFrame()
start = []
for i in range(5)):
if random.random() < p:
start.append("0")
else:
start.append("1")
df['start'] = start
print(df['start'])
Essentially, I want to loop over the final column "n" times and if the value is 0, change it to 1 with probability p so the results become the new final column. (I am simulating on-off every time unit with probability p).
e.g. after one iteration, the dataframe would look something like:
0 0
0 1
1 1
0 0
0 1
after two:
0 0 1
0 1 1
1 1 1
0 0 0
0 1 1
What is the best way to do this?
Sorry if I am asking this wrong, I have been trying to google for a solution for hours and coming up empty.
Like this. Append col with name 1, 2, ...
# continue from question code ...
# colname is 1, 2, ...
for col in range(1, 5):
tmp = []
for i in range(5):
# check final col
if df.iloc[i,col-1:col][0] == "0":
if random.random() < p:
tmp.append("0")
else:
tmp.append("1")
else: # == 1
tmp.append("1")
# append new col
df[str(col)] = tmp
print(df)
# initial
s
0 0
1 1
2 0
3 0
4 0
# result
s 1 2 3 4
0 0 0 1 1 1
1 0 0 0 0 1
2 0 0 1 1 1
3 1 1 1 1 1
4 0 0 0 0 0

Mean (likelihood) encoding

I have a dataset called "data" with categorical values I'd like to encode with mean (likelihood/target) encoding rather than label encoding.
My dataset looks like:
data.head()
ID X0 X1 X10 X100 X101 X102 X103 X104 X105 ... X90 X91 X92 X93 X94 X95 X96 X97 X98 X99
0 0 k v 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 6 k t 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
2 7 az w 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
3 9 az t 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
4 13 az v 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 1 0 1 0
5 rows × 377 columns
I've tried:
# Select categorical features
cat_features = data.dtypes == 'object'
# Define function
def mean_encoding(df, cols, target):
for c in cols:
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
# Encode
data = mean_encoding(data, cat_features, target)
which raises:
KeyError: False
I've also tried:
# Define function
def mean_encoding(df, target):
for c in df.columns:
if df[c].dtype == 'object':
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
which raises:
KeyError: 'Columns not found: 87.68, 87.43, 94.38, 72.11, 73.7, 74.0,
74.28, 76.26,...
I've concated train and test dataset into one called "data" and saved train target before dropping in the dataset as:
target = train.y
split = len(train)
data = pd.concat(objs=[train, test])
data = data.drop('y', axis=1)
data.shape
Help would be appreciated. Thanks.
I think you are not selecting categorical columns correctly. By doingcat_features = data.dtypes == 'object' you are not getting columns names, instead you get boolean showing if column type is categorical or not. Resulting in KeyError: False
You can select categorical column as
mycolumns = data.columns
numerical_columns = data._get_numeric_data().columns
cat_features= list(set(mycolumns) - set(numerical_columns))
or
cat_features = df.select_dtypes(['object']).columns
Rest of you code will be same
# Define function
def mean_encoding(df, cols, target):
for c in cols:
means = df.groupby(c)[target].mean()
df[c].map(means)
return df
# Encode
data = mean_encoding(data, cat_features, target)

Return rows based off the most recent increase in value from other columns python

The title of this question is a little confusing to write out succinctly.
I have pandas df that contains integers and a relevant key Column. When a value is in the key Column is present I want to return the most recent increase in integers from the other Columns.
For the df below, the key Column is [Area]. When X is in [Area], I want to find the most recent increase is integers from Columns ['ST_A','PG_A','ST_B','PG_B'].
import pandas as pd
d = ({
'ST_A' : [0,0,0,0,0,1,1,1,1],
'PG_A' : [0,0,0,1,1,1,2,2,2],
'ST_B' : [0,1,1,1,1,1,1,1,1],
'PG_B' : [0,0,0,0,0,0,0,1,1],
'Area' : ['','','X','','X','','','','X'],
})
df = pd.DataFrame(data = d)
Output:
ST_A PG_A ST_B PG_B Area
0 0 0 0 0
1 0 0 1 0
2 0 0 1 0 X
3 0 1 1 0
4 0 1 1 0 X
5 1 1 1 0
6 1 2 1 0
7 1 2 1 1
8 1 2 1 1 X
I tried to use df = df.loc[(df['Area'] == 'X')] but this returns the rows where X is situated. I need something that uses X to return the most recent row where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B'].
I have also tried:
cols = ['ST_A','PG_A','ST_B','PG_B']
df[cols] = df[cols].diff()
df = df.fillna(0.)
df = df.loc[(df[cols] == 1).any(axis=1)]
This returns all rows where there was an increase in Columns ['ST_A','PG_A','ST_B','PG_B']. Not the most recent increase before X in ['Area'].
Intended Output:
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Does this question make sense or do I need to simplify it?
I believe you can use NumPy here via np.searchsorted:
import numpy as np
increases = np.where(df.iloc[:, :-1].diff().gt(0).max(1))[0]
marks = np.where(df['Area'].eq('X'))[0]
idx = increases[np.searchsorted(increases, marks) - 1]
res = df.iloc[idx]
print(res)
ST_A PG_A ST_B PG_B Area
1 0 0 1 0
3 0 1 1 0
7 1 2 1 1
Not efficient tho, but works, so big chunk of code which is kinda slow:
indexes=np.where(df['Area']=='X')[0].tolist()
indexes2=list(map((1).__add__,np.where(df[df.columns[:-1]].sum(axis=1) < df[df.columns[:-1]].shift(-1).sum(axis=1).sort_index())[0].tolist()))
l=[]
for i in indexes:
if min(indexes2,key=lambda x: abs(x-i)) in l:
l.append(min(indexes2,key=lambda x: abs(x-i))-2)
else:
l.append(min(indexes2,key=lambda x: abs(x-i)))
print(df.iloc[l].sort_index())
Output:
Area PG_A PG_B ST_A ST_B
1 0 0 0 1
3 1 0 0 1
7 2 1 1 1

python, read '.dat' file with differents columns for each lines

I need to extract some data from .dat file which I usually do with
import numpy as np
file = np.loadtxt('blablabla.dat')
Here my data are not separated by a specific delimiter but have predefined length (digits) and some lines don't have any values for some columns.
Here an sample to be clear :
3 0 36 0 0 0 0 0 0 0 99.
-2 0 0 0 0 0 0 0 0 0 99.
2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
-5 0 0 0 0 0 0 0 0 0 99.
99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5
My little code above get the error :
# Convert each value according to its column and store
ValueError: Wrong number of columns at line 3
Does someone have an idea about how to collect this kind of data?
numpy.genfromtxt seems to be what you want; it you can specify field widths for each column and treats missing data as NaNs.
For this case:
import numpy as np
data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5])
If you want to keep information in the string part of the file, you could read twice and specify the usecols parameter:
import numpy as np
number_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(0,1,2,3,4,5,6,7,8,9,11))
string_data = np.genfromtxt('blablabla.dat',delimiter=[2,3,4,3,3,2,3,4,5,3,8,5],\
usecols=(10),dtype=str)
What you essentially need is to get list of empty "columns" position that serve as delimiters
That will get you started
In [108]: table = ''' 3 0 36 0 0 0 0 0 0 0 99.
.....: -2 0 0 0 0 0 0 0 0 0 99.
.....: 2 0 0 0 0 0 0 0 0 0 .LA.0?. 3.
.....: 5 0 0 0 0 2 4 0 0 0 .SAS7?. 99.
.....: -5 0 0 0 0 0 0 0 0 0 99.
.....: 99 0 0 0 0 0 0 0 0 0 .S..3*. 3.5'''.split('\n')
In [110]: max_row_len = max(len(row) for row in table)
In [117]: spaces = reduce(lambda res, row: res.intersection(idx for idx, c in enumerate(row) if c == ' '), table, set(range(max_row_len)))
This code builds set of character positions in the longest row - and reduce leaves only set of positions that have spaces in all rows

Categories

Resources