Split Pandas Dataframe Using Criteria - python

I have time series data with a column that sums up seconds that something is running. All numbers are divisible by 30s but sometimes it does skip numbers (may jump from 30 to 90). This column can reset along as it is running, setting the start count back to 30s. How would I break up every chunk of runtime.
For example: If numbers in the column are 30, 60, 120, 150, 30, 60, 90, 30, 60, how would I break apart the dataframe into the full sequences with no resets.
30, 60, 120, 150 in 1 dataframe and 30, 60, 90 in the next and 30, 60 in the last? At the end, I need to take the max of each dataframe and add them together (that part I could figure out).

Using #RSale's input:
import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
d = dict(tuple(df.groupby(df['data'].eq(30).cumsum())))
d is a dictionary of three dataframes:
d[1]:
data
0 30
1 60
2 120
3 150
d[2]:
data
4 30
5 60
6 90
And d[3}:
data
7 30
8 60

Not very elegant but it get's the job done.
Loop through an array. Add array to a list when a number is smaller than the one before. Remove the saved array from the list and repeat until no change is detected.
numpy & recursive
import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []
def split(a,y):
for count,val in enumerate(a):
if count == 0:
pass
elif val < a[count-1]:
y.append(a[:count])
a = a[count:]
if len(a)> 0 and sorted(a) != list(a):
split(a,y)
else:
y.append(a)
a = []
return(y)
return(y)
y = split(a,y)
print(y)
>>[array([ 30, 60, 120, 150]), array([30, 60, 90]), array([30, 60])]
print([max(lis) for lis in y])
>>[150,90,60]
This will not consider 30 as a starting point but the samllest number after the reset.
Or using diff to find the change points.
numpy & diff version
import numpy as np
a = np.array([30, 60, 120, 150, 30, 60, 90, 30, 60])
y = []
def split(a,y):
a_diff = np.asarray(np.where(np.diff(a)<0))[0]
while len(a_diff)>1:
a_diff = np.asarray(np.where(np.diff(a)<0))[0]
y.append(a[:a_diff[0]+1])
a = a[a_diff[0]+1:]
y.append(a)
return(y)
y = split(a,y)
print(y)
print([max(lis) for lis in y])
>>[array([ 30, 60, 120, 150]), array([30, 60, 90]), array([30, 60])]
>>[150, 90, 60]
pandas & DataFrame version
import pandas as pd
df = pd.DataFrame({'data': [30, 60, 120, 150, 30, 60, 90, 30, 60]})
y = []
def split(df,y):
a = df['data']
a_diff = [count for count,val in enumerate(a.diff()[1:]) if val < 0 ]
while len(a_diff)>1:
a_diff = [count for count,val in enumerate(a.diff()[1:]) if val < 0 ]
y.append(a[:a_diff[0]+1])
a = a[a_diff[0]+1:]
y.append(a)
return(y)
y = split(df,y)
print(y)
print([max(lis) for lis in y])

Related

Combination sum higher than and lower than

im now starting in programing. I get one exercise to generate a combination of 10 numbers with a set of numbers, and make a sum and the sum of that numbers need to be less than 800 and higher than 700, and print the result and combination (Print All combinations).
For example if the set of numbers is 10,20,30,40,50,60,70,80,90,100 and i need to generate a set of 10 numbers using the numbers i set and the sum of that combination need to be less 560 and higher than 500.
10+20+30+40+50+60+70+80+90+100 = 550
10+20+30+40+50+40+100+80+90+90 = 550
..
I start a code in Python, but im little stuck, how i can sum the combinations.
import itertools
myList = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
for i in range(len(myList)):
for combinations in itertools.combinations(myList, i):
print(combinations)
sum(e for e in combinations if e >= 550)
You're very close, but you need to filter based on whether sum(e) is in the desired range, not whether e itself is.
>>> from itertools import combinations
>>> myList = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
>>> [e for i in range(len(myList)) for e in combinations(myList, i) if 500 < sum(e) < 560]
[(20, 40, 50, 60, 70, 80, 90, 100), (30, 40, 50, 60, 70, 80, 90, 100), (10, 20, 30, 50, 60, 70, 80, 90, 100), (10, 20, 40, 50, 60, 70, 80, 90, 100), (10, 30, 40, 50, 60, 70, 80, 90, 100), (20, 30, 40, 50, 60, 70, 80, 90, 100)]
combinations only exists inside your inner for loop - you can sum it right after you print it simply with sum(combinations). Your sum() statement is outside the loop, where combinations is no longer defined.
Something like:
import itertools
myList = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
for i in range(len(myList)):
for combinations in itertools.combinations(myList, i):
if 500 < sum(combinations) < 560:
print(combinations, sum(combinations))

how to find max, min and average by using if technique in a 2d array in python

x=[[80,59,34,89],[31,11,47,64],[29,56,13,91],[55,61,48,0],[75,78,81,91]]
I want to find maximum minimum and average value of the above 2d array.
You can use numpy module to find min and max values easily:
import numpy as np
x = np.array([[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 91], [55, 61, 48, 0], [75, 78, 81, 91]])
minValue = np.min(x)
maxValue = np.max(x)
print(minValue)
print(maxValue)
If you need to find them without build-in methods, you can use an approach as follows:
x = [[80, 59, 34, 89], [31, 11, 47, 64], [29, 56, 13, 91], [55, 61, 48, 0], [75, 78, 81, 91]]
minValue = x[0][0]
maxValue = x[0][0]
sumAll = 0
count = 0
for inner in x:
for each in inner:
if each > maxValue: maxValue = each
if each < minValue: minValue = each
sumAll += each
count += 1
average = sumAll / count
In this approach, you compare each value to find min and max. At the same time sum, count each element to calculate average.
You can get maximum , minimum and average of 2D array with using map like
def Average(lst):
return sum(lst) / len(lst)
x=[[80,59,34,89],[31,11,47,64],[29,56,13,91],[55,61,48,0],[75,78,81,91]]
maximum = max(map(max, x)) // 91
minimum = min(map(min, x)) // 0
average = Average(list(map(lambda idx: sum(idx)/float(len(idx)), x))) // 54.65
You can use numpy to flatten the 2d array into an 1d array.
import numpy as np
x=[[80,59,34,89],[31,11,47,64],[29,56,13,91],[55,61,48,0],[75,78,81,91]]
x = np.array(x)
print(max(x.flatten()))
print(min(x.flatten()))
print(sum(x.flatten())/ len(x.flatten()))

Python appending a list to dataframe as element

My dataframe is given below
df =
index data1
0 20
1 30
2 40
I want to add a new column and each element consiting a list.
My expected output is
df =
index data1 list_data
0 20 [200,300,90]
1 30 [200,300,90,78,90]
2 40 [1200,2300,390,89,78]
My present code:
df['list_data'] = []
df['list_data'].loc[0] = [200,300,90]
Present output:
raise ValueError('Length of values does not match length of index')
ValueError: Length of values does not match length of index
You can use pd.Series for your problem
import pandas as pd
lis = [[200, 300, 90], [200, 300, 90, 78, 90], [1200, 2300, 390, 89, 78]]
lis = pd.Series(lis)
df['list_data'] = lis
This gives the following output
index data1 list_data
0 0 20 [200, 300, 90]
1 1 30 [200, 300, 90, 78, 90]
2 2 40 [1200, 2300, 390, 89, 78]
Try using loc this way:
df['list_data'] = ''
df.loc[0, 'list_data'] = [200,300,90]

Finding where a value lands between two numbers in Python

I have a problem where I need to determine where a value lands between other values. This is an awful long question...but its a convoluted problem (at least to me).
The simplest presentation of the problem can be seen with the following data:
I have a value of 24.0. I need to determine where that value lands within six 'ranges'. The ranges are: 10, 20, 30, 40, 50, 60. I need to calculate where along the ranges, the value lands. I can see that it lands between 20 and 30. A simple if statement can find that for me.
My if statement for checking if the value is between 20 and 30 would be:
if value >=20 and value <=30:
Pretty simple stuff.
What I'm having trouble with is when I try to rank the output.
As an example, let's say that each range value is given an integer representation. 10 =1, 20=2, 30=3, 40=4, 50=5, 60=6, 70=7. Additionally, lets say that if the value is less than the midpoint between two values, it is assigned the rank output of the lower value. For example, my value of 24 is between 20 and 30 so it should be ranked as a "2".
This in and of itself is fairly straightforward with this example, but using real world data, I have ranges and values like the following:
Value = -13 with Ranges = 5,35,30,25,-25,-30,-35
Value = 50 with Ranges = 5,70,65,60,40,35,30
Value = 6 with Ranges = 1,40,35,30,5,3,0
Another wrinkle - the orders of the ranges matter. In the above, the first range number equates to a ranking of 1, the second to a ranking of 2, etc as I mentioned a few paragraphs above.
The negative numbers in the range values were causing trouble until I decided to use a percentile ranking which gets rid of the negative values all together. To do this, I am using an answer from Map each list value to its corresponding percentile like this:
y=[stats.percentileofscore(x, a, 'rank') for a in x]
where x is the ranges AND the value I'm checking. Running the value=6 values above through this results in y being:
x = [1, 40, 35, 30, 5, 3, 0, 6]
y=[stats.percentileofscore(x, a, 'rank') for a in x]
Looking at "y", we see it as:
[25.0, 100.0, 87.5, 75.0, 50.0, 37.5, 12.5, 62.5]
What I need to do now is compare that last value (62.5) with the other values to see what the final ranking will be (rankings of 1 through 7) according to the following ranking map:
1=25.0
2=100.0
3=87.5
4=75.0
5=50.0
6=37.5
7=12.5
If the value lies between two of the values, it should be assigned the lower rank. In this example, the 62.5 value would have a final ranking value of 4 because it sits between 75.0 (rank=4) and 50.0 (rank=5).
If I take 'y' and break it out and use those values in multiple if/else statements it works for some but not all (the -13 example does not work correctly).
My question is this:
How can I programmatically analyze any value/range set to find the final ranking without building an enormous if/elif structure? Here are a few sample sets. Rankings are in order of presentation below (first value in Ranges =1 , second = 2, etc etc)
Value = -13 with Ranges = 5, 35, 30, 25, -25, -30, -35 --> Rank = 4
Value = 50 with Ranges = 5, 70, 65, 60, 40, 35, 30 --> Rank = 4
Value = 6 with Ranges = 1, 40, 35, 30, 5, 3,0 --> Rank = 4
Value = 24 with Ranges = 10, 20, 30, 40, 50, 60, 70 --> Rank = 2
Value = 2.26 with Ranges = 0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95 --> Rank = 7
Value = 31 with Ranges = 10, 20, 30, 40, 60, 70, 80 --> Rank = 3
I may be missing something very easy within python to do this...but I've bumped my head on this wall for a few days with no progress.
Any help/pointers are appreciated.
def checker(term):
return term if term >= 0 else abs(term)+1e10
l1, v1 = [5, 35, 30, 25, -25, -30, -35], -13 # Desired: 4
l2, v2 = [5, 70, 65, 60, 40, 35, 30], 50 # Desired: 4
l3, v3 = [1, 40, 35, 30, 5, 3, 0], 6 # Desired: 4
l4, v4 = [10, 20, 30, 40, 50, 60, 70], 24 # Desired: 2
l5, v5 = [0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95], 2.26 # Desired: 7
l6, v6 = [10, 20, 30, 40, 60, 70, 80], 31 # Desired: 3
Result:
>>> print(*(sorted(l_+[val], key=checker).index(val) for
... l_, val in zip((l1,l2,l3,l4,l5,l6),(v1,v2,v3,v4,v5,v6))), sep='\n')
4
4
4
2
7
3
Taking the first example of -13.
y = [5, 35, 30, 25, -25, -30, -35]
value_to_check = -13
max_rank = len(y) # Default value in case no range found (as per 2.26 value example)
for ii in xrange(len(y)-1,0,-1):
if (y[ii] <= value_to_check <= y[ii-1]) or (y[ii] >= value_to_check >= y[ii-1]):
max_rank = ii
break
>>> max_rank
4
In function form:
def get_rank(y, value_to_check):
max_rank = len(y) # Default value in case no range found (as per 2.26 value example)
for ii in xrange(len(y)-1,0,-1):
if (y[ii] <= value_to_check <= y[ii-1]) or (y[ii] >= value_to_check >= y[ii-1]):
max_rank = ii
break
return max_rank
When you call:
>>> get_rank(y, value_to_check)
4
This correctly finds the answer for all your data:
def get_rank(l,n):
mindiff = float('inf')
minindex = -1
for i in range(len(l) - 1):
if l[i] <= n <= l[i + 1] or l[i + 1] <= n <= l[i]:
diff = abs(l[i + 1] - l[i])
if diff < mindiff:
mindiff = diff
minindex = i
if minindex != -1:
return minindex + 1
if n > max(l):
return len(l)
return 1
>>> test()
[5, 35, 30, 25, -25, -30, -35] -13 Desired: 4 Actual: 4
[5, 70, 65, 60, 40, 35, 30] 50 Desired: 4 Actual: 4
[1, 40, 35, 30, 5, 3, 0] 6 Desired: 4 Actual: 4
[10, 20, 30, 40, 50, 60, 70] 24 Desired: 2 Actual: 2
[0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95] 2.26 Desired: 7 Actual: 7
[10, 20, 30, 40, 60, 70, 80] 31 Desired: 3 Actual: 3
For completeness, here is my test() function, but you only need get_rank for what you are doing:
>>> def test():
lists = [[[5, 35, 30, 25, -25, -30, -35],-13,4],[[5, 70, 65, 60, 40, 35, 30],50,4],[[1, 40, 35, 30, 5, 3,0],6,4],[[10, 20, 30, 40, 50, 60, 70],24,2],[[0.1, 0.55, 0.65, 0.75, 1.75, 1.85, 1.95],2.26,7],[[10, 20, 30, 40, 60, 70, 80],31,3]]
for l,n,desired in lists:
print l,n,'Desired:',desired,'Actual:',get_rank(l,n)

take sum of ints preserving specific information

I have a list of ints
list = [25, 50, 70, 32, 10, 20, 50, 40, 30]
And I would like to sum up the ints (from left to right) if their sum is smaller than 99. Lets say I write this output to a list, than this list should look like this:
#75 because 25+50 = 70. 25+50+70 would be > 99
new_list = [75, 70, 62, 90, 30]
#70 because 70+32 > 99
#62 because 32+10+20 = 62. 32+10+20+50 would be > 99
But that is not all. I want to save the ints the sum was made from as well. So what I actually want to have is a data structure that looks like this:
list0 = [ [(25,50),75], [(70),70], [(32, 10, 20),62], [(50, 40),90], [(30),30] ]
How can I do this?
Use a separate list to track your numbers:
results = []
result = []
for num in inputlist:
if sum(result) + num < 100:
result.append(num)
else:
results.append([tuple(result), sum(result)])
result = [num]
if result:
results.append([tuple(result), sum(result)])
For your sample input, this produces:
[[(25, 50), 75], [(70,), 70], [(32, 10, 20), 62], [(50, 40), 90], [(30,), 30]]
You can use iterator fo this:
l = [25, 50, 70, 32, 10, 20, 50, 40, 30]
def sum_iter(lst):
s = 0
t = tuple()
for i in lst:
if s + i <= 99:
s += i
t += (i,)
else:
yield t, s
s = i
t = (i,)
else:
yield t, s
res = [[t, s] for t, s in sum_iter(l)]
On your data result is:
[[(25, 50), 75], [(70,), 70], [(32, 10, 20), 62], [(50, 40), 90], [(30,), 30]]

Categories

Resources