I have a dataset like
x y
1 0.34
2 0.3432
3 0.32
4 0.35
5 0.323
6 0.3623
7 0.345
8 0.32
9 0.31
10 0.378
11 0.34
12 0.33
13 0.31
14 0.33
15 0.34
For this dataset I want to perform a task which will go through my dataset and will count the number of occurrences above a cutoff if the length of occurrence is above M.
The cutoff and M will be system arguments.
So if the cutoff is 0.32 and M is 1 it will print out a list like
[2, 4, 3, 2]
Logic: First two values in second column are above 0.32 and the length of the is greater than M=1 hence it printed out 2 and 4,3,2 so on.
I need a help to write the argument so that if x >cutoff and length of broken is >M it will print out the length of broken frames (so the same out put as above). Any help?
The structure should look like following (I am not sure how to place the argument in place of XXX)
def get_input(filename):
with open(filename) as f:
next(f) # skip the first line
input_list = []
for line in f:
input_list.append(float(line.split()[1]))
return input_list
def countwanted(input_list, wantbroken, cutoff,M):
def whichwanted(x):
if(wantbroken): return x > cutoff
else: return x < cutoff
XXX I think here I need to add the criteria for M but not sure how?
filename=sys.argv[1]
wantbroken=(sys.argv[2]=='b' or sys.argv[2]=='B')
cutoff=float(sys.argv[3])
M=int(sys.argv[4])
input_list = get_input(filename)
broken,lifebroken=countwanted(input_list,True,cutoff,M)
#closed,lifeclosed=countwanted(input_list,False,cutoff,M)
print(lifebroken)
#print(lifeclosed)
Or maybe there is a simpler way to write it.
You are OK with using numpy, which makes life a lot easier.
First off, let's take a look at the file loader. np.loadtxt can do the same thing in one line.
y = np.loadtxt(filename, skiprows=1, usecols=1)
Now to create a mask of which values that make up your above-threshold runs:
b = (y > cutoff) # I think you can figure out how to switch the sense of the test
The rest is easy, and based off this question:
b = np.r_[0, b, 0] # pad the ends
d = np.diff(b) # find changes in state
start, = np.where(d > 0) # convert switch up to start indices
end, = np.where(d < 0) # convert switch down to end indices
len = end - start # get the lengths
Now you can apply M to len:
result = len[len >= M]
If you want to work with lists, itertools.groupby also offers a good solution:
grouper = it.groupby(y, key=lambda x: x > cutoff)
result = [x for x in (len(list(group)) for key, group in grouper if key) if x >= M]
Related
I am trying to test condition when iteration reach first 50 elements, then next 50 elements in a list based on certain conditions and so on. My list contains 630000 elements fed from df_. This is my attempt:
For a dataframe: df_
distance
0.5
10.4
0.5
14.4
0.15
100.4
0.25
12.4
mylist_data = list()
mylist1_data = list()
for index, row in df_.iterrows():
mylist = (row.distance)
mylist_data.append(mylist)
mylist1 = (row.day_night)
mylist1_data.append(mylist1)
if (len(mylist_data)== 50):
xmean = np.mean(mylist_data)
ymean = np.mean(mylist1_data)
:
:
print(index)
Thanks for your immense help!
How about this?
groups = df_.groupby(pd.cut(df_.index, int(630_000 / 50)))
for interval, sub_df in groups:
xmean = sub_df['distance'].mean()
ymean = sub_df['day_night'].mean()
print(f'doing my test for indices {sub_df.index[0]} : {sub_df.index[-1]}')
Here - for each group you have the sub-dataframe! (and you don't have to iterate through rows, which is very inefficient).
pd.cut returns a "categorical array-like object representing the respective bin" for each row of df_. It takes the number of bins as an argument: int(630_000 / 50).
I have a pandas dataframe nike that looks like this:
rise1 run1 position
1 0.82 1
3 1.64 2
5 3.09 3
7 5.15 4
8 7.98 5
15 11.12 6
I am trying to make a function that calculates grade (rise/run) and returns it as a pandas series. I want to use X points ahead of the current position minus X points behind the current position to calculate grade (i.e. if X = 2, the grade at position 4 is (15-3)/(11.12-1.64)).
def get_grade(dataf, X=n):
grade = pd.Series(data = None, index = range(dataf.shape[0]))
for i in range(X, dataf.shape[0] - X):
rise = dataf.loc[i + X, 'rise1'] - dataf.loc[i - X,'rise1']
run = dataf.loc[i + X, 'run1'] - dataf.loc[i - X, 'run1']
if np.isclose(rise, 0) or np.isclose(run, 0):
grade[i] = 0
elif rise / run > 1:
grade[i] = 1
elif rise / run < -1:
grade[i] = -1
else:
grade[i] = rise / run
return grade
get_grade(nike, X= 2)
When I call the function, nothing happens. The code executes but nothing appears. What might I be doing wrong? Apologies if this is unclear, I am very new to coding in general with limited vocab in this area.
You have to set a variable equal to the function (so setting the variable equal to your return value) and then print/display that variable. Like df = get_grade(nike, X= 2) print(df). Or put a print call in your function
def test_function():
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[4,3,2,1]})
return df
df = test_function()
print(df)
Or
def test_print_function():
df = pd.DataFrame({"col1":[1,2,3,4], "col2":[4,3,2,1]})
print(df)
test_print_function()
The way you are working is suboptimal. In general, a for loop + .loc in pandas repeatedly is a signal that you're not taking advantage for the framework.
My suggestion is to use a rolling window, and apply your calculations:
WINDOW = 2
rolled = df[['rise1', 'run1']].rolling(2*WINDOW + 1, center=True)\
.apply(lambda s: s.iloc[0] - s.iloc[-1])
print(rolled['rise1'] / rolled['run1'])
0 NaN
1 NaN
2 0.977654
3 1.265823
4 NaN
5 NaN
dtype: float64
Now, as to your specific problem, I cannot reproduce. Copying and pasting your code in a brand new notebook works fine, but apparently it doesn't yield the results you want (i.e. you don't find (15-3)/(11.12-1.64) as you intended).
I am trying to get the longest list of a set of five ordered position, 1 to 5 each, satisfying the condition that any two members of the list cannot share more than one identical position (index). I.e., 11111 and 12222 is permitted (only the 1 at index 0 is shared), but 11111 and 11222 is not permitted (same value at index 0 and 1).
I have tried a brute-force attack, starting with the complete list of permutations, 3125 members, and walking through the list element by element, rejecting the ones that do not match the criteria, in several steps:
step one: testing elements 2 to 3125 against element 1, getting a new shorter list L'
step one: testing elements 3 to N' against element 2', getting a shorter list yet L'',
and so on.
I get a 17 members solution, perfectly valid. The problem is that:
I know there are, at least, two 25-member valid solution found by a matter of good luck,
The solution by this brute-force method depends strongly on the initial order of the 3125 members list, so I have been able to find from 12- to 21-member solutions, shuffling the L0 list, but I have never hit the 25-member solutions.
Could anyone please put light on the problem? Thank you.
This is my approach so far
import csv, random
maxv = 0
soln=0
for p in range(0,1): #Intended to run multiple times
z = -1
while True:
z = z + 1
file1 = 'Step' + "%02d" % (z+0) + '.csv'
file2 = 'Step' + "%02d" % (z+1) + '.csv'
nextdata=[]
with open(file1, 'r') as csv_file:
data = list(csv.reader(csv_file))
#if file1 == 'Step00.csv': # related to p loop
# random.shuffle(data)
i = 0
while i <= z:
nextdata.append(data[i])
i = i + 1
for j in range(z, len(data)):
sum=0
for k in range(0,5):
if (data[z][k] == data[j][k]):
sum = sum + 1
if sum < 2:
nextdata.append(data[j])
ofile = open(file2, 'wb')
writer = csv.writer(ofile)
writer.writerows(nextdata)
ofile.close()
if (len(nextdata) < z + 1 + 1):
if (z+1)>= maxv:
maxv = z+1
print maxv
ofile = open("Solution"+"%02d" % soln + '.csv', 'wb')
writer = csv.writer(ofile)
writer.writerows(nextdata)
ofile.close()
soln = soln + 1
break
Here is a Picat model for the problem (as I understand it): http://hakank.org/picat/longest_subset_of_five_positions.pi It use constraint modelling and SAT solver.
Edit: Here is a MiniZinc model: http://hakank.org/minizinc/longest_subset_of_five_positions.mzn
The model (predicate go/0) check lengths of 2 to 100. All lengths between 2 and 25 has at least one solution (probably at lot more). So 25 is the longest sub sequence. Here is one 25 length solution:
{1,1,1,3,4}
{1,2,5,1,5}
{1,3,4,4,1}
{1,4,2,2,2}
{1,5,3,5,3}
{2,1,3,2,1}
{2,2,4,5,4}
{2,3,2,1,3}
{2,4,1,4,5}
{2,5,5,3,2}
{3,1,2,5,5}
{3,2,3,4,2}
{3,3,5,2,4}
{3,4,4,3,3}
{3,5,1,1,1}
{4,1,4,1,2}
{4,2,1,2,3}
{4,3,3,3,5}
{4,4,5,5,1}
{4,5,2,4,4}
{5,1,5,4,3}
{5,2,2,3,1}
{5,3,1,5,2}
{5,4,3,1,4}
{5,5,4,2,5}
There is a lot of different 25 lengths solutions (the predicate go2/0 checks that).
Here is the complete model (edited from the file above):
import sat.
main => go.
%
% Test all lengths from 2..100.
% 25 is the longest.
%
go ?=>
nolog,
foreach(M in 2..100)
println(check=M),
if once(check(M,_X)) then
println(M=ok)
else
println(M=not_ok)
end,
nl
end,
nl.
go => true.
%
% Check if there is a solution with M numbers
%
check(M, X) =>
N = 5,
X = new_array(M,N),
X :: 1..5,
foreach(I in 1..M, J in I+1..M)
% at most 1 same number in the same position
sum([X[I,K] #= X[J,K] : K in 1..N]) #<= 1,
% symmetry breaking: sort the sub sequence
lex_lt(X[I],X[J])
end,
solve([ff,split],X),
foreach(Row in X)
println(Row)
end,
nl.
My problem is as follows:
having file with list of intervals:
1 5
2 8
9 12
20 30
And a range of
0 200
I would like to do such an intersection that will report the positions [start end] between my intervals inside the given range.
For example:
8 9
12 20
30 200
Beside any ideas how to bite this, would be also nice to read some thoughts on optimization, since as always the input files are going to be huge.
this solution works as long the intervals are ordered by the start point and does not require to create a list as big as the total range.
code
with open("0.txt") as f:
t=[x.rstrip("\n").split("\t") for x in f.readlines()]
intervals=[(int(x[0]),int(x[1])) for x in t]
def find_ints(intervals, mn, mx):
next_start = mn
for x in intervals:
if next_start < x[0]:
yield next_start,x[0]
next_start = x[1]
elif next_start < x[1]:
next_start = x[1]
if next_start < mx:
yield next_start, mx
print list(find_ints(intervals, 0, 200))
output:
(in the case of the example you gave)
[(0, 1), (8, 9), (12, 20), (30, 200)]
Rough algorithm:
create an array of booleans, all set to false seen = [False]*200
Iterate over the input file, for each line start end set seen[start] .. seen[end] to be True
Once done, then you can trivially walk the array to find the unused intervals.
In terms of optimisations, if the list of input ranges is sorted on start number, then you can track the highest seen number and use that to filter ranges as they are processed -
e.g. something like
for (start,end) in input:
if end<=lowest_unseen:
next
if start<lowest_unseen:
start=lowest_unseen
...
which (ignoring the cost of the original sort) should make the whole thing O(n) - you go through the array once to tag seen/unseen and once to output unseens.
Seems I'm feeling nice. Here is the (unoptimised) code, assuming your input file is called input
seen = [False]*200
file = open('input','r')
rows = file.readlines()
for row in rows:
(start,end) = row.split(' ')
print "%s %s" % (start,end)
for x in range( int(start)-1, int(end)-1 ):
seen[x] = True
print seen[0:10]
in_unseen_block=False
start=1
for x in range(1,200):
val=seen[x-1]
if val and not in_unseen_block:
continue
if not val and in_unseen_block:
continue
# Must be at a change point.
if val:
# we have reached the end of the block
print "%s %s" % (start,x)
in_unseen_block = False
else:
# start of new block
start = x
in_unseen_block = True
# Handle end block
if in_unseen_block:
print "%s %s" % (start, 200)
I'm leaving the optimizations as an exercise for the reader.
If you make a note every time that one of your input intervals either opens or closes, you can do what you want by putting together the keys of opens and closes, sort into an ordered set, and you'll be able to essentially think, "okay, let's say that each adjacent pair of numbers forms an interval. Then I can focus all of my logic on these intervals as discrete chunks."
myRange = range(201)
intervals = [(1,5), (2,8), (9,12), (20,30)]
opens = {}
closes = {}
def open(index):
if index not in opens:
opens[index] = 0
opens[index] += 1
def close(index):
if index not in closes:
closes[index] = 0
closes[index] += 1
for start, end in intervals:
if end > start: # Making sure to exclude empty intervals, which can be problematic later
open(start)
close(end)
# Sort all the interval-endpoints that we really need to look at
oset = {0:None, 200:None}
for k in opens.keys():
oset[k] = None
for k in closes.keys():
oset[k] = None
relevant_indices = sorted(oset.keys())
# Find the clear ranges
state = 0
results = []
for i in range(len(relevant_indices) - 1):
start = relevant_indices[i]
end = relevant_indices[i+1]
start_state = state
if start in opens:
start_state += opens[start]
if start in closes:
start_state -= closes[start]
end_state = start_state
if end in opens:
end_state += opens[end]
if end in closes:
end_state -= closes[end]
state = end_state
if start_state == 0:
result_start = start
result_end = end
results.append((result_start, result_end))
for start, end in results:
print(str(start) + " " + str(end))
This outputs:
0 1
8 9
12 20
30 200
The intervals don't need to be sorted.
This question seems to be a duplicate of Merging intervals in Python.
If I understood well the problem, you have a list of intervals (1 5; 2 8; 9 12; 20 30) and a range (0 200), and you want to get the positions outside your intervals, but inside given range. Right?
There's a Python library that can help you on that: python-intervals (also available from PyPI using pip). Disclaimer: I'm the maintainer of that library.
Assuming you import this library as follows:
import intervals as I
It's quite easy to get your answer. Basically, you first want to create a disjunction of intervals based on the ones you provide:
inters = I.closed(1, 5) | I.closed(2, 8) | I.closed(9, 12) | I.closed(20, 30)
Then you compute the complement of these intervals, to get everything that is "outside":
compl = ~inters
Then you create the union with [0, 200], as you want to restrict the points to that interval:
print(compl & I.closed(0, 200))
This results in:
[0,1) | (8,9) | (12,20) | (30,200]
Dataframe.resample() works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method?
I'd use iloc, which takes a row/column slice, both based on integer position and following normal python syntax. If you want every 5th row:
df.iloc[::5, :]
Though #chrisb's accepted answer does answer the question, I would like to add to it the following.
A simple method I use to get the nth data or drop the nth row is the following:
df1 = df[df.index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.index % 3 == 0] # Selects every 3rd raw starting from 0
This arithmetic based sampling has the ability to enable even more complex row-selections.
This assumes, of course, that you have an index column of ordered, consecutive, integers starting at 0.
There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__.
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For example, to get every 2 rows, you can do
df[::2]
a b c
0 x x x
2 x x x
4 x x x
There's also GroupBy.first/GroupBy.head, you group on the index:
df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')
df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)
a b c
0 x x x
1 x x x
2 x x x
The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do
# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()
a b c
0 x x x
1 x x x
2 x x x
Adding reset_index() to metastableB's answer allows you to only need to assume that the rows are ordered and consecutive.
df1 = df[df.reset_index().index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.reset_index().index % 3 == 0] # Selects every 3rd row starting from 0
df.reset_index().index will create an index that starts at 0 and increments by 1, allowing you to use the modulo easily.
I had a similar requirement, but I wanted the n'th item in a particular group. This is how I solved it.
groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x % 3 == 0)
subset = data[selection]
A solution I came up with when using the index was not viable ( possibly the multi-Gig .csv was too large, or I missed some technique that would allow me to reindex without crashing ).
Walk through one row at a time and add the nth row to a new dataframe.
import pandas as pd
from csv import DictReader
def make_downsampled_df(filename, interval):
with open(filename, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
column_names = csv_dict_reader.fieldnames
df = pd.DataFrame(columns=column_names)
for index, row in enumerate(csv_dict_reader):
if index % interval == 0:
print(str(row))
df = df.append(row, ignore_index=True)
return df
df.drop(labels=df[df.index % 3 != 0].index, axis=0) # every 3rd row (mod 3)