Today I'm requesting help with a Python script that I'm writing; I'm using the CSV module to parse a large document with about 1,100 rows, and from each row it's pulling a Case_ID, a unique number that no other row has. For example:
['10215', '10216', '10277', '10278', '10279', '10280', '10281', '10282', '10292', '10293',
'10295', '10296', '10297', '10298', '10299', '10300', '10301', '10302', '10303', '10304',
'10305', '10306', '10307', '10308', '10309', '10310', '10311', '10312', '10313', '10314',
'10315', '10316', '10317', '10318', '10319', '10320', '10321', '10322', '10323', '10324',
'10325', '10326', '10344', '10399', '10400', '10401', '10402', '10403', '10404', '10405',
'10406', '10415', '10416', '10417', '10418', '10430', '10448', '10492', '10493', '10494',
'10495', '10574', '10575', '10576', '10577', '10578', '10579', '10580', '10581', '10582',
'10583', '10584', '10585', '10586', '10587', '10588', '10589', '10590', '10591', '10592',
'10593', '10594', '10595', '10596', '10597', '10598', '10599', '10600', '10601', '10602',
'10603', '10604', '10605', '10606', '10607', '10608', '10609', '10610', '10611', '10612',
'10613', '10614', '10615', '10616', '10617', '10618', '10619', '10620', '10621', '10622',
'10623', '10624', '10625', '10626', '10627', '10628', '10629', '10630', '10631', '10632',
'10633', '10634', '10635', '10636', '10637', '10638', '10639', '10640', '10641', '10642',
'10643', '10644', '10645', '10646', '10647', '10648', '10649', '10650', '10651', '10652',
'10653', '10654', '10655', '10656', '10657', '10658', '10659', '10707', '10708', '10709',
'10710', '10792', '10793', '10794', '10795', '10908', '10936', '10937', '10938', '10939',
'11108', '11109', '11110', '11111', '11112', '11113', '11114', '11115', '11116', '11117',
'11118', '11119', '11120', '11121', '11122', '11123', '11124', '11125', '11126', '11127',
'11128', '11129', '11130', '11131', '11132', '11133', '11134', '11135', '11136', '11137',
'11138', '11139', '11140', '11141', '11142', '11143', '11144', '11145', '11146', '11147',
'11148', '11149', '11150', '11151', '11152', '11153', '11154', '11155', '11194', '11195',
'11196', '11197', '11198', '11199', '11200', '11201', '11202', '11203', '11204', '11205',
'11206', '11207', '11208', '11209', '11210', '11211', '11212', '11213', '11214', '11215',
'11216', '11217', '11218', '11219', '11220', '11221', '11222', '11223', '11224', '11225',
'11226', '11227', '11228', '11229', '11230', '11231', '11232', '11233', '11234', '11235',
'10101', '10102', '10800', '11236']
As you can see, this list is quite an eyeful, so I'd like to include a small little function in my script that can reduce all of the sequential ranges down to hyphenated bookends of a sort, for example 10,277 - 10,282.
Thanks to all for any help included! Have a great day.
Doable. Let's see if this can be done with pandas.
import pandas as pd
data = ['10215', '10216', '10277', ...]
# Load data as series.
s = pd.Series(data)
# Find all consecutive rows with a difference of one
# and bin them into groups using `cumsum`.
v = s.astype(int).diff().bfill().ne(1).cumsum()
# Use `groupby` and `apply` to condense the consecutive numbers into ranges.
# This is only done if the group size is >1.
ranges = (
s.groupby(v).apply(
lambda x: '-'.join(x.values[[0, -1]]) if len(x) > 1 else x.item()).tolist())
print (ranges)
['10215-10216',
'10277-10282',
'10292-10293',
'10295-10326',
'10344',
'10399-10406',
'10415-10418',
'10430',
'10448',
'10492-10495',
'10574-10659',
'10707-10710',
'10792-10795',
'10908',
'10936-10939',
'11108-11155',
'11194-11235',
'10101-10102',
'10800',
'11236']
Your data must be sorted for this to work.
You can just use a simple loop here with the following logic:
Create a list to store the ranges (ranges).
Iterate over the values in your list (l)
If ranges is empty, append a list with the first value in l to ranges
Otherwise if the difference between the current and previous value is 1, append the current value to the last list in ranges
Otherwise append a list with the current value to ranges
Code:
l = ['10215', '10216', '10277', '10278', '10279', '10280', ...]
ranges = []
for x in l:
if not ranges:
ranges.append([x])
elif int(x)-prev_x == 1:
ranges[-1].append(x)
else:
ranges.append([x])
prev_x = int(x)
Now you can compute your final ranges by concatenating the first and last element of each list in ranges (if there are at least 2 elements).
final_ranges = ["-".join([r[0], r[-1]] if len(r) > 1 else r) for r in ranges]
print(final_ranges)
#['10215-10216',
# '10277-10282',
# '10292-10293',
# '10295-10326',
# '10344',
# '10399-10406',
# '10415-10418',
# '10430',
# '10448',
# '10492-10495',
# '10574-10659',
# '10707-10710',
# '10792-10795',
# '10908',
# '10936-10939',
# '11108-11155',
# '11194-11235',
# '10101-10102',
# '10800',
# '11236']
This also assumes your data is sorted. You could simplify the code to combine items 3 and 5.
For purely educational purposes (this is much more inefficient that the loop above), here's the same thing using map and reduce:
from functools import reduce
def myreducer(ranges, x):
if not ranges:
return [[x]]
elif (int(x) - int(ranges[-1][-1]) == 1):
return ranges[:-1] + [ranges[-1]+[x]]
else:
return ranges + [[x]]
final_ranges = map(
lambda r: "-".join([r[0], r[-1]] if len(r) > 1 else r),
reduce(myreducer, l, [])
)
There is also the pynumparser package:
import pynumparser
pynumparser.NumberSequence().encode([1, 2, 3, 5, 6, 7, 8, 10])
# result: '1-3,5-8,10'
pynumparser.NumberSequence().parse('1-3,5-8,10')
# result: (1, 2, 3, 5, 6, 7, 8, 10)
Related
The code :
import pandas as pd
import numpy as np
import csv
data = pd.read_csv("/content/NYC_temperature.csv", header=None,names = ['temperatures'])
np.cumsum(data['temperatures'])
printcounter = 0
list_30 = [15.22]#first temperature , i could have also added it by doing : list_30.append(i)[0] since it's every 30 values but doesn't append the first one :)
list_2 = [] #this is for the values of the subtraction (for the second iteration)
for i in data['temperatures']:
if (printcounter == 30):
list_30.append(i)
printcounter = 0
printcounter += 1
**for x in list_30:
substract = list_30[x] - list_30[x+1]**
list_2.append(substraction)
print(max(list_2))
Hey guys ! i'm really having trouble with the black part.
**for x in list_30:
substract = list_30[x] - list_30[x+1]**
I'm trying to iterate over the elements and sub stracting element x with the next element (x+1) but the following error pops out TypeError: 'float' object is not iterable. I have also tried to iterate using x instead of list_30[x] but then when I use next(x) I have another error.
for x in list_30: will iterate on list_30, and affect to x, the value of the item in the list, not the index in the list.
for your case you would prefer to loop on your list with indexes:
index = 0
while index < len(list_30):
substract = list_30[index] - list_30[index + 1]
edit: you will still have a problem when you will reach the last element of list_30 as there will be no element of list_30[laste_index + 1],
so you should probably stop before the end with while index < len(list_30) -1:
in case you want the index and the value, you can do:
for i, v in enumerate(list_30):
substract = v - list_30[i + 1]
but the first one look cleaner i my opinion
if you`re trying to find ifference btw two adjacent elements of an array (like differentiate it), you shoul probably use zip function
inp = [1, 2, 3, 4, 5]
delta = []
for x0,x1 in zip(inp, inp[1:]):
delta.append(x1-x0)
print(delta)
note that list of deltas will be one shorter than the input
I want to get border of data in a list using python
For example I have this list :
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
I want a code that return data borders. for example:
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
^ ^ ^ ^
b = get_border_index(a)
print(b)
output:
[0,4,7,12]
How can I implement get_border_index(lst: list) -> list function?
The scalable answer that also works for very long lists or arrays is to use np.diff. In that case you should avoid a for loop at all costs.
import numpy as np
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
a = np.array(a)
# this is unequal 0 if there is a step
d = np.diff(a)
# boolean array where the steps are
is_step = d != 0
# get the indices of the steps (first one is trivial).
ics = np.where(is_step)
# get the first dimension and shift by one as you want
# the index of the element right of the step
ics_shift = ics[0] + 1
# and if you need a list
ics_list = ics_shift.tolist()
print(ics_list)
You can use for loop with enumerate
def get_border_index(a):
last_value = None
result = []
for i, v in enumerate(a):
if v != last_value:
last_value = v
result.append(i)
return result
a = [1,1,1,1,4,4,4,6,6,6,6,6,1,1,1]
b = get_border_index(a)
print(b)
Output
[0, 4, 7, 12]
This code will check if an element in the a list is different then the element before and if so it will append the index of the element to the result list.
I have a numpy array with these values:
[10620.5, 11899., 11879.5, 13017., 11610.5]
import Numpy as np
array = np.array([10620.5, 11899, 11879.5, 13017, 11610.5])
I would like to get values that are "close" (in this instance, 11899 and 11879) and average them, then replace them with a single instance of the new number resulting in this:
[10620.5, 11889, 13017, 11610.5]
the term "close" would be configurable. let's say a difference of 50
the purpose of this is to create Spans on a Bokah graph, and some lines are just too close
I am super new to python in general (a couple weeks of intense dev)
I would think that I could arrange the values in order, and somehow grab the one to the left, and right, and do some math on them, replacing a match with the average value. but at the moment, I just dont have any idea yet.
Try something like this, I added a few extra steps, just to show the flow:
the idea is to group the data into adjacent groups, and decide if you want to group them or not based on how spread they are.
So as you describe you can combine you data in sets of 3 nummbers and if the difference between the max and min numbers are less than 50 you average them, otherwise you leave them as is.
import pandas as pd
import numpy as np
arr = np.ravel([1,24,5.3, 12, 8, 45, 14, 18, 33, 15, 19, 22])
arr.sort()
def reshape_arr(a, n): # n is number of consecutive adjacent items you want to compare for averaging
hold = len(a)%n
if hold != 0:
container = a[-hold:] #numbers that do not fit on the array will be excluded for averaging
a = a[:-hold].reshape(-1,n)
else:
a = a.reshape(-1,n)
container = None
return a, container
def get_mean(a, close): # close = how close adjacent numbers need to be, in order to be averaged together
my_list=[]
for i in range(len(a)):
if a[i].max()-a[i].min() > close:
for j in range(len(a[i])):
my_list.append(a[i][j])
else:
my_list.append(a[i].mean())
return my_list
def final_list(a, c): # add any elemts held in the container to the final list
if c is not None:
c = c.tolist()
for i in range(len(c)):
a.append(c[i])
return a
arr, container = reshape_arr(arr,3)
arr = get_mean(arr, 5)
final_list(arr, container)
You could use fuzzywuzzy here to gauge the ratio of cloesness between 2 data sets.
See details here: http://jonathansoma.com/lede/algorithms-2017/classes/fuzziness-matplotlib/fuzzing-matching-in-pandas-with-fuzzywuzzy/
Taking Gustavo's answer and tweaking it to my needs:
def reshape_arr(a, close):
flag = True
while flag is not False:
array = a.sort_values().unique()
l = len(array)
flag = False
for i in range(l):
previous_item = next_item = None
if i > 0:
previous_item = array[i - 1]
if i < (l - 1):
next_item = array[i + 1]
if previous_item is not None:
if abs(array[i] - previous_item) < close:
average = (array[i] + previous_item) / 2
flag = True
#find matching values in a, and replace with the average
a.replace(previous_item, value=average, inplace=True)
a.replace(array[i], value=average, inplace=True)
if next_item is not None:
if abs(next_item - array[i]) < close:
flag = True
average = (array[i] + next_item) / 2
# find matching values in a, and replace with the average
a.replace(array[i], value=average, inplace=True)
a.replace(next_item, value=average, inplace=True)
return a
this will do it if I do something like this:
candlesticks['support'] = reshape_arr(supres_df['support'], 150)
where candlesticks is the main DataFrame that I am using and supres_df is another DataFrame that I am massaging before I apply it to the main one.
it works, but is extremely slow. I am trying to optimize it now.
I added a while loop because after averaging, the averages can become close enough to average out again, so I will loop again, until it doesn't need to average anymore. This is total newbie work, so if you see something silly, please comment.
This question already has answers here:
Merging Overlapping Intervals
(4 answers)
Closed last year.
I am trying to solve a question where in overlapping intervals need to be merged.
The question is:
Given a collection of intervals, merge all overlapping intervals.
For example, Given [1,3],[2,6],[8,10],[15,18], return [1,6],[8,10],[15,18].
I tried my solution:
# Definition for an interval.
# class Interval:
# def __init__(self, s=0, e=0):
# self.start = s
# self.end = e
class Solution:
def merge(self, intervals):
"""
:type intervals: List[Interval]
:rtype: List[Interval]
"""
start = sorted([x.start for x in intervals])
end = sorted([x.end for x in intervals])
merged = []
j = 0
new_start = 0
for i in range(len(start)):
if start[i]<end[j]:
continue
else:
j = j + 1
merged.append([start[new_start], end[j]])
new_start = i
return merged
However it is clearly missing the last interval as:
Input : [[1,3],[2,6],[8,10],[15,18]]
Answer :[[1,6],[8,10]]
Expected answer: [[1,6],[8,10],[15,18]]
Not sure how to include the last interval as overlap can only be checked in forward mode.
How to fix my algorithm so that it works till the last slot?
Your code implicitly already assumes the starts and ends to be sorted, so that sort could be left out. To see this, try the following intervals:
intervals = [[3,9],[2,6],[8,10],[15,18]]
start = sorted([x[0] for x in intervals])
end = sorted([x[1] for x in intervals]) #mimicking your start/end lists
merged = []
j = 0
new_start = 0
for i in range(len(start)):
if start[i]<end[j]:
continue
else:
j = j + 1
merged.append([start[new_start], end[j]])
new_start = i
print(merged) #[[2, 9], [8, 10]]
Anyway, the best way to do this is probably recursion, here shown for a list of lists instead of Interval objects.
def recursive_merge(inter, start_index = 0):
for i in range(start_index, len(inter) - 1):
if inter[i][1] > inter[i+1][0]:
new_start = inter[i][0]
new_end = inter[i+1][1]
inter[i] = [new_start, new_end]
del inter[i+1]
return recursive_merge(inter.copy(), start_index=i)
return inter
sorted_on_start = sorted(intervals)
merged = recursive_merge(sorted_on_start.copy())
print(merged) #[[2, 10], [15, 18]]
I know the question is old, but in case it might help, I wrote a Python library to deal with (set of) intervals. Its name is portion and makes it easy to merge intervals:
>>> import portion as P
>>> inputs = [[1,3],[2,6],[8,10],[15,18]]
>>> # Convert each input to an interval
>>> intervals = [P.closed(a, b) for a, b in inputs]
>>> # Merge these intervals
>>> merge = P.Interval(*intervals)
>>> merge
[1,6] | [8,10] | [15,18]
>>> # Output as a list of lists
>>> [[i.lower, i.upper] for i in merge]
[[1,6],[8,10],[15,18]]
Documentation can be found here: https://github.com/AlexandreDecan/portion
We can have intervals sorted by the first interval and we can build the merged list in the same interval list by checking the intervals one by one not appending to another one so. we increment i for every interval and interval_index is current interval check
x =[[1,3],[2,6],[8,10],[15,18]]
#y = [[1,3],[2,6],[8,10],[15,18],[19,25],[20,26],[25,30], [32,40]]
def merge_intervals(intervals):
sorted_intervals = sorted(intervals, key=lambda x: x[0])
interval_index = 0
#print(sorted_intervals)
for i in sorted_intervals:
if i[0] > sorted_intervals[interval_index][1]:
interval_index += 1
sorted_intervals[interval_index] = i
else:
sorted_intervals[interval_index] = [sorted_intervals[interval_index][0], i[1]]
#print(sorted_intervals)
return sorted_intervals[:interval_index+1]
print(merge_intervals(x)) #-->[[1, 6], [8, 10], [15, 18]]
#print ("------------------------------")
#print(merge_intervals(y)) #-->[[1, 6], [8, 10], [15, 18], [19, 30], [32, 40]]
This is very old now, but in case anyone stumbles across this, I thought I'd throw in my two cents, since I wasn't completely happy with the answers above.
I'm going to preface my solution by saying that when I work with intervals, I prefer to convert them to python3 ranges (probably an elegant replacement for your Interval class) because I find them easy to work with. However, you need to remember that ranges are half-open like everything else in Python, so the stop coordinate is not "inside" of the interval. Doesn't matter for my solution, but something to keep in mind.
My own solution:
# Start by converting the intervals to ranges.
my_intervals = [[1, 3], [2, 6], [8, 10], [15, 18]]
my_ranges = [range(start, stop) for start, stop in my_intervals]
# Next, define a check which will return True if two ranges overlap.
# The double inequality approach means that your intervals don't
# need to be sorted to compare them.
def overlap(range1, range2):
if range1.start <= range2.stop and range2.start <= range1.stop:
return True
return False
# Finally, the actual function that returns a list of merged ranges.
def merge_range_list(ranges):
ranges_copy = sorted(ranges.copy(), key=lambda x: x.stop)
ranges_copy = sorted(ranges_copy, key=lambda x: x.start)
merged_ranges = []
while ranges_copy:
range1 = ranges_copy[0]
del ranges_copy[0]
merges = [] # This will store the position of ranges that get merged.
for i, range2 in enumerate(ranges_copy):
if overlap(range1, range2): # Use our premade check function.
range1 = range(min([range1.start, range2.start]), # Overwrite with merged range.
max([range1.stop, range2.stop]))
merges.append(i)
merged_ranges.append(range1)
# Time to delete the ranges that got merged so we don't use them again.
# This needs to be done in reverse order so that the index doesn't move.
for i in reversed(merges):
del ranges_copy[i]
return merged_ranges
print(merge_range_list(my_ranges)) # --> [range(1, 6), range(8, 10), range(15, 18)]
Make pairs for every endpoint: (value; kind = +/-1 for start or end of interval)
Sort them by value. In case of tie choose paie with -1 first if you need to merge intervals with coinciding ends like 0-1 and 1-2
Make CurrCount = 0, walk through sorted list, adding kind to CurrCount
Start new resulting interval when CurrCount becomes nonzero, finish interval when CurrCount becomes zero.
Late to the party, but here is my solution. I typically find recursion with an invariant easier to conceptualize. In this case, the invariant is that the head is always merged, and the tail is always waiting to be merged, and you compare the last element of head with the first element of tail.
One should definitely use sorted with the key argument rather than using a list comprehension.
Not sure how efficient this is with slicing and concatenating lists.
def _merge(head, tail):
if tail == []:
return head
a, b = head[-1]
x, y = tail[0]
do_merge = b > x
if do_merge:
head_ = head[:-1] + [(a, max(b, y))]
tail_ = tail[1:]
return _merge(head_, tail_)
else:
head_ = head + tail[:1]
tail_ = tail[1:]
return _merge(head_, tail_)
def merge_intervals(lst):
if len(lst) <= 1:
return lst
lst = sorted(lst, key=lambda x: x[0])
return _merge(lst[:1], lst[1:])
I have a list that contains sublists with 3 values and I need to print out a list that looks like:
I also need to compare the third column values with eachother to tell if they are increasing or decreasing as you go down.
bb = 3.9
lowest = 0.4
#appending all the information to a list
allinfo= []
while bb>=lowest:
everything = angleWithPost(bb,cc,dd,ee)
allinfo.append(everything)
bb-=0.1
I think the general idea for finding out whether or not the third column values are increasing or decreasing is:
#Checking whether or not Fnet are increasing or decreasing
ii=0
while ii<=(10*(bb-lowest)):
if allinfo[ii][2]>allinfo[ii+1][2]:
abc = "decreasing"
elif allinfo[ii][2]<allinfo[ii+1][2]:
abc = "increasing"
ii+=1
Then when i want to print out my table similar to the one above.
jj=0
while jj<=(10*(bb-lowest))
print "%8.2f %12.2f %12.2f %s" %(allinfo[jj][0], allinfo[jj][1], allinfo[jj][2], abc)
jj+=1
here is the angle with part
def chainPoints(aa,DIS,SEG,H):
#xtuple x chain points
n=0
xterms = []
xterm = -DIS
while n<=SEG:
xterms.append(xterm)
n+=1
xterm = -DIS + n*2*DIS/(SEG)
#
#ytuple y chain points
k=0
yterms = []
while k<=SEG:
yterm = H + aa*m.cosh(xterms[k]/aa) - aa*m.cosh(DIS/aa)
yterms.append(yterm)
k+=1
return(xterms,yterms)
#
#
def chainLength(aa,DIS,SEG,H):
xterms, yterms = chainPoints(aa,DIS,SEG,H)# using x points and y points from the chainpoints function
#length of chain
ff=1
Lterm=0.
totallength=0.
while ff<=SEG:
Lterm = m.sqrt((xterms[ff]-xterms[ff-1])**2 + (yterms[ff]-yterms[ff-1])**2)
totallength += Lterm
ff+=1
return(totallength)
#
def angleWithPost(aa,DIS,SEG,H):
xterms, yterms = chainPoints(aa,DIS,SEG,H)
totallength = chainLength(aa,DIS,SEG,H)
#Find the angle
thetaradians = (m.pi)/2. + m.atan(((yterms[1]-yterms[0])/(xterms[1]-xterms[0])))
#Need to print out the degrees
thetadegrees = (180/m.pi)*thetaradians
#finding the net force
Fnet = abs((rho*grav*totallength))/(2.*m.cos(thetaradians))
return(totallength, thetadegrees, Fnet)
Review this Python2 implementation which uses map and an iterator trick.
from itertools import izip_longest, islice
from pprint import pprint
data = [
[1, 2, 3],
[1, 2, 4],
[1, 2, 3],
[1, 2, 5],
]
class AddDirection(object):
def __init__(self):
# This default is used if the series begins with equal values or has a
# single element.
self.increasing = True
def __call__(self, pair):
crow, nrow = pair
if nrow is None or crow[-1] == nrow[-1]:
# This is the last row or the direction didn't change. Just return
# the direction we previouly had.
inc = self.increasing
elif crow[-1] > nrow[-1]:
inc = False
else:
# Here crow[-1] < nrow[-1].
inc = True
self.increasing = inc
return crow + ["Increasing" if inc else "Decreasing"]
result = map(AddDirection(), izip_longest(data, islice(data, 1, None)))
pprint(result)
The output:
pts/1$ python2 a.py
[[1, 2, 3, 'Increasing'],
[1, 2, 4, 'Decreasing'],
[1, 2, 3, 'Increasing'],
[1, 2, 5, 'Increasing']]
Whenever you want to transform the contents of a list (in this case the list of rows), map is a good place where to begin thinking.
When the algorithm requires data from several places of a list, offsetting the list and zipping the needed values is also a powerful technique. Using generators so that the list doesn't have to be copied, makes this viable in real code.
Finally, when you need to keep state between calls (in this case the direction), using an object is the best choice.
Sorry if the code is too terse!
Basically you want to add a 4th column to the inner list and print the results?
#print headers of table here, use .format for consistent padding
previous = 0
for l in outer_list:
if l[2] > previous:
l.append('increasing')
elif l[2] < previous:
l.append('decreasing')
previous = l[2]
#print row here use .format for consistent padding
Update for list of tuples, add value to tuple:
import random
outer_list = [ (i, i, random.randint(0,10),)for i in range(0,10)]
previous = 0
allinfo = []
for l in outer_list:
if l[2] > previous:
allinfo.append(l +('increasing',))
elif l[2] < previous:
allinfo.append(l +('decreasing',))
previous = l[2]
#print row here use .format for consistent padding
print(allinfo)
This most definitely can be optimized and you could reduce the number of times you are iterating over the data.