Sum total by two conditions in list of lists - python

cust_id = semi_final_df['0_x'].tolist()
date = semi_final_df[1].tolist()
total_amount = semi_final_df[0].tolist()
prod_num = semi_final_df['0_y'].tolist()
prod_deduped = []
quant_cleaned = []
product_net_amount = []
cust_id_final = []
date_final = []
for row in total_amount:
quant_cleaned.append(float(row))
for unique_prodz in prod_num:
if unique_prodz not in prod_deduped:
prod_deduped.append(unique_prodz)
for unique_product in prod_deduped:
indices = [i for i, x in enumerate(prod_num) if x == unique_product]
product_total = 0
for index in indices:
product_total += quant_cleaned[index]
product_net_amount.append(product_total)
first_index = prod_num.index(unique_product)
cust_id_final.append(cust_id[first_index])
date_final.append(date[first_index])
Above code calculates sum amount by one condition in order to sum the total on an invoice.
The data had multiple lines but shared the same invoice/product number.
Problem:
I need to modify the below code so that I can sum by unique product and unique date.
I have given it a go but I am getting a value error -
saying x, y is not in a list
As per my understanding the issue lies in the fact that I am zipping two de-duped lists together of different lengths and then I am attempting to loop through the result inline.
This line causes the error
for i,[x, y] in enumerate(zipped_list):
Any help would be sincerely appreciated. Here is the second batch of code with comments.
from itertools import zip_longest
#I have not included the code for the three lists below but you can assume they are populated as these are the lists that I will be #working off of. They are of the same length.
prod_numbers = []
datesz = []
converted_quant = []
#Code to dedupe date and product which will end up being different lengths. These two lists are populated by the two for loops below
prod_deduped = []
dates_deduped = []
for unique_prodz in prod_numbers:
if unique_prodz not in prod_deduped:
prod_deduped.append(unique_prodz)
for unique_date in datesz:
if unique_date not in dates_deduped:
dates_deduped.append(unique_date)
#Now for the fun part. Time to sum by date and product. The three lists below are empty until we run the code
converted_net_amount = []
prod_id_final = []
date_final = []
#I zipped the list together using itertools which I imported at the top
for unique_product, unique_date in zip_longest(prod_deduped, dates_deduped, fillvalue = ''):
indices = []
zipped_object = zip(prod_numbers, datesz)
zipped_list = list(zipped_object)
for i,[x, y] in enumerate(zipped_list):
if x == unique_product and y == unique_date:
indices.append(i)
converted_total = 0
for index in indices:
converted_total += converted_quant[index]
converted_net_amount.append[converted_total]
first_index = zipped_list.index([unique_product, unique_date])
prod_id_final.append(prod_numbers[first_index])
date_final.append(datesz[first_index])

from collections import defaultdict
summed_dictionary = defaultdict(int)
for x, y, z in list:
summed_dictionary[(x,y)] += z
Using defaultdict should solve your problem and is a lot easier on the eyes than all your code above. I saw this on reddit this morning and figured you crossposted. Credit to the guy from reddit on /r/learnpython

Related

Finding similar numbers in a list and getting the average

I currently have the numbers above in a list. How would you go about adding similar numbers (by nearest 850) and finding average to make the list smaller.
For example I have the list
l = [2000,2200,5000,2350]
In this list, i want to find numbers that are similar by n+500
So I want all the numbers similar by n+500 which are 2000,2200,2350 to be added and divided by the amount there which is 3 to find the mean. This will then replace the three numbers added. so the list will now be l = [2183,5000]
As the image above shows the numbers in the list. Here I would like the numbers close by n+850 to all be selected and the mean to be found
It seems that you look for a clustering algorithm - something like K-means.
This algorithm is implemented in scikit-learn package
After you find your K means, you can count how many of your data were clustered with that mean, and make your computations.
However, it's not clear in your case what is K. You can try and run the algorithm for several K values until you get your constraints (the n+500 distance between the means)
You can use:
import numpy as np
l = np.array([2000,2200,5000,2350])
# find similar numbers (that are within each 500 fold)
similar = l // 500
# for each similar group get the average and convert it to integer (as in the desired output)
new_list = [np.average(l[similar == num]).astype(int) for num in np.unique(similar)]
print(new_list)
Output:
[2183, 5000]
Step 1:
list = [5620.77978515625,
7388.43017578125,
7683.580078125,
8296.6513671875,
8320.82421875,
8557.51953125,
8743.5,
9163.220703125,
9804.7939453125,
9913.86328125,
9940.1396484375,
9951.74609375,
10074.23828125,
10947.0419921875,
11048.662109375,
11704.099609375,
11958.5,
11964.8232421875,
12335.70703125,
13103.0,
13129.529296875,
16463.177734375,
16930.900390625,
17712.400390625,
18353.400390625,
19390.96484375,
20089.0,
34592.15625,
36542.109375,
39478.953125,
40782.078125,
41295.26953125,
42541.6796875,
42893.58203125,
44578.27734375,
45077.578125,
48022.2890625,
52535.13671875,
58330.5703125,
61597.91796875,
62757.12890625,
64242.79296875,
64863.09765625,
66930.390625]
Step 2:
seen = [] #to log used indices pairs
diff_dic = {} #to record indices and diff
for i,a in enumerate(list):
for j,b in enumerate(list):
if i!=j and (i,j)[::-1] not in seen:
seen.append((i,j))
diff_dic[(i,j)] = abs(a-b)
keys = []
for ind, diff in diff_dic.items():
if diff <= 850:
keys.append(ind)
uniques_k = [] #to record unique indices
for pair in keys:
for key in pair:
if key not in uniques_k:
uniques_k.append(key)
import numpy as np
list_arr = np.array(list)
nearest_avg = np.mean(list_arr[uniques_k])
list_arr = np.delete(list_arr, uniques_k)
list_arr = np.append(list_arr, nearest_avg)
list_arr
output:
array([ 5620.77978516, 34592.15625, 36542.109375, 39478.953125, 48022.2890625, 52535.13671875, 58330.5703125 , 61597.91796875, 62757.12890625, 66930.390625 , 20566.00205365])
You just need a conditional list comprehension like this:
l = [2000,2200,5000,2350]
n = 2000
a = [ (x) for x in l if ((n -250) < x < (n + 250)) ]
Then you can average with
np.mean(a)
or whatever method you prefer.

python interpolation of some datapoints in dataset / merging lists

In an .xlsx file there is logged machine data in a way that is not suitable for further calculations. Meaning I've got a file that contains depth data of a cutting tool. Each depth increment comes with several further informations like pressure, rotational speed, forces and many more.
As you can see in some datapoints the resolution of the depth parameter (0.01) is insufficient, as other parameters are updated more often. So I want to interpolate between two consecutive depth datapoints.
What is important to know, this effect doesn't occure on each depth. When the cutting tool moves fast, everything is fine.
Here is also an example file.
So I just need to interpolate values of the depth, when the differnce between two consecutive depth datapoints is 0.01
I've tried the following approach:
Open as dataframe, rename, drop NaN, convert to list
count identical depths in list and transfer them to dataframe
calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0"
Divide delta depth by number of time steps if 0.009 < delta depth < 0.011 -->interpolated depth
empty List of Lists with the number of elements of the sublist corresponding to the duration
Pass values from interpolated depth to the respective sublists --> List 1
Transfer elements from delta_depth to sublists --> Liste 2
Merge List 1 and List 2
Flatten the Lists
replace the original depth value by the interpolated values in dataframe
It looks like this, but at point 8 (merging) I don't get what I need:
import pandas as pd
from itertools import groupby
from itertools import zip_longest
import matplotlib.pyplot as plt
import numpy as np
#open and rename of some columns
df_raw=pd.read_excel(open('---.xlsx', 'rb'), sheet_name='---')
df_raw=df_raw.rename(columns={"---"})
#drop NaN
df_1=df_raw.dropna(subset=['depth'])
#convert to list
li = df_1['depth'].tolist()
#count identical depths in list and transfer them to dataframe
df_count = pd.DataFrame.from_records([[i, len([*group])] for i, group in groupby(li)])
df_count = df_count.rename(columns={0: "depth", 1: "duration"})
#calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0".
df_count["delta_depth"] = df_count["depth"].diff()
df_count=df_count.fillna(0)
#Divide delta depth by number of time steps if 0.009 < delta depth < 0.011
df_count["inter_depth"] = np.where(np.logical_and(df_count['delta_depth'] > 0.009, df_count['delta_depth'] < 0.011),df_count["delta_depth"] / df_count["duration"],0)
li2=df_count.values.tolist()
li_depth = df_count['depth'].tolist()
li_delta = df_count['delta_depth'].tolist()
li_duration = df_count['duration'].tolist()
li_inter = df_count['inter_depth'].tolist()
#empty List of Lists with the number of elements of the sublist corresponding to the duration
out=[]
for number in li_duration:
out.append(li_inter[:number])
#Pass values from interpolated depth to the respective sublists --> Liste 1
out = [[i]*j for i, j in zip(li_inter, [len(j) for j in out])]
#Transfer elements from delta_depth to sublists --> Liste 2
def extractDigits(lst):
return list(map(lambda el:[el], lst))
lst=extractDigits(li_delta)
#Merge list 1 and list 2
list1 = out
list2 = lst
new_list = []
for l1, l2 in zip_longest(list1, list2, fillvalue=[]):
new_list.append([y if y else x for x, y in zip_longest(l1, l2)])
new_list
After merging the first elements of the sublists the original depth values are followed by the interpolated values. But the sublists should contain only interpolated values.
Now I have the following questions:
is there in general a better approach to this problem?
How could I solve the problem with merging, or...
... find a way to override the wrong first elements in the sublists
The desired result would look something like this.
Any help would be much appreciated, as I'm very unexperienced in python and totally stuck.
I am sure someone could write something prettier, but I think this will work just fine:
Edited to some kinda messy scripting. I think this will do what you need it to though
_list_helper1 = df["Depth [m]"].to_list()
_list_helper1.insert(0, 0)
_list_helper1.insert(0, 0)
_list_helper1 = _list_helper1[:-2]
df["helper1"] = _list_helper1
_list = df["Depth [m]"].to_list() # grab all depth values
_list.insert(0, 0) # insert a value at the beginning to offset from original col
_list = _list[0:-1] # Delete the very last item
df["helper"] = _list # add the list to a helper col which is now offset
df["delta depth"] = df["Depth [m]"] - df["helper"] # subtract helper col from original
_id = 0
for i in range(len(df)):
if df.loc[i, "Depth [m]"] == df.loc[i, "helper"]:
break_val = df.loc[i, "Depth [m]"]
break_val_2 = df.loc[i+1, "Depth [m]"]
if break_val_2 == break_val:
df.loc[i, "IDcol"] = _id
df.loc[i+1, "IDcol"] = _id
else:
_id += 1
depth = df["IDcol"].to_list()
depth = list(dict.fromkeys(depth))
depth = [x for x in depth if str(x) != 'nan']
increments = []
for i in depth:
_df = df.copy()
_df = _df[_df["IDcol"] == i]
_df.reset_index(inplace=True, drop=True)
div_by = len(_df)
increment = _df.loc[0, "helper"] - _df.loc[0, "helper1"]
_df["delta depth"] = increment / div_by
_increment = increment / div_by
base_value = _df.loc[0, "Depth [m]"]
for y in range(div_by):
_df.loc[y, "Depth [m]"] = base_value + ((y + 1) * _increment)
increments.append(_df)
df["IDcol"] = df["IDcol"].fillna("KEEP")
df = df[df["IDcol"] == "KEEP"]
increments.append(df)
df = pd.concat(increments)
df = df.fillna(0)
df = df[["index", "Depth [m]", "delta depth", "IDcol"]] # and whatever other cols u want

how to select pieces of 2 lists using np.sign?

I have 2 separate lists which I would like to select some pieces of both. In the first variable set, I need to select based on its sign then when I could have zero crossing, I should select from both lists based on the indices coming from zero_crossing.
I could define the function to have zero_crossing indices but I do not know how to select from both lists using the indices.
def func_1(arr1):
for i, a in enumerate(arr1):
zero_crossings = np.where(np.diff(np.sign(arr1)))[0]
return zero_crossings
def func_2(arr1,arr2):
res_list1 = []
res_list2 = []
for i in (zero_crossings):
res_list1.append(arr1[i:i+1])
res_list2.append(arr2[i:i+1])
zero_crossing = [3,12,18]
list1 = [-12,-14,-16,-10,1,3,6,2,,1,5,5,3,-1,-12,-2,-3,-5,-3,3,2,5,2]
list2 = [0.00040409 0.00041026 0.0004162 ... 0.00116538 0.001096 0.00102431]
The expected results:
new_list_1 = [list1[0:3]+list1[3:12]+list1[12:18]]
new_list_1 = [list2[0:3]+list2[3:12]+list2[12:18]]
for i in range(len(zero_crossing)):
list_{%d} = []
list_i = list1[zero_crossing[i]:zero_crossing[i+1]]
I want to use list1 to see where we have a change in sign then through the list of indices of sign changing, try to select the values of both lists.
All efforts will be appreciated.

How do I use a while loop to access all the 2nd elements of lists which are the values stored in a dictionary?

If I have a dictionary like this, filled with similar lists, how can I apply a while loo tp extract a list that prints that second element:
racoona_valence={}
racoona_valence={"rs13283416": ["7:87345874365-839479328749+","BOBB7"],\}
I need to print the part that says "BOBB7" for 2nd element of the lists in a larger dictionary. There are ten key-value pairs in it, so I am starting it like so, but unsure what to do because all the examples I can find don't relate to my problem:
n=10
gene_list = []
while n>0:
Any help greatly appreciated.
Well, there's a bunch of ways to do it depending on how well-structured your data is.
racoona_valence={"rs13283416": ["7:87345874365-839479328749+","BOBB7"], "rs13283414": ["7:87345874365-839479328749+","BOBB4"]}
output = []
for key in racoona_valence.keys():
output.append(racoona_valence[key][1])
print(output)
other_output = []
for key, value in racoona_valence.items():
other_output.append(value[1])
print(other_output)
list_comprehension = [value[1] for value in racoona_valence.values()]
print(list_comprehension)
n = len(racoona_valence.values())-1
counter = 0
gene_list = []
while counter<=n:
gene_list.append(list(racoona_valence.values())[n][1])
counter += 1
print(gene_list)
Here is a list comprehension that does what you want:
second_element = [x[1] for x in racoona_valence.values()]
Here is a for loop that does what you want:
second_element = []
for value in racoona_valence.values():
second_element.append(value[1])
Here is a while loop that does what you want:
# don't use a while loop to loop over iterables, it's a bad idea
i = 0
second_element = []
dict_values = list(racoona_valence.values())
while i < len(dict_values):
second_element.append(dict_values[i][1])
i += 1
Regardless of which approach you use, you can see the results by doing the following:
for item in second_element:
print(item)
For the example that you gave, this is the output:
BOBB7

how to sum adjacent tuples/list

I apologise for the terrible description and if this is a duplicated, i have no idea how to phrase this question. Let me explain what i am trying to do. I have a list consisting of 0s and 1s that is 3600 elements long (1 hour time series data). i used itertools.groupby() to get a list of consecutive keys. I need (0,1) to be counted as (1,1), and be summed with the flanking tuples.
so
[(1,8),(0,9),(1,5),(0,1),(1,3),(0,3)]
becomes
[(1,8),(0,9),(1,5),(1,1),(1,3),(0,3)]
which should become
[(1,8),(0,9),(1,9),(0,3)]
right now, what i have is
def counter(file):
list1 = list(dict[file]) #make a list of the data currently working on
graph = dict.fromkeys(list(range(0,3601))) #make a graphing dict, x = key, y = value
for i in list(range(0,3601)):
graph[i] = 0 # set all the values/ y from none to 0
for i in list1:
graph[i] +=1 #populate the values in graphing dict
x,y = zip(*graph.items()) # unpack graphing dict into list, x = 0 to 3600 and y = time where it bite
z = [(x[0], len(list(x[1]))) for x in itertools.groupby(y)] #make a new list z where consecutive y is in format (value, count)
z[:] = [list(i) for i in z]
for i in z[:]:
if i == [0,1]:
i[0]=1
return(z)
dict is a dictionary where the keys are filenames and the values are a list of numbers to be used in the function counter(). and this gives me something like this but much longer
[[1,8],[0,9],[1,5], [1,1], [1,3],[0,3]]
edits:
solved it with the help of a friend,
while (0,1) in z:
idx=z.index((0,1))
if idx == len(z)-1:
break
z[idx] = (1,1+z[idx-1][1] + z[idx+1][1])
del z[idx+1]
del z[idx-1]
Not sure what exactly is that you need. But this is my best attempt of understanding it.
def do_stuff(original_input):
new_original = []
new_original.append(original_input[0])
for el in original_input[1:]:
if el == (0, 1):
el = (1, 1)
if el[0] != new_original[-1][0]:
new_original.append(el)
else:
(a, b) = new_original[-1]
new_original[-1] = (a, b + el[1])
return new_original
# check
print (do_stuff([(1,8),(0,9),(1,5),(0,1),(1,3),(0,3)]))

Categories

Resources