I am new to Python. Currently I need to count the number of duplicates, delete the duplicate ones and update the duplicates occasions into a new column. Below is my code:
import pandas as pd
from openpyxl import load_workbook
filepath = '/Users/jordanliu/Desktop/test/testA.xlsx'
data = load_workbook(filepath)
sku = data.active
duplicate_column = []
for x in range(sku.max_row):
duplicate_count = 0
for i in range(x):
if sku.cell(row =i + 2, column = 1).value == sku.cell(row = x + 2, column = 1).value:
duplicate_count = duplicate_column[i] + 1
sku.cell(row =i+2, column = 1).value = 0
duplicate_column.append(duplicate_count)
for x in range(len(duplicate_column)):
sku.cell(row=x + 2, column=3).value = duplicate_column[x]
for y in range(sku.max_row):
y = y + 1
if sku.cell(row = y, column = 1).value == 0:
sku.delete_rows(y,1)
data.save(filepath)
I've tried using both pandas but because the execution time takes extraordinary long, I've decided to change to openpyxl but it doesn't seem to help much. Many from other post have suggest to use CSV but because its the writing process that takes the majority of the time I thought that it wouldn't help much.
Can someone please provide me some help here?
for x in range(sku.max_row):
duplicate_count = 0
for i in range(x):
if sku.cell(row =i + 2, column = 1).value == sku.cell(row = x + 2, column = 1).value:
duplicate_count = duplicate_column[i] + 1
sku.cell(row =i+2, column = 1).value = 0
For this portion, you are rechecking the same values over and over. Assuming these should be unique totally, which is how I think your code is written, then you should instead implement a cache of a hashed type (dict or set) to do these subsequent lookups instead of doing the lookup via sku.cell every time.
So it would be something like:
xl_cache = {}
duplicate_count = {}
delete_set = set()
for x in range(sku.max_row):
x_val = sku.cell(row = x, column = 1).value
if x_val in xl_cache: # then this is not first time
xl_cache[x_val][1] += 1 # increase duplicate count
delete_set.add(x)
else:
xl_cache[x_val] = x # key is value for duplicate cache, value is row number
duplicate_count[x] = 0 # key is row number, value is duplicate count
So now you have a dictionary of originals with duplicate counts, you need to go back and delete your rows that you don't want plus change the duplicate counts in the sheet. So go backwards through the range and delete the row or update the duplicate count. You can do this by going to your max first and reducing by 1, check for delete first, otherwise change the duplicate.
y = sku.max_row
for i in range(y, 0, -1):
if i in delete_set:
sku.delete_rows(i,1)
else:
sku.cell(row=i, column=3) = duplicate_count[i]
In theory, this would only traverse your range twice in total, and lookups from the cache would be O(1) on average. You need to traverse this in reverse to maintain row order as you delete rows.
Since I don't actually have your sample data I can't test this code completely so there could be minor issues but I tried to use the structures that you have in your code to make it easily usable for you.
Related
In an .xlsx file there is logged machine data in a way that is not suitable for further calculations. Meaning I've got a file that contains depth data of a cutting tool. Each depth increment comes with several further informations like pressure, rotational speed, forces and many more.
As you can see in some datapoints the resolution of the depth parameter (0.01) is insufficient, as other parameters are updated more often. So I want to interpolate between two consecutive depth datapoints.
What is important to know, this effect doesn't occure on each depth. When the cutting tool moves fast, everything is fine.
Here is also an example file.
So I just need to interpolate values of the depth, when the differnce between two consecutive depth datapoints is 0.01
I've tried the following approach:
Open as dataframe, rename, drop NaN, convert to list
count identical depths in list and transfer them to dataframe
calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0"
Divide delta depth by number of time steps if 0.009 < delta depth < 0.011 -->interpolated depth
empty List of Lists with the number of elements of the sublist corresponding to the duration
Pass values from interpolated depth to the respective sublists --> List 1
Transfer elements from delta_depth to sublists --> Liste 2
Merge List 1 and List 2
Flatten the Lists
replace the original depth value by the interpolated values in dataframe
It looks like this, but at point 8 (merging) I don't get what I need:
import pandas as pd
from itertools import groupby
from itertools import zip_longest
import matplotlib.pyplot as plt
import numpy as np
#open and rename of some columns
df_raw=pd.read_excel(open('---.xlsx', 'rb'), sheet_name='---')
df_raw=df_raw.rename(columns={"---"})
#drop NaN
df_1=df_raw.dropna(subset=['depth'])
#convert to list
li = df_1['depth'].tolist()
#count identical depths in list and transfer them to dataframe
df_count = pd.DataFrame.from_records([[i, len([*group])] for i, group in groupby(li)])
df_count = df_count.rename(columns={0: "depth", 1: "duration"})
#calculate Delta between depth i and depth i-1 (i.e. to the predecessor), replace NaN with "0".
df_count["delta_depth"] = df_count["depth"].diff()
df_count=df_count.fillna(0)
#Divide delta depth by number of time steps if 0.009 < delta depth < 0.011
df_count["inter_depth"] = np.where(np.logical_and(df_count['delta_depth'] > 0.009, df_count['delta_depth'] < 0.011),df_count["delta_depth"] / df_count["duration"],0)
li2=df_count.values.tolist()
li_depth = df_count['depth'].tolist()
li_delta = df_count['delta_depth'].tolist()
li_duration = df_count['duration'].tolist()
li_inter = df_count['inter_depth'].tolist()
#empty List of Lists with the number of elements of the sublist corresponding to the duration
out=[]
for number in li_duration:
out.append(li_inter[:number])
#Pass values from interpolated depth to the respective sublists --> Liste 1
out = [[i]*j for i, j in zip(li_inter, [len(j) for j in out])]
#Transfer elements from delta_depth to sublists --> Liste 2
def extractDigits(lst):
return list(map(lambda el:[el], lst))
lst=extractDigits(li_delta)
#Merge list 1 and list 2
list1 = out
list2 = lst
new_list = []
for l1, l2 in zip_longest(list1, list2, fillvalue=[]):
new_list.append([y if y else x for x, y in zip_longest(l1, l2)])
new_list
After merging the first elements of the sublists the original depth values are followed by the interpolated values. But the sublists should contain only interpolated values.
Now I have the following questions:
is there in general a better approach to this problem?
How could I solve the problem with merging, or...
... find a way to override the wrong first elements in the sublists
The desired result would look something like this.
Any help would be much appreciated, as I'm very unexperienced in python and totally stuck.
I am sure someone could write something prettier, but I think this will work just fine:
Edited to some kinda messy scripting. I think this will do what you need it to though
_list_helper1 = df["Depth [m]"].to_list()
_list_helper1.insert(0, 0)
_list_helper1.insert(0, 0)
_list_helper1 = _list_helper1[:-2]
df["helper1"] = _list_helper1
_list = df["Depth [m]"].to_list() # grab all depth values
_list.insert(0, 0) # insert a value at the beginning to offset from original col
_list = _list[0:-1] # Delete the very last item
df["helper"] = _list # add the list to a helper col which is now offset
df["delta depth"] = df["Depth [m]"] - df["helper"] # subtract helper col from original
_id = 0
for i in range(len(df)):
if df.loc[i, "Depth [m]"] == df.loc[i, "helper"]:
break_val = df.loc[i, "Depth [m]"]
break_val_2 = df.loc[i+1, "Depth [m]"]
if break_val_2 == break_val:
df.loc[i, "IDcol"] = _id
df.loc[i+1, "IDcol"] = _id
else:
_id += 1
depth = df["IDcol"].to_list()
depth = list(dict.fromkeys(depth))
depth = [x for x in depth if str(x) != 'nan']
increments = []
for i in depth:
_df = df.copy()
_df = _df[_df["IDcol"] == i]
_df.reset_index(inplace=True, drop=True)
div_by = len(_df)
increment = _df.loc[0, "helper"] - _df.loc[0, "helper1"]
_df["delta depth"] = increment / div_by
_increment = increment / div_by
base_value = _df.loc[0, "Depth [m]"]
for y in range(div_by):
_df.loc[y, "Depth [m]"] = base_value + ((y + 1) * _increment)
increments.append(_df)
df["IDcol"] = df["IDcol"].fillna("KEEP")
df = df[df["IDcol"] == "KEEP"]
increments.append(df)
df = pd.concat(increments)
df = df.fillna(0)
df = df[["index", "Depth [m]", "delta depth", "IDcol"]] # and whatever other cols u want
I need to find a more efficient solution for the following problem:
Given is a dataframe with 4 variables in each row. I need to find the list of 8 elements that includes all the variables per row in a maximum amount of rows.
A working, but very slow, solution is to create a second dataframe containing all possible combinations (basically a permutation without repetation). Then loop through every combination and compare it wit the inital dataframe. The amount of solutions is counted and added to the second dataframe.
import numpy as np
import pandas as pd
from itertools import combinations
df = pd.DataFrame(np.random.randint(0,20,size=(100, 4)), columns=list('ABCD'))
df = 'x' + df.astype(str)
listofvalues = df['A'].tolist()
listofvalues.extend(df['B'].tolist())
listofvalues.extend(df['C'].tolist())
listofvalues.extend(df['D'].tolist())
listofvalues = list(dict.fromkeys(listofvalues))
possiblecombinations = list(combinations(listofvalues, 6))
dfcombi = pd.DataFrame(possiblecombinations, columns = ['M','N','O','P','Q','R'])
dfcombi['List'] = dfcombi.M.map(str) + ',' + dfcombi.N.map(str) + ',' + dfcombi.O.map(str) + ',' + dfcombi.P.map(str) + ',' + dfcombi.Q.map(str) + ',' + dfcombi.R.map(str)
dfcombi['Count'] = ''
for x, row in dfcombi.iterrows():
comparelist = row['List'].split(',')
pointercounter = df.index[(df['A'].isin(comparelist) == True) & (df['B'].isin(comparelist) == True) & (df['C'].isin(comparelist) == True) & (df['D'].isin(comparelist) == True)].tolist()
row['Count'] = len(pointercounter)
I assume there must be a way to avoid the for - loop and replace it with some pointer, i just can not figure out how.
Thanks!
Your code can be rewritten as:
# working with integers are much better than strings
enums, codes = df.stack().factorize()
# encodings of df
s = [set(x) for x in enums.reshape(-1,4)]
# possible combinations
from itertools import combinations, product
possiblecombinations = np.array([set(x) for x in combinations(range(len(codes)), 6)])
# count the combination with issubset
ret = [0]*len(possiblecombinations)
for a, (i,b) in product(s, enumerate(possiblecombinations)):
ret[i] += a.issubset(b)
# the combination with maximum count
max_combination = possiblecombinations[np.argmax(ret)]
# in code {0, 3, 4, 5, 17, 18}
# and in values:
codes[list(max_combination)]
# Index(['x5', 'x15', 'x12', 'x8', 'x0', 'x6'], dtype='object')
All that took about 2 seconds as oppose to your code that took around 1.5 mins.
cust_id = semi_final_df['0_x'].tolist()
date = semi_final_df[1].tolist()
total_amount = semi_final_df[0].tolist()
prod_num = semi_final_df['0_y'].tolist()
prod_deduped = []
quant_cleaned = []
product_net_amount = []
cust_id_final = []
date_final = []
for row in total_amount:
quant_cleaned.append(float(row))
for unique_prodz in prod_num:
if unique_prodz not in prod_deduped:
prod_deduped.append(unique_prodz)
for unique_product in prod_deduped:
indices = [i for i, x in enumerate(prod_num) if x == unique_product]
product_total = 0
for index in indices:
product_total += quant_cleaned[index]
product_net_amount.append(product_total)
first_index = prod_num.index(unique_product)
cust_id_final.append(cust_id[first_index])
date_final.append(date[first_index])
Above code calculates sum amount by one condition in order to sum the total on an invoice.
The data had multiple lines but shared the same invoice/product number.
Problem:
I need to modify the below code so that I can sum by unique product and unique date.
I have given it a go but I am getting a value error -
saying x, y is not in a list
As per my understanding the issue lies in the fact that I am zipping two de-duped lists together of different lengths and then I am attempting to loop through the result inline.
This line causes the error
for i,[x, y] in enumerate(zipped_list):
Any help would be sincerely appreciated. Here is the second batch of code with comments.
from itertools import zip_longest
#I have not included the code for the three lists below but you can assume they are populated as these are the lists that I will be #working off of. They are of the same length.
prod_numbers = []
datesz = []
converted_quant = []
#Code to dedupe date and product which will end up being different lengths. These two lists are populated by the two for loops below
prod_deduped = []
dates_deduped = []
for unique_prodz in prod_numbers:
if unique_prodz not in prod_deduped:
prod_deduped.append(unique_prodz)
for unique_date in datesz:
if unique_date not in dates_deduped:
dates_deduped.append(unique_date)
#Now for the fun part. Time to sum by date and product. The three lists below are empty until we run the code
converted_net_amount = []
prod_id_final = []
date_final = []
#I zipped the list together using itertools which I imported at the top
for unique_product, unique_date in zip_longest(prod_deduped, dates_deduped, fillvalue = ''):
indices = []
zipped_object = zip(prod_numbers, datesz)
zipped_list = list(zipped_object)
for i,[x, y] in enumerate(zipped_list):
if x == unique_product and y == unique_date:
indices.append(i)
converted_total = 0
for index in indices:
converted_total += converted_quant[index]
converted_net_amount.append[converted_total]
first_index = zipped_list.index([unique_product, unique_date])
prod_id_final.append(prod_numbers[first_index])
date_final.append(datesz[first_index])
from collections import defaultdict
summed_dictionary = defaultdict(int)
for x, y, z in list:
summed_dictionary[(x,y)] += z
Using defaultdict should solve your problem and is a lot easier on the eyes than all your code above. I saw this on reddit this morning and figured you crossposted. Credit to the guy from reddit on /r/learnpython
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03
So, among the selected values want to calculate median value.
arcpy.env.workspace = r"Database Connections\local.sde"
pLoc = "local.DBO.Parcels"
luLoc = "local.DBO.Land_Use"
luFields = ["MedYrBlt","MedVal","OCCount"]
arcpy.MakeFeatureLayer_management(pLoc,"cities_lyr")
arcpy.SelectLayerByAttribute_management("cities_lyr", "NEW_SELECTION", "YrBlt > 1000")
from selected cities_lyr want to calculate mean value field from YrBlt
with arcpy.da.SearchCursor(luLoc, ["OID#", "SHAPE#", luFields[0], luFields[1], luFields[2]]) as cursor:
for row in cursor:
if arcpy.Exists('in_memory/stats'):
arcpy.Delete_management(r'in_memory/stats')
arcpy.SelectLayerByLocation_management('cities_lyr', select_features = row[1])
arcpy.Statistics_analysis('cities_lyr', 'in_memory/stats','YrBlt MEAN','OBJECTID')
Here comes a question:
I just want to see the mean value, how can I do that?
luFields = ["MedYrBlt","MedVal","OCCount"]
are going to be used later not important for now.
Append values to an empty array and then calculate mean of that array. For example:
# Create array & cycle through years, append values to array
yrArray =[]
for row in cursor:
val = getValue("yrBlt")
yrArray.append(val)
#get sum of all values in array
x = 0
for i in yrArray:
x += i
#get average by dividing above sum by the length of the array.
meanYrBlt = x / len(yrArray)
On another note it may be beneficial to separate these processes out into their own classes. For example:
class arrayAvg:
def __init__(self,array):
x = 0
for i in array:
x += 1
arrayLength = len(array)
arrayAvg = x/arrayLength
self.avg = arrayAvg
self.count = arrayLength
This way you can reuse the code by calling:
yrBltAvg = arrayAvg(yrArray)
avg = yrBltAvg.avg #returns average
count = yrBltAvg.count #returns count
The second portion is unnecessary, but allows you to take advantage of object oriented programming, and you can expand upon that throughout the program.