making a Pandas while loop faster

making a Pandas while loop faster - python

I have a while loop which runs through a data frame A of 30000 rows and updates another data frame B and uses dataframe B for further iterations. its taking too much time. want to make it faster! any ideas
for x in range(0, dataframeA.shape[0]):
AuthorizationID_Temp = dataframeA["AuthorizationID"].iloc[x]
Auth_BeginDate = dataframeA["BeginDate"].iloc[x]
Auth_EndDate = dataframeA["EndDate"].iloc[x]
BeginDate_Temp = pd.to_datetime(Auth_BeginDate).date()
ScriptsFlag = dataframeA["ScriptsFlag"].iloc[x]
Legacy_PlacementID = dataframeA["Legacy_PlacementID"].iloc[x]
Legacy_AncillaryServicesID = dataframeA["Legacy_AncillaryServicesID"].iloc[x]
ProviderID_Temp = dataframeA["ProviderID"].iloc[x]
SRSProcode_Temp = dataframeA["SRSProcode"].iloc[x]
Rate_Temp = dataframeA["Rate"].iloc[x]
Scripts2["BeginDate1_SC"] = pd.to_datetime(Scripts2["BeginDate_SC"]).dt.date
Scripts2["EndDate1_SC"] = pd.to_datetime(Scripts2["EndDate_SC"]).dt.date
# BeginDate_Temp = BeginDate_Temp.date()
# EndDate_Temp = EndDate_Temp.date()
Scripts_New_Modified1 = Scripts2.loc[
((Scripts2["ScriptsFlag_SC"].isin(["N", "M"])) & (Scripts2["AuthorizationID_SC"] == AuthorizationID_Temp))
& ((Scripts2["ProviderID_SC"] == ProviderID_Temp) & (Scripts2["SRSProcode_SC"] == SRSProcode_Temp)),
:,
]
Scripts_New_Modified = Scripts_New_Modified1.loc[
(Scripts_New_Modified1["BeginDate1_SC"] == BeginDate_Temp)
& ((Scripts_New_Modified1["EndDate1_SC"] == EndDate_Temp) & (Scripts_New_Modified1["Rate_SC"] == Rate_Temp)),
"AuthorizationID_SC",
]
if ScriptsFlag == "M":
if Legacy_PlacementID is not None:
InsertA = insertA(AuthorizationID_Temp, BeginDate_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
dataframeB = dataframeB.append(InsertA)
print("ScriptsTemp6 shape is {}".format(dataframeB.shape))
# else:
# ScriptsTemp6 = ScriptsTemp5.copy()
# print('ScriptsTemp6 shape is {}'.format(ScriptsTemp6.shape))
if Legacy_AncillaryServicesID is not None:
InsertB = insertB(AuthorizationID_Temp, BeginDate_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
dataframeB = dataframeB.append(InsertB)
print("ScriptsTemp7 shape is {}".format(dataframeB.shape))
dataframe_New = dataframeB.loc[
((dataframeB["ScriptsFlag"] == "N") & (dataframeB["AuthorizationID"] == AuthorizationID_Temp))
& ((dataframeB["ProviderID"] == ProviderID_Temp) & (dataframeB["SRSProcode"] == SRSProcode_Temp)),
:,
]
dataframe_New1 = dataframe_New.loc[
(pd.to_datetime(dataframe_New["BeginDate"]).dt.date == BeginDate_Temp)
& ((pd.to_datetime(dataframe_New["EndDate"]).dt.date == EndDate_Temp_DO) & (dataframe_New["Rate"] == Rate_Temp)),
"AuthorizationID",
]
# PLAATN = dataframeA.copy()
Insert1 = insert1(dataframe_New1, BeginDate_Temp, AuthorizationID_Temp, EndDate_Temp, Units_Temp, EndDate_Temp_DO)
if Insert1.shape[0] > 0:
dataframeB = dataframeB.append(Insert1.iloc[0])
# else:
# ScriptsTemp8 = ScriptsTemp7
print("ScriptsTemp8 shape is {}".format(dataframeB.shape))
dataframe_modified1 = dataframeB.loc[
((dataframeB["ScriptsFlag"] == "M") & (dataframeB["AuthorizationID"] == AuthorizationID_Temp))
& ((dataframeB["ProviderID"] == ProviderID_Temp) & (dataframeB["SRSProcode"] == SRSProcode_Temp)),
:,
]
dataframe_modified = dataframe_modified1.loc[
(dataframe_modified1["BeginDate"] == BeginDate_Temp)
& ((dataframe_modified1["EndDate"] == EndDate_Temp_DO) & (dataframe_modified1["Rate"] == Rate_Temp)),
"AuthorizationID",
]
Insert2 = insert2(
dataframe_modified,
Scripts_New_Modified,
AuthorizationID_Temp,
BeginDate_Temp,
EndDate_Temp,
Units_Temp,
EndDate_Temp_DO,
)
if Insert2.shape[0] > 0:
dataframeB = dataframeB.append(Insert2.iloc[0])
dataframeA having 30000 rows
dataframeB should be inserted with new rows every iteration(30000 iterations) from DataframeA
updated dataframeB should be used in middle of each iteration for filtering conditions
insertA and InsertB are two functions which has additional filtering
it takes too much time to run for 30000 rows so
so it takes more time to run.
provide suggestions for making the loop faster in terms of execution time

Related

parallelized tasks are not ditributed a cross available cpus

As shown in the code posted below in section DecoupleGridCellsProfilerLoopsPool, the run() is called as much times as the contents of the self.__listOfLoopDecouplers and it works as it supposed to be, i mean the parallelization is working duly.
as shown in the same section, DecoupleGridCellsProfilerLoopsPool.pool.map returns results and i populate some lists,lets discuss the list names self.__iterablesOfZeroCoverageCell it contains number of objects of type gridCellInnerLoopsIteratorsForZeroCoverageModel.
After that, i created the pool ZeroCoverageCellsProcessingPool with the code as posted below as well.
The problem i am facing, is the parallelized code in ZeroCoverageCellsProcessingPool is very slow and the visulisation of the cpu tasks shows that there are no processes work in parallel as shown in the video contained in url posted below.
i was suspicious about the pickling issues related to when parallelizing the code in ZeroCoverageCellsProcessingPool,so i removed the enitre body of the run() in ZeroCoverageCellsProcessingPool. however, it shows no change in the behaviour of the parallelized code.
also the url posted below shown how the parallelized methoth of ZeroCoverageCellsProcessingPool behaves.
given the code posted below, please let me know why the parallelization does not work for code in ZeroCoverageCellsProcessingPool
output url:please click the link
output url
DecoupleGridCellsProfilerLoopsPool
def postTask(self):
self.__postTaskStartTime = time.time()
with Pool(processes=int(config['MULTIPROCESSING']['proceses_count'])) as DecoupleGridCellsProfilerLoopsPool.pool:
self.__chunkSize = PoolUtils.getChunkSize(lst=self.__listOfLoopDecouplers,cpuCount=int(config['MULTIPROCESSING']['cpu_count']))
logger.info(f"DecoupleGridCellsProfilerLoopsPool.self.__chunkSize(task per processor):{self.__chunkSize}")
for res in DecoupleGridCellsProfilerLoopsPool.pool.map(self.run,self.__listOfLoopDecouplers,chunksize=self.__chunkSize):
if res[0] is not None and res[1] is None and res[2] is None:
self.__iterablesOfNoneZeroCoverageCell.append(res[0])
elif res[1] is not None and res[0] is None and res[2] is None:
self.__iterablesOfZeroCoverageCell.append(res[1])
elif res[2] is not None and res[0] is None and res[1] is None:
self.__iterablesOfNoDataCells.append(res[2])
else:
raise Exception (f"WTF.")
DecoupleGridCellsProfilerLoopsPool.pool.join()
assert len(self.__iterablesOfNoneZeroCoverageCell)+len(self.__iterablesOfZeroCoverageCell)+len(self.__iterablesOfNoDataCells) == len(self.__listOfLoopDecouplers)
zeroCoverageCellsProcessingPool = ZeroCoverageCellsProcessingPool(self.__devModeForWSAWANTIVer2,self.__iterablesOfZeroCoverageCell)
zeroCoverageCellsProcessingPool.postTask()
def run(self,param:LoopDecoupler):
row = param.getRowValue()
col = param.getColValue()
elevationsTIFFWindowedSegmentContents = param.getElevationsTIFFWindowedSegment()
verticalStep = param.getVericalStep()
horizontalStep = param.getHorizontalStep()
mainTIFFImageDatasetContents = param.getMainTIFFImageDatasetContents()
NDVIsTIFFWindowedSegmentContentsInEPSG25832 = param.getNDVIsTIFFWindowedSegmentContentsInEPSG25832()
URLOrFilePathForElevationsTIFFDatasetInEPSG25832 = param.getURLOrFilePathForElevationsTIFFDatasetInEPSG25832()
threshold = param.getThreshold()
rowsCnt = 0
colsCnt = 0
pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt = 0
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = int(config['window']['width']) * int(config['window']['height'])
pixelsWithNoDataValueInTIFFImageDatasetCnt = int(config['window']['width']) * int(config['window']['height'])
_pixelsValuesSatisfyThresholdInNoneZeroCoverageCell = []
_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell = []
_pixelsValuesInNoDataCell = []
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel = None
gridCellInnerLoopsIteratorsForZeroCoverageModel = None
gridCellInnerLoopsIteratorsForNoDataCellsModel = None
for x in range(row,row + verticalStep):
if rowsCnt == verticalStep:
rowsCnt = 0
for y in range(col,col + horizontalStep):
if colsCnt == horizontalStep:
colsCnt = 0
pixelValue = mainTIFFImageDatasetContents[0][x][y]
# windowIOUtils.writeContentsToFile(windowIOUtils.getPathToOutputDir()+"/"+config['window']['file_name']+".{0}".format(config['window']['file_extension']), "pixelValue:{0}\n".format(pixelValue))
if pixelValue >= float(threshold):
pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt+=1
_pixelsValuesSatisfyThresholdInNoneZeroCoverageCell.append(elevationsTIFFWindowedSegmentContents[0][rowsCnt][colsCnt])
elif ((pixelValue < float(threshold)) and (pixelValue > float(config['TIFF']['no_data_value']))):
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt-=1
_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell.append(elevationsTIFFWindowedSegmentContents[0][rowsCnt][colsCnt])
elif (pixelValue <= float(config['TIFF']['no_data_value'])):
pixelsWithNoDataValueInTIFFImageDatasetCnt-=1
_pixelsValuesInNoDataCell.append(elevationsTIFFWindowedSegmentContents[0][rowsCnt][colsCnt])
else:
raise Exception ("WTF.Exception: unhandled condition for pixel value: {0}".format(pixelValue))
# _pixelCoordinatesInWindow.append([x,y])
colsCnt+=1
rowsCnt+=1
'''Grid-cell classfication'''
if (pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt > 0):
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel = GridCellInnerLoopsIteratorsForNoneZeroCoverageModel()
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setRowValue(row)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setColValue(col)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setVericalStep(verticalStep)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setHorizontalStep(horizontalStep)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setMainTIFFImageDatasetContents(mainTIFFImageDatasetContents)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setNDVIsTIFFWindowedSegmentContentsInEPSG25832(NDVIsTIFFWindowedSegmentContentsInEPSG25832)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setURLOrFilePathForElevationsTIFFDatasetInEPSG25832(URLOrFilePathForElevationsTIFFDatasetInEPSG25832)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setPixelsValuesSatisfyThresholdInTIFFImageDatasetCnt(pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setPixelsValuesSatisfyThresholdInNoneZeroCoverageCell(_pixelsValuesSatisfyThresholdInNoneZeroCoverageCell)
elif (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt < (int(config['window']['width']) * int(config['window']['height'])) and pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt >= 0):
gridCellInnerLoopsIteratorsForZeroCoverageModel = GridCellInnerLoopsIteratorsForZeroCoverageModel()
gridCellInnerLoopsIteratorsForZeroCoverageModel.setRowValue(row)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setColValue(col)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setVericalStep(verticalStep)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setHorizontalStep(horizontalStep)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setMainTIFFImageDatasetContents(mainTIFFImageDatasetContents)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setNDVIsTIFFWindowedSegmentContentsInEPSG25832(NDVIsTIFFWindowedSegmentContentsInEPSG25832)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setURLOrFilePathForElevationsTIFFDatasetInEPSG25832(URLOrFilePathForElevationsTIFFDatasetInEPSG25832)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt(pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsWithNoDataValueInTIFFImageDatasetCnt(pixelsWithNoDataValueInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsValuesDoNotSatisfyThresholdInZeroCoverageCell(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell)
elif (pixelsWithNoDataValueInTIFFImageDatasetCnt == 0):
gridCellInnerLoopsIteratorsForNoDataCellsModel = GridCellInnerLoopsIteratorsForNoDataCellsModel()
gridCellInnerLoopsIteratorsForNoDataCellsModel.setRowValue(row)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setColValue(col)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setVericalStep(verticalStep)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setHorizontalStep(horizontalStep)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setMainTIFFImageDatasetContents(mainTIFFImageDatasetContents)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setNDVIsTIFFWindowedSegmentContentsInEPSG25832(NDVIsTIFFWindowedSegmentContentsInEPSG25832)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setURLOrFilePathForElevationsTIFFDatasetInEPSG25832(URLOrFilePathForElevationsTIFFDatasetInEPSG25832)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setPixelsWithNoDataValueInTIFFImageDatasetCnt(pixelsWithNoDataValueInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setPixelsValuesInNoDataCell(_pixelsValuesInNoDataCell)
if gridCellInnerLoopsIteratorsForZeroCoverageModel is not None:
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsWithNoDataValueInTIFFImageDatasetCnt(pixelsWithNoDataValueInTIFFImageDatasetCnt)
else:
raise Exception (f"WTF.")
return gridCellInnerLoopsIteratorsForNoneZeroCoverageModel,gridCellInnerLoopsIteratorsForZeroCoverageModel,gridCellInnerLoopsIteratorsForNoDataCellsModel
ZeroCoverageCellsProcessingPool:
def postTask(self):
self.__postTaskStartTime = time.time()
"""to collect results per each row
"""
resAllCellsForGridCellsClassifications = []
# NDVIs
resAllCellsForNDVITIFFDetailsForZeroCoverageCell = []
# area of coverage
resAllCellsForAreaOfCoverageForZeroCoverageCell = []
# interception
resAllCellsForInterceptionForZeroCoverageCell = []
# fourCornersOfWindowInEPSG25832
resAllCellsForFourCornersOfWindowInEPSG25832ZeroCoverageCell = []
# outFromEPSG25832ToEPSG4326-lists
resAllCellsForOutFromEPSG25832ToEPSG4326ForZeroCoverageCells = []
# fourCornersOfWindowsAsGeoJSON
resAllCellsForFourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell = []
# calculatedCenterPointInEPSG25832
resAllCellsForCalculatedCenterPointInEPSG25832ForZeroCoverageCell = []
# centerPointsOfWindowInImageCoordinatesSystem
resAllCellsForCenterPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell = []
# pixelValuesOfCenterPoints
resAllCellsForPixelValuesOfCenterPointsForZeroCoverageCell = []
# centerPointOfKeyWindowAsGeoJSONInEPSG4326
resAllCellsForCenterPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell = []
# centerPointInEPSG4326
resAllCellsForCenterPointInEPSG4326ForZeroCoveringCell = []
# average heights
resAllCellsForAverageHeightsForZeroCoverageCell = []
# pixels values
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCell = []
# area Of Coverage
resAllCellsForAreaOfCoverageForZeroCoverageCell = []
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = []
noneKeyWindowCnt=0
# center points as string
centerPointsAsStringForZeroCoverageCell = ""
with Pool(processes=int(config['MULTIPROCESSING']['proceses_count'])) as ZeroCoverageCellsProcessingPool.pool:
self.__chunkSize = PoolUtils.getChunkSize(lst=self.__iterables,cpuCount=int(config['MULTIPROCESSING']['cpu_count']))
logger.info(f"ZeroCoverageCellsProcessingPool.self.__chunkSize(task per processor):{self.__chunkSize}")
for res in ZeroCoverageCellsProcessingPool.pool.map(func=self.run,iterable=self.__iterables,chunksize=self.__chunkSize):
resAllCellsForGridCellsClassifications.append(res[0])
# NDVIs
resAllCellsForNDVITIFFDetailsForZeroCoverageCell.append(res[1])
# area of coverage
resAllCellsForAreaOfCoverageForZeroCoverageCell.append(res[2])
# interception
resAllCellsForInterceptionForZeroCoverageCell.append(res[3])
# fourCornersOfWindowInEPSG25832
resAllCellsForFourCornersOfWindowInEPSG25832ZeroCoverageCell.append(res[4])
# outFromEPSG25832ToEPSG4326-lists
resAllCellsForOutFromEPSG25832ToEPSG4326ForZeroCoverageCells.append(res[5])
# fourCornersOfWindowsAsGeoJSONInEPSG4326
resAllCellsForFourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell.append(res[6])
# calculatedCenterPointInEPSG25832
resAllCellsForCalculatedCenterPointInEPSG25832ForZeroCoverageCell.append(res[7])
# centerPointsOfWindowInImageCoordinatesSystem
resAllCellsForCenterPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell.append(res[8])
# pixelValuesOfCenterPoints
resAllCellsForPixelValuesOfCenterPointsForZeroCoverageCell.append(res[9])
# centerPointInEPSG4326
resAllCellsForCenterPointInEPSG4326ForZeroCoveringCell.append(res[10])
# centerPointOfKeyWindowAsGeoJSONInEPSG4326
resAllCellsForCenterPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell.append(res[11])
# average heights
resAllCellsForAverageHeightsForZeroCoverageCell.append(res[12])
# pixels values
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCell.append(res[13])
# pixelsValues cnt
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt.append(res[14])
noneKeyWindowCnt +=res[15]
# centerPoints-As-String
if (res[16] is not None):
centerPointsAsStringForZeroCoverageCell+=str(res[16])
assert noneKeyWindowCnt == len(self.__iterables)
ZeroCoverageCellsProcessingPool.pool.close()
ZeroCoverageCellsProcessingPool.pool.terminate()
ZeroCoverageCellsProcessingPool.pool.join()
return
def run(self,params:GridCellInnerLoopsIteratorsForZeroCoverageModel):
if params is not None:
logger.info(f"Processing zero coverage cell #(row{params.getRowValue()},col:{params.getColValue()})")
row = params.getRowValue()
col = params.getColValue()
mainTIFFImageDatasetContents = params.getMainTIFFImageDatasetContents()
NDVIsTIFFWindowedSegmentContentsInEPSG25832 = params.getNDVIsTIFFWindowedSegmentContentsInEPSG25832()
URLOrFilePathForElevationsTIFFDatasetInEPSG25832 = params.getURLOrFilePathForElevationsTIFFDatasetInEPSG25832()
datasetElevationsTIFFInEPSG25832 = rasterio.open(URLOrFilePathForElevationsTIFFDatasetInEPSG25832,'r')
_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell = params.getPixelsValuesDoNotSatisfyThresholdInZeroCoverageCell()
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = params.getPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt()
countOfNoDataCells = params.getPixelsWithNoDataValueInTIFFImageDatasetCnt()
outFromEPSG25832ToEPSG4326ForZeroCoverageCells = []
fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell = []
ndviTIFFDetailsForZeroCoverageCell = NDVITIFFDetails(None,None,None).getNDVIValuePer10mX10m()
"""area of coverage per grid-cell"""
areaOfCoverageForZeroCoverageCell = None
""""interception"""
interceptionForZeroCoverageCell = None
CntOfNDVIsWithNanValueInZeroCoverageCell = 0
fourCornersOfWindowInEPSG25832ZeroCoverageCell = None
outFromEPSG25832ToEPSG4326ForZeroCoverageCells = []
fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell = []
calculatedCenterPointInEPSG25832ForZeroCoverageCell = None
centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell = None
pixelValuesOfCenterPointsOfZeroCoverageCell = None
centerPointInEPSG4326ForZeroCoveringCell = None
centerPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell = None
centerPointsAsStringForZeroCoverageCell = None
"""average heights"""
averageHeightsForZeroCoverageCell = None
gridCellClassifiedAs = GridCellClassifier.ZERO_COVERAGE_CELL.value
cntOfNoneKeyWindow = 1
ndviTIFFDetailsForZeroCoverageCell = NDVITIFFDetails(ulX=row//int(config['ndvi']['resolution_height']),ulY=col//int(config['ndvi']['resolution_width']),dataset=NDVIsTIFFWindowedSegmentContentsInEPSG25832).getNDVIValuePer10mX10m()
"""area of coverage per grid-cell"""
areaOfCoverageForZeroCoverageCell = round(AreaOfCoverageDetails(pixelsCount=(int(config['window']['width']) * int(config['window']['height'])) - (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt + countOfNoDataCells)).getPercentageOfAreaOfCoverage(),2)
""""interception"""
if math.isnan(ndviTIFFDetailsForZeroCoverageCell):
# ndviTIFFDetailsForZeroCoverageCell = 0
CntOfNDVIsWithNanValueInZeroCoverageCell = 1
interceptionForZeroCoverageCell = config['sentinel_values']['interception']
else:
Indvi = INDVI()
Ic = Indvi.calcInterception(ndviTIFFDetailsForZeroCoverageCell)
Pc=areaOfCoverageForZeroCoverageCell,"""percentage of coverage"""
Pnc=float((int(config['window']['width'])*int(config['window']['height'])) - areaOfCoverageForZeroCoverageCell),"""percentage of non-coverage"""
Inc=float(config['interception']['noneCoverage']),"""interception of none-coverage"""
I=(float(Pc[0])*(Ic))+float((Pnc[0]*Inc[0]))
interceptionForZeroCoverageCell = round(I,2)
if I != 10 and I != float('nan'):
logger.error(f"ndviTIFFDetailsForZeroCoverageCell:{ndviTIFFDetailsForZeroCoverageCell}")
logger.error(f"I:{I}")
fourCornersOfWindowInEPSG25832ZeroCoverageCell = RasterIOPackageUtils.convertFourCornersOfWindowFromImageCoordinatesToCRSByCoordinatesOfCentersOfPixelsMethodFor(row,col,int(config['window']['height']),int(config['window']['width']),datasetElevationsTIFFInEPSG25832)
for i in range(0,len(fourCornersOfWindowInEPSG25832ZeroCoverageCell)):
# fourCornersOfKeyWindowInEPSG4326.append(RasterIOPackageUtils.convertCoordsToDestEPSGForDataset(fourCornersOfWindowInEPSG25832[i],datasetElevationsTIFFInEPSG25832,destEPSG=4326))
outFromEPSG25832ToEPSG4326ForZeroCoverageCells.append(OSGEOUtils.fromEPSG25832ToEPSG4326(fourCornersOfWindowInEPSG25832ZeroCoverageCell[i])) # resultant coords order is in form of lat,lon and it must be in lon,lat.thus, out[1]-lat out[0]-lon
"""fourCornersOfWindowsAsGeoJSONInEPSG4326"""
fourCornersOfWindowInEPSG4326 = []
for i in range(0,len(outFromEPSG25832ToEPSG4326ForZeroCoverageCells)):
fourCornersOfWindowInEPSG4326.append(([outFromEPSG25832ToEPSG4326ForZeroCoverageCells[i][1]],[outFromEPSG25832ToEPSG4326ForZeroCoverageCells[i][0]]))
fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell.append(jsonUtils.buildFeatureCollectionAsGeoJSONForFourCornersOfKeyWindow(fourCornersOfWindowInEPSG4326[0],fourCornersOfWindowInEPSG4326[1],fourCornersOfWindowInEPSG4326[2],fourCornersOfWindowInEPSG4326[3],areaOfCoverageForZeroCoverageCell))
# debugIOUtils.writeContentsToFile(debugIOUtils.getPathToOutputDir()+"/"+"NDVIsPer10mX10mForKeyWindow"+config['window']['file_name']+".{0}".format(config['window']['file_extension']),"{0}\n".format(NDVIsPer10mX10mForKeyWindow))
"""
building geojson object for a point "center-point" to visualize it.
"""
calculatedCenterPointInEPSG25832ForZeroCoverageCell = MiscUtils.calculateCenterPointsGivenLLOfGridCell(fourCornersOfWindowInEPSG25832ZeroCoverageCell[1])#lower-left corner
centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell = RasterIOPackageUtils.convertFromCRSToImageCoordinatesSystemFor(calculatedCenterPointInEPSG25832ForZeroCoverageCell[0],calculatedCenterPointInEPSG25832ForZeroCoverageCell[1],datasetElevationsTIFFInEPSG25832)
pixelValuesOfCenterPointsOfZeroCoverageCell = mainTIFFImageDatasetContents[0][centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell[0]][centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell[1]]
centerPointInEPSG4326ForZeroCoveringCell = RasterIOPackageUtils.convertCoordsToDestEPSGForDataset(calculatedCenterPointInEPSG25832ForZeroCoverageCell,datasetElevationsTIFFInEPSG25832,destEPSG=4326)
centerPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell = jsonUtils.buildGeoJSONForPointFor(centerPointInEPSG4326ForZeroCoveringCell)
"""average heights"""
averageHeightsForZeroCoverageCell = round(MiscUtils.getAverageFor(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell),2)
assert len(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell) > 0 and (len(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell) <= (int(config['window']['width']) * int(config['window']['height'])) )
"""the following code block is for assertion only"""
if self.__devModeForWSAWANTIVer2 == config['DEVELOPMENT_MODE']['debug']:
assert pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt >= 0 and (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt < (int(config['window']['width']) * int(config['window']['height'])) )
assert (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt+countOfNoDataCells) == (int(config['window']['width']) * int(config['window']['height']))
print(f"profiling for gridCellClassifiedAs:{gridCellClassifiedAs}....>pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt:{pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt}")
print(f"profiling for gridCellClassifiedAs:{gridCellClassifiedAs}....>countOfNoDataCells:{countOfNoDataCells}")
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = (int(config['window']['width']) * int(config['window']['height'])) - (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt + countOfNoDataCells)
assert pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt == 0, (f"WTF.")
print(f"profiling for gridCellClassifiedAs:{gridCellClassifiedAs}....>computed pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt:{pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt}")
print(f"\n")
centerPointAsTextInWKTInEPSG3857 = CoordinatesUtils.buildWKTPointFormatForSinglePointFor(calculatedCenterPointInEPSG25832ForZeroCoverageCell[0],calculatedCenterPointInEPSG25832ForZeroCoverageCell[1])
s = centerPointAsTextInWKTInEPSG3857.replace("POINT","")
s = s.replace("(","")
s = s.replace(")","")
s = s.strip()
s = s.split(" ")
centerPointsAsStringForZeroCoverageCell = s[0] + "\t" + s[1] + "\n"
centerPointsAsStringForZeroCoverageCell = centerPointsAsStringForZeroCoverageCell.replace('\'',"")
return gridCellClassifiedAs,ndviTIFFDetailsForZeroCoverageCell,areaOfCoverageForZeroCoverageCell,interceptionForZeroCoverageCell,fourCornersOfWindowInEPSG25832ZeroCoverageCell,outFromEPSG25832ToEPSG4326ForZeroCoverageCells,fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell,calculatedCenterPointInEPSG25832ForZeroCoverageCell,centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell,pixelValuesOfCenterPointsOfZeroCoverageCell,centerPointInEPSG4326ForZeroCoveringCell,centerPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell,averageHeightsForZeroCoverageCell,np.array(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell).tolist(),pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt,cntOfNoneKeyWindow,centerPointsAsStringForZeroCoverageCell,CntOfNDVIsWithNanValueInZeroCoverageCell

How to optimize the finding of divergences between 2 signals

I am trying to create an indicator that will find all the divergences between 2 signals.
The output of the function so far looks like this
But the problem is that is painfully slow when I am trying to use it with long signals. Could any of you guys help me to make it faster if is possible?
My code:
def find_divergence(price: pd.Series, indicator: pd.Series, width_divergence: int, order: int):
div = pd.DataFrame(index=range(price.size), columns=[
f"Bullish_{width_divergence}_{order}",
f"Berish_{width_divergence}_{order}"
])
div[f'Bullish_idx_{width_divergence}_{order}'] = False
div[f'Berish_idx_{width_divergence}_{order}'] = False
def calc_argrelextrema(price_: np.numarray):
return argrelextrema(price_, np.less_equal, order=order)[0]
price_ranges = []
for i in range(len(price)):
price_ranges.append(price.values[0:i + 1])
f = []
with ThreadPoolExecutor(max_workers=16) as exe:
for i in price_ranges:
f.append(exe.submit(calc_argrelextrema, i))
prices_lows = SortedSet()
for r in concurrent.futures.as_completed(f):
data = r.result()
for d in reversed(data):
if d not in prices_lows:
prices_lows.add(d)
else:
break
price_lows_idx = pd.Series(prices_lows)
for idx_1 in range(price_lows_idx.size):
min_price = price[price_lows_idx[idx_1]]
min_indicator = indicator[price_lows_idx[idx_1]]
for idx_2 in range(idx_1 + 1, idx_1 + width_divergence):
if idx_2 >= price_lows_idx.size:
break
if price[price_lows_idx[idx_2]] < min_price:
min_price = price[price_lows_idx[idx_2]]
if indicator[price_lows_idx[idx_2]] < min_indicator:
min_indicator = indicator[price_lows_idx[idx_2]]
consistency_price_rd = min_price == price[price_lows_idx[idx_2]]
consistency_indicator_rd = min_indicator == indicator[price_lows_idx[idx_1]]
consistency_price_hd = min_price == price[price_lows_idx[idx_1]]
consistency_indicator_hd = min_indicator == indicator[price_lows_idx[idx_2]]
diff_price = price[price_lows_idx[idx_1]] - price[price_lows_idx[idx_2]] # should be neg
diff_indicator = indicator[price_lows_idx[idx_1]] - indicator[price_lows_idx[idx_2]] # should be pos
is_regular_divergence = diff_price > 0 and diff_indicator < 0
is_hidden_divergence = diff_price < 0 and diff_indicator > 0
if is_regular_divergence and consistency_price_rd and consistency_indicator_rd:
div.at[price_lows_idx[idx_2], f'Bullish_{width_divergence}_{order}'] = (price_lows_idx[idx_1], price_lows_idx[idx_2])
div.at[price_lows_idx[idx_2], f'Bullish_idx_{width_divergence}_{order}'] = True
elif is_hidden_divergence and consistency_price_hd and consistency_indicator_hd:
div.at[price_lows_idx[idx_2], f'Berish_{width_divergence}_{order}'] = (price_lows_idx[idx_1], price_lows_idx[idx_2])
div.at[price_lows_idx[idx_2], f'Berish_idx_{width_divergence}_{order}'] = True
return div

While and for loop with global variable not working Python updated

Working for single symbol
todate = zerodha.get_trade_day(datetime.now().astimezone(to_india) - timedelta(days=0))
fromdate = zerodha.get_trade_day(datetime.now().astimezone(to_india) - timedelta(days=5))
symbol = "ZINC20MAYFUT"
instype = "MCX"
Timeinterval = "5minute"
tradeDir = 0 #neutral
while (True):
histdata1 = zerodha.get_history(symbol, fromdate, todate, Timeinterval, instype)
df = pd.DataFrame(histdata1)
df = heikinashi(df)
df = bollinger_bands(df,field='h_close',period=20, numsd=2)
df1 =pd.DataFrame(df, columns=['date','volume','close','h_close','middle_band', 'upper_band'])
pp = pd.DataFrame(df1.tail(3))
print(pp)
dfCToList = pp['h_close'].tolist()
dfCList = list(pp['h_close'])
dfHValues = pp['h_close'].values
dfBMValues = pp['middle_band'].values
H_last = dfHValues[2] # tail 1
BM_last = dfBMValues[2] # tail 1
if (H_last > BM_last and (tradeDir == 0 or tradeDir == -1)):
print("buy")
tradeDir = 1 # up
if (H_last < BM_last and (tradeDir == 0 or tradeDir == 1)):
print("SELL")
tradeDir = -1 # down
# pdb.set_trace()
Question: When conditions meet its Printing "BUY/SELL" again and again. I want to just print a single time when condition meet the first time
todate = zerodha.get_trade_day(datetime.now().astimezone(to_india) - timedelta(days=0))
fromdate = zerodha.get_trade_day(datetime.now().astimezone(to_india) - timedelta(days=5))
tradeDir = 0 #neutral
def script():
global tradeDir
##For historical Data##
symbol = ["ZINC20MAYFUT" ,"CRUDEOIL20MAYFUT","GOLD20JUNFUT"]
instype = "MCX"
Timeinterval = "5minute"
for symbol in symbol:
global tradeDir
histdata1 = zerodha.get_history(symbol, fromdate, todate, Timeinterval, instype)
df = pd.DataFrame(histdata1)
df = heikinashi(df)
df = bollinger_bands(df,field='h_close',period=20, numsd=2)
df1 =pd.DataFrame(df, columns=['date','volume','close','h_close','middle_band', 'upper_band'])
pp = pd.DataFrame(df1.tail(3))
print(pp)
dfCToList = pp['h_close'].tolist()
dfCList = list(pp['h_close'])
dfHValues = pp['h_close'].values
dfBMValues = pp['middle_band'].values
H_last = dfHValues[2] # tail 1
BM_last = dfBMValues[2] # tail 1
if (H_last > BM_last and (tradeDir == 0 or tradeDir == -1)):
print("buy")
tradeDir = 1 # up
if (H_last < BM_last and (tradeDir == 0 or tradeDir == 1)):
print("SELL")
tradeDir = -1 # down
# pdb.set_trace()
while True:
try:
script()
except Exception as e:
sleep(2)
continue
When conditions meet its Printing "BUY/SELL" again and again. I want to just print a single time when condition meet the first time full Script and should run continuously

If you want the code to stop looping after the first time it prints "buy" or "SELL", you just need to add a break statement after each of the prints (inside the scope of the containing if blocks).

Generate synthetic time series data from existing sample data

Are there any good library/tools in python for generating synthetic time series data from existing sample data? For example I have sales data from January-June and would like to generate synthetic time series data samples from July-December )(keeping time series factors intact, like trend, seasonality, etc).

Leaving the question about quality of such data aside, here is a simple approach you can use Gaussian distribution to generate synthetic data based-off a sample. Below is the critical part.
import numpy as np
x # original sample np.array of features
feature_means = np.mean(x, axis=1)
feature_std = np.std(x, axis=1)
random_normal_feature_values = np.random.normal(feature_means, feature_std)
Here is a fully functioning code I used,
def generate_synthetic_data(sample_dataset, window_mean, window_std, fixed_window=None, variance_range =1 , sythesize_ratio = 2, forced_reverse = False):
synthetic_data = pd.DataFrame(columns=sample_dataset.columns)
synthetic_data.insert(len(sample_dataset.columns), "synthesis_seq", [], True)
for k in range(sythesize_ratio):
if len(synthetic_data) >= len(sample_dataset) * sythesize_ratio:
break;
#this loop generates a set that resembles the entire dataset
country_synthetic = pd.DataFrame(columns=synthetic_data.columns)
if fixed_window != None:
input_sequence_len = fixed_window
else:
input_sequence_len = int(np.random.normal(window_mean, window_std))
#population data change
country_data_i = sample_dataset
if len(country_data_i) < input_sequence_len :
continue
feature_length = configuration['feature_length'] #number of features to be randomized
country_data_array = country_data_i.to_numpy()
country_data_array = country_data_array.T[:feature_length]
country_data_array = country_data_array.reshape(feature_length,len(country_data_i))
x = country_data_array[:feature_length].T
reversed = np.random.normal(0,1)>0
if reversed:
x = x[::-1]
sets =0
x_list = []
dict_x = dict()
for i in range(input_sequence_len):
array_len = ((len(x) -i) - ((len(x)-i)%input_sequence_len))+i
if array_len <= 0:
continue
sets = int( array_len/ input_sequence_len)
if sets <= 0:
continue
x_temp = x[i:array_len].T.reshape(sets,feature_length,input_sequence_len)
uniq_keys = np.array([i+(input_sequence_len*k) for k in range(sets)])
x_temp = x_temp.reshape(feature_length,sets,input_sequence_len)
arrays_split = np.hsplit(x_temp,sets)
dict_x.update(dict(zip(uniq_keys, arrays_split)))
temp_x_list = [dict_x[i].T for i in sorted(dict_x.keys())]
temp_x_list = np.array(temp_x_list).squeeze()
feature_means = np.mean(temp_x_list, axis=1)
feature_std = np.std(temp_x_list, axis=1) /variance_range
random_normal_feature_values = np.random.normal(feature_means, feature_std).T
random_normal_feature_values = np.round(random_normal_feature_values,0)
random_normal_feature_values[random_normal_feature_values < 0] = 0
if reversed:
random_normal_feature_values = random_normal_feature_values.T[::-1]
random_normal_feature_values = random_normal_feature_values.T
for i in range(len(random_normal_feature_values)):
country_synthetic[country_synthetic.columns[i]] = random_normal_feature_values[i]
country_synthetic['synthesis_seq'] = k
synthetic_data = synthetic_data.append(country_synthetic, ignore_index=True)
return synthetic_data
for i in range(1):
directory_name = '/synthetic_'+str(i)
mypath = source_path+ '/cleaned'+directory_name
if os.path.exists(mypath) == False:
os.mkdir(mypath)
data = generate_synthetic_data(original_data, window_mean = 0, window_std= 0, fixed_window=2 ,variance_range = 10**i, sythesize_ratio = 1)
synthetic_data.append(data)
#data.to_csv(mypath+'/synthetic_'+str(i)+'_dt31_05_.csv', index=False )
print('synth step : ', i, ' len : ', len(synthetic_data))
Good luck!

Concise way updating values based on column values

Background: I have a DataFrame whose values I need to update using some very specific conditions. The original implementation I inherited used a lot nested if statements wrapped in for loop, obfuscating what was going on. With readability primarily in mind, I rewrote it into this:
# Other Widgets
df.loc[(
(df.product == 0) &
(df.prod_type == 'OtherWidget') &
(df.region == 'US')
), 'product'] = 5
# Supplier X - All clients
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'X')
), 'product'] = 6
# Supplier Y - Client A
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'A')
), 'product'] = 1
# Supplier Y - Client B
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'B')
), 'product'] = 3
# Supplier Y - Client C
df.loc[(
(df.product == 0) &
(df.region.isin(['UK','US'])) &
(df.supplier == 'Y') &
(df.client == 'C')
), 'product'] = 4
Problem: This works well, and makes the conditions clear (in my opinion), but I'm not entirely happy because it's taking up a lot of space. Is there anyway to improve this from a readability/conciseness perspective?

Per EdChum's recommendation, I created a mask for the conditions. The code below goes a bit overboard in terms of masking, but it gives the general sense.
prod_0 = ( df.product == 0 )
ptype_OW = ( df.prod_type == 'OtherWidget' )
rgn_UKUS = ( df.region.isin['UK', 'US'] )
rgn_US = ( df.region == 'US' )
supp_X = ( df.supplier == 'X' )
supp_Y = ( df.supplier == 'Y' )
clnt_A = ( df.client == 'A' )
clnt_B = ( df.client == 'B' )
clnt_C = ( df.client == 'C' )
df.loc[(prod_0 & ptype_OW & reg_US), 'prod_0'] = 5
df.loc[(prod_0 & rgn_UKUS & supp_X), 'prod_0'] = 6
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_A), 'prod_0'] = 1
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_B), 'prod_0'] = 3
df.loc[(prod_0 & rgn_UKUS & supp_Y & clnt_C), 'prod_0'] = 4

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

making a Pandas while loop faster - python

Related

parallelized tasks are not ditributed a cross available cpus

How to optimize the finding of divergences between 2 signals

While and for loop with global variable not working Python updated

Generate synthetic time series data from existing sample data

Concise way updating values based on column values

Categories

Resources