Hi I am making a code to drop a certain row in pandas if a dictionary is empty. That dictionary is empty when it doesn't find the words it was searching in that pdf.
Here is my code -
allcapex = pd.read_csv("C:\Shodh by Arthavruksha\CorporateAnnouncements\Test 2021 - CA.csv",index_col = 0)
allcapex.drop(['Type', 'Topic','Date','Time','DateTime','Sector'], axis = 1,inplace = True)
allcapex = allcapex[allcapex['Source'].str.contains("-") == False]
allcapex['Content'] = allcapex['Source'].apply(lambda link: urltotext(link))
allcapex = allcapex[allcapex['Content'].str.contains("none") == False]
allcapex['Sentence'] = ''
allcapex['Date'] = ''
allcapex['Type'] = ''
allcapex['Value'] = ''
print(allcapex)
i = 0
for text in allcapex['Content']:
doc = nlp(text)
for sent in doc.sents:
capexmatches = findmatch(sent,['capex', 'capacity expansion', 'Capacity expansion', 'CAPEX', 'Capacity Expansion', 'Capex'],'CAPEX','CAPEX')
typematches = findmatch(sent,['Greenfield','greenfield', 'brownfield','Brownfield', 'de-bottlenecking', 'De-bottlenecking'],'Type','Type')
valuematches = findmatch(sent,['Crore', 'Cr','crore', 'cr'],'Value','Value')
datematches = findmatch(sent,['2020', '2021','2022', '2023','2024', '2025', 'FY21', 'FY22', 'FY23', 'FY24', 'FY25','FY26','Q1FY22','Q2FY22','Q3FY22','Q4FY22','Q1FY23','Q2FY23','Q3FY23','Q4FY23','Q1FY24','Q2FY24','Q3FY24','Q4FY24','Q1FY25','Q2FY25','Q3FY25','Q4FY25'],'Date','Date')
if not capexmatches:
link = allcapex['Source'].iloc[i]
allcapex = allcapex.drop(link,axis=1)
else:
company = allcapex['Company Name'].iloc[i]
link = allcapex['Source'].iloc[i]
capextype = getext(typematches)
capexvalue = getext(valuematches)
capexdate = getext(datematches)
allcapex.loc[len(allcapex.index)] = [link,company,capexdate,capextype,capexvalue]
i += 1
print(allcapex)
Error is in this line - allcapex = allcapex.drop(link,axis=1) . And error is this - KeyError: "['https://archives.nseindia.com/corporate/PIONDIST_31122021225817_intimation_NimishShah_appointment_signed.pdf'] not found in axis"
My Dataframe is this -
Source,Company Name
1. https://archives.nseindia.com/corporate/PIONDIST_31122021225817_intimation_NimishShah_appointment_signed.pdf,Pioneer Distilleries Limited
2. https://archives.nseindia.com/corporate/EQUITAS_31122021225634_311221EHLPostalballotresultsSIGNED.pdf,Equitas Holdings Limited
Related
As shown in the code posted below in section DecoupleGridCellsProfilerLoopsPool, the run() is called as much times as the contents of the self.__listOfLoopDecouplers and it works as it supposed to be, i mean the parallelization is working duly.
as shown in the same section, DecoupleGridCellsProfilerLoopsPool.pool.map returns results and i populate some lists,lets discuss the list names self.__iterablesOfZeroCoverageCell it contains number of objects of type gridCellInnerLoopsIteratorsForZeroCoverageModel.
After that, i created the pool ZeroCoverageCellsProcessingPool with the code as posted below as well.
The problem i am facing, is the parallelized code in ZeroCoverageCellsProcessingPool is very slow and the visulisation of the cpu tasks shows that there are no processes work in parallel as shown in the video contained in url posted below.
i was suspicious about the pickling issues related to when parallelizing the code in ZeroCoverageCellsProcessingPool,so i removed the enitre body of the run() in ZeroCoverageCellsProcessingPool. however, it shows no change in the behaviour of the parallelized code.
also the url posted below shown how the parallelized methoth of ZeroCoverageCellsProcessingPool behaves.
given the code posted below, please let me know why the parallelization does not work for code in ZeroCoverageCellsProcessingPool
output url:please click the link
output url
DecoupleGridCellsProfilerLoopsPool
def postTask(self):
self.__postTaskStartTime = time.time()
with Pool(processes=int(config['MULTIPROCESSING']['proceses_count'])) as DecoupleGridCellsProfilerLoopsPool.pool:
self.__chunkSize = PoolUtils.getChunkSize(lst=self.__listOfLoopDecouplers,cpuCount=int(config['MULTIPROCESSING']['cpu_count']))
logger.info(f"DecoupleGridCellsProfilerLoopsPool.self.__chunkSize(task per processor):{self.__chunkSize}")
for res in DecoupleGridCellsProfilerLoopsPool.pool.map(self.run,self.__listOfLoopDecouplers,chunksize=self.__chunkSize):
if res[0] is not None and res[1] is None and res[2] is None:
self.__iterablesOfNoneZeroCoverageCell.append(res[0])
elif res[1] is not None and res[0] is None and res[2] is None:
self.__iterablesOfZeroCoverageCell.append(res[1])
elif res[2] is not None and res[0] is None and res[1] is None:
self.__iterablesOfNoDataCells.append(res[2])
else:
raise Exception (f"WTF.")
DecoupleGridCellsProfilerLoopsPool.pool.join()
assert len(self.__iterablesOfNoneZeroCoverageCell)+len(self.__iterablesOfZeroCoverageCell)+len(self.__iterablesOfNoDataCells) == len(self.__listOfLoopDecouplers)
zeroCoverageCellsProcessingPool = ZeroCoverageCellsProcessingPool(self.__devModeForWSAWANTIVer2,self.__iterablesOfZeroCoverageCell)
zeroCoverageCellsProcessingPool.postTask()
def run(self,param:LoopDecoupler):
row = param.getRowValue()
col = param.getColValue()
elevationsTIFFWindowedSegmentContents = param.getElevationsTIFFWindowedSegment()
verticalStep = param.getVericalStep()
horizontalStep = param.getHorizontalStep()
mainTIFFImageDatasetContents = param.getMainTIFFImageDatasetContents()
NDVIsTIFFWindowedSegmentContentsInEPSG25832 = param.getNDVIsTIFFWindowedSegmentContentsInEPSG25832()
URLOrFilePathForElevationsTIFFDatasetInEPSG25832 = param.getURLOrFilePathForElevationsTIFFDatasetInEPSG25832()
threshold = param.getThreshold()
rowsCnt = 0
colsCnt = 0
pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt = 0
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = int(config['window']['width']) * int(config['window']['height'])
pixelsWithNoDataValueInTIFFImageDatasetCnt = int(config['window']['width']) * int(config['window']['height'])
_pixelsValuesSatisfyThresholdInNoneZeroCoverageCell = []
_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell = []
_pixelsValuesInNoDataCell = []
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel = None
gridCellInnerLoopsIteratorsForZeroCoverageModel = None
gridCellInnerLoopsIteratorsForNoDataCellsModel = None
for x in range(row,row + verticalStep):
if rowsCnt == verticalStep:
rowsCnt = 0
for y in range(col,col + horizontalStep):
if colsCnt == horizontalStep:
colsCnt = 0
pixelValue = mainTIFFImageDatasetContents[0][x][y]
# windowIOUtils.writeContentsToFile(windowIOUtils.getPathToOutputDir()+"/"+config['window']['file_name']+".{0}".format(config['window']['file_extension']), "pixelValue:{0}\n".format(pixelValue))
if pixelValue >= float(threshold):
pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt+=1
_pixelsValuesSatisfyThresholdInNoneZeroCoverageCell.append(elevationsTIFFWindowedSegmentContents[0][rowsCnt][colsCnt])
elif ((pixelValue < float(threshold)) and (pixelValue > float(config['TIFF']['no_data_value']))):
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt-=1
_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell.append(elevationsTIFFWindowedSegmentContents[0][rowsCnt][colsCnt])
elif (pixelValue <= float(config['TIFF']['no_data_value'])):
pixelsWithNoDataValueInTIFFImageDatasetCnt-=1
_pixelsValuesInNoDataCell.append(elevationsTIFFWindowedSegmentContents[0][rowsCnt][colsCnt])
else:
raise Exception ("WTF.Exception: unhandled condition for pixel value: {0}".format(pixelValue))
# _pixelCoordinatesInWindow.append([x,y])
colsCnt+=1
rowsCnt+=1
'''Grid-cell classfication'''
if (pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt > 0):
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel = GridCellInnerLoopsIteratorsForNoneZeroCoverageModel()
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setRowValue(row)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setColValue(col)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setVericalStep(verticalStep)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setHorizontalStep(horizontalStep)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setMainTIFFImageDatasetContents(mainTIFFImageDatasetContents)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setNDVIsTIFFWindowedSegmentContentsInEPSG25832(NDVIsTIFFWindowedSegmentContentsInEPSG25832)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setURLOrFilePathForElevationsTIFFDatasetInEPSG25832(URLOrFilePathForElevationsTIFFDatasetInEPSG25832)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setPixelsValuesSatisfyThresholdInTIFFImageDatasetCnt(pixelsValuesSatisfyThresholdInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForNoneZeroCoverageModel.setPixelsValuesSatisfyThresholdInNoneZeroCoverageCell(_pixelsValuesSatisfyThresholdInNoneZeroCoverageCell)
elif (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt < (int(config['window']['width']) * int(config['window']['height'])) and pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt >= 0):
gridCellInnerLoopsIteratorsForZeroCoverageModel = GridCellInnerLoopsIteratorsForZeroCoverageModel()
gridCellInnerLoopsIteratorsForZeroCoverageModel.setRowValue(row)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setColValue(col)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setVericalStep(verticalStep)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setHorizontalStep(horizontalStep)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setMainTIFFImageDatasetContents(mainTIFFImageDatasetContents)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setNDVIsTIFFWindowedSegmentContentsInEPSG25832(NDVIsTIFFWindowedSegmentContentsInEPSG25832)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setURLOrFilePathForElevationsTIFFDatasetInEPSG25832(URLOrFilePathForElevationsTIFFDatasetInEPSG25832)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt(pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsWithNoDataValueInTIFFImageDatasetCnt(pixelsWithNoDataValueInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsValuesDoNotSatisfyThresholdInZeroCoverageCell(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell)
elif (pixelsWithNoDataValueInTIFFImageDatasetCnt == 0):
gridCellInnerLoopsIteratorsForNoDataCellsModel = GridCellInnerLoopsIteratorsForNoDataCellsModel()
gridCellInnerLoopsIteratorsForNoDataCellsModel.setRowValue(row)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setColValue(col)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setVericalStep(verticalStep)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setHorizontalStep(horizontalStep)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setMainTIFFImageDatasetContents(mainTIFFImageDatasetContents)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setNDVIsTIFFWindowedSegmentContentsInEPSG25832(NDVIsTIFFWindowedSegmentContentsInEPSG25832)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setURLOrFilePathForElevationsTIFFDatasetInEPSG25832(URLOrFilePathForElevationsTIFFDatasetInEPSG25832)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setPixelsWithNoDataValueInTIFFImageDatasetCnt(pixelsWithNoDataValueInTIFFImageDatasetCnt)
gridCellInnerLoopsIteratorsForNoDataCellsModel.setPixelsValuesInNoDataCell(_pixelsValuesInNoDataCell)
if gridCellInnerLoopsIteratorsForZeroCoverageModel is not None:
gridCellInnerLoopsIteratorsForZeroCoverageModel.setPixelsWithNoDataValueInTIFFImageDatasetCnt(pixelsWithNoDataValueInTIFFImageDatasetCnt)
else:
raise Exception (f"WTF.")
return gridCellInnerLoopsIteratorsForNoneZeroCoverageModel,gridCellInnerLoopsIteratorsForZeroCoverageModel,gridCellInnerLoopsIteratorsForNoDataCellsModel
ZeroCoverageCellsProcessingPool:
def postTask(self):
self.__postTaskStartTime = time.time()
"""to collect results per each row
"""
resAllCellsForGridCellsClassifications = []
# NDVIs
resAllCellsForNDVITIFFDetailsForZeroCoverageCell = []
# area of coverage
resAllCellsForAreaOfCoverageForZeroCoverageCell = []
# interception
resAllCellsForInterceptionForZeroCoverageCell = []
# fourCornersOfWindowInEPSG25832
resAllCellsForFourCornersOfWindowInEPSG25832ZeroCoverageCell = []
# outFromEPSG25832ToEPSG4326-lists
resAllCellsForOutFromEPSG25832ToEPSG4326ForZeroCoverageCells = []
# fourCornersOfWindowsAsGeoJSON
resAllCellsForFourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell = []
# calculatedCenterPointInEPSG25832
resAllCellsForCalculatedCenterPointInEPSG25832ForZeroCoverageCell = []
# centerPointsOfWindowInImageCoordinatesSystem
resAllCellsForCenterPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell = []
# pixelValuesOfCenterPoints
resAllCellsForPixelValuesOfCenterPointsForZeroCoverageCell = []
# centerPointOfKeyWindowAsGeoJSONInEPSG4326
resAllCellsForCenterPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell = []
# centerPointInEPSG4326
resAllCellsForCenterPointInEPSG4326ForZeroCoveringCell = []
# average heights
resAllCellsForAverageHeightsForZeroCoverageCell = []
# pixels values
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCell = []
# area Of Coverage
resAllCellsForAreaOfCoverageForZeroCoverageCell = []
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = []
noneKeyWindowCnt=0
# center points as string
centerPointsAsStringForZeroCoverageCell = ""
with Pool(processes=int(config['MULTIPROCESSING']['proceses_count'])) as ZeroCoverageCellsProcessingPool.pool:
self.__chunkSize = PoolUtils.getChunkSize(lst=self.__iterables,cpuCount=int(config['MULTIPROCESSING']['cpu_count']))
logger.info(f"ZeroCoverageCellsProcessingPool.self.__chunkSize(task per processor):{self.__chunkSize}")
for res in ZeroCoverageCellsProcessingPool.pool.map(func=self.run,iterable=self.__iterables,chunksize=self.__chunkSize):
resAllCellsForGridCellsClassifications.append(res[0])
# NDVIs
resAllCellsForNDVITIFFDetailsForZeroCoverageCell.append(res[1])
# area of coverage
resAllCellsForAreaOfCoverageForZeroCoverageCell.append(res[2])
# interception
resAllCellsForInterceptionForZeroCoverageCell.append(res[3])
# fourCornersOfWindowInEPSG25832
resAllCellsForFourCornersOfWindowInEPSG25832ZeroCoverageCell.append(res[4])
# outFromEPSG25832ToEPSG4326-lists
resAllCellsForOutFromEPSG25832ToEPSG4326ForZeroCoverageCells.append(res[5])
# fourCornersOfWindowsAsGeoJSONInEPSG4326
resAllCellsForFourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell.append(res[6])
# calculatedCenterPointInEPSG25832
resAllCellsForCalculatedCenterPointInEPSG25832ForZeroCoverageCell.append(res[7])
# centerPointsOfWindowInImageCoordinatesSystem
resAllCellsForCenterPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell.append(res[8])
# pixelValuesOfCenterPoints
resAllCellsForPixelValuesOfCenterPointsForZeroCoverageCell.append(res[9])
# centerPointInEPSG4326
resAllCellsForCenterPointInEPSG4326ForZeroCoveringCell.append(res[10])
# centerPointOfKeyWindowAsGeoJSONInEPSG4326
resAllCellsForCenterPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell.append(res[11])
# average heights
resAllCellsForAverageHeightsForZeroCoverageCell.append(res[12])
# pixels values
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCell.append(res[13])
# pixelsValues cnt
resAllCellsForPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt.append(res[14])
noneKeyWindowCnt +=res[15]
# centerPoints-As-String
if (res[16] is not None):
centerPointsAsStringForZeroCoverageCell+=str(res[16])
assert noneKeyWindowCnt == len(self.__iterables)
ZeroCoverageCellsProcessingPool.pool.close()
ZeroCoverageCellsProcessingPool.pool.terminate()
ZeroCoverageCellsProcessingPool.pool.join()
return
def run(self,params:GridCellInnerLoopsIteratorsForZeroCoverageModel):
if params is not None:
logger.info(f"Processing zero coverage cell #(row{params.getRowValue()},col:{params.getColValue()})")
row = params.getRowValue()
col = params.getColValue()
mainTIFFImageDatasetContents = params.getMainTIFFImageDatasetContents()
NDVIsTIFFWindowedSegmentContentsInEPSG25832 = params.getNDVIsTIFFWindowedSegmentContentsInEPSG25832()
URLOrFilePathForElevationsTIFFDatasetInEPSG25832 = params.getURLOrFilePathForElevationsTIFFDatasetInEPSG25832()
datasetElevationsTIFFInEPSG25832 = rasterio.open(URLOrFilePathForElevationsTIFFDatasetInEPSG25832,'r')
_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell = params.getPixelsValuesDoNotSatisfyThresholdInZeroCoverageCell()
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = params.getPixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt()
countOfNoDataCells = params.getPixelsWithNoDataValueInTIFFImageDatasetCnt()
outFromEPSG25832ToEPSG4326ForZeroCoverageCells = []
fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell = []
ndviTIFFDetailsForZeroCoverageCell = NDVITIFFDetails(None,None,None).getNDVIValuePer10mX10m()
"""area of coverage per grid-cell"""
areaOfCoverageForZeroCoverageCell = None
""""interception"""
interceptionForZeroCoverageCell = None
CntOfNDVIsWithNanValueInZeroCoverageCell = 0
fourCornersOfWindowInEPSG25832ZeroCoverageCell = None
outFromEPSG25832ToEPSG4326ForZeroCoverageCells = []
fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell = []
calculatedCenterPointInEPSG25832ForZeroCoverageCell = None
centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell = None
pixelValuesOfCenterPointsOfZeroCoverageCell = None
centerPointInEPSG4326ForZeroCoveringCell = None
centerPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell = None
centerPointsAsStringForZeroCoverageCell = None
"""average heights"""
averageHeightsForZeroCoverageCell = None
gridCellClassifiedAs = GridCellClassifier.ZERO_COVERAGE_CELL.value
cntOfNoneKeyWindow = 1
ndviTIFFDetailsForZeroCoverageCell = NDVITIFFDetails(ulX=row//int(config['ndvi']['resolution_height']),ulY=col//int(config['ndvi']['resolution_width']),dataset=NDVIsTIFFWindowedSegmentContentsInEPSG25832).getNDVIValuePer10mX10m()
"""area of coverage per grid-cell"""
areaOfCoverageForZeroCoverageCell = round(AreaOfCoverageDetails(pixelsCount=(int(config['window']['width']) * int(config['window']['height'])) - (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt + countOfNoDataCells)).getPercentageOfAreaOfCoverage(),2)
""""interception"""
if math.isnan(ndviTIFFDetailsForZeroCoverageCell):
# ndviTIFFDetailsForZeroCoverageCell = 0
CntOfNDVIsWithNanValueInZeroCoverageCell = 1
interceptionForZeroCoverageCell = config['sentinel_values']['interception']
else:
Indvi = INDVI()
Ic = Indvi.calcInterception(ndviTIFFDetailsForZeroCoverageCell)
Pc=areaOfCoverageForZeroCoverageCell,"""percentage of coverage"""
Pnc=float((int(config['window']['width'])*int(config['window']['height'])) - areaOfCoverageForZeroCoverageCell),"""percentage of non-coverage"""
Inc=float(config['interception']['noneCoverage']),"""interception of none-coverage"""
I=(float(Pc[0])*(Ic))+float((Pnc[0]*Inc[0]))
interceptionForZeroCoverageCell = round(I,2)
if I != 10 and I != float('nan'):
logger.error(f"ndviTIFFDetailsForZeroCoverageCell:{ndviTIFFDetailsForZeroCoverageCell}")
logger.error(f"I:{I}")
fourCornersOfWindowInEPSG25832ZeroCoverageCell = RasterIOPackageUtils.convertFourCornersOfWindowFromImageCoordinatesToCRSByCoordinatesOfCentersOfPixelsMethodFor(row,col,int(config['window']['height']),int(config['window']['width']),datasetElevationsTIFFInEPSG25832)
for i in range(0,len(fourCornersOfWindowInEPSG25832ZeroCoverageCell)):
# fourCornersOfKeyWindowInEPSG4326.append(RasterIOPackageUtils.convertCoordsToDestEPSGForDataset(fourCornersOfWindowInEPSG25832[i],datasetElevationsTIFFInEPSG25832,destEPSG=4326))
outFromEPSG25832ToEPSG4326ForZeroCoverageCells.append(OSGEOUtils.fromEPSG25832ToEPSG4326(fourCornersOfWindowInEPSG25832ZeroCoverageCell[i])) # resultant coords order is in form of lat,lon and it must be in lon,lat.thus, out[1]-lat out[0]-lon
"""fourCornersOfWindowsAsGeoJSONInEPSG4326"""
fourCornersOfWindowInEPSG4326 = []
for i in range(0,len(outFromEPSG25832ToEPSG4326ForZeroCoverageCells)):
fourCornersOfWindowInEPSG4326.append(([outFromEPSG25832ToEPSG4326ForZeroCoverageCells[i][1]],[outFromEPSG25832ToEPSG4326ForZeroCoverageCells[i][0]]))
fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell.append(jsonUtils.buildFeatureCollectionAsGeoJSONForFourCornersOfKeyWindow(fourCornersOfWindowInEPSG4326[0],fourCornersOfWindowInEPSG4326[1],fourCornersOfWindowInEPSG4326[2],fourCornersOfWindowInEPSG4326[3],areaOfCoverageForZeroCoverageCell))
# debugIOUtils.writeContentsToFile(debugIOUtils.getPathToOutputDir()+"/"+"NDVIsPer10mX10mForKeyWindow"+config['window']['file_name']+".{0}".format(config['window']['file_extension']),"{0}\n".format(NDVIsPer10mX10mForKeyWindow))
"""
building geojson object for a point "center-point" to visualize it.
"""
calculatedCenterPointInEPSG25832ForZeroCoverageCell = MiscUtils.calculateCenterPointsGivenLLOfGridCell(fourCornersOfWindowInEPSG25832ZeroCoverageCell[1])#lower-left corner
centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell = RasterIOPackageUtils.convertFromCRSToImageCoordinatesSystemFor(calculatedCenterPointInEPSG25832ForZeroCoverageCell[0],calculatedCenterPointInEPSG25832ForZeroCoverageCell[1],datasetElevationsTIFFInEPSG25832)
pixelValuesOfCenterPointsOfZeroCoverageCell = mainTIFFImageDatasetContents[0][centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell[0]][centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell[1]]
centerPointInEPSG4326ForZeroCoveringCell = RasterIOPackageUtils.convertCoordsToDestEPSGForDataset(calculatedCenterPointInEPSG25832ForZeroCoverageCell,datasetElevationsTIFFInEPSG25832,destEPSG=4326)
centerPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell = jsonUtils.buildGeoJSONForPointFor(centerPointInEPSG4326ForZeroCoveringCell)
"""average heights"""
averageHeightsForZeroCoverageCell = round(MiscUtils.getAverageFor(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell),2)
assert len(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell) > 0 and (len(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell) <= (int(config['window']['width']) * int(config['window']['height'])) )
"""the following code block is for assertion only"""
if self.__devModeForWSAWANTIVer2 == config['DEVELOPMENT_MODE']['debug']:
assert pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt >= 0 and (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt < (int(config['window']['width']) * int(config['window']['height'])) )
assert (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt+countOfNoDataCells) == (int(config['window']['width']) * int(config['window']['height']))
print(f"profiling for gridCellClassifiedAs:{gridCellClassifiedAs}....>pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt:{pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt}")
print(f"profiling for gridCellClassifiedAs:{gridCellClassifiedAs}....>countOfNoDataCells:{countOfNoDataCells}")
pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt = (int(config['window']['width']) * int(config['window']['height'])) - (pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt + countOfNoDataCells)
assert pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt == 0, (f"WTF.")
print(f"profiling for gridCellClassifiedAs:{gridCellClassifiedAs}....>computed pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt:{pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt}")
print(f"\n")
centerPointAsTextInWKTInEPSG3857 = CoordinatesUtils.buildWKTPointFormatForSinglePointFor(calculatedCenterPointInEPSG25832ForZeroCoverageCell[0],calculatedCenterPointInEPSG25832ForZeroCoverageCell[1])
s = centerPointAsTextInWKTInEPSG3857.replace("POINT","")
s = s.replace("(","")
s = s.replace(")","")
s = s.strip()
s = s.split(" ")
centerPointsAsStringForZeroCoverageCell = s[0] + "\t" + s[1] + "\n"
centerPointsAsStringForZeroCoverageCell = centerPointsAsStringForZeroCoverageCell.replace('\'',"")
return gridCellClassifiedAs,ndviTIFFDetailsForZeroCoverageCell,areaOfCoverageForZeroCoverageCell,interceptionForZeroCoverageCell,fourCornersOfWindowInEPSG25832ZeroCoverageCell,outFromEPSG25832ToEPSG4326ForZeroCoverageCells,fourCornersOfWindowsAsGeoJSONInEPSG4326ForZeroCoverageCell,calculatedCenterPointInEPSG25832ForZeroCoverageCell,centerPointsOfWindowInImageCoordinatesSystemForZeroCoverageCell,pixelValuesOfCenterPointsOfZeroCoverageCell,centerPointInEPSG4326ForZeroCoveringCell,centerPointOfKeyWindowAsGeoJSONInEPSG4326ForZeroCoverageCell,averageHeightsForZeroCoverageCell,np.array(_pixelsValuesDoNotSatisfyThresholdInZeroCoverageCell).tolist(),pixelsValuesDoNotSatisfyThresholdInTIFFImageDatasetCnt,cntOfNoneKeyWindow,centerPointsAsStringForZeroCoverageCell,CntOfNDVIsWithNanValueInZeroCoverageCell
When trying to scrape multiple pages of this website, I get no content in return. I usually check to make sure all the lists I'm creating are of equal length, but all are coming back as len = 0.
I've used similar code to scrape other websites, so why does this code not work correctly?
Some solutions I've tried, but haven't worked for my purposes: requests.Session() solutions as suggested in this answer, .json as suggested here.
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
from time import sleep
from random import randint
from googletrans import Translator
translator = Translator()
rg = []
ctr_n = []
ctr = []
yr = []
mn = []
sub = []
cst_n = []
cst = []
mag = []
pty_n = []
pty = []
can = []
pev1 = []
vot1 = []
vv1 = []
ivv1 = []
to1 = []
cv1 = []
cvs1 = []
pv1 = []
pvs1 = []
pev2 = []
vot2 = []
vv2 = []
ivv2 = []
to2 = []
cv2 = []
cvs2 =[]
pv2 = []
pvs2 = []
seat = []
no_info = []
manual = []
START_PAGE = 1
END_PAGE = 42
for page in range(START_PAGE, END_PAGE + 1):
page = requests.get("https://sejmsenat2019.pkw.gov.pl/sejmsenat2019/en/wyniki/sejm/okr/" + str(page))
page.encoding = page.apparent_encoding
if not page:
pass
else:
soup = BeautifulSoup(page.text, 'html.parser')
tbody = soup.find_all('table', class_='table table-borderd table-striped table-hover dataTable no-footer clickable right2 right4')
sleep(randint(2,10))
for container in tbody:
col1 = container.find_all('tr', {'data-id':'26079'})
for info in col1:
col_1 = info.find_all('td')
for data in col_1:
party = data[0]
party_trans = translator.translate(party)
pty_n.append(party_trans)
pvotes = data[1]
pv1.append(pvotes)
pshare = data[2]
pvs1.append(pshare)
mandates = data[3]
seat.append(mandates)
col2 = container.find_all('tr', {'data-id':'26075'})
for info in col2:
col_2 = info.find_all('td')
for data in col_2:
party2 = data[0]
party_trans2 = translator.translate(party2)
pty_n.append(party_trans2)
pvotes2 = data[1]
pv1.append(pvotes2)
pshare2 = data[2]
pvs1.append(pshare2)
mandates2 = data[3]
seat.append(mandates2)
col3 = container.find_all('tr', {'data-id':'26063'})
for info in col3:
col_3 = info.find_all('td')
for data in col_3:
party3 = data[0].text
party_trans3 = translator.translate(party3)
pty_n.extend(party_trans3)
pvotes3 = data[1].text
pv1.extend(pvotes3)
pshare3 = data[2].text
pvs1.extend(pshare3)
mandates3 = data[3].text
seat.extend(mandates3)
col4 = container.find_all('tr', {'data-id':'26091'})
for info in col4:
col_4 = info.find_all('td',recursive=True)
for data in col_4:
party4 = data[0]
party_trans4 = translator.translate(party4)
pty_n.extend(party_trans4)
pvotes4 = data[1]
pv1.extend(pvotes4)
pshare4 = data[2]
pvs1.extend(pshare4)
mandates4 = data[3]
seat.extend(mandates4)
col5 = container.find_all('tr', {'data-id':'26073'})
for info in col5:
col_5 = info.find_all('td')
for data in col_5:
party5 = data[0]
party_trans5 = translator.translate(party5)
pty_n.extend(party_trans5)
pvotes5 = data[1]
pv1.extend(pvotes5)
pshare5 = data[2]
pvs1.extend(pshare5)
mandates5 = data[3]
seat.extend(mandates5)
col6 = container.find_all('tr', {'data-id':'26080'})
for info in col6:
col_6 = info.find_all('td')
for data in col_6:
party6 = data[0]
party_trans6 = translator.translate(party6)
pty_n.extend(party_trans6)
pvotes6 = data[1]
pv1.extend(pvotes6)
pshare6 = data[2]
pvs1.extend(pshare6)
mandates6 = data[3]
seat.extend(mandates6)
#### TOTAL VOTES ####
tfoot = soup.find_all('tfoot')
for data in tfoot:
fvote = data.find_all('td')
for info in fvote:
votefinal = info.find(text=True).get_text()
fvoteindiv = [votefinal]
fvotelist = fvoteindiv * (len(pty_n) - len(vot1))
vot1.extend(fvotelist)
#### CONSTITUENCY NAMES ####
constit = soup.find_all('a', class_='btn btn-link last')
for data in constit:
names = data.get_text()
names_clean = names.replace("Sejum Constituency no.","")
names_clean2 = names_clean.replace("[","")
names_clean3 = names_clean2.replace("]","")
namesfinal = names_clean3.split()[1]
constitindiv = [namesfinal]
constitlist = constitindiv * (len(pty_n) - len(cst_n))
cst_n.extend(constitlist)
#### UNSCRAPABLE INFO ####
region = 'Europe'
reg2 = [region]
reglist = reg2 * (len(pty_n) - len(rg))
rg.extend(reglist)
country = 'Poland'
ctr2 = [country]
ctrlist = ctr2 * (len(pty_n) - len(ctr_n))
ctr_n.extend(ctrlist)
year = '2019'
yr2 = [year]
yrlist = yr2 * (len(pty_n) - len(yr))
yr.extend(yrlist)
month = '10'
mo2 = [month]
molist = mo2 * (len(pty_n) - len(mn))
mn.extend(molist)
codes = ''
codes2 = [codes]
codeslist = codes2 * (len(pty_n) - len(manual))
manual.extend(codeslist)
noinfo = '-990'
noinfo2 = [noinfo]
noinfolist = noinfo2 * (len(pty_n) - len(no_info))
no_info.extend(noinfolist)
print(len(rg), len(pty_n), len(pv1), len(pvs1), len(no_info), len(vot1), len(cst_n))
poland19 = pd.DataFrame({
'rg' : rg,
'ctr_n' : ctr_n,
'ctr': manual,
'yr' : yr,
'mn' : mn,
'sub' : manual,
'cst_n': cst_n,
'cst' : manual,
'mag': manual,
'pty_n': pty_n,
'pty': manual,
'can': can,
'pev1': no_info,
'vot1': vot1,
'vv1': vot1,
'ivv1': no_info,
'to1': no_info,
'cv1': no_info,
'cvs1': no_info,
'pv1': cv1,
'pvs1': cvs1,
'pev2': no_info,
'vot2': no_info,
'vv2': no_info,
'ivv2': no_info,
'to2': no_info,
'cv2': no_info,
'cvs2': no_info,
'pv2' : no_info,
'pvs2' : no_info,
'seat' : manual
})
print(poland19)
poland19.to_csv('poland_19.csv')
As commented you probably need to use Selenium. You could replace the requests lib and replace the request statements with sth like this:
from selenium import webdriver
wd = webdriver.Chrome('pathToChromeDriver') # or any other Browser driver
wd.get(url) # instead of requests.get()
soup = BeautifulSoup(wd.page_source, 'html.parser')
You need to follow the instructions to install and implement the selenium lib at this link: https://selenium-python.readthedocs.io/
Note: I tested your code with selenium and I was able to get the table that you were looking for, but with the class_=... does not work for some reason.
Instead browsing at the scraped data I found that it has an attribute id. So maybe try also this instead:
tbody = soup.find_all('table', id="DataTables_Table_0")
And again, by doing the get requests with the selenium lib.
Hope that was helpful :)
Cheers
I'm working on a Deep Learning project where I use a bidirectional attention flow model (allennlp pretrained model)to make a question answering system.It uses squad dataset.The bidaf model extracts the answer span from paragraph.Is there any way to determine the confidence score(accuracy)or any other metrics of the answer extracted by the model?
I have used the subcommand evaluate from the allennlp package but it determines only score of the model after testing.I was hoping there is a much easier way to solve the issue using other such command.
Attaching the code and the terminal output below.
from rake_nltk import Rake
from string import punctuation
from nltk.corpus import stopwords
from allennlp.predictors.predictor import Predictor
import spacy
import wikipedia
import re
import requests
from requests_html import HTMLSession
from bs4 import BeautifulSoup
import traceback
from nltk.stem import SnowballStemmer
from nltk.util import ngrams
from math import log10
from flask import Flask, request, jsonify, render_template
from gevent.pywsgi import WSGIServer
import time
import multiprocessing as mp
from gtts import gTTS
import os
NLP = spacy.load('en_core_web_md')
stop = stopwords.words('english')
symbol = r"""!#$%^&*();:\n\t\\\"!\{\}\[\]<>-\?"""
stemmer = SnowballStemmer('english')
wikipedia.set_rate_limiting(True)
session = HTMLSession()
results = 5
try:
predictor = Predictor.from_path("bidaf-model-2017.09.15-charpad.tar.gz")
except:
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo-model-2018.11.30-charpad.tar.gz")
try:
srl = Predictor.from_path('srl-model-2018.05.25.tar.gz')
except:
srl = Predictor.from_path('https://s3-us-west-2.amazonaws.com/allennlp/models/bert-base-srl-2019.06.17.tar.gz')
key = Rake(min_length=1, stopwords=stop, punctuations=punctuation, max_length=6)
wh_words = "who|what|how|where|when|why|which|whom|whose|explain".split('|')
stop.extend(wh_words)
session = HTMLSession()
output = mp.Queue()
def termFrequency(term, doc):
normalizeTermFreq = re.sub('[\[\]\{\}\(\)]', '', doc.lower()).split()
normalizeTermFreq = [stemmer.stem(i) for i in normalizeTermFreq]
dl = len(normalizeTermFreq)
normalizeTermFreq = ' '.join(normalizeTermFreq)
term_in_document = normalizeTermFreq.count(term)
#len_of_document = len(normalizeTermFreq )
#normalized_tf = term_in_document / len_of_document
normalized_tf = term_in_document
return normalized_tf, normalizeTermFreq, dl#, n_unique_term
def inverseDocumentFrequency(term, allDocs):
num_docs_with_given_term = 0
for doc in allDocs:
if term in doc:
num_docs_with_given_term += 1
if num_docs_with_given_term > 0:
total_num_docs = len(allDocs)
idf_val = log10(((total_num_docs+1) / num_docs_with_given_term))
term_split = term.split()
if len(term_split) == 3:
if len([term_split[i] for i in [0, 2] if term_split[i] not in stop]) == 2:
return idf_val*1.5
return idf_val
return idf_val
else:
return 0
def sent_formation(question, answer):
tags_doc = NLP(question)
tags_doc_cased = NLP(question.title())
tags_dict_cased = {i.lower_:i.pos_ for i in tags_doc_cased}
tags_dict = {i.lower_:i.pos_ for i in tags_doc}
question_cased = []
for i in question[:-1].split():
if tags_dict[i] == 'PROPN' or tags_dict[i] == 'NOUN':
question_cased.append(i.title())
else:
question_cased.append(i.lower())
question_cased.append('?')
question_cased = ' '.join(question_cased)
#del tags_dict,tags_doc, tags_doc_cased
pre = srl.predict(question_cased)
verbs = []
arg1 = []
for i in pre['verbs']:
verbs.append(i['verb'])
if 'B-ARG1' in i['tags']:
arg1.append((i['tags'].index('B-ARG1'), i['tags'].count('I-ARG1'))\
if not pre['words'][i['tags'].index('B-ARG1')].lower() in wh_words else \
(i['tags'].index('B-ARG2'), i['tags'].count('I-ARG2')))
arg1 = arg1[0] if arg1 else []
if not arg1:
verb_idx = pre['verbs'][0]['tags'].index('B-V')
verb = pre['words'][verb_idx] if pre['words'][verb_idx] != answer.split()[0].lower() else ''
subj_uncased = pre['words'][verb_idx+1:] if pre['words'][-1] not in symbol else \
pre['words'][verb_idx+1:-1]
else:
verb = ' '.join(verbs)
subj_uncased = pre['words'][arg1[0]:arg1[0]+arg1[1]+1]
conj = ''
if question.split()[0].lower() == 'when':
conj = ' on' if len(answer.split()) > 1 else ' in'
subj = []
for n, i in enumerate(subj_uncased):
if tags_dict_cased[i.lower()] == 'PROPN' and tags_dict[i.lower()] != 'VERB' or n == 0:
subj.append(i.title())
else:
subj.append(i.lower())
subj[0] = subj[0].title()
print(subj)
print(pre)
subj = ' '.join(subj)
sent = "{} {}{} {}.".format(subj, verb, conj, answer if answer[-1] != '.' else answer[:-1])
return sent
class extractAnswer:
def __init__(self):
self.wiki_error = (wikipedia.exceptions.DisambiguationError,
wikipedia.exceptions.HTTPTimeoutError,
wikipedia.exceptions.WikipediaException)
self.article_title = None
# symbol = """!#$%^&*();:\n\t\\\"!\{\}\[\]<>-\?"""
def extractAnswer_model(self, passage, question, s=0.4, e=0.3, wiki=False):
if type(passage) == list:
passage = " ".join(passage)
if not question[-1] == '?':
question = question+'?'
pre = predictor.predict(passage=passage, question=question)
if wiki:
if max(pre['span_end_probs']) > 0.5:
s = 0.12
elif max(pre['span_end_probs']) > 0.4:
s = 0.13
elif max(pre['span_end_probs']) > 0.35:
s = 0.14
if max(pre['span_start_probs']) > 0.5:
e = 0.12
elif max(pre['span_start_probs']) > 0.4:
e = 0.14
elif max(pre['span_start_probs']) > 0.3:
e = 0.15
if max(pre['span_start_probs']) > s and max(pre['span_end_probs']) > e:
key.extract_keywords_from_text(question)
ques_key = [stemmer.stem(i) for i in ' '.join(key.get_ranked_phrases())]
key.extract_keywords_from_text(passage)
pass_key = [stemmer.stem(i) for i in ' '.join(key.get_ranked_phrases())]
l = len(ques_key)
c = 0
for i in ques_key:
if i in pass_key:
c += 1
if c >= l/2:
print(max(pre['span_start_probs']),
max(pre['span_end_probs']))
if wiki:
return pre['best_span_str'], max(pre['span_start_probs']) + max(pre['span_end_probs'])
try:
ans = sent_formation(question, pre['best_span_str'])
except:
ans = pre['best_span_str']
print(traceback.format_exc())
return ans
print(ques_key, c, l)
print(max(pre['span_start_probs']), max(pre['span_end_probs']))
return 0, 0
else:
print(max(pre['span_start_probs']), max(pre['span_end_probs']), pre['best_span_str'])
return 0, 0
def wiki_search_api(self, query):
article_list = []
try:
article_list.extend(wikipedia.search(query, results=results))
print(article_list)
return article_list
except self.wiki_error:
params = {'search': query, 'profile': 'engine_autoselect',
'format': 'json', 'limit': results}
article_list.extend(requests.get('https://en.wikipedia.org/w/api.php?action=opensearch',
params=params).json()[1])
return article_list
except:
print('Wikipedia search error!')
print(traceback.format_exc())
return 0
def wiki_passage_api(self, article_title, article_list, output):
# Disambiguation_title = {}
try:
passage = wikipedia.summary(article_title)
output.put((article_title, self.passage_pre(passage)))
except wikipedia.exceptions.DisambiguationError as e:
print(e.options[0], e.options)
Disambiguation_pass = {}
for p in range(2 if len(e.options) > 1 else len(e.options)):
params = {'search':e.options[p], 'profile':'engine_autoselect', 'format':'json'}
article_url = requests.get('https://en.wikipedia.org/w/api.php?action=opensearch',
params=params).json()
if not article_url[3]:
continue
article_url = article_url[3][0]
r = session.get(article_url)
soup = BeautifulSoup(r.html.raw_html)
print(soup.title.string)
article_title_dis = soup.title.string.rsplit('-')[0].strip()
if article_title_dis in article_list:
print('continue')
continue
try:
url = "https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro&explaintext&redirects=1&titles={}".format(article_title_dis)
passage = requests.get(url).json()['query']['pages']
for i in passage.keys():
if 'extract' in passage[i]:
Disambiguation_pass[article_title_dis] = self.passage_pre(passage[i]['extract'])
except wikipedia.exceptions.HTTPTimeoutError:
passage = wikipedia.summary(article_title_dis)
Disambiguation_pass[article_title_dis] = self.passage_pre(passage)
except:
Disambiguation_pass[article_title_dis] = ''
continue
output.put((article_title, Disambiguation_pass))
except:
output.put((article_title, ''))
print(traceback.format_exc())
def sorting(self, article, question, topic):
processes = [mp.Process(target=self.wiki_passage_api, args=(article[x], article, output))\
for x in range(len(article))]
for p in processes:
p.start()
for p in processes:
p.join(timeout=3)
results_p = [output.get() for p in processes]
article_list = []
passage_list = []
for i, j in results_p:
if type(j) != dict and j:
article_list.append(i)
passage_list.append(j)
elif type(j) == dict and j:
for k, l in j.items():
if l:
article_list.append(k)
passage_list.append(l)
normalize_passage_list = []
start = time.time()
keywords = " ".join(self.noun+self.ques_key+[topic.lower()])
keywords = re.sub('[{0}]'.format(symbol), ' ', keywords).split()
question = question+' '+topic
ques_tokens = [stemmer.stem(i.lower()) for i in question.split() \
if i.lower() not in wh_words]
print(ques_tokens)
keywords_bigram = [' '.join(i) for i in list(ngrams(ques_tokens, 2)) \
if i[0] not in stop and i[1] not in stop]
if len(ques_tokens) > 3:
keywords_trigram = [' '.join(i) for i in list(ngrams(ques_tokens, 3)) \
if (i[0] in stop) + (i[2] in stop) + (i[1] in stop) < 3]
else:
keywords_trigram = []
if len(ques_tokens) > 5:
keywords_4gram = [' '.join(i) for i in list(ngrams(ques_tokens, 4)) \
if (i[0] in stop) + (i[2] in stop) +(i[1] in stop)+(i[3] in stop) < 4]
else:
keywords_4gram = []
keywords_unigram = list(set([stemmer.stem(i.lower()) for i in keywords \
if i.lower() not in stop]))
keywords = keywords_unigram+list(set(keywords_bigram))+keywords_trigram+keywords_4gram
tf = []
if not passage_list:
return 0
pass_len = []
#n_u_t=[]
#key_dict = {i: keywords.count(i) for i in keywords}
print('Extraction complete')
#remove_pass={}
#for n,i in enumerate(passage_list):
#if len(i)<200 or not i:
#remove_pass[article_list[n]]=i
#print(n, article_list[n])
#passage_list=[i for i in passage_list if i not in remove_pass.values()]
#article_list=[i for i in article_list if i not in remove_pass.keys()]
passage_list_copy = passage_list.copy()
article_list_copy = article_list.copy()
for i in range(len(passage_list_copy)):
if passage_list.count(passage_list_copy[i]) > 1:
passage_list.remove(passage_list_copy[i])
article_list.remove(article_list_copy[i])
print('Copy:', article_list_copy[i])
del passage_list_copy
del article_list_copy
for n, i in enumerate(passage_list):
temp_tf = {}
c = 0
for j in keywords:
temp_tf[j], temp_pass, temp_len = termFrequency(j, i + ' ' + article_list[n])
if temp_tf[j]:
c += 1
normalize_passage_list.append(temp_pass)
pass_len.append(temp_len)
temp_tf['key_match'] = c
tf.append(temp_tf)
print(pass_len)
print(keywords)
idf = {}
for i in keywords:
idf[i] = inverseDocumentFrequency(i, normalize_passage_list)
#print(tf, idf)
tfidf = []
#b=0.333 #for PLN
b, k = 0.75, 1.2 #for BM25
avg_pass_len = sum(pass_len)/len(pass_len)
#pivot=sum(n_u_t)/len(n_u_t)
for n, i in enumerate(tf):
tf_idf = 0
#avg_tf=sum(i.values())/len(i)
key_match_ratio = i['key_match']/len(keywords)
for j in keywords:
#tf_idf+=idf[j]*((log(1+log(1+i[j])))/(1-b+(b*pass_len[n]/avg_pass_len))) #PLN
tf_idf += idf[j]*(((k+1)*i[j])/(i[j]+k*(1-b+(b*pass_len[n]/avg_pass_len)))) #BM25
tfidf.append(tf_idf*key_match_ratio)
tfidf = [i/sum(tfidf)*100 for i in tfidf if any(tfidf)]
if not tfidf:
return 0, 0, 0, 0, 0
print(tfidf)
print(article_list, len(passage_list))
if len(passage_list) > 1:
sorted_tfidf = sorted(tfidf, reverse=1)
idx1 = tfidf.index(sorted_tfidf[0])
passage1 = passage_list[idx1]
#article_title=
tfidf1 = sorted_tfidf[0]
idx2 = tfidf.index(sorted_tfidf[1])
passage2 = passage_list[idx2]
article_title = (article_list[idx1], article_list[idx2])
tfidf2 = sorted_tfidf[1]
else:
article_title = 0
tfidf2 = 0
if passage_list:
passage1 = passage_list[0]
tfidf1 = tfidf[0]
passage2 = 0
else:
passage1 = 0
passage2 = 0
tfidf1, tfidf2 = 0, 0
end = time.time()
print('TFIDF time:', end-start)
return passage1, passage2, article_title, tfidf1, tfidf2
def passage_pre(self, passage):
#passage=re.findall("[\da-zA-z\.\,\'\-\/\–\(\)]*", passage)
passage = re.sub('\n', ' ', passage)
passage = re.sub('\[[^\]]+\]', '', passage)
passage = re.sub('pronunciation', '', passage)
passage = re.sub('\\\\.+\\\\', '', passage)
passage = re.sub('{.+}', '', passage)
passage = re.sub(' +', ' ', passage)
return passage
def wiki(self, question, topic=''):
if not question:
return 0
question = re.sub(' +', ' ', question)
question = question.title()
key.extract_keywords_from_text(question)
self.ques_key = key.get_ranked_phrases()
doc = NLP(question)
self.noun = [str(i).lower() for i in doc.noun_chunks if str(i).lower() not in wh_words]
print(self.ques_key, self.noun)
question = re.sub('[{0}]'.format(symbol), ' ', question)
if not self.noun + self.ques_key:
return 0
article_list = None
question = question.lower()
if self.noun:
if len(self.noun) == 2 and len(" ".join(self.noun).split()) < 6:
#question1=question
self.noun = " ".join(self.noun).split()
if self.noun[0] in stop:
self.noun.pop(0)
self.noun = question[question.index(self.noun[0]):question.index(self.noun[-1]) \
+len(self.noun[-1])+1].split()
#del question1
print(self.noun)
article_list = self.wiki_search_api(' '.join(self.noun))
if self.ques_key and not article_list:
article_list = self.wiki_search_api(self.ques_key[0])
if not article_list:
article_list = self.wiki_search_api(' '.join(self.ques_key))
if not article_list:
print('Article not found on wikipedia.')
return 0, 0
article_list = list(set(article_list))
passage1, passage2, article_title, tfidf1, tfidf2 = self.sorting(article_list,
question, topic)
if passage1:
ans1, conf1 = self.extractAnswer_model(passage1, question, s=0.20, e=0.20, wiki=True)
else:
ans1, conf1 = 0, 0
if ans1:
conf2 = 0
if len(ans1) > 600:
print(ans1)
print('Repeat')
ans1, conf1 = self.extractAnswer_model(ans1, question, s=0.20, e=0.20, wiki=True)
threshhold = 0.3 if not ((tfidf1- tfidf2) <= 10) else 0.2
if round(tfidf1- tfidf2) < 5:
threshhold = 0
if (tfidf1- tfidf2) > 20:
threshhold = 0.35
if (tfidf1- tfidf2) > 50:
threshhold = 1
if (passage2 and conf1 < 1.5) or (tfidf1 - tfidf2) < 10:
ans2, conf2 = self.extractAnswer_model(passage2, question, s=0.20, e=0.20,
wiki=True) if passage2 else (0, 0)
title = 0
if round(conf1, 2) > round(conf2, 2) - threshhold:
print('ans1')
ans = ans1
title = article_title[0] if article_title else 0
else:
print('ans2')
title = article_title[1] if article_title else 0
ans = ans2
if not question[-1] == '?':
question = question+'?'
try:
ans = sent_formation(question, ans)
except:
print(traceback.format_exc())
print(ans, '\n', '\n', article_title)
return ans, title
extractor = extractAnswer()
app = Flask(__name__)
#app.route("/", methods=["POST", "get"])
#app.route("/ans")
def ans():
start = time.time()
question = request.args.get('question')
topic = request.args.get('topic')
passage = request.args.get('passage')
if not question:
return render_template('p.html')
if not topic:
topic = ''
if passage:
answer = extractor.extractAnswer_model(passage, question)
else:
answer, title = extractor.wiki(question, topic)
end = time.time()
if answer:
mytext = str(answer)
language = 'en'
myobj = gTTS(text=mytext, lang=language, slow=False)
myobj.save("welcome.mp3")
# prevName = 'welcome.mp3'
#newName = 'static/welcome.mp3'
#os.rename(prevName,newName)
return render_template('pro.html', answer=answer)
else:
return jsonify(Status='E', Answer=answer, Time=end-start)
#app.route("/audio_del/", methods=["POST", "get"])
def audio_del():
return render_template('p.html');
#app.route("/audio_play/", methods=["POST", "get"])
def audio_play():
os.system("mpg321 welcome.mp3")
return render_template('white.html')
if __name__ == "__main__":
PORT = 7091
HTTP_SERVER = WSGIServer(('0.0.0.0', PORT), app)
print('Running on',PORT, '...')
HTTP_SERVER.serve_forever()
![Output in the terminal for a question I've asked](https://i.stack.imgur.com/6pyv5.jpg)
I came across a possible solution to this after deeply looking into the output returned by the model. Although this, is probably not something you can accurately rely on, it seemed to have done the task in my case:
Note that the text answer which is "best_span_str" is always a subarray of the passage. It spans the range which is stored in "best_span".
i.e., "best_span" contains the start and end index of the answer.
Now, the output data contains a property named "span_end_probs".
"span_end_probs" contains a list of values that correspond to all the words present in the text input.
If you look closely for various inputs, the value is always maximum at one of the indexes within the starting and ending range that "best_span" contains. This value seemed to be very similar to the confidence levels that we need. Let's call this value score. All you need to do now is to try some inputs and find a suitable method to use this score as a metric.
e.g.: if you need a threshold value for some application, you can try a number of test inputs and find a value that is most accurate. In my case, this was around 0.35.
i.e. if score is lesser than 0.35, it prints answer not found and if greater than or equal 0.35, prints string in "best_span_str".
Here's my code snippet:
from allennlp.predictors.predictor import Predictor
passage = '--INPUT PASSAGE--'
question = '--INPUT QUESTION--'
predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bidaf-elmo.2021-02-11.tar.gz")
output = predictor.predict(
passage = passage,
question = question
)
score = max(output["span_end_probs"])
if score < 0.35:
print('Answer not found')
else:
print(output["best_span_str"])
You can readily see the example input and output here.
I am creating a Word document from data using python-docx. I can create all the rows and cells with no problem but, in some cases, when the current record from the database has some content in field comment, I need to add a new line to display a long content.
I tried by appending a paragraph, but the result is that the comment is appended after the table, and I need it to be added bellow the current table row.
I think the solution is to append a table row with all cells merged, but I can't find documentation to do so.
This is the code where I generate the docx file:
class OperationDOCXView(viewsets.ViewSet):
exclude_from_schema = True
def list(self, request):
from ReportsManagerApp.controllers import Operations2Controller
self.profile_id = request.query_params['profile_id']
self.operation_date = request.query_params['operation_date']
self.operation_type = request.query_params['operation_type']
self.format = request.query_params['doc_format']
operation_report_controller = Operations2Controller(self.profile_id, self.operation_date, self.operation_type)
context = operation_report_controller.get_context()
if self.format == 'json':
return Response(context)
else:
word_doc = self.get_operation_word_file(request, context)
return Response("{}{}{}".format(request.get_host(), settings.MEDIA_URL, word_doc))
def get_operation_word_file(self, request, context):
import unicodedata
from django.core.files import File
from django.urls import reverse
from docx import Document
from docx.shared import Inches, Pt
operation_type = {
'arrival': 'Llegadas',
'departure': 'Salidas',
'hotel': 'Hotel-Hotel',
'tour': 'Tours',
}
weekdays = {
'0': 'LUNES',
'1': 'MARTES',
'2': 'MIÉRCOLES',
'3': 'JUEVES',
'4': 'VIERNES',
'5': 'SÁBADO',
'6': 'DOMINGO',
}
titles = ['Booking', 'Nombre', '#', 'Vuelo', 'Hr', 'P Up', 'Traslado', 'Circuito', 'Priv?', 'Agencia', '']
widths = [Inches(1), Inches(2), Inches(0.5), Inches(1), Inches(1), Inches(1), Inches(2), Inches(3), Inches(0.5), Inches(3), Inches(0.5)]
document = Document()
section = document.sections[-1]
section.top_margin = Inches(0.5)
section.bottom_margin = Inches(0.5)
section.left_margin = Inches(0.3)
section.right_margin = Inches(0.2)
style = document.styles['Normal']
font = style.font
font.name ='Arial'
font.size = Pt(10)
company_paragraph = document.add_heading("XXXX TTOO INC")
company_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
description_paragraph = document.add_paragraph("Operación de {} del día {}".format(operation_type[self.operation_type], self.operation_date))
description_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
operation_date = self.get_operation_date().date()
operation_week_day = operation_date.weekday()
day_paragraph = document.add_paragraph(weekdays[str(operation_week_day)])
day_paragraph.alignment = WD_ALIGN_PARAGRAPH.CENTER
for provider_unit, transfers in context.items():
provider_unit_paragraph = document.add_paragraph(provider_unit)
provider_unit_paragraph.style.font.size = Pt(10)
provider_unit_paragraph.style.font.bold = False
table = document.add_table(rows=1, cols=11)
hdr_cells = table.rows[0].cells
runs = []
for i in range(len(hdr_cells)):
runs.append(self.get_hdr_cells_run(hdr_cells[i], titles[i]))
for row in table.rows:
for idx, width in enumerate(widths):
row.cells[idx].width = width
adults = 0
minors = 0
for transfer in transfers:
# table = document.add_table(rows=1, cols=11)
row_cells = table.add_row().cells
row_cells[0].text = transfer['booking']
row_cells[1].text = transfer['people']
row_cells[2].text = transfer['pax']
flight = transfer.get("flight","") if transfer.get("flight","") is not None else ""
row_cells[3].text = flight
flight_time = self.get_flight_time(flight) if flight != '' else ''
row_cells[4].text = flight_time
row_cells[5].text = transfer['pickup_time'].strftime('%H:%M') if transfer['pickup_time'] is not None else ''
row_cells[6].text = transfer['place']
row_cells[7].text = transfer['roundtrip']
row_cells[8].text = transfer['is_private']
row_cells[9].text = transfer['agency']
people = transfer['pax'].split('.')
adults = adults + int(people[0])
minors = minors + int(people[1])
if transfer['comment'] is not None:
document.add_paragraph("Comentarios: {}".format(transfer['comment']))
for row in table.rows:
for idx, width in enumerate(widths):
row.cells[idx].width = width
for cell in row.cells:
paragraphs = cell.paragraphs
for paragraph in paragraphs:
for run in paragraph.runs:
font = run.font
font.size = Pt(8)
row_cells = table.add_row().cells
row_cells[10].text = "{}.{}".format(adults, minors)
current_directory = settings.MEDIA_DIR
file_name = "Operaciones {} {}.docx".format(self.operation_type, self.operation_date)
document.save("{}{}".format(current_directory, file_name))
return file_name
def get_flight_time(self, flight):
from OperationsManagerApp.models import Flight
operation_types = {
'arrival': 'ARRIVAL',
'departure': 'DEPARTURE'
}
operation_date = datetime.strptime(self.operation_date, '%Y-%m-%d')
try:
flight = Flight.objects.get(flight_type=operation_types[self.operation_type], number=flight)
except:
return ''
else:
weekday_times = {
'0': flight.time_monday,
'1': flight.time_tuesday,
'2': flight.time_wednesday,
'3': flight.time_thursday,
'4': flight.time_friday,
'5': flight.time_saturday,
'6': flight.time_sunday,
}
weekday_time = weekday_times[str(operation_date.weekday())]
return weekday_time.strftime('%H:%M') if weekday_time is not None else ''
def get_hdr_cells_run(self, hdr_cells, title):
from docx.shared import Pt
new_run = hdr_cells.paragraphs[0].add_run(title)
new_run.bold = True
new_run.font.size = Pt(8)
return new_run
def get_operation_date(self):
date_array = self.operation_date.split('-')
day = int(date_array[2])
month = int(date_array[1])
year = int(date_array[0])
operation_date = datetime(year, month, day)
return operation_date
One approach is to add a paragraph to one of the cells:
cell.add_paragraph(transfer['comment'])
This will place it in the right position with respect to the row it belongs to, rather than after the table.
If that would take too much room for a single cell that already has another data item in it and you want to add a row, you'll need to account for that when you allocate the table. But assuming you get that worked out, merging the cells is easy:
row.cells[0].merge(row.cells[-1])
Sorry for the unsophisticated question title but I need help desperately:
My objective at work is to create a script that pulls all the records from exacttarget salesforce marketing cloud API. I have successfully setup the API calls, and successfully imported the data into DataFrames.
The problem I am running into is two-fold that I need to keep pulling records till "Results_Message" in my code stops reading "MoreDataAvailable" and I need to setup logic which allows me to control the date from either within the API call or from parsing the DataFrame.
My code is getting stuck at line 44 where "print Results_Message" is looping around the string "MoreDataAvailable"
Here is my code so far, on lines 94 and 95 you will see my attempt at parsing the date directly from the dataframe but no luck and no luck on line 32 where I have specified the date:
import ET_Client
import pandas as pd
AggreateDF = pd.DataFrame()
Data_Aggregator = pd.DataFrame()
#Start_Date = "2016-02-20"
#End_Date = "2016-02-25"
#retrieveDate = '2016-07-25T13:00:00.000'
Export_Dir = 'C:/temp/'
try:
debug = False
stubObj = ET_Client.ET_Client(False, debug)
print '>>>BounceEvents'
getBounceEvent = ET_Client.ET_BounceEvent()
getBounceEvent.auth_stub = stubObj
getBounceEvent.search_filter = {'Property' : 'EventDate','SimpleOperator' : 'greaterThan','Value' : '2016-02-22T13:00:00.000'}
getResponse1 = getBounceEvent.get()
ResponseResultsBounces = getResponse1.results
Results_Message = getResponse1.message
print(Results_Message)
#EventDate = "2016-05-09"
print "This is orginial " + str(Results_Message)
#print ResponseResultsBounces
i = 1
while (Results_Message == 'MoreDataAvailable'):
#if i > 5: break
print Results_Message
results1 = getResponse1.results
#print(results1)
i = i + 1
ClientIDBounces = []
partner_keys1 = []
created_dates1 = []
modified_date1 = []
ID1 = []
ObjectID1 = []
SendID1 = []
SubscriberKey1 = []
EventDate1 = []
EventType1 = []
TriggeredSendDefinitionObjectID1 = []
BatchID1 = []
SMTPCode = []
BounceCategory = []
SMTPReason = []
BounceType = []
for BounceEvent in ResponseResultsBounces:
ClientIDBounces.append(str(BounceEvent['Client']['ID']))
partner_keys1.append(BounceEvent['PartnerKey'])
created_dates1.append(BounceEvent['CreatedDate'])
modified_date1.append(BounceEvent['ModifiedDate'])
ID1.append(BounceEvent['ID'])
ObjectID1.append(BounceEvent['ObjectID'])
SendID1.append(BounceEvent['SendID'])
SubscriberKey1.append(BounceEvent['SubscriberKey'])
EventDate1.append(BounceEvent['EventDate'])
EventType1.append(BounceEvent['EventType'])
TriggeredSendDefinitionObjectID1.append(BounceEvent['TriggeredSendDefinitionObjectID'])
BatchID1.append(BounceEvent['BatchID'])
SMTPCode.append(BounceEvent['SMTPCode'])
BounceCategory.append(BounceEvent['BounceCategory'])
SMTPReason.append(BounceEvent['SMTPReason'])
BounceType.append(BounceEvent['BounceType'])
df1 = pd.DataFrame({'ClientID': ClientIDBounces, 'PartnerKey': partner_keys1,
'CreatedDate' : created_dates1, 'ModifiedDate': modified_date1,
'ID':ID1, 'ObjectID': ObjectID1,'SendID':SendID1,'SubscriberKey':SubscriberKey1,
'EventDate':EventDate1,'EventType':EventType1,'TriggeredSendDefinitionObjectID':TriggeredSendDefinitionObjectID1,
'BatchID':BatchID1,'SMTPCode':SMTPCode,'BounceCategory':BounceCategory,'SMTPReason':SMTPReason,'BounceType':BounceType})
#print df1
#df1 = df1[(df1.EventDate > "2016-02-20") & (df1.EventDate < "2016-02-25")]
#AggreateDF = AggreateDF[(AggreateDF.EventDate > Start_Date) and (AggreateDF.EventDate < End_Date)]
print(df1['ID'].max())
AggreateDF = AggreateDF.append(df1)
print(AggreateDF.shape)
#df1 = df1[(df1.EventDate > "2016-02-20") and (df1.EventDate < "2016-03-25")]
#AggreateDF = AggreateDF[(AggreateDF.EventDate > Start_Date) and (AggreateDF.EventDate < End_Date)]
print("Final Aggregate DF is: " + str(AggreateDF.shape))
#EXPORT TO CSV
AggreateDF.to_csv(Export_Dir +'DataTest1.csv')
#with pd.option_context('display.max_rows',10000):
#print (df_masked1.shape)
#print df_masked1
except Exception as e:
print 'Caught exception: ' + str(e.message)
print e
Before my code parses the data, the orginal format I get of the data is a SOAP response, this is what it look like(below). Is it possible to directly parse records based on EventDate from the SOAP response?
}, (BounceEvent){
Client =
(ClientID){
ID = 1111111
}
PartnerKey = None
CreatedDate = 2016-05-12 07:32:20.000937
ModifiedDate = 2016-05-12 07:32:20.000937
ID = 1111111
ObjectID = "1111111"
SendID = 1111111
SubscriberKey = "aaa#aaaa.com"
EventDate = 2016-05-12 07:32:20.000937
EventType = "HardBounce"
TriggeredSendDefinitionObjectID = "aa111aaa"
BatchID = 1111111
SMTPCode = "1111111"
BounceCategory = "Hard bounce - User Unknown"
SMTPReason = "aaaa"
BounceType = "immediate"
Hope this makes sense, this is my desperately plea for help.
Thank you in advance!
You don't seem to be updating Results_Message in your loop, so it's always going to have the value it gets in line 29: Results_Message = getResponse1.message. Unless there's code involved that you didn't share, that is.