I have a list of Wikipedia pages related to some entities and I want to select only geographical places and locations (cities, provinces, but also regions, mountains, rivers and so on).
I can easily select pages with coordinates but this is not enough since many places actually in Wikipedia are not associated to their coordinates. I guess I should use labels from Wikidata, but I never used them and I am a bit lost with Python API. For example, if I use wptools:
import wptools
page = wptools.page('Indianapolis')
print(page.get_wikidata())
I obtain this:
www.wikidata.org (wikidata) Indianapolis
www.wikidata.org (labels) Q1000136|P1830|P421|Q1093829|P163|Q2579...
www.wikidata.org (labels) Q537853|P281|P949|Q2494513|Q3166162|Q18...
www.wikidata.org (labels) P1036|Q499547|P1997|P31|P17|P268|Q62049...
en.wikipedia.org (imageinfo) File:IndianapolisC12.png
Indianapolis (en) data
{
aliases: <list(10)> Circle City, Indy, Naptown, Crossroads of Am...
claims: <dict(61)> P1082, P227, P1151, P31, P17, P131, P163, P41...
description: <str(109)> city in and county seat of Marion County...
image: <list(1)> {'file': 'File:IndianapolisC12.png', 'kind': 'w...
label: Indianapolis
labels: <dict(145)> Q1000136, P1830, P421, Q1093829, P163, Q2579...
modified: <dict(1)> wikidata
requests: <list(5)> wikidata, labels, labels, labels, imageinfo
title: Indianapolis
what: county seat
wikibase: Q6346
wikidata: <dict(61)> population (P1082), GND ID (P227), topic's ...
wikidata_pageid: 7459
wikidata_url: https://www.wikidata.org/wiki/Q6346
}
How can I extract only the labels?
I suppose there exists a label "THIS IS A LOCATION" but how to use it?
Thanks in advance
Related
I have the following test_string from which I need to obtain the actual URL.
Test string (partly shown):
An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Desired output for part of test_string
https://www.sciencedirect.com/science/article/pii/S0010218020301346
I have been trying to obtain this with the MWE given below applied to many strings, but it gives only one URL.
MWE
from urlparse import urlparse, parse_qs
import re
from re import search
test_string = '''
Production, Properties, and Applications of ALPHA-Terpineol
<http://scholar.google.com/scholar_url?url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&hl=en&sa=X&d=12771069332921982368&scisig=AAGBfm1tFjLUm7GV1DRnuYCzvR4uGWq9Cg&nossl=1&oi=scholaralrt&hist=v2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
A Sales, L de Oliveira Felipe, JL Bicas
Abstract ALPHA-Terpineol (CAS No. 98-55-5) is a tertiary monoterpenoid
alcohol widely
and commonly used in the flavors and fragrances industry for its sensory
properties.
It is present in different natural sources, but its production is mostly
based on ...
Save
<http://scholar.google.com/citations?update_op=email_library_add&info=oB2z7uTzO7EJ&citsig=AMD79ooAAAAAYLfmix3sQyUWnFrHeKYZxuK31qlqlbCh&hl=en>
Twitter
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=tw&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g>
Facebook
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=fb&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g>
An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Butanol/diesel blend is considered as a very promising alternative fuel
with
agreeable combustion and emission performance in engines. This paper
intends to
further investigate its autoignition characteristics with the combination
of a heated =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3DE=
27Gd756Qj4J&citsig=3DAMD79ooAAAAAYImDxwWCwd5S5xIogWp9RTavFRMtTDgS&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
Using NMR spectroscopy to investigate the role played by copper in prion
diseases.
<http://scholar.google.com/scholar_url?url=3Dhttps://europepmc.org/article/=
med/32328835&hl=3Den&sa=3DX&d=3D16122276072657817806&scisig=3DAAGBfm1AE6Kyl=
jWO1k0f7oBnKFClEzhTMg&nossl=3D1&oi=3Dscholaralrt&hist=3Dv2Y_3P0AAAAJ:179499=
55323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
RA Alsiary, M Alghrably, A Saoudi, S Al-Ghamdi=E2=80=A6 - =E2=80=A6 and of =
the Italian
Society of =E2=80=A6, 2020
Prion diseases are a group of rare neurodegenerative disorders that develop
as a
result of the conformational conversion of normal prion protein (PrPC) to
the disease-
associated isoform (PrPSc). The mechanism that actually causes disease
remains =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3Dz=
pCMKavUvd8J&citsig=3DAMD79ooAAAAAYImDx3r4gltEWBAkhl0g2POsXB9Qn4Lk&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
'''
regex = re.compile('(http://scholar.*?)&')
url_all = regex.findall(test_string)
citation_url = []
for i in url_all:
if search('scholar.google.com',i):
qs = parse_qs(urlparse(i).query).values()
if search('http',str(qs[0])):
citation_url.append(qs[0])
print citation_url
Present output
https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
Desired output
https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
https://www.sciencedirect.com/science/article/pii/S0010218020301346
https://europepmc.org/article/med/3232883
How to get handle URL text wrapping with equal sign and extracting the redirect URL in Python?
You could match either a question mark or ampersand [&?] using a character class. Looking at the example data, for the url= part, you can add optional newlines and an optional equals sign and adjust accordingly.
Some urls start with 3D, you can make that part optional using a non capturing group (?:3D)?
Then capture in group 1 matching http followed by matching all chars except &
\bhttp://scholar\.google\.com.*?[&?]\n?u=?\n?r\n?l\n?=(?:3D)?(http[^&]+)
Regex demo
see this regex pattern i think it might help to extract redirect uri
(http:\/\/scholar[\w.\/=&?]*)[?]?u[=]?rl=([\w\:.\/\-=]+)
also see this example here https://regex101.com/r/dmkF3h/3
I have a list of names of fortune 500 companies.
here is an example [Abbott Laboratories,Progressive,Arrow Electronics,Kraft Heinz
Plains GP Holdings,Gilead Sciences,Mondelez International,Northrop Grumman]
Now I want to get the complete url from Wikipedia for each element in the list.
for example, after searching the name on Google or Wikipedia,
it should give me back list of all wikipedia urls like:
https://en.wikipedia.org/wiki/Abbott_Laboratories (this is only one example)
The biggest problem is looking for possible sites and only selecting the one belonging to the company.
One somewhat wrong way would be just just appending the company name to the wiki url and hoping that it works. That results in a) it works (like Abbott Laboratories), b) it produces a page, but not the right one (Progressive, should be Progressive_Corporation) or c) it produces no result at all.
companies = [
"Abbott Laboratories", "Progressive", "Arrow Electronics", "Kraft Heinz Plains GP Holdings", "Gilead Sciences",
"Mondelez International", "Northrop Grumman"
]
url = "https://en.wikipedia.org/wiki/%s"
for company in companies:
print(url % company.replace(" ", "_"))
Another (way better) option would be using the wikipedia package (https://pypi.org/project/wikipedia/) and its built-in search function. The problem of selecting the right site still remains, so you basically have to do this by hand (or create a good automatic selection like searching for the word "company")
companies = [
"Abbott Laboratories", "Progressive", "Arrow Electronics", "Kraft Heinz Plains GP Holdings", "Gilead Sciences",
"Mondelez International", "Northrop Grumman"
]
import wikipedia
for company in companies:
options = wikipedia.search(company)
print(company, options)
I have plotted graph in Python using folium/Leaflet with search. Problem I am facing is that it highlights only one result on map even if there are multiple results.
For example if I search by name then it works fine as mostly there is only 1 person with that full name.
But if I search on a city for example 'Delhi' then it results in highlighting only 1 marker instead of 7 or 10.
In the image below: There is only 1 marker circled in red which it is pointing as result but there others from same city which are not highlighted.
And how can I change highlighting properties to something more noticeable?
Image:
Here is the code:
######################################################################
# Part 1 - Creating map
m_search = folium.Map(location=[28.6003435, 77.21952300000001],zoom_start=11
)
#######################################################################
# Part 2 - Creating folium markers on map with html text, images
# etc dynamically for multiple points using for-loop
for plot_numb in range(gdf.shape[0]):
icon = folium.Icon(color="blue", icon="cloud", icon_color='white')
tooltip = 'Click to view more about: '+gdf.iloc[plot_numb,0]
var_name = gdf.iloc[plot_numb,0]
var_loc = gdf.iloc[plot_numb,2]
pic = base64.b64encode(open('Images/'+gdf.iloc[plot_numb,5],'rb').read()).decode()
html = f'''<img ALIGN="Right" src="data:image/png;base64,{pic}">\
<strong>Name: </strong>{var_name}<br /><br />\
<strong>Location: </strong>{var_loc}<br /><br />\
'''
html
iframe = IFrame(html, width=300+180, height=300)
popup = folium.Popup(iframe, max_width=650)
marker = folium.Marker(location=gdf.iloc[plot_numb,1], popup=popup, tooltip=tooltip, icon=icon).add_to(m_search)
########################################################################
# Part 3 - Creating Markers using GeoJson, creating search
# creating folium GeoJson objects from out GeoDataFrames
pointgeo = folium.GeoJson(gdf,name='group on map', show=False,
tooltip=folium.GeoJsonTooltip(fields=['Name', 'Relation', 'City'], aliases=['Name','Relation', 'City'],
localize=True)).add_to(m_search)
# Add some Search boxes to the map that reference the GeoDataFrames with some different parameters passed to the
# arguments.
pointsearch = Search(layer=pointgeo, geom_type='Point',
placeholder='Search for contacts', collapsed=False,
search_label='Name').add_to(m_search)
# To Add a LayerControl add below line
folium.LayerControl().add_to(m_search)
m_search
Below is the example of the gdf Data Frame:
Name Location City Relation Relation Detail Image Lat Lon geometry
0 abc [28.562193, 77.387073] Noida Cousin Cousin 1.png 28.562193 77.387073 POINT (77.387073 28.562193)
1 def [28.565282027743955, 77.44913935661314] Noida Cousin Cousin Brother 2.png 28.565282 77.449139 POINT (77.44913935661314 28.56528202774395)
3 ghi [28.6206996683473, 77.42576122283936] Noida Cousin Cousin Brother 4.png 28.620700 77.425761 POINT (77.42576122283936 28.6206996683473)
Need some help here as I am new to coding and unable to figure out.
I am trying to read the text from a pdf file. This file is part of a generated report. I am amble to read the text in the file but it comes out very garbled. What I want is to get each line in the pdf file as an item in a list, eventually, but you can see that the field names and entries get all mixed up. An example of the pdf I am trying to important can be found here, and below is the code that I am trying to use to get the lines.
import PyPDF2
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
filename = 'U:/PLAN/BCUBRICH/Python/Network Plan/Page 1 from AMP380_1741500.pdf'
def getPDFContent(filename):
content = ""
p = open(filename, "rb")
pdf = PyPDF2.PdfFileReader(p)
pdf.
num_pages = pdf.getNumPages()
for i in range(0, num_pages):
content += pdf.getPage(i).extractText()+'\n'
# content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
content=getPDFContent(filename)
Here is the output I get:
Out:'''UNITED STATES ENVIRONMENTAL PROTECTION AGENCYAIR QUALITY SYSTEMSITE DESCRIPTION REPORTApr. 25, 2019Site ID: 49-003-0003
Site Name: Brigham City
Local ID: BR
140 W.FISHBURN DRIVE, BRIGHAM CITY, UTStreet Address: City: Brigham City
Utah Zip Code: 84302
State: Box ElderCounty: Monitoring PointLocation Description: SuburbanLocation Setting: Interpolation-MapColl. Method:ResidentialLand Use: 20000819Date Established: Date Terminated: 20190130Last Updated: HQ Eval. Date:Regional Eval. Date: UtahAQCR : Ogden-Clearfield, UTCBSA: Salt Lake City-Provo-Orem, UTCSA: Met. Site ID:Direct Met Site: On-Site Met EquipType Met Site: Dist to Met. Site(m): Local Region: Urban Area: Not in an urban area
EPA Region: Denver
17411City Population: Dir. to CBD: Dist. to City(km): 3000Census Block: 3Block Group: 960701Census Tract: 1Congressional District: Class 1 Area: +41.492707Site Latitude: -112.018863Site Longitude: MountainTime Zone: UTM Zone: UTM Northing: UTM Easting: Accuracy: 60.73
Datum: WGS84
Scale: 24000
Point/Line/Area: Point 1,334.0Vertical Measure(m): 0Vert Accuracy: UnknownVert Datum : Vert Method: Unknown
Owning Agency: 1113 Utah Department Of Environmental Quality SITE COMMENTS SITE FOR OZONE, PM2.5, AND MET ACTIVE MONITOR TYPES Primary Monitor Periods # of Parameter Code Poc Begin Date End Date Monitor Type Monitors 42602 1 20180126 OTHER 2 44201 1 20010501 SLAMS 16 88101 1 20000819 20141231 88101 1 20160101 20161231 88101 1 20180101 88101 3 20170101 20171231 88101 4 20150101 20151231 TANGENT ROADS Road Traffic Traffic Compass Number Road Name Count Year Traffic Volume Source Road Type Sector 1 FISHBURN DRIVE 450 2000 LOCAL ST OR HY S Page 1 of 77
'''
For Example, I would like the eighth item in the list to be
State: Utah Zip Code: 84302 County: Box Elder
but what I get is
Utah Zip Code: 84302 State: Box ElderCounty:
These kind of mix ups happen throughout the document.
This is merely an explanation why that happens, not a solution. But it is too long for a comment, so it got an answer...
The reason for this weird order is that the text chunks in the document drawn in that order.
If you dig into the PDF and look at the content stream, you find this segment responsible for the example line you picked:
/TD <</MCID 12 >>BDC
-47.25 -1.685 Td
(Utah )Tj
28.125 0 Td
[(Zip Code: )-190(84302)]TJ
-32.06 -0 Td
(State: )Tj
EMC
/TD <</MCID 13 >>BDC
56.81 0 Td
(Box Elder)Tj
-5.625 0 Td
(County: )Tj
EMC
You probably don't understand the instructions but can see that the strings (in round brackets (...)) come exactly in the order you observe in the output
Utah Zip Code: 84302 State: Box ElderCounty:
instead of the desired
State: Utah Zip Code: 84302 County: Box Elder
The Td instructions in-between make the text insertion point jump back and forth to achieve the different appearance in a viewer.
Apparently your text extraction method merely retrieves the strings from the content stream in the order they are drawn and ignores the actual locations at which they are drawn. For a proper text extraction, therefore, you have to change the method you use. As I don't really know PyPDF2 myself, I cannot say whether this library offers different text extraction methods to turn to or whether you have to resort to a different library.
I've used a number of pdf-->text methods to extract text from pdf documents. For one particular type of PDF I have, neither pyPDF or pdfMiner are doing a good job extracting the text. However, http://www.convertpdftotext.net/ does it (almost) perfectly.
I discovered that the pdf I'm using has some transparent text in it, and it is getting merged into the other text.
Some examples of the blocks of text I get back are:
12324 35th Ed. 01-MAR-12 Last LNM: 14/12 NAD 83 14/12 Corrective Object of Corrective
ChartTitle: Intracoastal Waterway Sandy Hook to Little Egg Harbor Position
C HAActRionT N Y -NJ - S A N D Y H OO K ATcO tionLI T TLE EGG HARBOR. Page/Side: N/A
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true.
Bearings RoEf LlighOCtAT seEc tors aSrehre towwsbuardry th Re ivligher Ct fhroanmn seel Lawighartde.d B Theuoy 5no minal range of lights is expressedf roin mna 4u0tic-24al -mi46les.56 0(NNM ) unless othe0r7w4is-00e n-o05te.d8.8 0 W
to 40-24-48.585N 074-00-05.967W
and
12352 33rd Ed. 01-MAR-11 Last LNM: 03/12 NAD 83 04/12 . . l . . . . Corrective Object of Corrective ChartTitle: Shinnecock Bay to East Rockaway Inlet Position C HAActRionT S H IN N E C OC K B A Y TO AcEtionAS T ROCKAWAY INLET. Page/Side: N/A (Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are givCGenD 0in 1 degrees clockwise from 000 true. (BTeeamringp) s DoEf LlighETtE s ectors aSretat toew Baoratd Ctheh anlighnet lf Droaym beseacoawanr 3d. The nominal range of lights is expressedf roin mna 4u0tic-37al -mi11les.52 0(NNM ) unless othe0r7w3is-29e n-5o3te.d76. 0 W
and I have discovered that the "ghost text" is ALWAYS the following:
Corrective Object of Corrective Position
Action Action
(Temp) indicates that the chart correction action is temporary in nature. Courses and bearings are given in degrees clockwise from 000 true.
Bearings of light sectors are toward the light from seaward. The nominal range of lights is expressed in nautical miles (NM) unless otherwise noted.
In the 2nd example I posted, the text I want (with the ghost text removed) is:
12352 33rd Ed. 01-Mar-11 Last LNM:03/12 NAD 83 04/12
Chart Title:Shinnecock Bay to East Rockaway Inlet. Page/Side:N/A
CGD01
(Temp) DELETE State Boat Channel Daybeacon 3 from 40-37-11.520N 073-29-53.760W
This problem occurs just once per document, and does not appear to be totally consistent (as seen above). I am wondering if one of you wizards could think of a way to remove the ghosted text (I don't need/want it) using python. If I had been using pyPDF, I would have used a regex to rip it out during the conversion to text. Unfortunately, since I'm starting out with a text file from the website listed above, the damage has already been done. I'm at a bit of a loss.
Thanks for reading.
EDIT:
The solution to this problem looks like it be more complex than the rest of the application, so I'm going to withdraw my request for help.
I very much appreciate the thought put into it by those who have contributed.
Given that the ghost text can be split up in seemingly unpredictable ways, I don't think there is a simple automatic way of removing it that would not have false positives. What you need is almost human-level pattern recognition. :-)
What you could try is exploiting the format of these kinds of messages. Roughly;
<number> <number>[rn]d Ed. <date> Last LNM:<mm>/<yy> NAD <date2>
Chart Title:<text>. Page/Side:<N/A or number(s)> CGD<number> <text>
<position>
Using this you could pluck out the nonsense from the predictable elements, and then if you have a list of chart names ('Shinnecock Bay to East Rockaway Inlet') and descriptive words (like 'State', 'Boat', 'Daybeacon') you might be able to reconstruct the original words by finding the smallest levenshtein distance between mangled words in the two text blocks and those in your word lists.
If you can install the poppler software, you could try and use pdftotext with the -layout option to keep the formatting from the original PDF as much as possible. That might make your problem disappear.
You could recursively find all possible ways that your Pattern
"Corrective Object of Corrective Position Action ..." can be contained within your mangled text,
Then you can unmangle the text for each of these possible paths, run some sort of spellcheck over them, and choose the one with the fewest spelling mistakes. Or since you know roughly where each substring should appear, you can use that as a heuristic.
Or you could simply use the first path.
some pseudocode (untested):
def findPaths(mangledText, pattern, path)
if len(pattern)==0: # end of pattern
return [path]
else:
nextLetter= pattern[0]
locations = findAllOccurences (mangledText, nextLetter) # get all indices in mangledText that contain nextLetter
allPaths = []
for loc in locations:
paths = findPaths( mangledText[loc+1:], pattern[1:], path + (loc,) )
allPaths.Extend(paths)
return allPaths # if no locations for the next letters exist, allPaths will be emtpy
Then you can call it like this (optionally remove all spaces from your search pattern, unless you are certain they are all included in the mangled text)
allPossiblePaths = findPaths ( YourMangledText, "Corrective Object...", () )
then allPossiblePaths should contain a list of all possible ways your pattern could be contained in your mangled text.
Each entry is a tuple with the same length as the pattern, containing the index at which the corresponding letter of the pattern occurs in the search text.