IPTCInfo keywords error - not all keywords displayed - python

I am trying to get the keywords from the metadata of an image using the IPTCInfo3 python library. The keyword tags for the images on windows (right clicking the image, going to the details tab) are:
Capital Cities; Travel; Tourism; Building Exterior; Banking; City Life; Arranging; Downtown District; Dusk; City of London; London - England; Working; Skyscraper; Office Building; Global Business; Illuminated; Glass - Material; Modern; International Landmark; Famous Place; Business; Finance; Architecture; Travel Destinations; Urban Scene; Outdoors; England; UK; Europe; Reflection; Sunset; Sunrise - Dawn; Sky; Thames River; River; Tower; District; Urban Skyline; Cityscape; shard; Capital Cities,Travel,Tourism,Building Exterior,Banking,City Lif;
However I am only getting the last part of the keywords list:
Capital Cities,Travel,Tourism,Building Exterior,Banking,City Lif;
This is my python script:
from iptcinfo3 import IPTCInfo
info = IPTCInfo("C:/Users/Dave/Desktop/ImageTest.jpg")
print(info['keywords'])
However, if I remove Capital Cities,Travel,Tourism,Building Exterior,Banking,City Lif; from the tags, it then outputs all the keywords, not just that one entry. It seems that this line in the tags seperated by , rather than ; is throwing it off. Is there a work around this to display all the keywords or is it nesting it in a way I can't seem to output it correctly?
This is the output from my script exactly in the terminal:
[b'Capital Cities,Travel,Tourism,Building Exterior,Banking,City Lif']

Related

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).
You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy
I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.
The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.
Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

How to get the country code and national phone number after parsing using the phonenumbers package?

I'm making an application to clean cellphone number. I'm using the phonenumbers package. So I used phonenumbers.parse(Cell No, Country Initials)
and on console it looks like this:
Country Code: ### National Number: ###
I was planning to just delete some text to get the area code and national number but I remembered that the area code and national number character length is different in other countries.
Is there a way to get just the area code and national numbers seperatly
phonenumbers.parse returns a PhoneNumber object.
After y = phonenumbers.parse("020 8366 1177", "GB"), you can access the attributes by e.g. y.country_code or set them by y.country_code = 49 etc.
Extraction of the area code is a bit tricky as not all countries have the concept of an area code. See this code snippet of how to correctly get the area code and national subscriber number separately.

Python Exact Match- Absolute exact match

Based on the code I have I am trying to find an exact match to any of the job positions listed in the input.
INPUT
this is str contains specific MATCH
dfp1[dfp1.index.str.match('Teacher|Dentist|General Manager|District Manager|Bus Driver|Team Lead|Dancer')]
Output is:
Teacher
Teacher, Middle
Teacher, High
Dentist, Sanford
Dentist
General Manager
General Manager, Dollar Tree
Team Lead
Dancer, 10th
Dancer
Dancer, Previous
I do not want anything extra other than the exact job position I put in the input. I want to specifically see only Teacher or Dentist or General Manager or District Manager or Bus Driver or Team Lead or Dancer.
I am not sure what my code is missing for it to display the job titles and no others.
Fixed your regex. You need to add a ^ at the beginning and a $ at the end.
dfp1[dfp1.index.str.match('^(Teacher|Dentist|General Manager|District Manager|Bus Driver|Team Lead|Dancer)$')]

matching commas in BeautifulSoup

I've tested my regex with Pythex and it works as it's supposed to:
The HTML:
Something Very Important (SVI) 2013 Sercret Information, Big Company
Name (LBCN) Catalog Number BCN2013R18 and BSSN 3-55564-789-Y, was
developed as part of the SUP 2012 Something Task force was held in
conjunction with *SEM 2013, the second joint conference on study of
banana hand grenades and gorilla tactics (Association of Ape Warfare
Studies) interest groups BUDDY HOLLY and LION KING. It is comprised of
one hairy object containing 750 gross stories told in the voice of
Morgan Freeman and his trusty sidekick Michelle Bachman.
My regex:
,[\s\w()-]+,
When used with Pythex it selects the area I'm looking for, which is between the 2 commas in the paragraph:
Something Very Important (SVI) 2013 Sercret Information , Big
Company Name (LBCN) Catalog Number BCN2013R18 and BSSN
3-55564-789-Y, was developed as part of the SUP 2012 Something Task
force was held in conjunction with <a href="http://justaURL.com">*SEM
2013</a>, the second joint
conference on study of banana hand grenades and gorilla tactics
(Association of Ape Warfare Studies) interest groups BUDDY HOLLY and
LION KING. It is comprised of one hairy object containing 750 gross
stories told in the voice of Morgan Freeman and his trusty sidekick
Michelle Bachman.
However when I use BeautifulSoup's text regex:
print HTML.body.p.find_all(text=re.compile('\,[\s\w()-]+\,'))
I'm returned this instead of the area between the commas:
[u'Something Very Important (SVI) 2013 Sercret Information, Big Company Name (LBCN) Catalog Number BCN2013R18 and BSSN 3-55564-789-Y, was developed as part of the SUP 2012 Something Task force was held in conjunction with ']
I've also tried escaping the commas but to no luck. Beautiful soup just wants to return the whole <p> instead of the regex that I specified. Also I noticed that it returns the paragraph up until that link in the middle. Is this a problem with how I'm using BeautifulSoup or is this a regex problem?
BeautifulSoup uses the regular expression to search for matching elements. That whole text node matches your search.
You still then have to extract the part you want; BeautifulSoup does not do this for you. You could just reuse your regex here:
expression = re.compile('\,[\s\w()-]+\,')
textnode = HTML.body.p.find_all(text=expression)
print expression.search(textnode).group(0)

Categories

Resources