How mysql is ordering textual data? - python

This is the query result from some dataset of articles ordered by article title ascending with limit 10 in MySQL. Encoding is set to utf8_unicode_ci.
'GTA 5' Sells $800 Million In One Day
'Infinity Blade III' hits the App Store ahead of i...
‘Have you lurked her texts?’ How the directors of ...
‘Second Moon’ by Katie Paterson now on a journey a...
"Do not track" effort in trouble as Digital Advert...
"Forbes": Bill Gates wciąż najbogatszym obywatelem...
"Here Is Something False: You Only Live Once"
“That's The Dumbest Thing I've Ever Heard Of.”
[Introduction to Special Issue] The Future Is Now
1 Great Dividend You Can Buy Right Now
I thought ordering works by getting the position of the character in the encoding table.
like ' is 39 and " is 34 in unicode but apostrophe ʼ and double quote “ have much higher position. From my understanding ʼ“ shouldn't make it into the result and " should be at the top. I'm clearly missing something here.
My goal is to order this data by title in python to get the same results as if data was ordered in mysql.

The gist of it is that in order to get better sort orders the Unicode Collation Algorithm is used, which would (probably) convert “ into " and ‘ into ' when sorting.
Unfortunately this is not simple to emulate in Python as the algorithm is non-trivial. You can look for a wrapper library like PyICU to do the hard work, although I've no guarantees they'll work.

Related

Python package to extract sentence from a textfile based on keyword

I need a python package that could get the related sentence from a text, based on the keywords provided.
For example, below is the Wikipedia page of J.J Oppenheimer -
Early life
Childhood and education
J. Robert Oppenheimer was born in New York City on April 22, 1904,[note 1][7] to Julius Oppenheimer, a wealthy Jewish textile importer who had immigrated to the United States from Germany in 1888, and Ella Friedman, a painter.
Julius came to the United States with no money, no baccalaureate studies, and no knowledge of the English language. He got a job in a textile company and within a decade was an executive with the company. Ella was from Baltimore.[8] The Oppenheimer were non-observant Ashkenazi Jews.[9]
The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico.
Oppenheimer later remarked that it brought to mind words from the Bhagavad Gita: "Now I am become Death, the destroyer of worlds.
If my passed string is - "JJ Oppenheimer birth date", it should return "J. Robert Oppenheimer was born in New York City on April 22, 1904"
If my passed string is - "JJ Openheimer Trinity test", it should return "The first atomic bomb was successfully detonated on July 16, 1945, in the Trinity test in New Mexico"
I tried searching a lot but nothing comes closer to what I want and I don't know much about NLP vectorization techniques. It would be great if someone please suggest some package if they know(or exist).
You could use fuzzywuzzy.
fuzz.ratio(search_text, sentence).
This gives you a score of how similar two strings are.
https://github.com/seatgeek/fuzzywuzzy
I am pretty sure a Module exists that could do this for you, you could try and make it yourself by parsing through the text and creating words like: ["date of birth", "born", "birth date", etc] and you do this for multiple fields. This would thus allow you to find information that would be available.
The idea is:
you grab your text or whatever u have,
you grab what you are looking for (example date of birth)
You then assign a date of birth to a list of similar words,
you look through ur file to see if you find a sentence that has that in it.
I am pretty sure there is no module, maybe I am wrong but smth like this should work.
The task you describe looks like Information Retrieval. Given a query (the keywords) the model should return a list of document (the sentences) that best matches the query.
This is essentially what the response using fuzzywuzzy is suggesting. But maybe just counting the number of occurences of the query words in each sentence is enough (and more efficient).
The next step would be to use Tf-Idf. It is a weighting scheme, that gives high scores to words that are specific to a document with respect to a set of document (the corpus).
This results in every document having a vector associated, you will then be able to sort the documents according to their similarity to the query vector. SO Answer to do that

Want to extract text from a text or pdf file as different paragraphs

Check the following text piece
IN THE HIGH COURT OF GUJARAT AT AHMEDABAD
R/CRIMINAL APPEAL NO. 251 of 2009
FOR APPROVAL AND SIGNATURE:
HONOURABLE MR.JUSTICE R.P.DHOLARIA
==========================================================
1 Whether Reporters of Local Papers may be allowed to see the judgment ?
2 To be referred to the Reporter or not ?
3 Whether their Lordships wish to see the fair copy of the judgment ?
4 Whether this case involves a substantial question of law as to the interpretation of the Constitution of India or any order made thereunder ?
========================================================== STATE OF GUJARAT,S M RAO,FOOD INSPECTOR,OFFICE OF THE Versus DHARMESHBHAI NARHARIBHAI GANDHI ========================================================== Appearance: MS HB PUNANI, APP (2) for the Appellant(s) No. 1 MR DK MODI(1317) for the Opponent(s)/Respondent(s) No. 1 ==========================================================
CORAM: HONOURABLE MR.JUSTICE R.P.DHOLARIA
Date : 12/03/2019
ORAL JUDGMENT
1. The appellant State of Gujarat has
preferred the present appeal under section 378(1)
(3) of the Code of Criminal Procedure, 1973
against the judgment and order of acquittal dated
Page 1 of 12
R/CR.A/251/2009 JUDGMENT
17.11.2008 rendered by learned 2nd Additional
Civil Judge and Judicial Magistrate, First Class,
Nadiad in Food Case No.1 of 2007.
The short facts giving rise to the
present appeal are that on 10.11.2006 at about
18.00 hours, the complainant visited the place of
the respondent accused situated at Juna
Makhanpura, Rabarivad, Nadiad along with panch
witness and the respondent was found dealing in
provisional items. The complainant identified
himself as a Food Inspector and after giving
intimation in Form No.6 has purchased muddamal
sample of mustard seeds in the presence of the
panchas for the purpose of analysis. Thereafter,
the complainant Food Inspector has divided the
said sample in equal three parts and after
completing formalities of packing and sealing
obtained signatures of the vendor and panchas and
out of the said three parts, one part was sent to
the Public Analyst, Vadodara for analysis and
remaining two parts were sent to the Local Health
Authority, Gandhinagar. Thereafter, the Public
Analyst forwarded his report. In the said report,
it is stated that the muddamal sample of mustard
seeds is misbranded which is in breach of the
provisions of the Food Adulteration Act, 1954
(for short “the Act”) and the Rules framed
thereunder. It is alleged that, therefore, the
sample of mustard seeds was misbranded and,
thereby, the accused has committed the offence.
**Page 2 of 12
R/CR.A/251/2009* JUDGMENT*
Hence, the complaint came to be lodged against
the respondent accused.
I want to be able to write a program such that it follows the given constraints. Be wary of the fact that this is only a single file i have like 40k files and it should run on all the files. All the files have some difference but the basic format for every file is the same.
Constraints.
It should start the text extraction process from after the "metadata" . Metadata is the data about the file from the starting of the file i.e " In the high court of gujarat" till Oral Judgment. In all the files i have , there are various POINTS after the string ends. So i need all these points as a separate paragraph ( see the text has 2 points , i need it in different paragraphs ).
Check the lines in italics, these are the panes in the text/pdf file. I need to remove these as these donot have any meaning to the text content i want.
These files are both available in TEXT or PDF format so i can use either. But i am new to python so i dont know how and where to start. I just have basic knowledge in python.
This data is going to be made into a "corpus" for further processes in building a huge expert system so you know what needs to be done i hope.
Read the official python docs!
Start with python's basic str type and its methods. One of its methods, find, will find substrings in your text.
Use the python slicing notation to extract the portion of text you need, e.g.
text = """YOUR TEXT HERE..."""
meta_start = 'In the high court of gujarat'
meta_end = 'ORAL JUDGMENT'
pos1 = text.find(meta_start)
pos2 = text.find(meta_end)
if pos2 > pos1 and pos1 > -1:
# text is found, extract it
text1 = text[meta_start + len(meta_start):meta_end - 1]
After that, you can go ahead and save your extracted text to a database.
Of course, a better and more complicated solution would be to use regular expressions, but that's another story -- try finding the right way for yourself!
As to italics and other text formatting, you won't ever be able to mark it out in plain text (unless you have some 'meta' markers, like e.g. [i] tags).

Analysing English text with some French name

I'm dealing with the Well-known novel of Victor Hugo "Les Miserables".
A part of my project is to detect the existence of each of the novel's character in a sentence and count them. This can be done easily by something like this:
def character_frequency(character_per_sentences_dict,):
characters_frequency = OrderedDict([])
for k, v in character_per_sentences_dict.items():
if len(v) != 0:
characters_frequency[k] = len(v)
return characters_frequency, characters_in_vol
This pies of could works well for all of the characters except "Èponine".
I also read the text with the following piece code:
import codecs
import nltk.tokenize
with open(path_to_volume + '.txt', 'r', encoding='latin1') as fp:
novel = ' '.join(fp.readlines())
# Tokenize sentences and calculate the number of sentences
sentences = sent_tokenize(novel)
num_volume = path_to_volume.split("-v")[-1]
I should add that the dictation of "Èponine" is the same everywhere.
Any idea what's going on ?!
Here is a sample in which this name apears:
" ONE SHOULD ALWAYS BEGIN BY ARRESTING THE VICTIMS
At nightfall, Javert had posted his men and had gone into ambush himself between the trees of the Rue de la Bar­rieredes-Gobelins which faced the Gorbeau house, on the other side of the boulevard. He had begun operations by opening his pockets, and dropping into it the two young girls who were charged with keeping a watch on the ap­proaches to the den. But he had only caged Azelma. As for Èponine, she was not at her post, she had disappeared, and he had not been able to seize her. Then Javert had made a point and had bent his ear to waiting for the signal agreed upon. The comings and goings of the fiacres had greatly agi­tated him. At last, he had grown impatient, and, sure that there was a nest there, sure of being in luck, having recog­nized many of the ruffians who had entered, he had finally decided to go upstairs without waiting for the pistol-shot."
I agree with #BoarGules that there is likely a more efficient and effective way to approach this problem. With that said, I'm not sure what your problem is here. Python is fully Unicode supportive. You can "just do it" in terms of using Unicode in your program logic using Python's standard string ops and libraries.
For example, this works:
#!/usr/bin/env python
import requests
names = [
u'Éponine',
u'Cosette'
]
# Retrieve Les Misérables from Project Gutenberg
t = requests.get("http://www.gutenberg.org/files/135/135-0.txt").text
for name in names:
c = t.count(name)
print("{}: {}".format(name, c))
Results:
Éponine: 81
Cosette: 1004
I obviously don't have the text you have, so I don't know if how it is encoded, or how it is being read is the problem. I can't test that without having it. In this code, I get the source text off the internet. My point is just that non-ASCII characters should not pose any impediment to you as long as your inputs are reasonable.
All of the time to run this is spent reading the text. I think even if you added dozens of names, it wouldn't add up to a noticeable delay on any decent computer. So this method works just fine.

Parsing txt file in python where it is hard to split by delimiter

I am new to python, and am wondering if anyone can help me with some file loading.
Situation is I have some text files and i'm trying to do sentiment analysis. Here's the text file. It is split into three category: <department>, <user>, <review>
Here are some sample data:
men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working
I want to make into this
<category> <user> <review>
I have 50k lines of these data.
I have tried to load directly into numpy, but it says its an empty separator error. I looked up stackoverflow, but i couldn't find a situation where it applies to different number of delimiters. For instance, i will never get to know how many spaces are there in the data set that i have.
My biggest problem is, how do you count the number of delimiters and give them column. Is there a way that I can make into three categories <department>, <user>, <review>. Bear in mind that the review data can contain random commas and spaces which i can't control. So the system must be smart enough to pick up!
Any ideas? Is there a way that i can tell python that after you read the user data, then everything behind falls under review?
With data like this I'd just use split() with the maxplit argument:
If maxsplit is given, at most maxsplit splits are done (thus, the list will have at most maxsplit+1 elements).
Example:
from StringIO import StringIO
s = StringIO("""men peter123 the pants are too tight for my liking!
kids georgel i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it
health kksd1 the health pills is drowsy by nature, please take care and do not drive after you eat the pills
office ty7d1 the printer came on time, the only problem with it is with the duplex function which i suspect its not really working""")
for line in s:
category, user, review = line.split(None, 2)
print ("category: {} - user: {} - review: '{}'".format(category,
user,
review.strip()))
The output is:
category: men - user: peter123 - review: 'the pants are too tight for my liking!'
category: kids - user: georgel - review: 'i really like this toy, it keeps my kid entertained for days! It is affordable and comes on time, i strongly recommend it'
category: health - user: kksd1 - review: 'the health pills is drowsy by nature, please take care and do not drive after you eat the pills'
category: office - user: ty7d1 - review: 'the printer came on time, the only problem with it is with the duplex function which i suspect its not really working'
For reference:
https://docs.python.org/2/library/stdtypes.html#str.split
What about doing it sorta manually:
data = []
for line in input_data:
tmp_split = line.split(" ")
#Get the first part (dept)
dept = tmp_split[0]
#get the 2nd part
user = tmp_split[1]
#everything after is the review - put spaces inbetween each piece
review = " ".join(tmp_split[2:])
data.append([dept, user, review])

How can I parse email text for components like <salutation><body><signature><reply text> etc?

I'm writing an application that analyzes emails and it would save me a bunch of time if I could use a python library that would parse email text down into named components like <salutation><body><signature><reply text> etc.
For example, the following text "Hi Dave,\nLets meet up this Tuesday\nCheers, Tom\n\nOn Sunday, 15 May 2011 at 5:02 PM, Dave Trindall wrote: Hey Tom,\nHow about we get together ..." would be parsed as
Salutation: "Hi Dave,\n"
Body: "Lets meet up this Tuesday\n"
Signature: "Cheers, Tom\n\n"
Reply Text: "On Sunday, 15 May 2011 at 5:02 PM, Dave Trindal wrote: ..."
I know there's no perfect solution for this kind of problem, but even a library that does good approximation would help. Where can I find one?
https://github.com/Trindaz/EFZP
This provides functionality posed in the original question, plus fair recognition of email zones as they commonly appear in email written by native English speakers from common email clients like Outlook and Gmail.
If you score each line based on the types of words it contains you may get a fairly good indication.
E.G. A line with greeting words near the start is the salutation (also salutations may have phrases that refer to the past tense e.g. it was good to see you last time)
A Body will typically contain words such as "movie, concert" etc. It will also contain verbs (go to, run, walk, etc) and questions marks and offerings (e.g. want to, can we, should we, prefer..).
Check out http://nodebox.net/code/index.php/Linguistics#verb_conjugation
http://ogden.basic-english.org/
http://osteele.com/projects/pywordnet/
the signature will contain closing words.
If you find a datasource that has messages of the structure you want you could do some frequency analysis to see how often each word occurs in each section.
Each word would get a score [salutation score, body score, signature score,..]
e.g. hello could occur 900 times in the salutation, 10 times in the body, and 3 times in the signature.
this means hello would get assigned [900, 10, 3, ..]
cheers might get assigned [10,3,100,..]
now you will have a large list of about 500,000 words.
words that don't have a large range aren't useful.
e.g. catch might have [100,101,80..] = range of 21
(it was good to catch up, wanna go catch a fish, catch you later). catch can occur anywhere.
Now you can reduce the number of words down to about 10,000
now for each line, give the line a score also of the form [salutation score, body score, signature score,..]
this score is calculated by adding the vector scores of each word.
e.g. a sentence "hello cheers for giving me your number" could be:
[900, 10, 3, ..] + [10,3,100,..] + .. + .. + = [900+10+..,10+3+..,3+100,..]
=[1023,900,500,..] say
then because the biggest number is at the start in the salutation score position, this sentence is a salutation.
then if you had to score one of your lines to see what component the line should be in, for each word you would add on its score
Good luck, there is always a trade-off between computation complexity and accuracy. If you can find a good set of words and make a good model to base you calculations it will help.
The first approach that comes to mind (not necessarily the best...) would be to start off by using split. here's a little bit of code and stuff
linearray=emailtext.split('\n')
now you have an array of strings, each one like a paragraph or whatever
so linearray[0] would contain the salutation
deciding where the reply text starts is a little more tricky, i noticed that there is a double newline just before it so maybe do a search for that from the back and hope that the last one indicates the start of the reply text.
Or store some signature words you might expect and search for those from the front, like cheers, regards, and whatever else.
Once you figure out where the signature is the rest is the rest is easy
hope this helped
I built a pretty cheap API for this actually to parse the contact data from signatures of emails and email chains. It's called SigParser. You can see the Swagger docs here for it.
Basically you send it a header 'x-api-key' with a JSON body like so and it parses all the contacts in the reply chain of an email.
{
"subject": "Thanks for meeting...",
"from_address": "bgates#example.com",
"from_name": "Bill Gates",
"htmlbody": "<div>Hi, good seeing you the other day.</div><div>--</div><div>Bill Gates</div><div>Cell 777-444-8888</div>LinkedInTwitter",
"plainbody": "Hi, good seeing you the other day. \r\n--\r\nBill Gates\r\nCell 777-444-8888",
"date": "Mon, 28 May 2018 23:33:40 +0000 (UTC)"
}

Categories

Resources