Differences in language identification langid

Differences in language identification langid - python

I have a database from which I read. I want to identify the language in a specific cell, defined by column. I am using the langid library for python.
I read from my database like this:
connector = sqlite3.connect("somedb.db")
selecter = connector.cursor()
selecter.execute(''' SELECT tags FROM sometable''')
for row in selecter: #iterate through all the rows in db
#print (type(row)) #tuple
rf = str(row)
#print (type(rf)) #string
lan = langid.classify("{}".format(rf))
Technically, it works. It identifies the languages used and later on (not displayed here) writes the identified language back into the database.
So, now comes the weird part.
I wanted to double check some results manually. So I have these words:
a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
When I perform the language identification on the database it plots me Portuguese into the database.
But, performing it like this:
a = "shadow party people bw music mer white man black france men art nature monochrome french fun shoe sand nikon europe noir noiretblanc sable playa poetic nb ombre shade contraste plage blanc saxophone dunkerque nord homme musique saxo artiste artistique musicien chaussure blancandwhite d90 saxophoniste zyudcoote"
lan = langid.classify(a)
Well, that returns me French. Apart from that it is neither French nor Portuguese, why is it returned different results?!

In the loop row is bound to a tuple with a single item, i.e. ('tags',) - where 'tags' stands for the list of words. str(row) therefore (in Python 3) returns "('tags',)" and it is this string (including the single quotes, commas and braces) that is being passed to langid.classify(). If you are using Python 2 the string becomes "(u'tags',)".
Now, I am not sure that this explains the different language detection result, and my testing in Python 2 shows that it doesn't, but it is an obvious difference between database sourced data and the plain string data.
Possibly there is some encoding issue coming into play. How is the data stored in the database? Text data should be UTF-8 encoded.

Related

Identify and replace using regex some strings, stored within a list, within a string that may or may not contain them

import re
#list of names to identify in input strings
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort() # sorts normally by alphabetical order (optional)
result_list.sort(key=len, reverse=True) # sorts by descending length
#example 1
input_text = "Melissa went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so Thomas Edd is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker."
#In this example 2, it is almost the same however, some of the names were already encapsulated
# under the ((PERS)name) structure, and should not be encapsulated again.
input_text = "((PERS)Melissa) went for a walk in the park, then Melisa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
for i in result_list:
input_text = re.sub(r"\(\(PERS\)" + r"(" + str(i) + r")" + r"\)",
lambda m: (f"((PERS){m[1]})"),
input_text)
print(repr(input_text)) # --> output
Note that the names meet certain conditions under which they must be identified, that is, they must be in the middle of 2 whitespaces \s*the searched name\s* or be at the beginning (?:(?<=\s)|^) or/and at the end of the input string.
It may also be the case that a name is followed by a comma, for example "Ada White, Melissa and Louis went shopping" or if spaces are accidentally omitted "Ada White,Melissa and Louis went shopping".
For this reason it is important that after [.,;] the possibility that it does find a name.
Cases where the names should NOT be encapsulated, would be for example...
"the Edd's business"
"The whitespace"
"the pasteurization process takes time"
"Those White-spaces in that text are unnecessary"
, since in these cases the name is followed or preceded by another word that should not be part of the name that is being searched for.
For examples 1 and 2 (note that example 2 is the same as example 1 but already has some encapsulated names and you have to prevent them from being encapsulated again), you should get the following output.
"((PERS)Melissa) went for a walk in the park, then ((PERS)Melisa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker."

You can use lookarounds to exclude already encapsulated names and those followed by ', an alphanumeric character or -:
import re
result_list = ['Thomas Edd', 'Melissa Clark', 'Ada White', 'Louis Pasteur', 'Edd Thomas', 'Clark Melissa', 'White Eda', 'Pasteur Louis', 'Thomas', 'Melissa', 'Ada', 'Louis', 'Edd', 'Clark', 'White', 'Pasteur']
result_list.sort(key=len, reverse=True) # sorts by descending length
input_text = "((PERS)Melissa) went for a walk in the park, then Melissa Clark went to the cosmetics store. There Thomas showed her a wide variety of cosmetic products. Edd Thomas is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as Edd is always honest with his customers. White is a new client who came to Edd's business due to the good social media reviews she saw from Melissa, her co-worker." #example 1
pat = re.compile(rf"(?<!\(PERS\))({'|'.join(result_list)})(?!['\w)-])")
input_text = re.sub(pat, r'((PERS)\1)', input_text)
Output:
((PERS)Melissa) went for a walk in the park, then ((PERS)Melissa Clark) went to the cosmetics store. There ((PERS)Thomas) showed her a wide variety of cosmetic products. ((PERS)Edd Thomas) is a great salesman, even so ((PERS)Thomas Edd) is a skilled but responsible salesman, as ((PERS)Edd) is always honest with his customers. ((PERS)White) is a new client who came to Edd's business due to the good social media reviews she saw from ((PERS)Melissa), her co-worker.
Of course you can refine the content of your lookahead based on further edge cases.

How to extract the address data from a String in Python

I am trying to extract relevant address info form an string and discard the garbage.
So this:
al domicilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Asesoría Jurídica – Protección de Datos)
Should be:
Avenida de Burgos, 109 - 28050 Madrid
What i've done:
I am using stanza NER to find locations from text.
After that, I am using the indexes of the found entities to get the full address.
For eg: If A Madrid (Spanish city) is found in text[120:128] i will extract the string text[60:101] to get the full address.
My current code is:
##
##STANZA NER FOR LOCATIONS
##
!pip install stanza
#Download the spanish model
import stanza
stanza.download('es')
#create and run the ner tagger
nlp = stanza.Pipeline(lang='es', processors='tokenize,ner')
text = 'al domicilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Asesoría Jurídica – Protección de Datos) '
doc = nlp(text)
#print results of NER tagger
print([ent for ent in doc.ents if ent.type=="LOC"], sep='\n')
print(*[text[int(ent.start_char)-60:int(ent.end_char)+15] for ent in doc.ents if ent.type=="LOC"], sep='\n')
After this, in this particular case, which should be reproducible. I get the next address.
cilio social de la Compañía, Avenida de Burgos, 109 - 28050 Madrid (Indicar Aseso
Which contains extra "garbage" info --> " cilio social de la Compañía," and "(Indicar Aseso"
In the next part of the process,I am using the libpostal library to parse the address as it follows:
!pip install postal
from postal.parser import parse_address
parse_address('Avenida de Burgos, 109 - 28050 Madrid')
Which works reliably in most cases, but only with clean addresses.
[('avenida de burgos', 'road'),
('109', 'house_number'),
('28050', 'postcode'),
('madrid', 'city')]
So, to sum up, I am searching from another tecnique apart from regex to help me discard garbage info from addresses apart from regex. (Libraries which do this if they exist or a new NLP approach ... )
Thanks

For US address extraction from bulk text:
For US addresses in bulks of text I have pretty good luck, though not perfect with the below regex. It wont work on many of the oddity type addresses and only captures first 5 of the zip.
Explanation:
([0-9]{1,6}) - string of 1-5 digits to start off
(.{5,75}) - Any character 5-75 times. I looked at the addresses I was interested in and the vast vast majority were over 5 and under 60 characters for the address line 1, address 2 and city.
(BIG LIST OF AMERICAN STATES AND ABBERVIATIONS) - This is to match on states. Assumes state names will be Title Case.
.{1,2} - designed to accommodate many permutations of ,/s or just /s between the state and the zip
([0-9]{5}) - captures first 5 of the zip.
text = "is an individual maintaining a residence at 175 Fox Meadow, Orchard Park, NY 14127. 2. other,"
address_regex = r"([0-9]{1,6})(.{5,75}?)((?:Ala(?:(?:bam|sk)a)|American Samoa|Arizona|Arkansas|(?:^(?!Baja )California)|Colorado|Connecticut|Delaware|District of Columbia|Florida|Georgia|Guam|Hawaii|Idaho|Illinois|Indiana|Iowa|Kansas|Kentucky|Louisiana|Maine|Maryland|Massachusetts|Michigan|Minnesota|Miss(?:(?:issipp|our)i)|Montana|Nebraska|Nevada|New (?:Hampshire|Jersey|Mexico|York)|North (?:(?:Carolin|Dakot)a)|Ohio|Oklahoma|Oregon|Pennsylvania|Puerto Rico|Rhode Island|South (?:(?:Carolin|Dakot)a)|Tennessee|Texas|Utah|Vermont|Virgin(?:ia| Island(s?))|Washington|West Virginia|Wisconsin|Wyoming|A[KLRSZ]|C[AOT]|D[CE]|FL|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|P[AR]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])).{1,2}([0-9]{5})"
addresses = re.findall(address_regex, text)
addresses is then: [('175', ' Fox Meadow, Orchard Park, ', 'NY', '', '14127')]
You can combine these and remove spaces like so:
for address in addresses:
out_address = " ".join(address)
out_address = " ".join(out_address.split())
To then break this into a proper line 1, line 2 etc. I suggest using an address validation API like Google or Lob. These can take a string and break it into parts. There are also some python solutions for this like usaddress
For outside the US
To port this to other locations suggest considering how this could be adapted to your region. If you have a finite number of states and a similar structure, try looking for someone who has already built the "state" regex for your country.

My Regex to remove RT is not working for some reason

The head of my dataframe looks like this
for i in df.index:
txt = df.loc[i]["tweet"]
txt=re.sub(r'#[A-Z0-9a-z_:]+','',txt)#replace username-tags
txt=re.sub(r'^[RT]+','',txt)#replace RT-tags
txt = re.sub('https?://[A-Za-z0-9./]+','',txt)#replace URLs
txt=re.sub("[^a-zA-Z]", " ",txt)#replace hashtags
df.at[i,"tweet"]=txt
However, running this does not remove the 'RT' tags. In addition, I would like to remove the 'b' tag also.
Raw result tweet column:
b Yal suppose you would people waiting for a tub of paint and garden furniture the league is gone and any that thinks anything else is a complete tool of a human who really needs to get down off that cloud lucky to have it back for
b RT watching porn aftern normal people is like no turn it off they don xe x x t love each other
b RT If not now when nn
b Used red wine as a chaser for Captain Morgan xe x x s Fun times
b RT shackattack Hold the front page s Lockdown property project sent me up the walls

Your regular expression is not working, beause this sing ^ means at the beginning of the string. But the two characters you want to remove are not at the beginning.
Change r'^[RT]+' to r'[RT]+' the two letters will be removed. But tbe carefull beacause all other matches will be removed, too.
If you want to remove the letter be as well, try r'^b\s([RT]+)?'.
I suggest you try it yourself on https://regex101.com/

Insert quotes in the string using index python

I want to insert quotes("") around the date and text in the string (which is in the file input.txt). Here is my input file:
created_at : October 9, article : ISTANBUL — Turkey is playing a risky game of chicken in its negotiations with NATO partners who want it to join combat operations against the Islamic State group — and it’s blowing back with violence in Turkish cities. As the Islamic militants rampage through Kurdish-held Syrian territory on Turkey’s border, Turkey says it won’t join the fight unless the U.S.-led coalition also goes after the government of Syrian President Bashar Assad.
created_at : October 9, article : President Obama chairs a special meeting of the U.N. Security Council last month. (Timothy A. Clary/AFP/Getty Images) When it comes to President Obama’s domestic agenda and his maneuvers to (try to) get things done, I get it. I understand what he’s up to, what he’s trying to accomplish, his ultimate endgame. But when it comes to his foreign policy, I have to admit to sometimes thinking “whut?” and agreeing with my colleague Ed Rogers’s assessment on the spate of books criticizing Obama’s foreign policy stewardship.
I want to put quotes around the date and text as follows:
created_at : "October 9", article : "ISTANBUL — Turkey is playing a risky game of chicken in its negotiations with NATO partners who want it to join combat operations against the Islamic State group — and it’s blowing back with violence in Turkish cities. As the Islamic militants rampage through Kurdish-held Syrian territory on Turkey’s border, Turkey says it won’t join the fight unless the U.S.-led coalition also goes after the government of Syrian President Bashar Assad".
created_at : "October 9", article : "President Obama chairs a special meeting of the U.N. Security Council last month. (Timothy A. Clary/AFP/Getty Images) When it comes to President Obama’s domestic agenda and his maneuvers to (try to) get things done, I get it. I understand what he’s up to, what he’s trying to accomplish, his ultimate endgame. But when it comes to his foreign policy, I have to admit to sometimes thinking “whut?” and agreeing with my colleague Ed Rogers’s assessment on the spate of books criticizing Obama’s foreign policy stewardship".
Here is my code which finds the index for comma(, after the date) and index for the article and then by using these, I want to insert quotes around the date. Also I want to insert quotes around the text, but how to do this?
f = open("input.txt", "r")
for line in f:
article_pos = line.find("article")
print article_pos
comma_pos = line.find(",")
print comma_pos

While you can do this with low-level operations like find and slicing, that's really not the easy or idiomatic way to do it.
First, I'll show you how to do it your way:
comma_pos = line.find(", ")
first_colon_pos = line.find(" : ")
second_colon_pos = line.find(" : ", comma_pos)
line = (line[:first_colon_pos+3] +
'"' + line[first_colon_pos+3:comma_pos] + '"' +
line[comma_pos:second_colon_pos+3] +
'"' + line[second_colon_pos+3:] + '"')
But you can more easily just split the line into bits, munge those bits, and join them back together:
dateline, article = line.split(', ', 1)
key, value = dateline.split(' : ')
dateline = '{} : "{}"'.format(key, value)
key, value = article.split(' : ')
article = '{} : "{}"'.format(key, value)
line = '{}, {}'.format(dateline, article)
And then you can take the repeated parts and refactor them into a simple function so you don't have to write the same thing twice (which may come in handy if you later need to write it four times).
It's even easier using a regular expression, but that might not be as easy to understand for a novice:
line = re.sub(r'(.*?:\s*)(.*?)(\s*,.*?:\s*)(.*)', r'\1"\2"\3"\4"', line)
This works by capturing everything up to the first : (and any spaces after it) in one group, then everything from there to the first comma in a second group, and so on:
(.*?:\s*)(.*?)(\s*,.*?:\s*)(.*)
Debuggex Demo
Notice that the regex has the advantage that I can say "any spaces after it" very simply, while with find or split I had to explicitly specify that there was exactly one space on either side of the colon and one after the comma, because searching for "0 or more spaces" is a lot harder without some way to express it like \s*.

You could also take a look at the regex library re.
E.g.
>>> import re
>>> print(re.sub(r'created_at:\s(.*), article:\s(.*)',
... r'created_at: "\1", article: "\2"',
... 'created_at: October 9, article: ...'))
created_at: "October 9", article: "..."
The first param to re.sub is the pattern you are trying to match. The parens () capture the matches and can be used in the second argument with \1. The third argument is the line of text.

Remove nested newline characters in delimited file?

I have a caret-delimited file. The only carets in the file are delimiters -- there are none in text. Several of the fields are free text fields and contain embedded newline characters. This makes parsing the file very difficult. I need the newline characters at the end of the records, but I need to remove them from the fields with text.
This is open source maritime piracy data from the Global Integrated Shipping Information System. Here are three records, preceded by the header row. In the first, the boat name is NORMANNIA, in the second, it is "Unkown" and in the third, it is KOTA BINTANG.
ship_name^ship_flag^tonnage^date^time^imo_num^ship_type^ship_released_on^time_zone^incident_position^coastal_state^area^lat^lon^incident_details^crew_ship_cargo_conseq^incident_location^ship_status_when_attacked^num_involved_in_attack^crew_conseq^weapons_used_by_attackers^ship_parts_raided^lives_lost^crew_wounded^crew_missing^crew_hostage_kidnapped^assaulted^ransom^master_crew_action_taken^reported_to_coastal_authority^reported_to_which_coastal_authority^reporting_state^reporting_intl_org^coastal_state_action_taken
NORMANNIA^Liberia^24987^2009-09-19^22:30^9142980^Bulk carrier^^^Off Pulau Mangkai,^^South China Sea^3° 04.00' N^105° 16.00' E^Eight pirates armed with long knives and crowbars boarded the ship underway. They broke into 2/O cabin, tied up his hands and threatened him with a long knife at his throat. Pirates forced the 2/O to call the Master. While the pirates were waiting next to the Master’s door, they seized C/E and tied up his hands. The pirates rushed inside the Master’s cabin once it was opened. They threatened him with long knives and crowbars and demanded money. Master’s hands were tied up and they forced him to the aft station. The pirates jumped into a long wooden skiff with ship’s cash and crew personal belongings and escaped. C/E and 2/O managed to free themselves and raised the alarm^Pirates tied up the hands of Master, C/E and 2/O. The pirates stole ship’s cash and master’s, C/E & 2/O cash and personal belongings^In international waters^Steaming^5-10 persons^Threat of violence against the crew^Knives^^^^^^^^SSAS activated and reported to owners^^Liberian Authority^^ICC-IMB Piracy Reporting Centre Kuala Lumpur^-
Unkown^Marshall Islands^19846^2013-08-28^23:30^^General cargo ship^^^Cam Pha Port^Viet Nam^South China Sea^20° 59.92' N^107° 19.00' E^While at anchor, six robbers boarded the vessel through the anchor chain and cut opened the padlock of the door to the forecastle store. They removed the turnbuckle and lashing of the forecastle store's rope hatch. The robbers escaped upon hearing the alarm activated when they were sighted by the 2nd officer during the turn-over of duty watch keepers.^"There was no injury to the crew however, the padlock of the door to the forecastle store and the rope hatch were cut-opened.
Two centre shackles and one end shackle were stolen"^In port area^At anchor^5-10 persons^^None/not stated^Main deck^^^^^^^-^^^Viet Nam^"ReCAAP ISC via ReCAAP Focal Point (Vietnam)
ReCAAP ISC via Focal Point (Singapore)"^-
KOTA BINTANG^Singapore^8441^2002-05-12^15:55^8021311^Bulk carrier^^UTC^^^South China Sea^^^Seven robbers armed with long knives boarded the ship, while underway. They broke open accommodation door, held hostage a crew member and forced the Master to open his cabin door. They then tied up the Master and crew member, forced them back onto poop deck from where the robbers jumped overboard and escaped in an unlit boat^Master and cadet assaulted; Cash, crew belongings and ship's cash stolen^In territorial waters^Steaming^5-10 persons^Actual violence against the crew^Knives^^^^^^2^^-^^Yes. SAR, Djakarta and Indonesian Naval Headquarters informed^^ICC-IMB PRC Kuala Lumpur^-
You'll notice that the first and third records are fine and easy to parse. The second record, "Unkown," has some nested newline characters.
How should I go about removing the nested newline characters (but not those at the end of the records) in a python script (or otherwise, if there's an easier way) so that I can import this data into SAS?

load the data into a string a then do
import re
newa=re.sub('\n','',a)
and there will be no newlines in newa
newa=re.sub('\n(?!$)','',a)
and it leaves the ones at the end of the line but strips the rest

I see you've tagged this as regex, but I would recommend using the builtin CSV library to parse this. The CSV library will parse the file correctly, keeping newlines where it should.
Python CSV Examples: http://docs.python.org/2/library/csv.html

I solved the problem by counting the number of delimiters encountered and manually switching to a new record when I reached the number associated with a single record. I then stripped all of the newline characters and wrote the data back out to a new file. In essence, it's the original file with the newline characters stripped from the fields but with a newline character at the end of each record. Here's the code:
f = open("events.csv", "r")
carets_per_record = 33
final_file = []
temp_file = []
temp_str = ''
temp_cnt = 0
building = False
for i, line in enumerate(f):
# If there are no carets on the line, we are building a string
if line.count('^') == 0:
building = True
# If we are not building a string, then set temp_str equal to the line
if building is False:
temp_str = line
else:
temp_str = temp_str + " " + line
# Count the number of carets on the line
temp_cnt = temp_str.count('^')
# If we do not have the proper number of carets, then we are building
if temp_cnt < carets_per_record:
building = True
# If we do have the proper number of carets, then we are finished
# and we can push this line to the list
elif temp_cnt == carets_per_record:
building = False
temp_file.append(temp_str)
# Strip embedded newline characters from the temp file
for i, item in enumerate(temp_file):
final_file.append(temp_file[i].replace('\n', ''))
# Write the final_file list out to a csv final_file
g = open("new_events.csv", "wb")
# Write the lines back to the file
for item in enumerate(final_file):
# item is a tuple, so we get the content part and append a new line
g.write(item[1] + '\n')
# Close the files we were working with
f.close()
g.close()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.