Convert in utf16 - python

I am crawling several websites and extract the names of the products. In some names there are errors like this:
Malecon 12 Jahre 0,05 ltr.<br>Reserva Superior
Bols Watermelon Lik\u00f6r 0,7l
Hayman\u00b4s Sloe Gin
Ron Zacapa Edici\u00f3n Negra
Havana Club A\u00f1ejo Especial
Caol Ila 13 Jahre (G&M Discovery)
How can I fix that?
I am using xpath and re.search to get the names.
In every Python file, this is the first code: # -*- coding: utf-8 -*-
Edit:
This is the sourcecode, how I get the information.
if '"articleName":' in details:
closer_to_product = details.split('"articleName":', 1)[1]
closer_to_product_2 = closer_to_product.split('"imageTitle', 1)[0]
if debug_product == 1:
print('product before try:' + repr(closer_to_product_2))
try:
found_product = re.search(f'{'"'}(.*?)'f'{'",'}'closer_to_product_2).group(1)
except AttributeError:
found_product = ''
if debug_product == 1:
print('cleared product: ', '>>>' + repr(found_product) + '<<<')
if not found_product:
print(product_detail_page, found_product)
items['products'] = 'default'
else:
items['products'] = found_product
Details
product_details = information.xpath('/*').extract()
product_details = [details.strip() for details in product_details]

Where is a problem (Python 3.8.3)?
import html
strings = [
'Bols Watermelon Lik\u00f6r 0,7l',
'Hayman\u00b4s Sloe Gin',
'Ron Zacapa Edici\u00f3n Negra',
'Havana Club A\u00f1ejo Especial',
'Caol Ila 13 Jahre (G&M Discovery)',
'Old Pulteney \\u00b7 12 Years \\u00b7 40% vol',
'Killepitsch Kr\\u00e4uterlik\\u00f6r 42% 0,7 L']
for str in strings:
print( html.unescape(str).
encode('raw_unicode_escape').
decode('unicode_escape') )
Bols Watermelon Likör 0,7l
Hayman´s Sloe Gin
Ron Zacapa Edición Negra
Havana Club Añejo Especial
Caol Ila 13 Jahre (G&M Discovery)
Old Pulteney · 12 Years · 40% vol
Killepitsch Kräuterlikör 42% 0,7 L
Edit Use .encode('raw_unicode_escape').decode('unicode_escape') for doubled Reverse Solidi, see Python Specific Encodings

Related

Python Regex Capturing Multiple Matches in separate observations

I am trying to create variables location; contract items; contract code; federal aid using regex on the following text:
PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E
This text is from one word file; I'll be looping my code on multiple doc files. In the text above are three location; contract items; contract code; federal aid pairs. But when I use regex to create variables, only the first instance of each pair is included.
The code I have right now is:
# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword
all_bod = []
all_cn = []
all_location = []
all_fedaid = []
all_contractcode = []
all_contractitems = []
all_file = []
text = ' PAGE 1
BID OPENING DATE 07/25/18 FROM 0.2 MILES WEST OF ICE HOUSE 07/26/18 CONTRACT NUMBER 03-2F1304 ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '
LOCATION 03-ED-50-39.5/48.7 DIVISION HIGHWAY ROAD 44 CONTRACT ITEMS
INSTALL SANDTRAPS AND PULLOUTS FEDERAL AID ACNH-P050-(146)E
PAGE 1
BID OPENING DATE 07/25/18 IN EL DORADO COUNTY AT VARIOUS 07/26/18 CONTRACT NUMBER 03-2H6804 LOCATIONS ALONG ROUTES 49 AND 193 CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR 13 CONTRACT ITEMS
TREE REMOVAL FEDERAL AID NONE
PAGE 1
BID OPENING DATE 07/25/18 IN LOS ANGELES, INGLEWOOD AND 07/26/18 CONTRACT NUMBER 07-296304 CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '
LOCATION 07-LA-405-R21.5/26.3 ROAD UNDERCROSSING 55 CONTRACT ITEMS
ROADWAY SAFETY IMPROVEMENT FEDERAL AID ACIM-405-3(056)E'
bod1 = re.search('BID OPENING DATE \s+ (\d+\/\d+\/\d+)', text)
bod2 = re.search('BID OPENING DATE\n\n(\d+\/\d+\/\d+)', text)
if not(bod1 is None):
bod = bod1.group(1)
elif not(bod2 is None):
bod = bod2.group(1)
else:
bod = 'NA'
all_bod.append(bod)
# creating contract number
cn1 = re.search('CONTRACT NUMBER\n+(.*)', text)
cn2 = re.search('CONTRACT NUMBER\s+(.........)', text)
if not(cn1 is None):
cn = cn1.group(1)
elif not(cn2 is None):
cn = cn2.group(1)
else:
cn = 'NA'
all_cn.append(cn)
# location
location1 = re.search('LOCATION \s+\S+', text)
location2 = re.search('LOCATION \n+\S+', text)
if not(location1 is None):
location = location1.group(0)
elif not(location2 is None):
location = location2.group(0)
else:
location = 'NA'
all_location.append(location)
# federal aid
fedaid = re.search('FEDERAL AID\s+\S+', text)
fedaid = fedaid.group(0)
all_fedaid.append(fedaid)
# contract code
contractcode = re.search('CONTRACT CODE\s+\S+', text)
contractcode = contractcode.group(0)
all_contractcode.append(contractcode)
# contract items
contractitems = re.search('\d+ CONTRACT ITEMS', text)
contractitems = contractitems.group(0)
all_contractitems.append(contractitems)
This code parses the only first instance of these variables in the text.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
But, I am trying to figure out a way to get all possible instances in different observations.
contract-number
location
contract-items
contract-code
federal-aid
03-2F1304
03-ED-50-39.5/48.7
44
A
ACNH-P050-(146)E
03-2H6804
03-ED-0999-VAR
13
C
NONE
07-296304
07-LA-405-R21.5/26.3
55
B
ACIM-405-3(056)E
The all_variables in the code are for looping over multiple word files - we can ignore that if we want :).
Any leads would be super helpful. Thanks so much!
import re
data = []
df = pd.DataFrame()
regex_contract_number =r"(?:CONTRACT NUMBER\s+(?P<contract_number>\S+?)\s)"
regex_location = r"(?:LOCATION\s+(?P<location>\S+))"
regex_contract_items = r"(?:(?P<contract_items>\d+)\sCONTRACT ITEMS)"
regex_federal_aid =r"(?:FEDERAL AID\s+(?P<federal_aid>\S+?)\s)"
regex_contract_code =r"(?:CONTRACT CODE\s+\'(?P<contract_code>\S+?)\s)"
regexes = [regex_contract_number,regex_location,regex_contract_items,regex_federal_aid,regex_contract_code]
for regex in regexes:
for match in re.finditer(regex, text):
data.append(match.groupdict())
df = pd.concat([df, pd.DataFrame(data)], axis=1)
data = []
df

spacy remove only org and person names

I have written the below function which removes all named entities from text. How could I modify it to remove only org and person names? I don't want to remove 6 from $6 from below. Thanks
import spacy
sp = spacy.load('en_core_web_sm')
def NER_removal(text):
document = sp(text)
text_no_namedentities = []
ents = [e.text for e in document.ents]
for item in document:
if item.text in ents:
pass
else:
text_no_namedentities.append(item.text)
return (" ".join(text_no_namedentities))
NER_removal("John loves to play at Sofi stadium at 6.00 PM and he earns $6")
'loves to play at stadium at 6.00 PM and he earns $'
I think item.ent_type_ will be useful here.
import spacy
sp = spacy.load('en_core_web_sm')
def NER_removal(text):
document = sp(text)
text_no_namedentities = []
# define ent types not to remove
ent_types_to_stay = ["MONEY"]
ents = [e.text for e in document.ents]
for item in document:
# add condition to leave defined ent types
if all((item.text in ents, item.ent_type_ not in ent_types_to_stay)):
pass
else:
text_no_namedentities.append(item.text)
return (" ".join(text_no_namedentities))
print(NER_removal("John loves to play at Sofi stadium at 6.00 PM and he earns $6"))
# loves to play at Sofi stadium at 6.00 PM and he earns $ 6

Looking at the next word

I would like to know how I can find a word which has the next one with the first letter capitalised.
For example:
ID Testo
141 Vivo in una piccola città
22 Gli Stati Uniti sono una grande nazione
153 Il Regno Unito ha votato per uscire dall'Europa
64 Hugh Laurie ha interpretato Dr. House
12 Mi piace bere birra.
My expected output would be:
ID Testo Estratte
141 Vivo in una piccola città []
22 Gli Stati Uniti sono una grande nazione [Gli Stati, Stati Uniti]
153 Il Regno Unito ha votato per uscire dall'Europa [Il Regno, Regno Unito]
64 Hugh Laurie ha interpretato Dr. House [Hugh Laurie, Dr House]
12 Mi piace bere birra. []
To extract letter capitalised I do:
df['Estratte'] = df['Testo'].str.findall(r'\b([A-Z][a-z]*)\b')
However this column collect only single words since the code does not look at the next word.
Could you please tell me which condition I should add to look at the next word?
Sometime regex is not always good , let us try split with explode
s=df.Testo.str.split(' ').explode()
s2=s.groupby(level=0).shift(-1)
assign=(s + ' ' + s2)[s.str.istitle() & s2.str.isttimeitle()].groupby(level=0).agg(list)
Out[244]:
1 [Gli Stati, Stati Uniti]
2 [Il Regno, Regno Unito]
3 [Hugh Laurie, Dr. House]
Name: Testo, dtype: object
df['New']=assign
# notice after assign the not find row will be assign as NaN
Maybe you could use my code below
def getCapitalize(myStr):
words = myStr.split()
for i in range(0, len(words) - 1):
if (words[i][0].isupper() and words[i+1][0].isupper()):
yield f"{words[i]} {words[i+1]}"
This function will create a generator and you will have to convert to a list or wtv
import re
import pandas as pd
x = {141 : 'Vivo in una piccola città', 22: 'Gli Stati Uniti sono una grande nazione',
153 : 'Il Regno Unito ha votato per uscire dall\'Europa', 64 : 'Hugh Laurie ha interpretato Dr. House', 12 :'Mi piace bere birra.'}
df = pd.DataFrame(x.items(), columns = ['id', 'testo'])
caps = []
vals = df.testo
for string in vals:
string = string.split(' ')
string = string[1:]
string = ' '.join(string)
caps.append(re.findall('([A-Z][a-z]+)', string))
df['Estratte'] = caps```
Why not match a word starting with capital letter but not at the start of line
df.Testo.str.findall('(?<!^)([A-Z]\w+)')
or
df.Testo.str.findall('(?<!^)[A-Z][a-z]+')
0 []
1 [Stati, Uniti]
2 [Regno, Unito, Europa]
3 [Laurie, Dr, House]
4 []
I think the simplest is to use regex, search (pattern-space-pattern), with overlapping:
import regex as re
df['Estratte'] = df.Testo.apply(lambda x: re.findall('[A-Z][a-z]+[ ][A-Z][a-z]+', x, overlapped=True))

html scraping using python topboxoffice list from imdb website

URL: http://www.imdb.com/chart/?ref_=nv_ch_cht_2
I want you to print top box office list from above site (all the movies' rank, title, weekend, gross and weeks movies in the order)
Example output:
Rank:1
title: godzilla
weekend:$93.2M
Gross:$93.2M
Weeks: 1
Rank: 2
title: Neighbours
This is just a simple way to extract those entities by BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
url = "http://www.imdb.com/chart/?ref_=nv_ch_cht_2"
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data, 'html.parser')
rows = page.findAll("tr", {'class': ['odd', 'even']})
for tr in rows:
for data in tr.findAll("td", {'class': ['titleColumn', 'weeksColumn','ratingColumn']}):
print data.get_text()
P.S.-Arrange according to your will.
There is no need to scrape anything. See the answer I gave here.
How to scrape data from imdb business page?
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long

Python RE ( In a word to check first letter is case sensitive and rest all case insensitive)

In the below case i want to match string "Singapore" where "S" should always be capital and rest of the words may be in lower or in uppercase. but in the below string "s" is in lower case and it gets matched in search condition. can any body let me know how to implement this?
import re
st = "Information in sinGapore "
if re.search("S""(?i)(ingapore)" , st):
print "matched"
Singapore => matched
sIngapore => notmatched
SinGapore => matched
SINGAPORE => matched
As commented, the Ugly way would be:
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "SingaPore")
<_sre.SRE_Match object at 0x10cea84a8>
>>> re.search("S[iI][Nn][Gg][Aa][Pp][Oo][Rr][Ee]" , "Information in sinGapore")
The more elegant way would be matching Singapore case-insensitive, and then checking that the first letter is S:
reg=re.compile("singapore", re.I)
>>> s="Information in sinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
False
>>> s="Information in SinGapore"
>>> reg.search(s) and reg.search(s).group()[0]=='S'
True
Update
Following your comment - you could use:
reg.search(s).group().startswith("S")
Instead of:
reg.search(s).group()[0]==("S")
If it seems more readable.
Since you want to set a GV code according to the catched phrase (unique name or several name blank separated, I know that), there must be a step in which the code is choosen in a dictionary according to the catched phrase.
So it's easy to make a profit of this step to perform the test on the first letter (must be uppercased) or the first name in the phrase that no regex is capable of.
I choosed certain conditions to constitute the test. For example, a dot in a first name is not mandatory, but uppercased letters are. These conditions will be easily changed at need.
EDIT 1
import re
def regexize(cntry):
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
to_join = []
for c in cntry:
cspl = c.split(' ',1)
if len(cspl)==1: # 'Singapore','Austria',...
to_join.append('(%s)%s'
% (doot(c[0]), doot(c[1:])))
else: # 'Den LMM','LMM Den',....
to_join.append('(%s) +%s'
% (doot(cspl[0]),
doot(cspl[1].strip(' ').lower())))
pattern = '|'.join(to_join).join('()')
return re.compile(pattern,re.I)
def code(X,CNTR,r = regexize):
r = regexize(CNTR)
for ma in r.finditer(X):
beg = ma.group(1).split(' ')[0]
if beg==ma.group(1):
GV = countries[beg[0]+beg[1:].replace('.','').lower()] \
if beg[0].upper()==beg[0] else '- bad match -'
else:
try:
k = (ki for ki in countries.iterkeys()
if beg.replace('.','')==ki.split(' ')[0]).next()
GV = countries[k]
except StopIteration:
GV = '- bad match -'
yield ' {!s:15} {!s:^13}'.format(ma.group(1), GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'
print '\n'.join(res for res in code(s,countries))
EDIT 2
I improved the code. It's shorter and more readable.
The instruction assert(.....] is to verify that the keys of the dictionaru are well formed for the purpose.
import re
def doot(x):
return '\.?'.join(ch for ch in x) + '\.?'
def regexize(labels,doot=doot,
wg2 = '(%s) *( %s)',wnog2 = '(%s)(%s)',
ri = re.compile('(.(?!.*? )|[^ ]+)( ?) *(.+\Z)')):
to_join = []
modlabs = {}
for K in labels.iterkeys():
g1,g2,g3 = ri.match(K).groups()
to_join.append((wg2 if g2 else wnog2)
% (doot(g1), doot(g3.lower())))
modlabs[g1+g2+g3.lower()] = labels[K]
return (re.compile('|'.join(to_join), re.I), modlabs)
def code(X,labels,regexize = regexize):
reglab,modlabs = regexize(labels)
for ma in reglab.finditer(X):
a,b = tuple(x for x in ma.groups() if x)
k = (a + b.lower()).replace('.','')
GV = modlabs[k] if k in modlabs else '- bad match -'
yield ' {!s:15} {!s:^13}'.format(a+b, GV)
countries = {'Singapore':'SG','Austria':'AU',
'Swiss':'CH','Chile':'CL',
'Den LMM':'DN','LMM Den':'LM'}
assert(all('.' not in k and
(k.count(' ')==1 or k[0].upper()==k[0])
for k in countries))
s = (' Singapore SIngapore SiNgapore SinGapore'
' SI.Ngapore SIngaPore SinGaporE SinGAPore'
' SINGaporE SiNg.aPoR singapore sIngapore'
' siNgapore sinGapore sINgap.ore sIngaPore'
' sinGaporE sinGAPore sINGaporE siNgaPoRe'
' Austria Aus.trIA aUSTria AUSTRiA'
' Den L.M.M Den Lm.M DEn Lm.M.'
' DEN L.MM De.n L.M.M. Den LmM'
' L.MM DEn LMM DeN LM.m Den')
print '\n'.join(res for res in code(s,countries))
You could write a simple lambda to generate the ugly-but-all-re-solution:
>>> leading_cap_re = lambda s: s[0].upper() + ''.join('[%s%s]' %
(c.upper(),c.lower())
for c in s[1:])
>>> leading_cap_re("Singapore")
'S[Ii][Nn][Gg][Aa][Pp][Oo][Rr][Ee]'
For multi-word cities, define a string-splitting version:
>>> leading_caps_re = lambda s : r'\s+'.join(map(leading_cap_re,s.split()))
>>> print leading_caps_re('Kuala Lumpur')
K[Uu][Aa][Ll][Aa]\s+L[Uu][Mm][Pp][Uu][Rr]
Then your code could just be:
if re.search(leading_caps_re("Singapore") , st):
...etc...
and the ugliness of the RE would be purely internal.
interestingly
/((S)((?i)ingapore))/
Does the right thing in perl but doesn't seem to work as needed in python. To be fair the python docs spell it out clearly, (?i) alters the whole regexp
This is the BEST answer:
(?-i:S)(?i)ingapore
ClickHere for proof:

Categories

Resources