im using Django api, the user will upload an invoice im converting it to html using PDF.js and the user will label the invoice number, the buyer name, the buyer address then i will take a json response to generate 100 pdfs from the pdf that the user uploaded after i modified the values i got in the json response with the faker library in python.
i made the task work but only with pdf 1.3 the code i used for pdf1.3
the Object of a PDF 1.3
7 0 obj
<< /Length 676 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( Simple PDF File 2 ) Tj
ET
BT
/F1 0010 Tf
69.2500 688.6080 Td
( ...continued from page 1. Yet more text. And more text. And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 676.6560 Td
( And more text. And more text. And more text. And more text. And more ) Tj
ET
BT
/F1 0010 Tf
69.2500 664.7040 Td
( text. Oh, how boring typing this stuff. But not as boring as watching ) Tj
ET
BT
/F1 0010 Tf
69.2500 652.7520 Td
( paint dry. And more text. And more text. And more text. And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 640.8000 Td
( Boring. More, a little more text. The end, and just as well. ) Tj
ET
endstream
endobj
but it doesn't work for other pdfs 1.4 , 1.7 and others
the Object of a 1.7 :
32 0 obj
<</Filter/FlateDecode/Length 421>>stream
x�]�Kn�0E�^�v`�DJ#�&�$�E�
PhYP�Av_�ސ*����Gz�>��x��{w�u�ҟ|�u���q�<R���nm�yM��;0���r}���_{�� y��k|��ߖFsM���c�)q{˗'u^\�6�7�WM�c��F��J��TQ���XB��j�F�L�F���F7C��{G����P�(���ί��c�or/�ȃ�0NzF
�M�j��I?Qa�'�e!}���X�[T�8Ԅ⑁�'/���'<�L���H�C�!Km��T*���T*k�F�T�RE�T�Q�R�lO��RY���A��2׀C�T�Tf�Z�R������J��g*De�6�N���!#e�%&��i�Jep�:���� ���!���e��JÕ��ڥ����-�.��n��췽�����Y��
endstream
endobj
i tried so many options and libraries but it doesn't work .
im expecting a way to modify PDF 1.7 which objects are like this :
32 0 obj
<</Filter/FlateDecode/Length 421>>stream
x�]�Kn�0E�^�v`�DJ#�&�$�E�
PhYP�Av_�ސ*����Gz�>��x��{w�u�ҟ|�u���q�<R���nm�yM��;0���r}���_{�� y��k|��ߖFsM���c�)q{˗'u^\�6�7�WM�c��F��J��TQ���XB��j�F�L�F���F7C��{G����P�(���ί��c�or/�ȃ�0NzF
�M�j��I?Qa�'�e!}���X�[T�8Ԅ⑁�'/���'<�L���H�C�!Km��T*���T*k�F�T�RE�T�Q�R�lO��RY���A��2׀C�T�Tf�Z�R������J��g*De�6�N���!#e�%&��i�Jep�:���� ���!���e��JÕ��ڥ����-�.��n��췽�����Y��
endstream
endobj
i used this solution for the PDF 1.3 :
the object of a pdf 1.3:
7 0 obj
<< /Length 676 >>
stream
2 J
BT
0 0 0 rg
/F1 0027 Tf
57.3750 722.2800 Td
( Simple PDF File 2 ) Tj
ET
BT
/F1 0010 Tf
69.2500 688.6080 Td
( ...continued from page 1. Yet more text. And more text. And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 676.6560 Td
( And more text. And more text. And more text. And more text. And more ) Tj
ET
BT
/F1 0010 Tf
69.2500 664.7040 Td
( text. Oh, how boring typing this stuff. But not as boring as watching ) Tj
ET
BT
/F1 0010 Tf
69.2500 652.7520 Td
( paint dry. And more text. And more text. And more text. And more text. ) Tj
ET
BT
/F1 0010 Tf
69.2500 640.8000 Td
( Boring. More, a little more text. The end, and just as well. ) Tj
ET
endstream
endobj
the solution :
replace text in a pdf 1.3 object
Related
I am using python, beautiful soup and lxml to extract some some information from the site but I am not getting any output as well as any errors. Here's the code
from bs4 import BeautifulSoup
import requests
html_file = requests.get('https://www.covid19india.org').text
soup = BeautifulSoup(html_file)
states = soup.find_all('div',class_='state-name fadeInUp')
print(states)
The data is loaded from external source via JavaScript, so BeautifulSoup doesn't see it. You can use this example how to load the data via requests module:
import json
import requests
url = "https://api.covid19india.org/v4/min/data.min.json"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
# print some data
print(
"{:<3} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
"STATE", "CONFIRMED", "DECEASED", "RECOVERED", "TESTED", "VACCINATED"
)
)
for k, v in data.items():
print(
"{:<5} {:<10} {:<10} {:<10} {:<10} {:<10}".format(
k,
v["total"]["confirmed"],
v["total"]["deceased"],
v["total"]["recovered"],
v["total"]["tested"],
v["total"]["vaccinated"],
)
)
Prints:
STATE CONFIRMED DECEASED RECOVERED TESTED VACCINATED
AN 6789 98 6417 381349 117175
AP 1542079 9904 1323019 18435149 7857020
AR 23159 89 20339 525589 318092
AS 359640 2588 302889 9982316 3620118
BR 681199 4339 627548 28735144 9363116
CH 57737 680 51382 477484 302008
CT 941366 12391 852529 8508024 6812740
DL 1412959 22831 1354445 18595993 4970999
DN 9822 4 9167 72410 133590
GA 143192 2302 121562 778828 466488
GJ 780471 9469 686581 20763702 15176798
HP 175384 2638 141198 1806900 2303829
HR 728607 7317 666893 8585262 5198191
JH 324884 4714 293659 7998874 3790724
JK 263905 3465 210547 8140036 2905270
KA 2367742 24207 1829276 28453442 11799162
KL 2293633 6995 1979919 18555023 8597282
LA 17025 172 15264 237486 128352
LD 5756 20 4051 108986 29058
MH 5527092 86618 5070801 32441776 20465193
ML 27755 414 20480 527955 424005
MN 42565 661 35606 675645 403249
MP 757119 7394 682100 9148503 9533515
MZ 9732 30 7450 354526 301219
NL 19593 258 14149 207977 246656
OR 668422 2483 567382 11118984 6964278
PB 528676 12888 452318 8556117 4519318
PY 93167 1295 73936 971544 239864
RJ 903418 7475 764137 10118617 15708235
SK 12527 220 8916 108170 230968
TG 547727 3085 500247 14402251 5522361
TN 1770988 19598 1476761 25949517 7280700
TR 44356 452 37037 836629 1527207
TT 26285069 295508 23059017 324417870 191879503
UP 1659212 18760 1534176 46111719 15813654
UT 307566 5600 233266 4443937 2743665
WB 1229805 14054 1083570 11786397 12932811
You are facing this issue because you are not getting the same response from the site as you recieve it from the site.
Hint:
you can use this to display it on your output terminal
from IPython.core.display import display,HTML
display(HTML(html_file))
I'm learning how to scrape data from websites using BeautifulSoup and Trying to scrape Movies link and some data about it from YTS website. But I'm stuck in it. I write a script to scrape movies type for two types. But some movies have two or more types of movies qualities in the Tech Specs area. To select i have to write code for every movie type. But how to create a for or while loop to scrape all data.
import requests
from bs4 import BeautifulSoup
m_r = requests.get('https://yts.mx/movies/suicide-squad-2016')
m_page = BeautifulSoup(m_r.content, 'html.parser')
#------------------ Name, Date, Category ----------------
m_det = m_page.find_all('div', class_='hidden-xs')
m_detail = m_det[4]
m_name = m_detail.contents[1].string
m_date = m_detail.contents[3].string
m_category = m_detail.contents[5].string
print(m_name)
print(m_date)
print(m_category)
#------------------ Download Links ----------------
m_li = m_page.find_all('p', {'class':'hidden-xs hidden-sm'})
m_link = m_li[0]
m_link_720 = m_link.contents[3].get('href')
print(m_link_720)
m_link_1080 = m_link.contents[5].get('href')
print(m_link_1080)
#-------------------- File Size & Language -------------------------
tech_spec = m_page.find_all('div', class_='row')
s_size = tech_spec[6].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in s_size:
s_size = s_size.replace('MB', '').strip()
print(s_size)
elif 'GB' in s_size:
s_size = float(s_size.replace('GB', '').strip())
s_size = s_size * 1024
print(s_size)
#--------- Big file Languge-----------
s_lan = tech_spec[6].contents[5].contents[2].strip()
print(s_lan)
b_size = tech_spec[8].contents[1].contents[1]
#-----------Convert file size to MB-----------
if 'MB' in b_size:
b_size = b_size.replace('MB', '').strip()
print(b_size)
elif 'GB' in b_size:
b_size = float(b_size.replace('GB', '').strip())
b_size = b_size * 1024
print(b_size)
#--------- Big file Languge-----------
b_lan = tech_spec[8].contents[5].contents[2].strip()
print(b_lan)
This script will get all information for each movie quality:
import requests
from bs4 import BeautifulSoup
url = 'https://yts.mx/movies/suicide-squad-2016'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tech_quality, tech_info in zip(soup.select('.tech-quality'), soup.select('.tech-spec-info')):
print('Tech Quality:', tech_quality.get_text(strip=True))
file_size, resolution, language, rating = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(1) > div')]
subtitles, fps, runtime, peers_seeds = [td.get_text(strip=True, separator=' ') for td in tech_info.select('div.row:nth-of-type(2) > div')]
print('File size:', file_size)
print('Resolution:', resolution)
print('Language:', language)
print('Rating:', rating)
print('Subtitles:', tech_info.select_one('div.row:nth-of-type(2) > div:nth-of-type(1)').a['href'] if subtitles else '-')
print('FPS:', fps)
print('Runtime:', runtime)
print('Peers/Seeds:', peers_seeds)
print('-' * 80)
Prints:
Tech Quality: 3D.BLU
File size: 1.88 GB
Resolution: 1920*800
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 8 / 35
--------------------------------------------------------------------------------
Tech Quality: 720p.BLU
File size: 999.95 MB
Resolution: 1280*720
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 61 / 534
--------------------------------------------------------------------------------
Tech Quality: 1080p.BLU
File size: 2.06 GB
Resolution: 1920*1080
Language: English 2.0
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 3 min
Peers/Seeds: P/S 80 / 640
--------------------------------------------------------------------------------
Tech Quality: 2160p.BLU
File size: 5.82 GB
Resolution: 3840*1600
Language: English 5.1
Rating: PG - 13
Subtitles: -
FPS: 23.976 fps
Runtime: 2 hr 2 min
Peers/Seeds: P/S 49 / 110
--------------------------------------------------------------------------------
I'm struggling with Tesseract OCR.
I have a blood examination image, it has a table with indentation. Although tesseract recognizes the characters very well, its structure isn't preserved in the final output. For example, look the lines below "Emocromo con formula" (Eng. Translation: blood count with formula) that are indented. I want to preserve that indentation.
I read the other related discussions and I found the option preserve_interword_spaces=1. The result became slightly better but as you can see, it isn't perfect.
Any suggestions?
Update:
I tried Tesseract v5.0 and the result is the same.
Code:
Tesseract version is 4.0.0.20190314
from PIL import Image
import pytesseract
# Preserve interword spaces is set to 1, oem = 1 is LSTM,
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
# default_config = r'-c -l eng+ita'
extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)
print(extracted_text)
# saving to a txt file
with open("referto.txt", "w") as text_file:
text_file.write(extracted_text)
Result with comparison:
GITHUB:
I have created a GitHub repository if you want to try it yourself.
Thanks for your help and your time
image_to_data() function provides much more information. For each word it will return it's bounding rectangle. You can use that.
Tesseract segments the image automatically to blocks. Then you can sort block by their vertical position and for each block you can find mean character width (that depends on the block's recognized font). Then for each word in the block check if it is close to the previous one, if not add spaces accordingly. I'm using pandas to ease on calculations, but it's usage is not necessary. Don't forget that the result should be displayed using monospaced font.
import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd
custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)
# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
curr = df1[df1['block_num']==block]
sel = curr[curr.text.str.len()>3]
char_w = (sel.width/sel.text.str.len()).mean()
prev_par, prev_line, prev_left = 0, 0, 0
text = ''
for ix, ln in curr.iterrows():
# add new line when necessary
if prev_par != ln['par_num']:
text += '\n'
prev_par = ln['par_num']
prev_line = ln['line_num']
prev_left = 0
elif prev_line != ln['line_num']:
text += '\n'
prev_line = ln['line_num']
prev_left = 0
added = 0 # num of spaces that should be added
if ln['left']/char_w > prev_left + 1:
added = int((ln['left'])/char_w) - prev_left
text += ' ' * added
text += ln['text'] + ' '
prev_left += len(ln['text']) + added + 1
text += '\n'
print(text)
This code will produce following output:
ssseeess+ SERVIZIO SANITARIO REGIONALE Pagina 2 di3
seoeeeees EMILIA-RROMAGNA
©2888 800
©9868 6 006 : pe ‘ ‘ "
«ee ##e#ecee Azienda Unita Sanitaria Locale di Modena
Seat se ces Amends Ospedaliero-Universitaria Policlinico di Modena
Dipartimento interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica
Direttore dr. T.Trenti
Ospedale Civile S.Agostino-Estense
S.C. Medicina di Laboratorio
S.S. Patologia Clinica - Corelab
Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015
Responsabile dr.ssa M.Varani
Richiesta (CDA): 49/073914 Data di accettazione: 18/12/2018
Data di check-in: 18/12/2018 10:27:06
Referto del 18/12/2018 16:39:53
Provenienza: D4-cp sassuolo
Sig.
Data di Nascita:
Domicilio:
ANALISI RISULTATO __UNITA'DI MISURA VALORI DI RIFERIMENTO
Glucosio 95 mg/dl (70 - 110 )
Creatinina 1.03 mg/dl ( 0.50 - 1.40 )
eGFR Filtrato glomerulare stimato >60 ml/min Cut-off per rischio di I.R.
7 <60. Il calcolo é€ riferito
Equazione CKD-EPI ad una superfice corporea
Standard (1,73 mq)x In Caso
di etnia afroamericana
moltiplicare per il fattore
1,159.
Colesterolo 212 * mg/dl < 200 v.desiderabile
Trigliceridi 106 mg/dl < 180 v.desiderabile
Bilirubina totale 0.60 mg/dl ( 0.16 - 1.10 )
Bilirubina diretta 0.10 mg/dl ( 0.01 - 0.3 )
GOT - AST 17 U/L (1-37)
GPT - ALT ay U/L (1- 40 )
Gamma-GT 15 U/L (1-55)
Sodio 142 mEq/L ( 136 - 146 )
Potassio 4.3 mEq/L (3.5 - 5.3)
Vitamina B12 342 pg/ml ( 200 - 960 )
TSH 5.47 * ulU/ml (0.35 - 4.94 )
FT4 9.7 pg/ml (7 = 15)
Urine chimico fisico morfologico
u-Colore giallo paglierino
u-Peso specifico 1.012 ( 1.010 - 1.027 )
u-pH 5.5 (5.5 - 6.5)
u-Glucosio assente mg/dl assente
u-Proteine assente mg/dl (0 -10 )
u-Emoglobina assente mg/dl assente
u-Corpi chetonici assente mg/dl assente
u-Bilirubina assente mg/dl assente
u-Urobilinogeno 0.20 mg/dl (0- 1.0 )
sedimento non significativo
Il Laureato:
Dott. CRISTINA ROTA
Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante
Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513;
D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10.
Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it)
i! Laureato: Dr. CRISTINA ROTA
1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna
I have the following example text / tweet:
RT #trader $AAPL 2012 is o´o´o´o´o´pen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO
I want to follow the procedure of Table 1 in Li, T, van Dalen, J, & van Rees, P.J. (Pieter Jan). (2017). More than just noise? Examining the information content of stock microblogs on financial markets. Journal of Information Technology. doi:10.1057/s41265-016-0034-2 in order to clean up the tweet.
They clean the tweet up in such a way that the final result is:
{RT|123456} {USER|56789} {TICKER|AAPL} {NUMBER|2012} notooopen nottalk patent {COMPANY|GOOG} notdefinetli treatment {HASH|samsung} {EMOTICON|POS} haha {URL}
I use the following script to tokenize the tweet based on the regex:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
emoticon_string = r"""
(?:
[<>]?
[:;=8] # eyes
[\-o\*\']? # optional nose
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
|
[\)\]\(\[dDpP/\:\}\{#\|\\] # mouth
[\-o\*\']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"""
,
# Twitter username:
r"""(?:#[\w_]+)"""
,
# Hashtags:
r"""(?:\#+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Cashtags:
r"""(?:\$+[\w_]+[\w\'_\-]*[\w_]+)"""
,
# Remaining word types:
r"""
(?:[+\-]?\d+[,/.:-]\d+[+\-]?) # Numbers, including fractions, decimals.
|
(?:[\w_]+) # Words without apostrophes or dashes.
|
(?:\.(?:\s*\.){1,}) # Ellipsis dots.
|
(?:\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""(%s)""" % "|".join(regex_strings), re.VERBOSE | re.I | re.UNICODE)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = word_re.findall(s)
if not self.preserve_case:
words = map((lambda x: x if emoticon_re.search(x) else x.lower()), words)
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh url_that_cannot_be_posted_on_SO'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
This yields the following output:
rt
#trader
$aapl
2012
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
#samsung
got
:-)
heh
url_that_cannot_be_posted_on_SO
How can I adjust this script to get:
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|url_that_cannot_be_posted_on_SO}
Thanks in advance for helping me out big time!
You really need to use named capturing groups (mentioned by thebjorn), and use groupdict() to get name-value pairs upon each match. It requires some post-processing though:
All pairs where the value is None must be discarded
If the self.preserve_case is false the value can be turned to lower case at once
If the group name is WORD, ELLIPSIS or ELSE the values are added to words as is
If the group name is HASHTAG, CASHTAG, USER or URL the values are added first stripped of $, # and # chars at the start and then added to words as {<GROUP_NAME>|<VALUE>} item
All other matches are added to words as {<GROUP_NAME>|<VALUE>} item.
Note that \w matches underscores by default, so [\w_] = \w. I optimized the patterns a little bit.
Here is a fixed code snippet:
import re
emoticon_string = r"""
(?P<EMOTICON>
[<>]?
[:;=8] # eyes
[-o*']? # optional nose
[][()dDpP/:{}#|\\] # mouth
|
[][()dDpP/:}{#|\\] # mouth
[-o*']? # optional nose
[:;=8] # eyes
[<>]?
)"""
regex_strings = (
# URL:
r"""(?P<URL>https?://(?:[-a-zA-Z0-9_$#.&+!*(),]|%[0-9a-fA-F][0-9a-fA-F])+)"""
,
# Twitter username:
r"""(?P<USER>#\w+)"""
,
# Hashtags:
r"""(?P<HASHTAG>\#+\w+[\w'-]*\w+)"""
,
# Cashtags:
r"""(?P<CASHTAG>\$+\w+[\w'-]*\w+)"""
,
# Remaining word types:
r"""
(?P<NUMBER>[+-]?\d+(?:[,/.:-]\d+[+-]?)?) # Numbers, including fractions, decimals.
|
(?P<WORD>\w+) # Words without apostrophes or dashes.
|
(?P<ELLIPSIS>\.(?:\s*\.)+) # Ellipsis dots.
|
(?P<ELSE>\S) # Everything else that isn't whitespace.
"""
)
word_re = re.compile(r"""({}|{})""".format(emoticon_string, "|".join(regex_strings)), re.VERBOSE | re.I | re.UNICODE)
#print(word_re.pattern)
emoticon_re = re.compile(regex_strings[1], re.VERBOSE | re.I | re.UNICODE)
######################################################################
class Tokenizer:
def __init__(self, preserve_case=False):
self.preserve_case = preserve_case
def tokenize(self, s):
try:
s = str(s)
except UnicodeDecodeError:
s = str(s).encode('string_escape')
s = unicode(s)
# Tokenize:
words = []
for x in word_re.finditer(s):
for key, val in x.groupdict().items():
if val:
if not self.preserve_case:
val = val.lower()
if key in ['WORD','ELLIPSIS','ELSE']:
words.append(val)
elif key in ['HASHTAG','CASHTAG','USER','URL']: # Add more here if needed
words.append("{{{}|{}}}".format(key, re.sub(r'^[##$]+', '', val)))
else:
words.append("{{{}|{}}}".format(key, val))
return words
if __name__ == '__main__':
tok = Tokenizer(preserve_case=False)
test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com'
tokenized = tok.tokenize(test)
print("\n".join(tokenized))
With test = ' RT #trader $AAPL 2012 is oooopen to ‘Talk’ about patents with GOOG definitely not the treatment #samsung got:-) heh http://some.site.here.com', it outputs
rt
{USER|trader}
{CASHTAG|aapl}
{NUMBER|2012}
is
oooopen
to
‘
talk
’
about
patents
with
goog
definitely
not
the
treatment
{HASHTAG|samsung}
got
{EMOTICON|:-)}
heh
{URL|http://some.site.here.com}
See the regex demo online.
A similar question is asked here (Python XML Parsing) but I could not reach to the content I am interested in.
I need to extract all the information that is enclosed between the tag patent-classification if the classification-scheme tag value is CPC. There are multiple such element and are enclosed inside patent-classifications tag.
In the example given below, there are three such values: C 07 K 16 22 I , A 61 K 2039 505 A and C 07 K 2317 21 A
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="/3.0/style/exchange.xsl"?>
<ops:world-patent-data xmlns="http://www.epo.org/exchange" xmlns:ops="http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">
<ops:meta name="elapsed-time" value="21"/>
<exchange-documents>
<exchange-document system="ops.epo.org" family-id="39103486" country="US" doc-number="2009234106" kind="A1">
<bibliographic-data>
<publication-reference>
<document-id document-id-type="docdb">
<country>US</country>
<doc-number>2009234106</doc-number>
<kind>A1</kind>
<date>20090917</date>
</document-id>
<document-id document-id-type="epodoc">
<doc-number>US2009234106</doc-number>
<date>20090917</date>
</document-id>
</publication-reference>
<classifications-ipcr>
<classification-ipcr sequence="1">
<text>C07K 16/ 44 A I </text>
</classification-ipcr>
</classifications-ipcr>
<patent-classifications>
<patent-classification sequence="1">
<classification-scheme office="" scheme="CPC"/>
<section>C</section>
<class>07</class>
<subclass>K</subclass>
<main-group>16</main-group>
<subgroup>22</subgroup>
<classification-value>I</classification-value>
</patent-classification>
<patent-classification sequence="2">
<classification-scheme office="" scheme="CPC"/>
<section>A</section>
<class>61</class>
<subclass>K</subclass>
<main-group>2039</main-group>
<subgroup>505</subgroup>
<classification-value>A</classification-value>
</patent-classification>
<patent-classification sequence="7">
<classification-scheme office="" scheme="CPC"/>
<section>C</section>
<class>07</class>
<subclass>K</subclass>
<main-group>2317</main-group>
<subgroup>92</subgroup>
<classification-value>A</classification-value>
</patent-classification>
<patent-classification sequence="1">
<classification-scheme office="US" scheme="UC"/>
<classification-symbol>530/387.9</classification-symbol>
</patent-classification>
</patent-classifications>
</bibliographic-data>
</exchange-document>
</exchange-documents>
</ops:world-patent-data>
Install BeautifulSoup if you don't have it:
$ easy_install BeautifulSoup4
Try this:
from bs4 import BeautifulSoup
xml = open('example.xml', 'rb').read()
bs = BeautifulSoup(xml)
# find patent-classification
patents = bs.findAll('patent-classification')
# filter the ones with CPC
for pa in patents:
if pa.find('classification-scheme', {'scheme': 'CPC'} ):
print pa.getText()
You can use python xml standard module:
import xml.etree.ElementTree as ET
root = ET.parse('a.xml').getroot()
for node in root.iterfind(".//{http://www.epo.org/exchange}classification-scheme[#scheme='CPC']/.."):
data = []
for d in node.getchildren():
if d.text:
data.append(d.text)
print ' '.join(data)