I am creating a script to extract text from a scanned pdf to create a JSON dictionary for implementation into a MongoDB later. The issue I have run into is that using tesseract-ocr via Textract module successfully extracted all the text but it is being read by python so all of the whitespace on the PDF is being turned in '\n' making it very hard to extract the information necessary.
I have tried cleaning it up using a bunch of lines of code, but it still is not very readable. and it gets rid of all the colons which i feel will make identifying the keys and values a lot easier.
stringedText = str(text)
cleanText = rmStop.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
I am still left with quite a bit of unclean almost over processed data. so at this point idk how to use it.
this is how i extracted the data
text = textract.process(filename, method="tesseract", language="eng")
I have tried nltk as well and that took out some data and made it a little easier to read but there are still a lot of \n muddling up the data.
here is the nltk code:
stringedText = str(text)
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(stringedText)
rmStop = [i for i in tokens if not i in ENGLISH_STOP_WORDS]
here is what I get from the first clean up i tried:
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
from that i need a JSON array that looks like:
"lacz-rp" : {
"Date" : "21-feb-2019",
"Sequence" : "gatctctaccatggcgcacatttccccgaaaagtgc"
"Order No." : "15775199"
"Ref No." : "207335463"
}
and so on... I am just not sure what to do. I can also provide the raw output. This is what it looked like before I touched it. the above data is all the information I need to make a complete array.
b' \n\nIDT\nINTEGRATED DNA TECHNOLOGIES,\nOLIGONUCLEOTIDE SPECIFICATION SHEET\n\n \n\n21-Feb-2019\n\nSequence - LacZ-RP\n\n5\'- GAT CTC TAC CAT GGC GCA CAT TTC CCC GAA AAG TGC -3\'\n\nOrder No. 15775199\nref.No. 207335463\n\n25 nmole DNA Oligo, 36 bases\n\n \n\nProperties Amount Of Oligo Shipped To\nTm (50mM NaCl)*:66.8 \xc2\xb0C 11.0= 32.6 DAVID COOK\nGC Content: 52.8% D260 mmoles KANSAS STATE UNIVERSITY-BIOTECHNO.\n\nMolecular Weight: 10,965.1\nnmoles/OD260: 3.0\nug/OD260: 32.6\nExt. Coefficient: 336,200 L/(mole-cm)\n\nSecondary Structure Calcul\n\n \n\nns\nLowest folding free energy (kcal/mole): -3.53 at 25 \xc2\xb0C\n\nStrongest Folding Tm: 46.6 \xc2\xb0C\n\n \n\nOligo Base Types Quantity\n\nDi eo\nModifications and Services Quantity\nStandard Desalting 7\n\nMfg. 1D 289505556\n\n207335463 ~<<IDT\nD.cooK,\n\n2eosoesse 2uren20%9\n\n207335463 ~XIDT\nD.cooK,\n\n \n \n \n\n \n\nINSTRUCTIONS\n\n.d contents may appear as either a translucent film or a white powder.\nice does not affect the quality of the oligo,\n\n\xe2\x80\x9cPlease centrifuge tubes prior to opening. Some of the product may have been\ndislodged during shipping.\n\n\xe2\x80\x9cThe Tm shown takes no account of Mg?* and dNTP concentrations. Use the\nOligoAnalyzer\xc2\xae Program at www.idtdna.com/scitools to calculate accurate Tm for\nyour reaction conditions.\n\nFor 100 |M: add 326 [iL\n\nBURT HALL #207\n\nMANHATTAN, KS 66506\n\nUSA\n\n7855321362\n\nCustomer No. 378741 PO No.06BF3000\n\nDisclaimer\n\nSee on reverse page notes (I) (Il) & (lll) for usage, label\nlicense, and product warranties\n\x0cUse Restrictions: Oligonucleotides and nucleic acid products are manufactured and sold by IDT for the\ncustomer\'s research purposes only. Resale of IDT products requires the express written consent of IDT.\nUnless pursuant to a separate signed agreement by authorized IDT officials, IDT products are not sold\nfor (and have not been approved) for use in any clinical, diagnostic or therapeutic applications.\nObtaining license or approval to use IDT products in proprietary applications or in any non-research\n(clinical) applications is the customer\'s exclusive responsibility. DT will not be responsible or liable for\nany losses, costs, expenses, or other forms of liability arising out of the unauthorized or unlicensed use\nof IDT products. Purchasers of IDT products shall indemnify and hold IDT harmless for any and all\ndamages and/or liability, however characterized, related to the unauthorized or unlicensed use of IDT\nproducts. Under no circumstances shall IDT be liable for any consequential damages, resulting from\nany use (approved or otherwise) of IDT products. All orders received by IDT, and all sales of IDT\nproducts are made subject to the aforementioned use restrictions and customer indemnification of IDT.\n\nGeneral Warranty: IDT\'s products are guaranteed to meet or exceed our published specifications for\nidentity, purity and yield as measured under normal laboratory conditions. If our product fails to meet\nsuch specifications, IDT will promptly replace the product. A// other warranties are hereby expressly\ndisclaimed, including but not limited to, the implied warranties of merchantability and fitness for a\nparticular purpose, and any warranty that the products, or the use of products, manufactured by IDT will\nnot infringe the patents of one or more third-partiesAll orders received by IDT, and all sales of IDT\nproducts are made subject to the aforementioned disclaimers of warranties.\n\nSee http://www.idtdna.com/Catalog/Usage/Page1.aspx for further details\na) Cy Dyes: The purchase of this Product includes a limited non-exclusive sublicense under U.S\n\nPatent Nos. 5 556 959 and 5 808 044 and foreign equivalent patents and other foreign and U.S\ncounterpart applications to use the amidites in the Product to perform research. NO OTHER\nLICENSE IS GRANTED EXPRESSLY, IMPLIEDLY OR BY ESTOPPEL. Use of the Product for\ncommercial purposes is strictly prohibited without written permission from Amersham Biosciences\nCorp. For information concerning availability of additional licenses to practice the patented\nmethodologies, please contact Amersham Biosciences Corp, Business Licensing Manager,\nAmersham Place, Little Chalfont, Bucks, HP79NA.\n\nb) \xe2\x80\x94 BHQ: Black Hole Quencher, BHQ-0, BHQ-1, BHQ-2 and BHQ-3 are registered trademarks of\nBiosearch Technologies, Inc., Novato, California, U.S.A Patents are currently pending for the BHQ\ntechnology and such BHQ technology is licensed by the manufacturer pursuant to an agreement\nwith BTI and these products are sold exclusively for research and development use only. They\nmay not be used for human veterinary in vitro or clinical diagnostic purposes and they may not be\nre-sold, distributed or re-packaged. For information on licensing programs to permit use for human\nor veterinary in vitro or clinical diagnostic purposes, please contact Biosearch at\nlicensing#biosearchtech.com.\n\nc) MPI dyes: MPI dyes. This product is provided under license from Molecular Probes, Inc., for\nresearch use only, and is covered by pending and issued patents.\n\nd) Molecular Beacons: Molecular Beacons. This product is sold under license from the Public Health\nResearch Institute only for use in the purchaser\'s research and development activities.\n\ne) ddRNAi: This product is sold solely for use for research purposes in fields other than plants. This\nproduct is not transferable. If the purchaser is not willing to accept the conditions of this label\nlicense, supplier is willing to accept the return of the unopened product and provide the purchaser\nwith a full refund. However if the product is opened, then the purchaser agrees to be bound by the\nconditions of this limited use statement. This product is sold by supplier under license from\nBenitec Australia Ltd and CSIRO as co-owners of U.S Patent No. 6,573,099 and foreign\ncounterparts. For information regarding licenses to these patents for use of ddRNAi as a\ntherapeutic agent or as a method to treat/prevent human disease, please contact Benitec at\nlicensing#benitec.com. For the use of ddRNAi in other fields, please contact CSIRO at\nwww.pi.csiro.au/RNAi.\n\x0cf)\n\n9)\n\nh)\n\nk)\n\n))\n\nm)\n\nn)\n\nDicer Substrate RNAi:\n\n* These products are not for use in humans or non-human animals and may not be used for\nhuman or veterinary diagnostic, prophylactic or therapeutic purposes. Sold under license of\npatents pending jointly assigned to IDT and the City of Hope Medical Center.\n\nThis product is licensed under European Patents 1144623, 121945 and foreign equivalents\nfrom Alnylam Pharmaceuticals, Inc., Cambridge, USA and is provided only for use in\nacademic and commercial research whose purpose is to elucidate gene function, including\nresearch to validate potential gene products and pathways for drug discovery and\ndevelopment and to screen non-siRNA based compounds (but excluding the evaluation or\ncharacterization of this product as the potential basis for a siRNA based drug) and not for\nany other commercial purposes. Information about licenses for commercial use (including\ndiscovery and development of siRNA-based drugs) is available from Alnylam\nPharmaceuticals, Inc., 300 Third Street, Cambridge MA 02142, USA\n\nLicense under U.S. Patent # 6506559; Domestic and Foreign Progeny; including European\nPatent Application # 98964202\n\nLNAs: Protected by US. Pat No. 6,268,490 and foreign applications and patents owned or\ncontrolled by Exiqon A/S. For Research Use Only. Not for resale or for therapeutic use or use in\nhumans\n\nOther siRNA duplexes: This product is provided under license from Molecular Probes, Inc., for\nresearch use only, and is covered by pending and issued patents.\n\nAcrydite: IDT is licensed under U.S Patent Number 6,180,770 and 5,932,711 to sell this product\nfor use solely in the purchaser\'s own life sciences research and development activities. Resale, or\nuse of this product in clinical or diagnostic applications, or other commercial applications, requires\nseparate license from Mosaic, Inc.\n\nlso-Bases: Licensed under EraGen, Inc. United States Patents Number 5,432,272; 6,001,983;\n6,037,120; and 6,140,496. For research use Only.\n\nDig: Licensed from Roche Diagnostics GmbH\n\n5\' Nuclease Assay: The 5\' Nuclease Assay and other homogenous amplification methods used in\nconnection with the Polymerase Chain Reaction (PCR) process are covered by patents owned by\nRoche Molecular Systems, Inc. and F. Hoffman La-Roche Ltd (Roche). No license to use the 5"\nNuclease Assay or any Roche patented homogenous amplification process is conveyed expressly\nor by implication to the purchaser by the purchase of the above listed products or any other IDT\nproducts.\n\nlowa Black\xc2\xae FQ and RQ: lowa Black is a registered trademark of IDT, and lowa Black-labeled\noligos are covered by pending patents owned and controlled by IDT.\n\nIRDye\xc2\xae 700 and IRDye\xc2\xae 800: IRDye\xc2\xae 700 and IRDye\xc2\xae 800 are products manufactured under\nlicense from LI-COR\xc2\xae Biosciences, which expressly excludes the right to use this product in\nQPCR or AFLP applications.\n\x0c'
You can convert your \n with newline. Please use following;
formatted_text = text.replace('\\n', '\n')
This will replace escaped newlines by actual newlines in the output.
Related
I have the following test_string from which I need to obtain the actual URL.
Test string (partly shown):
An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Desired output for part of test_string
https://www.sciencedirect.com/science/article/pii/S0010218020301346
I have been trying to obtain this with the MWE given below applied to many strings, but it gives only one URL.
MWE
from urlparse import urlparse, parse_qs
import re
from re import search
test_string = '''
Production, Properties, and Applications of ALPHA-Terpineol
<http://scholar.google.com/scholar_url?url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&hl=en&sa=X&d=12771069332921982368&scisig=AAGBfm1tFjLUm7GV1DRnuYCzvR4uGWq9Cg&nossl=1&oi=scholaralrt&hist=v2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
A Sales, L de Oliveira Felipe, JL Bicas
Abstract ALPHA-Terpineol (CAS No. 98-55-5) is a tertiary monoterpenoid
alcohol widely
and commonly used in the flavors and fragrances industry for its sensory
properties.
It is present in different natural sources, but its production is mostly
based on ...
Save
<http://scholar.google.com/citations?update_op=email_library_add&info=oB2z7uTzO7EJ&citsig=AMD79ooAAAAAYLfmix3sQyUWnFrHeKYZxuK31qlqlbCh&hl=en>
Twitter
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=tw&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g>
Facebook
<http://scholar.google.com/scholar_share?hl=en&oi=scholaralrt&ss=fb&url=https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf&rt=Production,+Properties,+and+Applications+of+%CE%B1-Terpineol&scisig=AAGBfm0yXFStqItd97MUyPT5nRKLjPIK6g>
An experimental and modeling study of autoignition characteristics of
butanol/diesel blends over wide temperature ranges
<http://scholar.google.com/scholar_url?url=3Dhttps://www.sciencedirect.com/=
science/article/pii/S0010218020301346&hl=3Den&sa=3DX&d=3D448628313728630325=
1&scisig=3DAAGBfm26Wh2koXdeGZkQxzZbenQYFPytLQ&nossl=3D1&oi=3Dscholaralrt&hi=
st=3Dv2Y_3P0AAAAJ:17949955323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
Y Qiu, W Zhou, Y Feng, S Wang, L Yu, Z Wu, Y Mao=E2=80=A6 - Combustion and =
Flame,
2020
Butanol/diesel blend is considered as a very promising alternative fuel
with
agreeable combustion and emission performance in engines. This paper
intends to
further investigate its autoignition characteristics with the combination
of a heated =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3DE=
27Gd756Qj4J&citsig=3DAMD79ooAAAAAYImDxwWCwd5S5xIogWp9RTavFRMtTDgS&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://www.sciencedirect.com/science/article/pii/S0010218020301346&rt=
=3DAn+experimental+and+modeling+study+of+autoignition+characteristics+of+bu=
tanol/diesel+blends+over+wide+temperature+ranges&scisig=3DAAGBfm19DOLNm3-Fl=
WaO0trAxZkeidxYWg>
Using NMR spectroscopy to investigate the role played by copper in prion
diseases.
<http://scholar.google.com/scholar_url?url=3Dhttps://europepmc.org/article/=
med/32328835&hl=3Den&sa=3DX&d=3D16122276072657817806&scisig=3DAAGBfm1AE6Kyl=
jWO1k0f7oBnKFClEzhTMg&nossl=3D1&oi=3Dscholaralrt&hist=3Dv2Y_3P0AAAAJ:179499=
55323429043383:AAGBfm1nUe-t2q_4mKFiHSHFEAo0A4rRSA>
RA Alsiary, M Alghrably, A Saoudi, S Al-Ghamdi=E2=80=A6 - =E2=80=A6 and of =
the Italian
Society of =E2=80=A6, 2020
Prion diseases are a group of rare neurodegenerative disorders that develop
as a
result of the conformational conversion of normal prion protein (PrPC) to
the disease-
associated isoform (PrPSc). The mechanism that actually causes disease
remains =E2=80=A6
[image: Save]
<http://scholar.google.com/citations?update_op=3Demail_library_add&info=3Dz=
pCMKavUvd8J&citsig=3DAMD79ooAAAAAYImDx3r4gltEWBAkhl0g2POsXB9Qn4Lk&hl=3Den>
[image:
Twitter]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dtw&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
[image:
Facebook]
<http://scholar.google.com/scholar_share?hl=3Den&oi=3Dscholaralrt&ss=3Dfb&u=
rl=3Dhttps://europepmc.org/article/med/32328835&rt=3DUsing+NMR+spectroscopy=
+to+investigate+the+role+played+by+copper+in+prion+diseases.&scisig=3DAAGBf=
m1RidyRD-x2FOemP6iqCsr-6GAVKA>
'''
regex = re.compile('(http://scholar.*?)&')
url_all = regex.findall(test_string)
citation_url = []
for i in url_all:
if search('scholar.google.com',i):
qs = parse_qs(urlparse(i).query).values()
if search('http',str(qs[0])):
citation_url.append(qs[0])
print citation_url
Present output
https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
Desired output
https://link.springer.com/content/pdf/10.1007/s11947-020-02461-6.pdf
https://www.sciencedirect.com/science/article/pii/S0010218020301346
https://europepmc.org/article/med/3232883
How to get handle URL text wrapping with equal sign and extracting the redirect URL in Python?
You could match either a question mark or ampersand [&?] using a character class. Looking at the example data, for the url= part, you can add optional newlines and an optional equals sign and adjust accordingly.
Some urls start with 3D, you can make that part optional using a non capturing group (?:3D)?
Then capture in group 1 matching http followed by matching all chars except &
\bhttp://scholar\.google\.com.*?[&?]\n?u=?\n?r\n?l\n?=(?:3D)?(http[^&]+)
Regex demo
see this regex pattern i think it might help to extract redirect uri
(http:\/\/scholar[\w.\/=&?]*)[?]?u[=]?rl=([\w\:.\/\-=]+)
also see this example here https://regex101.com/r/dmkF3h/3
I have got the first 50 dataset from the table but can seem to get all:
url='https://definitivehc.maps.arcgis.com/home/item.html?id=1044bb19da8d4dbfb6a96eb1b4ebf629&view=list&showFilters=false#data'
browser = webdriver.Chrome(r"C:\Users\scrape\chromedriver")
browser.get(url)
time.sleep(25)
rows_in_table = browser.find_elements_by_xpath('//table[#class="dgrid-row-table"]//tr[td]')
for element in rows_in_table:
print(element.text.replace('\n', '||'))
result:Snippet of first 50 data points
Atmore Community Hospital||Short Term Acute Care Hospital||Atmore||AL||36502||Escambia||Alabama||01||053||01053||51||49||6||6||0.00||0.36||2||2.00||46
Gadsden Regional Medical Center||Short Term Acute Care Hospital||Gadsden||AL||35903||Etowah||Alabama||01||055||01055||346||222||40||40||0.00||0.73||124||8.00||47
Riverview Regional Medical Center||Short Term Acute Care Hospital||Gadsden||AL||35901||Etowah||Alabama||01||055||01055||281||256||45||45||0.00||0.26||25||6.00||48
Fayette Medical Center (FKA Weimer Medical Center)||Short Term Acute Care Hospital||Fayette||AL||35555||Fayette||Alabama||01||057||01057||61||45||8||8||0.00||0.23||16||2.00||49
Russellville Hospital||Short Term Acute Care Hospital||Russellville||AL||35653||Franklin||Alabama||01||059||01059||100||49||7||7||0.00||0.40||51||2.00||50
Thanks for your help.
I'm trying to display my data into a Dash DataTable but i've got this error:
Invalid argument `data` passed into DataTable.
Expected an array.
Was supplied type `object`.
My data (amazon comments) :
[
{
"ratings": {
"\n 5 star\n ": "\n ",
"\n 4 star\n ": "\n ",
"\n 3 star\n ": "\n ",
"\n 2 star\n ": "\n ",
"\n 1 star\n ": "\n "
},
"reviews": [
{
"review_text": "The \"MPD Digital (TM) USA Made Ham CB Radio GMRS Repeater Transmission MILSPEC M17/ 163A RG-213/U (RG8/U) Coaxial Cable with UHF PL259 Connectors, 12 inches \", I having two products of like nature from MPD, the older of the two being lightly used between transmitter and SWR meter,... to being placed in storage when not in use, is a disappointing, if not marginal product given the high level of promoting by MPD (Times Microwave) relative to product component selection, product design, to manufacturing processes and procedures. i) The advertised description for this coax cable is a contradiction. This RF related product is either an RG-213/U or RG-8/U and not both types of RF transmission cable simultaneously. When measured at 50MHz, the Velocity of Propagation (VP) for Foam High Density Polyethylene (FHDPE) dielectric RG-213/U is 66% with 1.5dB loss/100ft where as FHDPE dielectric RG-8/U, the VP is 85% with 1.2dB loss/100ft. ii) The cable ends do not use high grade Amphenol, PL-259 connectors as they did in the past for this product line, the MPD cable received most recently from Kimberly Distribution, being produced with cheap and short (function of cost reducing engineering to basic disrespect, presuming the end user will not know the difference), Chinese made PL-259 connectors (see photo) and per such, have a smaller mating surface with the coax cable, thus less robust attachment. iii) The cable markings for product type, e.g. RG-213/U, are not of a permanent nature, the printing being neither molded into the cable outer jacket or laser burned, the label wearing off with ease from cleaning or use (see photo). Because of such low persistent printing method, less I check the purchase order or some other user is tasked, cannot properly determine with any great assurance, cable type (RF properties), thus VP (velocity of propagation), signal loss, SWR,... As a side note, since the printing on the outer jacket of the coax cable is easy to remove (not permanent), leaving no other identifiers present (mystery cable), that this electrical product maybe considered contraband per contract, code and insurance underwriters for applications in which UL standards compliant electrical components e.g. DoD, Hospitals, Schools, Civil Defense, Fire and Police,... are required or mandated. iv) The 12\" inch (1' ft) coax cables by MPD (Times Microwave) from current to past are not consistent in overall length as seen from simple side by side compassion (see photo and arrogantly rotated by disrespectful Amazon). There seeming to be some manufacturing (procedural and process controls) confusion at MPD relative to how to make consistent measurements (cuts) of the product e.g. is the coax conductor measured to 12\" or the entire path length (connector end point to connector end point) is measured to 12\". The MPD (Time Microwave) cable with contradicting advertising along with low end build quality is an indication that the leadership at MPD has prioritized cutting corners on raw component selection and manufacturing methods for the sake of cost reduction and or higher margins and earnings, as appose to being driven to manufacture a high end and excellent analytical product for discriminating customers. Remember: \"do a job, big or small, do it right, or not at all\". Less processes and procedures are adjusted and rigorously adhered to by Times Microwave (MPD), do not recommend this coax cable, from contradicting description thus EE (Electrical Engineering) properties, to raw component selection thus ultimately integrated build quality to product assurance and type, for demanding and disciplined RF applications. Minus 0.25 for inconsistent process and procedure for measuring total cable path length. Minus 0.25 for contradicting product description, thus electrical (RF) properties of the coax cable. Minus 0.50 for using Chinese made PL-259 connectors rather than high grade Amphenol connectors like in the past. Minus 1.0 for having easy to remove, non permanent type markings (printing) on the transmission coax cable jacket. Park McGraw JPL, Spacecraft Soldering Course Certified Experimental Physicist, Former US Navy, NASA Fellow Former CEO Class C Electrical Contracting Firm, Life Safety Industry Former Instructor, Basic Electricity and Electronics, University of Hawai`i Hilo Former Member Technical Staff and Process Engineering Mgr, Laser and Sensor Products Center, Northrop Grumman (Space Park)",
"review_posted_date": None,
"review_header": "Uses Chinese Made PL-259 Connectors, Cable Type Printing Wears Off Easily, Contradicting Description",
"review_rating": "3.0 ",
"review_author": "Directed Energy"
},
{
"review_text": "I ordered this coax for Amateur Radio use. RG-213 is rugged coax, much better than many of the air dielectric cables which are actually somewhat fragile. It is double shielded. The connectors are first quality and their installation appears first class. The price is very competitive. I have no connection with browning but I have come to respect their quality products and the value they provide. I have several of their antenna mounts and they are by far the best I have ever seen. In my 40+ years as a ham I learned long ago you buy the best because it lasts. I will confirm the specs on the cable and post any issues here when I get a chance, but I can tell from the packaging, connected quality, weight, and feel the manufacturer intended to deliver a first class product and spared no expense in doing so. I have bought other coax on Amazon but this was the most impressive so far re packaging. I never buy coax on ebay, been burned too often. This piece is intended to allow me to run my qrp cw rig by the pool but I may someday want to run higher power or use it for another antenna run to the shack. I probably own 3000 feet of coax installed at various sites but had none to spare nearby, so it was either gas or coax. The coax arrived in a sturdy box, it was sealed in moisture proof plastic bagging, and uniformly coiled and bound with quality cable ties. It's obvious the manufacturer takes good care of their inventory and this assembly is not made in someone's garage! W7CCE",
"review_posted_date": None,
"review_header": "Excellent quality coax, connectors",
"review_rating": "5.0 ",
"review_author": "Merciless"
},
{
"review_text": "The coax was working well for six months or so. When I recently unscrewed the PL-259 from my radio, the entire connector came off the coax. The crimp sleeve slid off and the center conductor came completely out of the connector. There was solder on the inner pin of the connector, but it never reached the wire itself, so only the crimp sleeve was holding the wire in, and apparently the crimp sleeve wasn't crimped very well.",
"review_posted_date": None,
"review_header": "Connector Failed",
"review_rating": "1.0 ",
"review_author": "RichG"
}
],
"url": "http://www.amazon.com/dp/product-reviews/B00Y7H39IW/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=14",
"name": "MPD Digital USA Made Ham CB Radio GMRS Repeater Transmission MILSPEC M17/ 163A RG-213/U (RG8/U) Coaxial Cable with Soldered Silver UHF PL-259 Connectors, 12 inches",
"price": "$14.99"
}]
My code :
external_stylesheets = ['https://codepen.io/chriddyp/pen/bWLwgP.css']
app = dash.Dash(__name__, external_stylesheets=external_stylesheets)
app.layout = html.Div([
dcc.Upload(
id='upload-data',
children=html.Div([
'Drag and Drop or ',
html.A('Select Files')
]),
style={
'width': '20%',
'height': '40px',
'lineHeight': '60px',
'borderWidth': '1px',
'borderStyle': 'dashed',
'borderRadius': '5px',
'textAlign': 'center',
'margin': '10px'
},
# Allow multiple files to be uploaded
multiple=True
),
html.Div(id='output-data-upload'),
])
def update_output(contents, filename):
.
.
.
dt = pd.DataFrame(data_extract)
return html.Div([
html.H5(filename),
dash_table.DataTable(
style_data={
'whiteSpace': 'normal',
'height': 'auto'
},
data=dt.to_dict('list'),
columns=[{'name': i, 'id': i} for i in dt.columns]
)
])
#app.callback(Output('output-data-upload', 'children'),
[Input('upload-data', 'contents')],
[State('upload-data', 'filename')])
def parse_contents(list_of_contents, list_of_names):
if list_of_contents is not None:
children = [
update_output(c, n) for c, n in
zip(list_of_contents, list_of_names)]
return children
if __name__ == '__main__':
app.run_server(debug=True)
I upload a csv file to generate the data_extract in the update_output() function which i want to display in my DataTable.
I convert my data_extract into a pandas DataFrame, then i try to submit my data with data=dt.to_dict('list'). I also tried with different argument.
I guess you only need the review part of the data? If so, and dt is that part of the json, you can use
data=dt.to_dict('records'),
because your structure is [{column -> value}, … , {column -> value}], see pandas reference
Given are two lists containing strings.
One contains the name of organisations (mostly universitys) all around the world - not only written in english but always using latin alphabet.
The other list contains mostly full addresses in which strings (organisations) from the first list may occur.
An Example:
addresses = [
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
"Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076",
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
"Department of Computer Science, Cornell University, Ithaca, NY",
"University of Wisconsin-Madison"
]
organisations = [
"Catholic University of Leuven"
"Fraunhofer IAIS"
"Cornell University of Ithaca"
"Tübingener Max Plank Institut"
]
As you can see the desired mapping would be:
"Department of Computer Science, Katholieke Universiteit Leuven, Leuven, Belgium",
--> Catholic University of Leuven
"Machine Learning and Computational Biology Research Group, Max Planck Institutes Tübingen, Tübingen, Germany 72076",
--> Max Plank Institut Tübingen
"Department of Computer Science and Engineering, University of Washington, Seattle, USA 98185",
--> --
"Knowledge Discovery Department, Fraunhofer IAIS, Sankt Augustin, Germany 53754",
--> Fraunhofer IAIS
"Computer Science Department, University of California, Santa Barbara, USA 93106",
"Fraunhofer IAIS, Sankt Augustin, Germany",
--> Fraunhofer IAIS
"Department of Computer Science, Cornell University, Ithaca, NY"
--> "Cornell University of Ithaca",
"University of Wisconsin-Madison",
--> --
My thinking was to use some kind of "disctance- algorithm" to calculate the similarity of the strings. Since I cannot just look for an organisation in an address just by doing if address in organisation because it could be written slightly differently at in different places. So my first guess was using the difflib module. Especially the difflib.get_close_matches() function for selecting for every address the closest string from the organisations list. But I am not quite confident, that the results will be accurate enough. Although I don't know how high I should set the ratio which seams to be a similarity measure.
Before spending too much time in trying the difflib module I thought of asking the more experienced people here, if this is the right approach or if there is a more suited tool / way to solve my problem. Thanks!
PS: I don't need an optimal solution.
Use the following as your string distance function (instead of plain levenshtein distance):
def strdist(s1, s2):
words1 = set(w for w in s1.split() if len(w) > 3)
words2 = set(w for w in s2.split() if len(w) > 3)
scores = [min(levenshtein(w1, w2) for w2 in words2) for w1 in words1]
n_shared_words = len([s for s in scores if s <= 3])
return -n_shared_words
Then use the Munkres assignment algorithm shown here since there appears to be a 1:1 mapping between organisations and adresses.
You can use soundex or metaphone to translate the sentence into a list of phonems, then compare the most similar lists.
Here is a Python implementation of the double-metaphone algo.
Most web applications have a Location field, in which uses may enter a Location of their choice.
How would you classify users into different countries, based on the location entered.
For eg, I used the Stack Overflow dump of users.xml and extracted users' names, reputation and location:
['Jeff Atwood', '12853', 'El Cerrito, CA']
['Jarrod Dixon', '1114', 'Morganton, NC']
['Sneakers OToole', '200', 'Unknown']
['Greg Hurlman', '5327', 'Halfway between the boardwalk and Six Flags, NJ']
['Power-coder', '812', 'Burlington, Ontario, Canada']
['Chris Jester-Young', '16509', 'Durham, NC']
['Teifion', '7024', 'Wales']
['Grant', '3333', 'Georgia']
['TimM', '133', 'Alabama']
['Leon Bambrick', '2450', 'Australia']
['Coincoin', '3801', 'Montreal']
['Tom Grochowicz', '125', 'NJ']
['Rex M', '12822', 'US']
['Dillie-O', '7109', 'Prescott, AZ']
['Pete', '653', 'Reynoldsburg, OH']
['Nick Berardi', '9762', 'Phoenixville, PA']
['Kandis', '39', '']
['Shawn', '4248', 'philadelphia']
['Yaakov Ellis', '3651', 'Israel']
['redwards', '21', 'US']
['Dave Ward', '4831', 'Atlanta']
['Liron Yahdav', '527', 'San Rafael, CA']
['Geoff Dalgas', '648', 'Corvallis, OR']
['Kevin Dente', '1619', 'Oakland, CA']
['Tom', '3316', '']
['denny', '573', 'Winchester, VA']
['Karl Seguin', '4195', 'Ottawa']
['Bob', '4652', 'US']
['saniul', '2352', 'London, UK']
['saint_groceon', '1087', 'Houston, TX']
['Tim Boland', '192', 'Cincinnati Ohio']
['Darren Kopp', '5807', 'Woods Cross, UT']
using the following Python script:
from xml.etree import ElementTree
root = ElementTree.parse('SO Export/so-export-2009-05/users.xml').getroot()
items = ['DisplayName','Reputation','Location']
def loop1():
for count,i in enumerate(root):
det = [i.get(x) for x in items]
print det
if count>30: break
loop1()
What is the simplest way to classify people into different countries? Are there any ready lookup tables available that provide me an output saying X location belongs to Y country?
The lookup table need not be totally accurate. Reasonably accurate answers are obtained by querying the location string on Google, or better still, Wolfram Alpha.
You best bet is to use a Geocoding API like geopy (some Examples).
The Google Geocoding API, for example, will return the country in the CountryNameCode-field of the response.
With just this one location field the number of false matches will probably be relatively high, but maybe it is good enough.
If you had server logs, you could try to also look up the users IP address with an IP geocoder (more information and pointers on Wikipedia
Force users to specify country, because you'll have to deal with ambiguities. This would be the right way.
If that's not possible, at least make your best-guess in conjunction with their IP address.
For example, ['Grant', '3333', 'Georgia']
Is this Georgia, USA?
Or is this the Republic of Georgia?
If their IP address suggests somewhere in Central Asia or Eastern Europe, then chances are it's the Republic of Georgia. If it's North America, chances are pretty good they mean Georgia, USA.
Note that mappings for IP address to country isn't 100% accurate, and the database needs to be updated regularly. In my opinion, far too much trouble.