How to scrape fixed-width files in Python? - python

In Python 3 I have a series of links with "fixed-width files". They are websites with public information about companies. Each line has information about companies
Example links:
http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
and
http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214RO
I have these links in a dictionary. The key is the name of the region of the country in which the companies are and the value is the link
for chave, valor in dict_val.items():
print (f'Region of country: {chave} - and link with information: {valor}')
Region of country: Acre - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
Region of country: Espírito Santo - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214ES
...
I want to read these links (fixed-width files) and save the content to a CSV file. Example content:
0107397388000155ASSOCIACAO CULTURAL
02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA
0101904573000102ABREU E SILVA COMERCIO DE MEDICAMENTOS LTDA-ME - ME
02019045730001022 49JETEBERSON OLIVEIRA DE ABREU
02019045730001022 49LUZINETE SANTOS DA SILVA ABREU
0101668652000161CONSELHO ESCOLAR DA ESCOLA ULISSES GUIMARAES
02016686520001612 10REGINA CLAUDIA RAMOS DA SILVA PESSOA
0101631137000107FORTERM * REPRESENTACOES E COMERCIO LTDA
02016311370001072 49ANTONIO MARCOS GONCALVES
02016311370001072 22IVANEIDE BERNARDO DE MENEZES
But to fill the rows of the CSV columns I need to separate and test on each line of the links with "fixed-width files"
I must follow rules like these:
1. If the line begins with "01" is a line with the company's registration number and its name. Example: "0107397388000155ASSOCIACAO CULTURAL"
1.1 - The "01" indicates this /
1.2 - The next 14 positions on the line are the company code - starts at position 3 and ends at 16 - (07397388000155) /
1.3 - The following 150 positions are the company name - starts at position 17 and ends at 166 - (ASSOCIACAO CULTURAL)
and
2. If the line starts with "02" it will have information about the partners of the company. Example: "02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA" /
2.1 - The "02" indicates this /
2.2 - The next fourteen positions are the company registration code - starts at position 3 and ends at 16 (07397388000155) /
2.3 - The next number is a member identifier code, which can be 1, 2 or 3 - starts and ends at position 17 - (2) /
2.4 - The next fourteen positions are another code identifying the member - starts at position 18 and ends at 31 -("" - in this case is empty) /
2.5 - The next two positions are another code identifying the member - starts at position 32 and ends at 33 (16) /
2.6 - And the 150 final positions are the name of the partner - starts at position 34 and ends at 183 (MARIA DO SOCORRO RODRIGUES ALVES BRAGA)
Please in this case one possible strategy would be to save each link as TXT? And then try to separate the positions?
Or is there a better way to wipe a fixed-width files?

You can take a look at any URL parsing modules. I recommend Requests, although you can use urllib which comes bundled with python.
With that in mind, you can the text from the page, and seeing as it doesn't require a login of any from, with requests it would simply be a matter of:
import requests
r = requests.get('Your link from receita.fazenda.gov.br')
page_text = r.text
Read more in the Quickstart section of requests. I'll leave the 'position-separating' to you.
Hint: Use regex.

Using scrapy it's possible to read the content from the link as a stream and process it without saving to file. Documentation for scrapy is here
There's also a related question here: How do you open a file stream for reading using Scrapy?

Related

who does IndexError: list index out of range appears ? i did some test still cant find out

Im working on a simple project python to practice , im trying to retreive data from file and do some test on a value
in my case i do retreive data as table from a file , and i do test the last value of the table if its true i add the whole line in another file
Here my data
AE300812 AFROUKH HAMZA 21 admis
AE400928 VIEGO SAN 22 refuse
AE400599 IBN KHYAT mohammed 22 admis
B305050 BOUNNEDI SALEM 39 refuse
here my code :
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
contenu = fichier.read()
tab = contenu.split()
for i in range(0,len(tab),5):
if tab[i+4]=="admis":
fichier2.write(tab[i]+" "+tab[i+1]+" "+tab[i+2]+" "+tab[i+3]+" "+tab[i+4]+" "+"\n")
fichier.close()
And here the following error :
if tab[i+4]=="admis":
IndexError: list index out of range
You look at tab[i+4], so you have to make sure you stop the loop before that, e.g. with range(0, len(tab)-4, 5). The step=5 alone does not guarantee that you have a full "block" of 5 elements left.
But why does this occur, since each of the lines has 5 elements? They don't! Notice how one line has 6 elements (maybe a double name?), so if you just read and then split, you will run out of sync with the lines. Better iterate lines, and then split each line individually. Also, the actual separator seems to be either a tab \t or double-spaces, not entirely clear from your data. Just split() will split at any whitespace.
Something like this (not tested):
fichier = open("concours.txt","r")
fichier2 = open("admis.txt","w")
for line in fichier:
tab = line.strip().split(" ") # actual separator seems to be tab or double-space
if tab[4]=="admis":
fichier2.write(tab[0]+" "+tab[1]+" "+tab[2]+" "+tab[3]+" "+tab[4]+" "+"\n")
Depending on what you actually want to do, you might also try this:
with open("concours.txt","r") as fichier, open("admis.txt","w") as fichier2:
for line in fichier:
if line.strip().endswith("admis"):
fichier2.write(line)
This should just copy the admis lines to the second file, with the origial double-space separator.

Parsing a genbank file format with biopython's SeqIO

I'm trying to parse a protein genbank file format, Here's an example file (example.protein.gpff)
LOCUS NP_001346895 208 aa linear PRI 20-JAN-2018
DEFINITION intercellular adhesion molecule 2 precursor [Cercocebus atys].
ACCESSION NP_001346895
VERSION NP_001346895.1
DBSOURCE REFSEQ: accession NM_001359966.1
KEYWORDS RefSeq.
SOURCE Cercocebus atys (sooty mangabey)
ORGANISM Cercocebus atys
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Cercopithecidae; Cercopithecinae; Cercocebus.
REFERENCE 1 (residues 1 to 208)
AUTHORS Palesch D, Bosinger SE, Tharp GK, Vanderford TH, Paiardini M,
Chahroudi A, Johnson ZP, Kirchhoff F, Hahn BH, Norgren RB, Patel
NB, Sodora DL, Dawoud RA, Stewart CB, Seepo SM, Harris RA, Liu Y,
Raveendran M, Han Y, English A, Thomas GWC, Hahn MW, Pipes L, Mason
CE, Muzny DM, Gibbs RA, Sauter D, Worley K, Rogers J and Silvestri
G.
TITLE Sooty mangabey genome sequence provides insight into AIDS
resistance in a natural SIV host
JOURNAL Nature 553 (7686), 77-81 (2018)
PUBMED 29300007
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
KY308194.1.
##Evidence-Data-START##
Transcript exon combination :: KY308194.1 [ECO:0000332]
RNAseq introns :: single sample supports all introns
SAMN02045730, SAMN03085078
[ECO:0000348]
##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..208
/organism="Cercocebus atys"
/db_xref="taxon:9531"
Protein 1..208
/product="intercellular adhesion molecule 2 precursor"
/calculated_mol_wt=21138
sig_peptide 1..19
/inference="COORDINATES: ab initio prediction:SignalP:4.0"
/calculated_mol_wt=1999
Region 24..109
/region_name="ICAM_N"
/note="Intercellular adhesion molecule (ICAM), N-terminal
domain; pfam03921"
/db_xref="CDD:252248"
Region 112..>167
/region_name="Ig"
/note="Immunoglobulin domain; cl11960"
/db_xref="CDD:325142"
CDS 1..208
/gene="ICAM2"
/coded_by="NM_001359966.1:1..627"
/db_xref="GeneID:105590766"
ORIGIN
1 mssfgfgtlt malfalvccs gsdekafevh mrleklivkp kesfevncst tcnqpevggl
61 etslnkilll eqtqwkhyli snishdtvlw chftcsgkqk smssnvsvyq pprqvfltlq
121 ptwvavgksf tiecrvpave pldsltlsll rgsetlhsqt frkaapalpv lrelgmkfiq
181 lcprrglagt mppsrpwcpa athwsqgc
//
LOCUS NP_001280013 406 aa linear MAM 22-JAN-2018
DEFINITION 26S proteasome regulatory subunit 8 [Dasypus novemcinctus].
ACCESSION NP_001280013 XP_004456848
VERSION NP_001280013.1
DBSOURCE REFSEQ: accession NM_001293084.1
KEYWORDS RefSeq.
SOURCE Dasypus novemcinctus (nine-banded armadillo)
ORGANISM Dasypus novemcinctus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Xenarthra; Cingulata; Dasypodidae; Dasypus.
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
AAGV03083693.1.
On May 28, 2014 this sequence version replaced XP_004456848.1.
Sequence Note: The RefSeq transcript and protein were derived from
genomic sequence to make the sequence consistent with the reference
genome assembly. The genomic coordinates used for the transcript
record were based on alignments.
##Evidence-Data-START##
RNAseq introns :: mixed/partial sample support SAMN00668203,
SAMN00744121 [ECO:0000350]
##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..406
/organism="Dasypus novemcinctus"
/db_xref="taxon:9361"
Protein 1..406
/product="26S proteasome regulatory subunit 8"
/calculated_mol_wt=45495
Region 4..404
/region_name="RPT1"
/note="ATP-dependent 26S proteasome regulatory subunit
[Posttranslational modification, protein turnover,
chaperones]; COG1222"
/db_xref="CDD:224143"
CDS 1..406
/gene="PSMC5"
/coded_by="NM_001293084.1:1..1221"
/db_xref="GeneID:101445299"
ORIGIN
1 maldgpeqme leegkagsgl rqyylskiee lqlivndksq nlrrlqaqrn elnakvrllr
61 eelqllqeqg syvgevvram dkkkvlvkvh pegkfvvdvd knidindvtp ncrvalrnds
121 ytlhkilpnk vdplvslmmv ekvpdstyem iggldkqike ikevielpvk hpelfealgi
181 aqpkgvllyg ppgtgktlla ravahhtdct firvsgselv qkfigegarm vrelfvmare
241 hapsiifmde idsigssrle ggsggdsevq rtmlellnql dgfeatknik vimatnridi
301 ldsallrpgr idrkiefppp neearldilk ihsrkmnltr ginlrkiael mpgasgaevk
361 gvcteagmya lrerrvhvtq edfemavakv mqkdseknms ikklwk
//
The format has repeating records (separated by //), where each record is a protein. Each record has several sections among them a FEATURES section with several fixed fields, such as source, CDS, and Region, with values that refer to information specific to that record.
I'm interested in using biopython's SeqIO to parse this file into a dataframe which lists for each record ID, the values of its gene, db_xref, and coded_by from its CDS field, the organism and db_xref values from its source field, and db_xref value from its Region field. Except for the Regions field, which may appear several times in the FEATURES section of a record, the CDS and source fields appear only once in the FEATURES section of a record.
My unsuccessful attempt so far looks like this:
from Bio import SeqIO
filename = "example.protein.gpff"
for record in SeqIO.parse(filename, "genbank"):
for feature in record.features:
if feature.type == "CDS":
symbol = feature.qualifiers.get("gene", ["???"])[0]
gene_id = feature.qualifiers.get("db_xref", ["???"])[0]
gene_id = re.sub('GeneID:', '', gene_id)
transcript_id = feature.qualifiers.get("coded_by", ["???"])[0]
transcript_id = re.sub(':.*', '', transcript_id)
if feature.type == "source":
species_name = feature.qualifiers.get("organism", ["???"])[0]
species_id = feature.qualifiers.get("db_xref", ["???"])[0]
species_id = re.sub('taxon:', '', species_id)
if feature.type == "Region":
cdd_id = feature.qualifiers.get("db_xref", ["???"])[0]
cdd_id = re.sub('CDD:', '', cdd_id)
print("%s,%s,%s,%s,%s,%s,%s" % (record.id, cdd_id, transcript_id, symbol, gene_id, species_name, species_id))
The resulting dataframe I'd like to obtain (for the example.protein.gpff above) is:
record_id CDS_coded_by CDS_db_xref CDS_gene source_organism source_db_xref Region_db_xref
1 NP_001346895 NM_001359966.1:1..627 GeneID:105590766 ICAM2 Cercocebus atys taxon:9531 CDD:252248
2 NP_001346895 NM_001359966.1:1..627 GeneID:105590766 ICAM2 Cercocebus atys taxon:9531 CDD:325142
3 NP_001280013 NM_001293084.1:1..1221 GeneID:101445299 PSMC5 Dasypus novemcinctus taxon:9361 CDD:224143
Check out the Genebank-parser library. It accepts a genebank filename and the batch size; next_batch yields as many number of records as batch_size specifies.
Seems like the easiest way to deal with this file format is to convert it to a JSON format (for example, using Bio), and then read it with various JSON parsers (like the rjson package in R, which parses a JSON file to a list of records)

Replace Single Character in a line of a Text file with Python

I have a text file with all of them currently having the same end character (N), which is being used to identify progress the system makes. I want to change the end character to "Y" in case the program ends via an error or other interruptions so that upon restarting the program will search until a line has the end character "N" and begin working from there. Below is my code as well as a sample from the text file.
UPDATED CODE:
def GeoCode():
f = open("geocodeLongLat.txt", "a")
with open("CstoGC.txt",'r') as file:
print("Geocoding...")
new_lines = []
for line in file.readlines():
check = line.split('~')
print(check)
if 'N' in check[-1]:
geolocator = Nominatim()
dot_number, entry_name, PHY_STREET,PHY_CITY,PHY_STATE,PHY_ZIP = check[0],check[1],check[2],check[3],check[4],check[5]
address = PHY_STREET + " " + PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
f.write(dot_number + '\n')
try:
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
try:
address = PHY_CITY + " " + PHY_STATE + " " + PHY_ZIP
location = geolocator.geocode(address)
f.write(dot_number + "," + entry_name + "," + str(location.longitude) + "," + str(location.latitude) + "\n")
except AttributeError:
print("Cannot Geocode")
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('CstoGC.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
f.close()
Output:
2967377~DARIN COLE~22112 TWP RD 209~ALVADA~OH~44802~Y
WAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
143608~LARRY A PETERSON & DONNA M PETERSON~W6359 450TH AVE~ELLSWORTH~WI~54011~N
635528~JAMES E WEBB~3926 GREEN ROAD~SPRINGFIELD~TN~37172~N
805496~WAYNE MLADY~22272 135TH ST~CRESCO~IA~52136~N
704996~SAVINA C MUNIZ~814 W LA QUINTA DR~PHARR~TX~78577~N
893169~BINDEWALD MAINTENANCE INC~213 CAMDEN DR~SLIDELL~LA~70459~N
948130~LOGISTICIZE LTD~861 E PERRY ST~PAULDING~OH~45879~N
438760~SMOOTH OPERATORS INC~W8861 CREEK ROAD~DARIEN~WI~53114~N
518872~A B C RELOCATION SERVICES INC~12 BOCKES ROAD~HUDSON~NH~03051~N
576143~E B D ENTERPRISES INC~29 ROY ROCHE DRIVE~WINNIPEG~MB~R3C 2E6~N
968264~BRIAN REDDEMANN~706 WESTGOR STREET~STORDEN~MN~56174-0220~N
721468~QUALITY LOGISTICS INC~645 LEONARD RD~DUNCAN~SC~29334~N
As you can see I am already keeping track of which line I am at just by using x. Should I use something like file.readlines()?
Sample of text document:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~N
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~N
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~N
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~N
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~N
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~N
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~N
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
Thank you!
Edit: updated code thanks to #idlehands
There are a few ways to do this.
Option #1
My original thought was to use the tell() and seek() method to go back a few steps but it quickly shows that you cannot do this conveniently when you're not opening the file in bytes and definitely not in a for loop of readlines(). You can see the reference threads here:
Is it possible to modify lines in a file in-place?
How to solve "OSError: telling position disabled by next() call"
The investigation led to this piece of code:
with open('file.txt','rb+') as file:
line = file.readline() # initiate the loop
while line: # continue while line is not None
print(line)
check = line.split(b'~')[-1]
if check.startswith(b'N'): # carriage return is expected for each line, strip it
# ... do stuff ... #
file.seek(-len(check), 1) # place the buffer at the check point
file.write(check.replace(b'N', b'Y')) # replace "N" with "Y"
line = file.readline() # read next line
In the first referenced thread one of the answers mentioned this could lead you to potential problems, and directly modifying the bytes on the buffer while reading it is probably considered a bad idea™. A lot of pros probably will scold me for even suggesting it.
Option #2a
(if file size is not horrendously huge)
with open('file.txt','r') as file:
new_lines = []
for line in file.readlines():
check = line.split('~')
if 'N' in check[-1]:
# ... do stuff ... #
check[-1] = check[-1].replace('N','Y')
new_lines.append('~'.join(check))
with open('file.txt','r+') as file: # IMPORTANT to open as 'r+' mode as 'w/w+' will truncate your file!
for line in new_lines:
file.writelines(line)
This approach loads all the lines into memory first, so you do the modification in memory but leave the buffer alone. Then you reload the file and write the lines that were changed. The caveat is that technically you are rewriting the entire file line by line - not just the string N even though it was the only thing changed.
Option #2b
Technically you could open the file as r+ mode from the onset and then after the iterations have completed do this (still within the with block but outside of the loop):
# ... new_lines.append('~'.join(check)) #
file.seek(0)
for line in new_lines:
file.writelines(line)
I'm not sure what distinguishes this from Option #1 since you're still reading and modifying the file in the same go. If someone more proficient in IO/buffer/memory management wants to chime in please do.
The disadvantage for Option 2a/b is that you always end up storing and rewriting the lines in the file even if you are only left with a few lines that needs to be updated from 'N' to 'Y'.
Results (for all solutions):
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~Y
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~Y
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~Y
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~Y
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~Y
And if you were to say, encountered a break at the line starting with 220940, the file would become:
570772~CORPORATE BANK TRANSIT OF KENTUCKY INC~3157 HIGHWAY 64 SUITE 100~EADS~TN~38028~Y
384767~MILLER FARMS TRANS LLC~1103 COURT ST~BEDFORD~IA~50833~Y
986150~R G S TRUCKING LTD~1765 LOMBARDIE DRIVE~QUESNEL~BC~V2J 4A8~Y
1012987~DONALD LARRY KIVETT~4509 LANSBURY RD~GREENSBORO~NC~27406-4509~Y
735308~ALZEY EXPRESS INC~2244 SOUTH GREEN STREET~HENDERSON~KY~42420~Y
870337~RIES FARMS~1613 255TH AVENUE~EARLVILLE~IA~52057~Y
148428~P R MASON & SON LLC~HWY 70 EAST~WILLISTON~NC~28589~Y
220940~TEXAS MOVING CO INC~908 N BOWSER RD~RICHARDSON~TX~75081-2869~N
854042~ARMANDO ORTEGA~6590 CHERIMOYA AVENUE~FONTANA~CA~92337~N
940587~DIAMOND A TRUCKING INC~192285 E COUNTY ROAD 55~HARMON~OK~73832~N
1032455~INTEGRITY EXPRESS LLC~380 OLMSTEAD AVENUE~DEPEW~NY~14043~N
889931~DUNSON INC~33 CR 3581~FLORA VISTA~NM~87415~N
There are pros and cons to these approaches. Try and see which one fits your use case the best.
I would read the entire input file into a list and .pop() the lines off one at a time. In case of an error, append the popped item to the list and write overwrite the input file. This way it will always be up to date and you won't need any other logic.

extracting part of text from file in python

I have a collection of text files that are of the form:
Sponsor : U of NC Charlotte
U N C C Station
Charlotte, NC 28223 704/597-2000
NSF Program : 1468 MANUFACTURING MACHINES & EQUIP
Fld Applictn: 0308000 Industrial Technology
56 Engineering-Mechanical
Program Ref : 9146,MANU,
Abstract :
9500390 Patterson This award supports a new concept in precision metrology,
the Extreme Ultraviolet Optics Measuring Machine (EUVOMM). The goals for this
system when used to measure optical surfaces are a diameter range of 250 mm
with a lateral accuracy of 3.3 nm rms, and a depth range of 7.5 mm w
there's more text above and below the snippet. I want to be able to do the following, for each text file:
store the NSF program, and Fld Applictn numbers in a list, and store the associated text in another list
so, in the above example I want the following, for the i-th text file:
y_num[i] = 1468, 0308000, 56
y_txt[i] = MANUFACTURING MACHINES & EQUIP, Industrial Technology, Engineering-Mechanical
Is there a clean way to do this in python? I prefer python since I am using os.walk to parse all the text files stored in subdirectories.
file = open( "file","r")
for line in file.readlines():
if "NSF" in line:
values= line.split(":")
elif "Fld" in line:
values1 = line.split(":")
So values and values1 has the specific values which you are intetested
You can try something like
yourtextlist = yourtext.split(':')
numbers = []
for slice in yourtextlist:
l = slice.split()
try:
numbers.append(int(l[0]))
except ValueError:
pass

How do I decode garbled text from the Library of Congress?

I am making a z39.50 search in python, but have a problem with decoding search results.
The first search result for "harry potter" is apparantly a hebrew version of the book.
How can I make this into unicode?
This is the minimal code I use to get a post:
#!/usr/bin/env python
# encoding: utf-8
from PyZ3950 import zoom
from PyZ3950 import zmarc
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName = 'VOYAGER'
query = zoom.Query('CCL', 'ti="HARRY POTTER"')
res = conn.search(query)
print "%d hits:" % len(res)
for r in res[:1]:
print unicode( r.data )
Running the script results in "UnicodeDecodeError: 'ascii' codec can't decode byte 0xf2 in position 788: ordinal not in range(128)"
r.data.decode('windows-1255').encode('utf-8')
you'll have to figure the correct encoding they used, and put that instead of 'windows-1255' (which might work, if you're right about the hebrew guess).
I'm trying to reproduce your problem, but am getting into the Python equivalent of "DLL Hell". Please specify what version of each of (Python, PyZ3950, and PLY) that you are using.
You will note from the error message that there are 788 bytes of ASCII before you get a non-ASCII byte. Doesn't sound like Hebrew/Arabic/Greek/Cyrillic/etc which use non-ASCII bytes to represent the characters most often used in those languages.
Instead of print unicode(r.data), do print type(r.data), repr(r.data) and edit your question to show the results.
Update I managed to get it running with the latest versions of PyZ3950 and PLY with Python 2.6 -- needed from ply import lex instead of import lex in PyZ3950/ccl.py (and likewise fixed import yacc.
Here are the results of dumping hit 0 and hit 200:
>>> print repr(res[0].data)
"01688cam 22003614a 45000010009000000050017000090080041000260350018000670350020
00085906004500105925004400150955002400194010001700218020001500235040001300250041
00130026305000180027610000270029488000540032124000330037524501270040888001620053
52460070006972600092007678800200008593000029010594900019010888800045011077000029
01152880006301181700002901244880005301273\x1e16012113\x1e20091209015332.0\x1e091
208s2008 is a 000 1 heb \x1e \x1fa(DLC)16012909\x1e \x1fa(DLC)200
9664431\x1e \x1fa0\x1fbibc\x1fcorignew\x1fd3\x1fencip\x1ff20\x1fgy-nonroman\x1e
0 \x1faacquire\x1fb1 shelf copies\x1fxpolicy default\x1e \x1fbcd06 2009-12-08 I
BC\x1e \x1fa 2009664431\x1e \x1fa965511564X\x1e \x1faDLC\x1fcDLC\x1e1 \x1fah
eb\x1fheng\x1e00\x1faPZ40.R685\x1fbH+\x1e1 \x1f6880-01\x1faRowling, J. K.\x1e1 \
x1f6100-01/(2/r‏\x1fa\x1b(2xelipb, b\x1b(B'\x1b(2i. wi.\x1b(B\x1e10\x1faH
arry Potter and ??.\x1flHebrew\x1e10\x1f6880-02\x1faHari Po\xf2ter \xf2ve-misdar
\xb0of ha-\xf2hol ? /\x1fcG'e. \xf2Ke. Roling ; me-Anglit, Gili Bar-Hilel Samu
; iyurim, Mery Granpreh.\x1e10\x1f6245-02/(2/r‏\x1fa‏\x1b(2d`xi te
hx e........‏ /\x1b(B\x1fc‏\x1b(2b\x1b(B'\x1b(2i. wi. xelipb ; n`p
bliz, bili ax\x1b(B-\x1b(2dll qne ; `iexim, nxi bx`ptxd.\x1b(B\x1e1 \x1fiTitle o
n t.p. verso:\x1faHarry Potter and the order of the phoenix ?\x1e \x1f6880-03\x
1faTel-Aviv :\x1fbYedi\xb0ot a\xf2haronot :\x1fbSifre \xf2hemed :\x1fbSifre \xb0
Aliyat ha-gag,\x1fcc[2008]\x1e \x1f6260-03/(2/r‏\x1fa‏\x1b(2zl\x1
b(B-\x1b(2`aia‏ :\x1b(B\x1fb\x1b(2icirez `gxepez :‏\x1b(B\x1fb&#x2
00f;\x1b(2qtxi gnc :‏\x1b(B\x1fb‏\x1b(2qtxi rliiz dbb,‏\x1b
(B\x1fc‏‪[2008]‬\x1e \x1fa887 p. :\x1fbill. ;\x1fc21 cm.\x
1e0 \x1f6880-04\x1faProzah\x1e0 \x1f6490-04/(2/r‏\x1fa‏\x1b(2txefd
\x1b(B\x1e1 \x1f6880-05\x1faBar-Hilel, Gili.\x1e1 \x1f6700-05/(2/r‏\x1fa&
#x200f;\x1b(2ax\x1b(B-\x1b(2dll qne, bili.\x1b(B\x1e1 \x1f6880-06\x1faGrandPr\xe
2e, Mary.\x1e1 \x1f6700-06/(2/r‏\x1fa‏\x1b(2bx`ptxd, nxi.\x1b(B\x1
e\x1d"
>>> print repr(res[200].data)
"01427cam 22003614a 45000010009000000050017000090080041000269060045000679250044
00112955017900156010001700335020001800352020001500370035002400385040001800409042
00140042705000220044110000280046324501160049126000760060730000200068344000350070
35040041007386500018007796500013007976500017008106500041008276000019008686000039
00887600004800926710005900974923003201033\x1e14882660\x1e20070925153312.0\x1e070
607s2007 ie b 000 0 eng d\x1e \x1fa7\x1fbcbc\x1fccopycat\x1fd3\x1fe
ncip\x1ff20\x1fgy-gencatlg\x1e0 \x1faacquire\x1fb2 shelf copies\x1fxpolicy defau
lt\x1e \x1fanb05 2007-06-07 z-processor ; nb05 2007-06-07 to HLCD for processin
g;\x1falk21 2007-08-09 to sh00\x1fish21 2007/09-18 (telework)\x1fesh49 2007-09-2
0 to BCCD\x1fesh45 2007-09-25 (Revised)\x1e \x1fa 2007390561\x1e \x1fa9780955
492617\x1e \x1fa0955492610\x1e \x1fa(OCoLC)ocn129545188\x1e \x1faVYF\x1fcVYF\
x1fdDLC\x1e \x1falccopycat\x1e00\x1faBT1105\x1fb.H44 2007\x1e1 \x1faHederman, M
ark Patrick.\x1e10\x1faHarry Potter and the Da Vinci code :\x1fb'Thunder of a Ba
ttle fought in some other Star' /\x1fcMark Patrick Hederman.\x1e \x1faDublin :\
x1fbDublin Centre for the Study of the Platonic Tradition,\x1fc2007.\x1e \x1fa3
8 p. ;\x1fc21 cm.\x1e 0\x1faPlatonic Centre pamphlets ;\x1fv2\x1e \x1faIncludes
bibliographical references.\x1e 0\x1faChristianity.\x1e 0\x1faMystery.\x1e 0\x1
faImagination.\x1e 0\x1faPotter, Harry (Fictitious character)\x1e10\x1faRowling,
J. K.\x1e10\x1faBrown, Dan,\x1fd1964-\x1ftDa Vinci code.\x1e10\x1faYeats, W. B.
\x1fq(William Butler),\x1fd1865-1939.\x1e2 \x1faDublin Centre for the Study of t
he Platonic Tradition.\x1e \x1fd20070411\x1fn565079784\x1fsKennys\x1e\x1d"
You will notice that there are quite a few of \x1e and \x1f in the "ASCII" part before the part where it blew up. There's also a \x1d at the end of each dump. (GROUP|UNIT|RECORD) SEPARATORs, perhaps. You will also notice that the second output also looks like gobbledegook but it's not mentioning Hebrew.
Conclusion: Forget Hebrew. Forget Unicode -- that stuff is NOT the result of sensible_unicode_text.encode("any_known_encoding"). Z3950 reeks of punched cards and magnetic drums and tapes. If it knows about Unicode, it's not evident in that data.
Looks like you need to read the ZOOM API docs that come with PyZ3950, and that will lead you to the ZOOM docs ... good luck.
Update 2
>>> r0 = res[0]
>>> dir(r0)
['__doc__', '__init__', '__module__', '__str__', '_rt', 'data', 'databaseName',
'get_field', 'get_fieldcount', 'is_surrogate_diag', 'syntax']
>>> r0.syntax
'USMARC'
>>>
Looks like you need to understand MARC
Update 3 Noticed BIDI stuff like ‏‪[2008]‬ in the first dump ... so you'll end up with Unicode eventually, AFTER you drop down through the levels of the docs working out what's wrapped in what ... again, good luck!
U need to convert Marc data for this:
U can use code below:
from pymarc import MARCReader
temp_list = []
for i in range(0, 2):# You can take len(res) here for all results
temp_list.append(res[i].data)
for i in range(0, 2):# You can take len(res) here for all results
reader = MARCReader(temp_list[i])
for i in reader:
print i.title(),i.author()

Categories

Resources