I'm trying to parse a protein genbank file format, Here's an example file (example.protein.gpff)
LOCUS NP_001346895 208 aa linear PRI 20-JAN-2018
DEFINITION intercellular adhesion molecule 2 precursor [Cercocebus atys].
ACCESSION NP_001346895
VERSION NP_001346895.1
DBSOURCE REFSEQ: accession NM_001359966.1
KEYWORDS RefSeq.
SOURCE Cercocebus atys (sooty mangabey)
ORGANISM Cercocebus atys
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Cercopithecidae; Cercopithecinae; Cercocebus.
REFERENCE 1 (residues 1 to 208)
AUTHORS Palesch D, Bosinger SE, Tharp GK, Vanderford TH, Paiardini M,
Chahroudi A, Johnson ZP, Kirchhoff F, Hahn BH, Norgren RB, Patel
NB, Sodora DL, Dawoud RA, Stewart CB, Seepo SM, Harris RA, Liu Y,
Raveendran M, Han Y, English A, Thomas GWC, Hahn MW, Pipes L, Mason
CE, Muzny DM, Gibbs RA, Sauter D, Worley K, Rogers J and Silvestri
G.
TITLE Sooty mangabey genome sequence provides insight into AIDS
resistance in a natural SIV host
JOURNAL Nature 553 (7686), 77-81 (2018)
PUBMED 29300007
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
KY308194.1.
##Evidence-Data-START##
Transcript exon combination :: KY308194.1 [ECO:0000332]
RNAseq introns :: single sample supports all introns
SAMN02045730, SAMN03085078
[ECO:0000348]
##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..208
/organism="Cercocebus atys"
/db_xref="taxon:9531"
Protein 1..208
/product="intercellular adhesion molecule 2 precursor"
/calculated_mol_wt=21138
sig_peptide 1..19
/inference="COORDINATES: ab initio prediction:SignalP:4.0"
/calculated_mol_wt=1999
Region 24..109
/region_name="ICAM_N"
/note="Intercellular adhesion molecule (ICAM), N-terminal
domain; pfam03921"
/db_xref="CDD:252248"
Region 112..>167
/region_name="Ig"
/note="Immunoglobulin domain; cl11960"
/db_xref="CDD:325142"
CDS 1..208
/gene="ICAM2"
/coded_by="NM_001359966.1:1..627"
/db_xref="GeneID:105590766"
ORIGIN
1 mssfgfgtlt malfalvccs gsdekafevh mrleklivkp kesfevncst tcnqpevggl
61 etslnkilll eqtqwkhyli snishdtvlw chftcsgkqk smssnvsvyq pprqvfltlq
121 ptwvavgksf tiecrvpave pldsltlsll rgsetlhsqt frkaapalpv lrelgmkfiq
181 lcprrglagt mppsrpwcpa athwsqgc
//
LOCUS NP_001280013 406 aa linear MAM 22-JAN-2018
DEFINITION 26S proteasome regulatory subunit 8 [Dasypus novemcinctus].
ACCESSION NP_001280013 XP_004456848
VERSION NP_001280013.1
DBSOURCE REFSEQ: accession NM_001293084.1
KEYWORDS RefSeq.
SOURCE Dasypus novemcinctus (nine-banded armadillo)
ORGANISM Dasypus novemcinctus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Xenarthra; Cingulata; Dasypodidae; Dasypus.
COMMENT VALIDATED REFSEQ: This record has undergone validation or
preliminary review. The reference sequence was derived from
AAGV03083693.1.
On May 28, 2014 this sequence version replaced XP_004456848.1.
Sequence Note: The RefSeq transcript and protein were derived from
genomic sequence to make the sequence consistent with the reference
genome assembly. The genomic coordinates used for the transcript
record were based on alignments.
##Evidence-Data-START##
RNAseq introns :: mixed/partial sample support SAMN00668203,
SAMN00744121 [ECO:0000350]
##Evidence-Data-END##
FEATURES Location/Qualifiers
source 1..406
/organism="Dasypus novemcinctus"
/db_xref="taxon:9361"
Protein 1..406
/product="26S proteasome regulatory subunit 8"
/calculated_mol_wt=45495
Region 4..404
/region_name="RPT1"
/note="ATP-dependent 26S proteasome regulatory subunit
[Posttranslational modification, protein turnover,
chaperones]; COG1222"
/db_xref="CDD:224143"
CDS 1..406
/gene="PSMC5"
/coded_by="NM_001293084.1:1..1221"
/db_xref="GeneID:101445299"
ORIGIN
1 maldgpeqme leegkagsgl rqyylskiee lqlivndksq nlrrlqaqrn elnakvrllr
61 eelqllqeqg syvgevvram dkkkvlvkvh pegkfvvdvd knidindvtp ncrvalrnds
121 ytlhkilpnk vdplvslmmv ekvpdstyem iggldkqike ikevielpvk hpelfealgi
181 aqpkgvllyg ppgtgktlla ravahhtdct firvsgselv qkfigegarm vrelfvmare
241 hapsiifmde idsigssrle ggsggdsevq rtmlellnql dgfeatknik vimatnridi
301 ldsallrpgr idrkiefppp neearldilk ihsrkmnltr ginlrkiael mpgasgaevk
361 gvcteagmya lrerrvhvtq edfemavakv mqkdseknms ikklwk
//
The format has repeating records (separated by //), where each record is a protein. Each record has several sections among them a FEATURES section with several fixed fields, such as source, CDS, and Region, with values that refer to information specific to that record.
I'm interested in using biopython's SeqIO to parse this file into a dataframe which lists for each record ID, the values of its gene, db_xref, and coded_by from its CDS field, the organism and db_xref values from its source field, and db_xref value from its Region field. Except for the Regions field, which may appear several times in the FEATURES section of a record, the CDS and source fields appear only once in the FEATURES section of a record.
My unsuccessful attempt so far looks like this:
from Bio import SeqIO
filename = "example.protein.gpff"
for record in SeqIO.parse(filename, "genbank"):
for feature in record.features:
if feature.type == "CDS":
symbol = feature.qualifiers.get("gene", ["???"])[0]
gene_id = feature.qualifiers.get("db_xref", ["???"])[0]
gene_id = re.sub('GeneID:', '', gene_id)
transcript_id = feature.qualifiers.get("coded_by", ["???"])[0]
transcript_id = re.sub(':.*', '', transcript_id)
if feature.type == "source":
species_name = feature.qualifiers.get("organism", ["???"])[0]
species_id = feature.qualifiers.get("db_xref", ["???"])[0]
species_id = re.sub('taxon:', '', species_id)
if feature.type == "Region":
cdd_id = feature.qualifiers.get("db_xref", ["???"])[0]
cdd_id = re.sub('CDD:', '', cdd_id)
print("%s,%s,%s,%s,%s,%s,%s" % (record.id, cdd_id, transcript_id, symbol, gene_id, species_name, species_id))
The resulting dataframe I'd like to obtain (for the example.protein.gpff above) is:
record_id CDS_coded_by CDS_db_xref CDS_gene source_organism source_db_xref Region_db_xref
1 NP_001346895 NM_001359966.1:1..627 GeneID:105590766 ICAM2 Cercocebus atys taxon:9531 CDD:252248
2 NP_001346895 NM_001359966.1:1..627 GeneID:105590766 ICAM2 Cercocebus atys taxon:9531 CDD:325142
3 NP_001280013 NM_001293084.1:1..1221 GeneID:101445299 PSMC5 Dasypus novemcinctus taxon:9361 CDD:224143
Check out the Genebank-parser library. It accepts a genebank filename and the batch size; next_batch yields as many number of records as batch_size specifies.
Seems like the easiest way to deal with this file format is to convert it to a JSON format (for example, using Bio), and then read it with various JSON parsers (like the rjson package in R, which parses a JSON file to a list of records)
Related
I am a beginner in python so apologies if it is a basic query. I would like to save values of ENTRY, NAME, SYMBOL, AND PATHWAY in their respective variables, whereever it is available. I am having issue while saving the values if it is followed by a space. How can I get that information.
ENTRY variable should have '102724788 CDS T01001'. PATHWAY should have both the lines.
ENTRY 102724788 CDS T01001
NAME (RefSeq) proline dehydrogenase 1, mitochondrial
SYMBOL ARVBSH8
ORTHOLOGY K00318 proline dehydrogenase [EC:1.5.5.2]
ORGANISM hsa Homo sapiens (human)
PATHWAY hsa00330 Arginine and proline metabolism
hsa01100 Metabolic pathways
BRITE KEGG Orthology (KO) [BR:hsa00001]
09100 Metabolism
09105 Amino acid metabolism
00330 Arginine and proline metabolism
102724788
Enzymes [BR:hsa01000]
1. Oxidoreductases
1.5 Acting on the CH-NH group of donors
1.5.5 With a quinone or similar compound as acceptor
1.5.5.2 proline dehydrogenase
102724788
POSITION 22
MOTIF Pfam: Pro_dh HrpB2
DBLINKS NCBI-GeneID: 102724788
NCBI-ProteinID: NP_001355178
Ensembl: ENSG00000277196
///
ENTRY 112268355 CDS T01001
NAME (RefSeq) killer cell immunoglobulin-like receptor 3DS1-like
ORGANISM hsa Homo sapiens (human)
POSITION 19
MOTIF Pfam: ig Ig_2 Ig_3
DBLINKS NCBI-GeneID: 112268355
NCBI-ProteinID: NP_001355183
///
data = {}
isPathway = false
for line in reader.lines():
line_list = line.split(" ") #makes a list of the line
if isPathway = true:
data['PATHWAY'] += " ".join(line_list)
isPathway = false
elif 'PATHWAY' in line:
data['PATHWAY'] = " ".join(line_list[1:])
isPathway = true
...
I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.
Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID
using the code below, I read a large amount of XML files (around 300.000) into a nested dictionary. I want to write this into a single CSV file. At first attempt I did this by using a pandas dataframe as intermediary. The dictionary is fully constructed, however during the last step, when converting into CSV I get exit code 137 (interrupted by signal 9: SIGKILL).
(I found that building a nested dictionary instead of appending a dataframe is by far the quickest option).
Any idea how I can manage to write into a single CSV by circumventing this error? Is there a way to free up some memory somewhere in between?
Thanks!
#Import packages.
import pandas as pd
from lxml import etree
import os
from os import listdir
from os.path import isfile, join
from tqdm import tqdm
from datetime import datetime
from collections import defaultdict
#Set options for displaying results
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)
def run(file, content):
data = etree.parse(file)
#get all paths from the XML
get_path = lambda x: data.getpath(x)
paths = list(map(get_path, data.getroot().getiterator()))
content = ""
content = [
data.getroot().xpath(path)
for path in paths
]
get_text = lambda x: x.text
content = [list(map(get_text, i)) for i in content]
content = dict(zip(paths, content))
content = {
content["/clinical_study/id_info/nct_id"][0]: content
}
dict_final.update(content)
def write_csv(df_name, csv):
df_name.to_csv(csv, sep=";")
#######RUN######
mypath = '/Users/Documents/AllPublicXML'
folder_all = os.listdir(mypath)
dict_final = {}
df_final = pd.DataFrame()
for folder in tqdm(folder_all):
mypath2 = mypath + "/" + folder
print(folder)
if os.path.isdir(mypath2):
file = [f for f in listdir(mypath2) if isfile(join(mypath2, f))]
output = "./Output/" + folder + ".csv"
for x in tqdm(file):
dir = mypath2 + "/" + x
#output = "./Output/"+x+".csv"
dict_name = x.split(".", 1)[0]
try:
run(dir,dict_name)
except:
log = open("log.txt", "a+")
log.write(str(datetime.now()) + ": Error in file " +x+"\r \n")
pass
log = open("log.txt", "a+")
log.write(str(datetime.now()) + ": " + folder +" written succesfully \r \n")
df_final = pd.DataFrame.from_dict(dict_final, orient='index')
write_csv(df_final, "./Output/final_csv.csv")
log.close()
The XMLs look like this
<clinical_study>
<!--
This xml conforms to an XML Schema at:
https://clinicaltrials.gov/ct2/html/images/info/public.xsd
-->
<required_header>
<download_date>
ClinicalTrials.gov processed this data on March 20, 2020
</download_date>
<link_text>Link to the current ClinicalTrials.gov record.</link_text>
<url>https://clinicaltrials.gov/show/NCT03261284</url>
</required_header>
<id_info>
<org_study_id>2017-P-032</org_study_id>
<nct_id>NCT03261284</nct_id>
</id_info>
<brief_title>
D-dimer to Guide Anticoagulation Therapy in Patients With Atrial Fibrillation
</brief_title>
<acronym>DATA-AF</acronym>
<official_title>
D-dimer to Determine Intensity of Anticoagulation to Reduce Clinical Outcomes in Patients With Atrial Fibrillation
</official_title>
<sponsors>
<lead_sponsor>
<agency>Wuhan Asia Heart Hospital</agency>
<agency_class>Other</agency_class>
</lead_sponsor>
</sponsors>
<source>Wuhan Asia Heart Hospital</source>
<oversight_info>
<has_dmc>Yes</has_dmc>
<is_fda_regulated_drug>No</is_fda_regulated_drug>
<is_fda_regulated_device>No</is_fda_regulated_device>
</oversight_info>
<brief_summary>
<textblock>
This was a prospective, three arms, randomized controlled study.
</textblock>
</brief_summary>
<detailed_description>
<textblock>
D-dimer testing is performed in AF Patients receiving warfarin therapy (target INR:1.5-2.5) in Wuhan Asia Heart Hospital. Patients with elevated d-dimer levels (>0.5ug/ml FEU) were SCREENED AND RANDOMIZED to three groups at a ratio of 1:1:1. First, NOAC group,the anticoagulant was switched to Dabigatran (110mg,bid) when elevated d-dimer level was detected during warfarin therapy.Second,Higher-INR group, INR was adjusted to higher level (INR:2.0-3.0) when elevated d-dimer level was detected during warfarin therapy. Third, control group, patients with elevated d-dimer levels have no change in warfarin therapy. Warfarin is monitored once a month by INR ,and dabigatran dose not need monitor. All patients were followed up for 24 months until the occurrence of endpoints, including bleeding events, thrombotic events and all-cause deaths.
</textblock>
</detailed_description>
<overall_status>Enrolling by invitation</overall_status>
<start_date type="Anticipated">March 1, 2019</start_date>
<completion_date type="Anticipated">May 30, 2020</completion_date>
<primary_completion_date type="Anticipated">February 28, 2020</primary_completion_date>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<has_expanded_access>No</has_expanded_access>
<study_design_info>
<allocation>Randomized</allocation>
<intervention_model>Parallel Assignment</intervention_model>
<primary_purpose>Treatment</primary_purpose>
<masking>None (Open Label)</masking>
</study_design_info>
<primary_outcome>
<measure>Thrombotic events</measure>
<time_frame>24 months</time_frame>
<description>
Stroke, DVT, PE, Peripheral arterial embolism, ACS etc.
</description>
</primary_outcome>
<primary_outcome>
<measure>hemorrhagic events</measure>
<time_frame>24 months</time_frame>
<description>cerebral hemorrhage,Gastrointestinal bleeding etc.</description>
</primary_outcome>
<secondary_outcome>
<measure>all-cause deaths</measure>
<time_frame>24 months</time_frame>
</secondary_outcome>
<number_of_arms>3</number_of_arms>
<enrollment type="Anticipated">600</enrollment>
<condition>Atrial Fibrillation</condition>
<condition>Thrombosis</condition>
<condition>Hemorrhage</condition>
<condition>Anticoagulant Adverse Reaction</condition>
<arm_group>
<arm_group_label>DOAC group</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
<description>
Patients with elevated d-dimer levels was switched to DOAC (dabigatran 150mg, bid).
</description>
</arm_group>
<arm_group>
<arm_group_label>Higher-INR group</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
<description>
Patients' target INR was adjusted from 1.5-2.5 to 2.0-3.0 by adding warfarin dose.
</description>
</arm_group>
<arm_group>
<arm_group_label>Control group</arm_group_label>
<arm_group_type>No Intervention</arm_group_type>
<description>
Patients continue previous strategy without change.
</description>
</arm_group>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Dabigatran Etexilate 150 MG [Pradaxa]</intervention_name>
<description>Dabigatran Etexilate 150mg,bid</description>
<arm_group_label>DOAC group</arm_group_label>
<other_name>Pradaxa</other_name>
</intervention>
<intervention>
<intervention_type>Drug</intervention_type>
<intervention_name>Warfarin Pill</intervention_name>
<description>Add warfarin dose according to INR values.</description>
<arm_group_label>Higher-INR group</arm_group_label>
</intervention>
<eligibility>
<criteria>
<textblock>
Inclusion Criteria: - Patients with non-valvular atrial fibrillation - Receiving warfarin therapy Exclusion Criteria: - Patients who had suffered from recent (within 3 months) myocardial infarction, ischemic stroke, deep vein thrombosis, cerebral hemorrhages, or other serious diseases. - Those who had difficulty in compliance or were unavailable for follow-up.
</textblock>
</criteria>
<gender>All</gender>
<minimum_age>18 Years</minimum_age>
<maximum_age>75 Years</maximum_age>
<healthy_volunteers>No</healthy_volunteers>
</eligibility>
<overall_official>
<last_name>Zhenlu ZHANG, MD,PhD</last_name>
<role>Study Director</role>
<affiliation>Wuhan Asia Heart Hospital</affiliation>
</overall_official>
<location>
<facility>
<name>Zhang litao</name>
<address>
<city>Wuhan</city>
<state>Hubei</state>
<zip>430022</zip>
<country>China</country>
</address>
</facility>
</location>
<location_countries>
<country>China</country>
</location_countries>
<verification_date>March 2019</verification_date>
<study_first_submitted>August 22, 2017</study_first_submitted>
<study_first_submitted_qc>August 23, 2017</study_first_submitted_qc>
<study_first_posted type="Actual">August 24, 2017</study_first_posted>
<last_update_submitted>March 6, 2019</last_update_submitted>
<last_update_submitted_qc>March 6, 2019</last_update_submitted_qc>
<last_update_posted type="Actual">March 7, 2019</last_update_posted>
<responsible_party>
<responsible_party_type>Sponsor</responsible_party_type>
</responsible_party>
<keyword>D-dimer</keyword>
<keyword>Nonvalvular atrial fibrillation</keyword>
<keyword>Direct thrombin inhibitor</keyword>
<keyword>INR</keyword>
<condition_browse>
<!--
CAUTION: The following MeSH terms are assigned with an imperfect algorithm
-->
<mesh_term>Atrial Fibrillation</mesh_term>
<mesh_term>Thrombosis</mesh_term>
<mesh_term>Hemorrhage</mesh_term>
</condition_browse>
<intervention_browse>
<!--
CAUTION: The following MeSH terms are assigned with an imperfect algorithm
-->
<mesh_term>Warfarin</mesh_term>
<mesh_term>Dabigatran</mesh_term>
<mesh_term>Fibrin fragment D</mesh_term>
</intervention_browse>
<!--
Results have not yet been posted for this study
-->
</clinical_study>
In Python 3 I have a series of links with "fixed-width files". They are websites with public information about companies. Each line has information about companies
Example links:
http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
and
http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214RO
I have these links in a dictionary. The key is the name of the region of the country in which the companies are and the value is the link
for chave, valor in dict_val.items():
print (f'Region of country: {chave} - and link with information: {valor}')
Region of country: Acre - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214AC
Region of country: Espírito Santo - and link with information: http://idg.receita.fazenda.gov.br/orientacao/tributaria/cadastros/cadastro-nacional-de-pessoas-juridicas-cnpj/consultas/download/F.K03200UF.D71214ES
...
I want to read these links (fixed-width files) and save the content to a CSV file. Example content:
0107397388000155ASSOCIACAO CULTURAL
02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA
0101904573000102ABREU E SILVA COMERCIO DE MEDICAMENTOS LTDA-ME - ME
02019045730001022 49JETEBERSON OLIVEIRA DE ABREU
02019045730001022 49LUZINETE SANTOS DA SILVA ABREU
0101668652000161CONSELHO ESCOLAR DA ESCOLA ULISSES GUIMARAES
02016686520001612 10REGINA CLAUDIA RAMOS DA SILVA PESSOA
0101631137000107FORTERM * REPRESENTACOES E COMERCIO LTDA
02016311370001072 49ANTONIO MARCOS GONCALVES
02016311370001072 22IVANEIDE BERNARDO DE MENEZES
But to fill the rows of the CSV columns I need to separate and test on each line of the links with "fixed-width files"
I must follow rules like these:
1. If the line begins with "01" is a line with the company's registration number and its name. Example: "0107397388000155ASSOCIACAO CULTURAL"
1.1 - The "01" indicates this /
1.2 - The next 14 positions on the line are the company code - starts at position 3 and ends at 16 - (07397388000155) /
1.3 - The following 150 positions are the company name - starts at position 17 and ends at 166 - (ASSOCIACAO CULTURAL)
and
2. If the line starts with "02" it will have information about the partners of the company. Example: "02073973880001552 16MARIA DO SOCORRO RODRIGUES ALVES BRAGA" /
2.1 - The "02" indicates this /
2.2 - The next fourteen positions are the company registration code - starts at position 3 and ends at 16 (07397388000155) /
2.3 - The next number is a member identifier code, which can be 1, 2 or 3 - starts and ends at position 17 - (2) /
2.4 - The next fourteen positions are another code identifying the member - starts at position 18 and ends at 31 -("" - in this case is empty) /
2.5 - The next two positions are another code identifying the member - starts at position 32 and ends at 33 (16) /
2.6 - And the 150 final positions are the name of the partner - starts at position 34 and ends at 183 (MARIA DO SOCORRO RODRIGUES ALVES BRAGA)
Please in this case one possible strategy would be to save each link as TXT? And then try to separate the positions?
Or is there a better way to wipe a fixed-width files?
You can take a look at any URL parsing modules. I recommend Requests, although you can use urllib which comes bundled with python.
With that in mind, you can the text from the page, and seeing as it doesn't require a login of any from, with requests it would simply be a matter of:
import requests
r = requests.get('Your link from receita.fazenda.gov.br')
page_text = r.text
Read more in the Quickstart section of requests. I'll leave the 'position-separating' to you.
Hint: Use regex.
Using scrapy it's possible to read the content from the link as a stream and process it without saving to file. Documentation for scrapy is here
There's also a related question here: How do you open a file stream for reading using Scrapy?
I have a collection of text files that are of the form:
Sponsor : U of NC Charlotte
U N C C Station
Charlotte, NC 28223 704/597-2000
NSF Program : 1468 MANUFACTURING MACHINES & EQUIP
Fld Applictn: 0308000 Industrial Technology
56 Engineering-Mechanical
Program Ref : 9146,MANU,
Abstract :
9500390 Patterson This award supports a new concept in precision metrology,
the Extreme Ultraviolet Optics Measuring Machine (EUVOMM). The goals for this
system when used to measure optical surfaces are a diameter range of 250 mm
with a lateral accuracy of 3.3 nm rms, and a depth range of 7.5 mm w
there's more text above and below the snippet. I want to be able to do the following, for each text file:
store the NSF program, and Fld Applictn numbers in a list, and store the associated text in another list
so, in the above example I want the following, for the i-th text file:
y_num[i] = 1468, 0308000, 56
y_txt[i] = MANUFACTURING MACHINES & EQUIP, Industrial Technology, Engineering-Mechanical
Is there a clean way to do this in python? I prefer python since I am using os.walk to parse all the text files stored in subdirectories.
file = open( "file","r")
for line in file.readlines():
if "NSF" in line:
values= line.split(":")
elif "Fld" in line:
values1 = line.split(":")
So values and values1 has the specific values which you are intetested
You can try something like
yourtextlist = yourtext.split(':')
numbers = []
for slice in yourtextlist:
l = slice.split()
try:
numbers.append(int(l[0]))
except ValueError:
pass