How to write multiple lines in a single line? - python

How can I write multiple lines in a single line? My inputs are like this:
HOXC11
HOXC11, HOX3H, MGC4906
human, Homo sapiens
HOXB6
HOXB6, HOX2, HU-2, HOX2B, Hox-2.2
human, Homo sapiens
HOXB13
HOXB13
human, Homo sapiens
PAX5
PAX5, BSAP
human, Homo sapiens
I need to make it into a single line like this:
HOXC11 HOXC11, HOX3H, MGC4906 human, Homo sapiens
HOXB6 HOXB6, HOX2, HU-2, HOX2B, Hox-2.2 human, Homo sapiens
HOXB13 HOXB13 human, Homo sapiens

Assuming your input is from a file, let's call it homosapiens.txt, you can go from the specified input to the desired output as follow:
with open('homosapiens.txt', 'r') as f:
for line in f:
if line == 'human, Homo sapiens':
print line # this will print and go to a newline
elif line:
print line, # the comma after line suppresses the newline

textInput = textInput.rstrip('\n')

Related

parsing the xml file using pandas

So, I'm trying to extract information from this xml --
<bdb:getTargetByCompoundResponse xmlns:bdb="http://ws.bindingdb.org/xsd">
<bdb:smile>C[C#H]1[C#H](C)CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C#](C)(C(=O)O)[C##H]5CC[C#]43C)[C#H]12</bdb:smile>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1 AuxInfo=1/1/N:4,1,8,19,12,33,25,5,30,21,6,20,31,9,10,14,3,2,13,15,29,22,34,17,26,7,18,11,32,24,16,23,27,28/E:(33,34)/it:im/rA:34CC.oC.eCCCC.oCCCC.oCCCCOC.eC.eCCCC.oOC.oCCOOC.eCCC.eCC.e/rB:s1;s2;s3;s3;s5;s6;s7;s7;s9;s10;s11;s11;d-13;s14;d15;s15;s17;s18;s18;s20;s21;s22;s22;s24;s24;d26;s26;s18s24;s29;s30;s11s17s31;s32;s2s7s13;/rC:2.8737,-5.8026,0;2.1037,-7.1363,0;.5637,-7.1363,0;-.2063,-5.8026,0;-.2063,-8.47,0;.5637,-9.8037,0;2.1037,-9.8037,0;1.3337,-11.1374,0;2.8737,-11.1374,0;4.4137,-11.1374,0;5.1837,-9.8037,0;5.9537,-11.1374,0;4.4137,-8.47,0;5.1837,-7.1363,0;6.7237,-7.1363,0;7.4937,-5.8026,0;7.4937,-8.47,0;9.0337,-8.47,0;8.2637,-7.1363,0;9.8037,-7.1363,0;11.3437,-7.1363,0;12.1137,-8.47,0;13.6537,-8.47,0;11.3437,-9.8037,0;11.0763,-11.3203,0;12.7908,-10.3304,0;13.9705,-9.3405,0;13.0582,-11.847,0;9.8037,-9.8037,0;9.0337,-11.1374,0;7.4937,-11.1374,0;6.7237,-9.8037,0;6.8847,-11.3352,0;2.8737,-8.47,0;</bdb:inchi>
<bdb:hit>7</bdb:hit>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Polyunsaturated fatty acid 5-lipoxygenase</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>3000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prolyl endopeptidase</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>36320</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prostaglandin E synthase</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>3000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prostaglandin G/H synthase 1</bdb:target>
<bdb:species>Ovis aries (Sheep)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>>40000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Prostaglandin G/H synthase 2</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>>40000</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Tyrosine-protein phosphatase non-receptor type 1</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>8040</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
<bdb:affinities>
<bdb:monomerid>50241261</bdb:monomerid>
<bdb:inhibitor>BDBM50241261</bdb:inhibitor>
<bdb:target>Tyrosine-protein phosphatase non-receptor type 2</bdb:target>
<bdb:species>Homo sapiens (Human)</bdb:species>
<bdb:affinity_type>IC50</bdb:affinity_type>
<bdb:affinity>9450</bdb:affinity>
<bdb:smiles>C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C##](C)([C##H]5CC[C##]34C)C(O)=O)[C##H]2[C#H]1C</bdb:smiles>
<bdb:inchi>InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)19(23(26)18(17)2)16-20(31)24-27(4)12-10-22(32)30(7,25(33)34)21(27)9-13-29(24,28)6/h16-18,21-24,32H,8-15H2,1-7H3,(H,33,34)/t17-,18+,21-,22-,23+,24-,26-,27+,28-,29-,30-/m1/s1</bdb:inchi>
<bdb:tanimoto>1.00000</bdb:tanimoto>
</bdb:affinities>
</bdb:getTargetByCompoundResponse>
But I'm getting the following error-
xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.
I tried this code
smile = 'C[C#H]1[C#H](C)CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C##]5(C)CC[C##H](O)[C#](C)(C(=O)O)[C##H]5CC[C#]43C)[C#H]12'
api_binding = requests.get(f'https://bindingdb.org/axis2/services/BDBService/getTargetByCompound?smiles={smile}&cutoff=1')
df = pd.read_xml(api_binding.text, xpath = ".//bdb", namespaces = {"bdb":"https://ws.bindingdb.org/xsd"})
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
you can build the data frame for all the attributes, and then can the result. for example:
df = pd.read_xml(str_xml, xpath = ".//bdb:*", namespaces = {"bdb":"http://ws.bindingdb.org/xsd"})
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
result
result would be:
3 Polyunsaturated fatty acid 5-lipoxygenase
7 None
13 Prolyl endopeptidase
17 None
If you remove xpath, you'll read it in almost as expected. We know from the output that there should be 7 rows (i.e. <bdb:hit> shows 7). The first three tags seem to contain only some metadata about what it returned:
<bdb:smile>
<bdb:inchi>
<bdb:hit>
These make up rows 0-2 of the dataframe.
Your actual data starts on row 3, or wherever we see the <bdb:affinities> tag. Based on this pattern, we can keep only the data where target is not missing. You can choose any other column where you know there will always be data, but in this case, we're going to stick with target.
df = pd.read_xml(api_binding.text, namespaces = {"bdb":"https://ws.bindingdb.org/xsd"}).dropna(subset='target')
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
Output:
3 Polyunsaturated fatty acid 5-lipoxygenase
4 Prolyl endopeptidase
5 Prostaglandin E synthase
7 Prostaglandin G/H synthase 2
8 Tyrosine-protein phosphatase non-receptor type 1
9 Tyrosine-protein phosphatase non-receptor type 2
There are a couple things going on here that I think the other answers missed, so I will try to give a complete overview and explanation.
In your sample of the the code you've tried, you are querying xpath=".//bdb", which will look for any element beneath the root node with a tag of "bdb". There are no such elements (whose tag is just bdb), so you get ValueError: xpath does not return any nodes. simpleApp's answer correctly identifies this issue, and suggests adding an asterisk (i.e. xpath=".//bdb:*") to tell xpath to look for any elements whose tag starts with bdb:. However, this returns all of the various nested elements in the XML, since in your case everything is prefixed with bdb:, so we lose the structural information contained in the hierarchy of the XML content, leading to a mess of a DataFrame.
Stu Sztukowski's answer suggests dropping the xpath kwarg completely, which will make pandas more or less do what you want, except then you get unwanted rows resulting from the <bdb:smile/>, <bdb:inchi/>, and <bdb:hit/> elements at the top that you have to clean up yourself after you create the DataFrame.
Also, you need to use http in your namespaces dictionary, not https, since the namespace mapping for bdb (shown in first line/root of your XML content) is http://ws.bindingdb.org/xsd, with no s.
A clean solution that will get you exactly the DataFrame that you want is
df = pd.read_xml(api_binding.text, xpath = "./bdb:affinities",
namespaces = {"bdb": "http://ws.bindingdb.org/xsd"})
result = df.loc[df["species"] == "Homo sapiens (Human)", "target"]
You can use either xpath = "./bdb:affinities" or xpath = ".//bdb:affinities" (single or double slash) since in this case all of the bdb:affinities elements are right below the root anyway, so the two syntaxes do the same thing.
This solution yields a df that looks like:
monomerid inhibitor target species affinity_type affinity smiles inchi tanimoto
0 50241261 BDBM50241261 Polyunsaturated fatty acid 5-lipoxygenase Homo sapiens (Human) IC50 3000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
1 50241261 BDBM50241261 Prolyl endopeptidase Homo sapiens (Human) IC50 36320 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
2 50241261 BDBM50241261 Prostaglandin E synthase Homo sapiens (Human) IC50 3000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
3 50241261 BDBM50241261 Prostaglandin G/H synthase 1 Ovis aries (Sheep) IC50 >40000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
4 50241261 BDBM50241261 Prostaglandin G/H synthase 2 Homo sapiens (Human) IC50 >40000 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
5 50241261 BDBM50241261 Tyrosine-protein phosphatase non-receptor type 1 Homo sapiens (Human) IC50 8040 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
6 50241261 BDBM50241261 Tyrosine-protein phosphatase non-receptor type 2 Homo sapiens (Human) IC50 9450 C[C##H]1CC[C#]2(C)CC[C#]3(C)C(=CC(=O)[C##H]4[C... InChI=1S/C30H46O4/c1-17-8-11-26(3)14-15-28(5)1... 1.0
And a result that looks like:
0 Polyunsaturated fatty acid 5-lipoxygenase
1 Prolyl endopeptidase
2 Prostaglandin E synthase
4 Prostaglandin G/H synthase 2
5 Tyrosine-protein phosphatase non-receptor type 1
6 Tyrosine-protein phosphatase non-receptor type 2
Name: target, dtype: object

Removing whitespaces using regex python

I am trying to amend each line of a file to remove any parts beginning with the character '(' or containing a number/character in square brackets i.e.'[2]':
f = open('/Users/name/Desktop/university_towns.txt',"r")
listed = []
import re
for i in f.readlines():
if i.find(r'\(.*?\)\n'):
here = re.sub(r'\(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \(.*?\)\n'):
here = re.sub(r' \(.*?\)\[.*?\]\n', "", i)
listed.append(here)
elif i.find(r' \[.*?\]\n'):
here = re.sub(r' \[.*?\]\n', "", i)
listed.append(here)
else:
here = re.sub(r'\[.*?\]\n', "", i)
listed.append(here)
A sample of my input data:
Platteville (University of Wisconsin–Platteville)[2]
River Falls (University of Wisconsin–River Falls)[2]
Stevens Point (University of Wisconsin–Stevens Point)[2]
Waukesha (Carroll University)
Whitewater (University of Wisconsin–Whitewater)[2]
Wyoming[edit]
Laramie (University of Wyoming)[5]
A sample of my output data:
Platteville
River Falls
Stevens Point
Waukesha (Carroll University)
Whitewater
Wyoming[edit]
Laramie
However, I do not want the parts such as '(Carroll University)' or '[edit]'.
How can I amend my formula?
I would be so grateful if anyone could give me any advice!
You can do:
import re
with open(ur_file) as f_in:
for line in f_in:
if m:=re.search(r'^([^([]+)', line): # Python 3.8+
print(m.group(1))
If your Python is prior to 3.8 without the Walrus:
with open(ur_file) as f_in:
for line in f_in:
m=re.search(r'^([^([]+)', line)
if m:
print(m.group(1))
Prints:
Platteville
River Falls
Stevens Point
Waukesha
Whitewater
Wyoming
Laramie
The regex explained:
^([^([]+)
^ start of the line
^ ^ capture group
^ ^ character class
^ class of characters OTHER THAN ( and [
^ + means one or more
Here is the regex on Regex101
Use this RegEx instead:
\(.*\)|\[.*\]
Like so:
re.sub(r'\(.*\)|\[.*\]', '', i)
This will substitute anything in parenthesis (\(.*\)) or (|) anything in square brackets (\[.*\])
If after a vectorised solution which is much faster and more readable than a loop. Then try;
Data
df=pd.DataFrame({'text':['Platteville (University of Wisconsin–Platteville)[2]','River Falls (University of Wisconsin–River Falls)[2]','Stevens Point (University of Wisconsin–Stevens Point)[2]','Waukesha (Carroll University)','Whitewater (University of Wisconsin–Whitewater)[2]','Wyoming[edit]','Wyoming[edit]']})
Regex extract
df['name']=df.text.str.extract('([A-Za-z\s+]+(?=\(|\[))')
Regex Breakdown
Capture any [A-Za-z\s+] UpperCase, Lowercase letters that are followed by space
(?=\(|\[)) and that are immediately followed by special character(` or special character [

How do I eliminate all the parenthesis in a txt files?

I have a txt file, single COLUMN, taken from excel, of the following type:
AMANDA (LOUDLY SPEAKING)
JEFF
STEVEN (TEASINGLY)
AMANDA
DOC BRIAN GREEN
As output I want:
AMANDA
JEFF
STEVEN
AMANDA
DOC BRIAN GREEN
I tried with a for cycle on all the column and then:
if (str[i] == '('):
return str.split('(')
but it's clearly not working.
Do you have any possible solution? I would then need an output file as my original txt, so with each name for each line in a single column.
Thanks everyone!
(I am using PyCharm 3.2)
I'd use regex in this situation. \w will replace letters, the * will select 0 or more. Then we check that it is between parenthesis.
import re
fi = "AMANDA (LOUDLY) JEFF STEVEN (TEASINGLY) AMANDA"
with open("mytext.txt","r") as fi, open("out.txt", "w") as fo:
for line in fi:
fo.write(re.sub("\(.*?\)", "", line))
You can split the string into a list using a regular expression that matches everything in parentheses or a full word, remove all elements from the list which contain parentheses and then join the list to a string again. The advantage is that there will be no double spaces in the result string where a word in parantheses was removed.
import re
text = "AMANDA (LOUDLY SPEAKING) JEFF STEVEN (TEASINGLY) AMANDA DOC BRIAN GREEN"
words = re.findall("\(.*?\)|[^\s]+",text)
print " ".join([x for x in words if "(" not in x])

splitting strings after certain characters

I would like to split my string after certain characters are found.
identifier = filecontent_id[0].split("SV=")[0]
I have this, but this "deletes" everything before "SV=" and I would like for it to "delete" everything 1 character after it. For example, it would "delete" everything after "SV=1" but I did not put 1 there because it doesn't always equal 1. The string is:
>tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ
and I am trying to only get:
>tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1
A regex might be better, but the below works
SPLIT="SV="
line=">tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ"
print line.split(SPLIT)[0] + SPLIT + line.split(SPLIT)[1][0]

How would I get numbers from the beginning of a line of text, split them and print them out

Here is my dilemma: I'm writing an application in Python that will allow me to search a flat file (KJV bible.txt) for particular strings, and return the line number, book, and string searched for. However, I would also like to return the chapter and verse in which the string was found. That calls for me going to the beginning of the line and getting the chapter and verse number. I'm a Python neophyte, and am currently still reading through the Python tutorial by Guido van Rossum. This is something I'm trying to accomplish for a bible study group; something portable that can ran in the cmd module almost anywhere. I appreciate any help ... Thanks. Below is an excerpt from an example of a Bible chapter:
Daniel
1:1 In the third year of the reign of Jehoiakim king of Judah came
Nebuchadnezzar king of Babylon unto Jerusalem, and besieged it.
Say I searched for 'Jehoiakim' and one of the search results was the first line above. I would like to go to the numbers that precede this line (in this case 1:1) and get the chapter (1) and verse (1) and print them to the screen.
1:2 And the Lord gave Jehoiakim king of Judah into his hand, with part
of the vessels of the house of God: which he carried into the land of
Shinar to the house of his god; and he brought the vessels into the
treasure house of his god.
Code:
import os
import sys
import re
word_search = raw_input(r'Enter a word to search: ')
book = open("KJV.txt", "r")
first_lines = {36: 'Genesis', 4812: 'Exodus', 8867: 'Leviticus', 11749: 'Numbers', 15718: 'Deuteronomy',
18909: 'Joshua', 21070: 'Judges', 23340: 'Ruth', 23651: 'I Samuel', 26641: 'II Samuel',
29094: 'I Kings', 31990: 'II Kings', 34706: 'I Chronicles', 37378: 'II Chronicles',
40502: 'Ezra', 41418: 'Nehemiah', 42710: 'Esther', 43352: 'Job', 45937: 'Psalms', 53537: 'Proverbs',
56015: 'Ecclesiastes', 56711: 'The Song of Solomon', 57076: 'Isaih', 61550: 'Jeremiah',
66480: 'Lamentations', 66961: 'Ezekiel', 71548: 'Daniel' }
for ln, line in enumerate(book):
if word_search in line:
first_line = max(l for l in first_lines if l < ln)
bibook = first_lines[first_line]
template = "\nLine: {0}\nString: {1}\nBook:\n"
output = template.format(ln, line, bibook)
print output
Do a single split on whitespace, then split on :.
passage, text = line.split(None, 1)
chapter, verse = passage.split(':')
Use a regular expression: r'(\d+)\.(\d+)'
After finding a match (match = re.match(r'(\d+)\.(\d+)', line)), you can find the chapter in group 1 (chapter = match.group(1)) and the verse in group 2.
Use this code:
for ln, line in enumerate(book):
match = match = re.match(r'(\d+)\.(\d+)', line)
if match:
chapter, verse = match.group(1), match.group(2)
if word_search in line:
...
print 'Book %s %s:%s ...%s...' % (book, chapter, verse, line)

Categories

Resources