How to extract certain paragraph from text file - python

def extract_book_info(self):
books_info = []
for file in os.listdir(self.book_folder_path):
title = "None"
author = "None"
release_date = "None"
last_update_date = "None"
language = "None"
producer = "None"
with open(self.book_folder_path + file, 'r', encoding = 'utf-8') as content:
book_info = content.readlines()
for lines in book_info:
if lines.startswith('Title'):
title = lines.strip().split(': ')
elif lines.startswith('Author'):
try:
author = lines.strip().split(': ')
except IndexError:
author = 'Empty'
elif lines.startswith('Release date'):
release_date = lines.strip().split(': ')
elif lines.startswith('Last updated'):
last_update_date = lines.strip().split(': ')
elif lines.startswith('Produce by'):
producer = lines.strip().split(': ')
elif lines.startswith('Language'):
language = lines.strip().split(': ')
elif lines.startswith('***'):
pass
books_info.append(Book(title, author, release_date, last_update_date, producer, language, self.book_folder_path))
with open(self.book_info_path, 'w', encoding="utf-8") as book_file:
for book_info in books_info:
book_file.write(book_info.__str__() + "\n")
I was using this code tried to extract the book title , author , release_date ,
last_update_date, language, producer, book_path).
This the the output I achieve:
['Title', 'The Adventures of Sherlock Holmes'];;;['Author', 'Arthur Conan Doyle'];;;None;;;None;;;None;;;['Language', 'English'];;;data/books_data/;;;
This is the output I should achieved.
May I know what method I should used to achieve the following output
The Adventures of Sherlock Holmes;;;Arthur Conan Doyle;;;November29,2002;;;May20,2019;;;English;;;
This is the example of input:
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: May 20, 2019]
Language: English
Character set encoding: UTF-8
Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez
*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***
cover

str.split gives you a list as a result. You're using it to assign to a single value instead.
'Title: Sherlock Holmes'.split(':') # => ['Title', 'Sherlock Holmes']
What I can gather from your requirement you want to access the second element from the split every time. You can do so by:
...
for lines in book_info:
if lines.startswith('Author'):
_, author = lines.strip().split(':')
elif...
Be careful since this can throw an IndexError if there is no second element in a split result. (That's why there's a try on the author param in your code)
Also, avoid calling __str__ directly. That's what the str() function calls for you anyway. Use that instead.

Related

How to do search by option to search from files? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I'm a beginner trying to build a simple library management system using Python. Users can search a book from a list of many books stored in a text file. Here is an example of what is in the text file:
Author: J.K Rowling
Title: Harry Potter and the Deathly Hollow
Keywords: xxxx
Published by: xxxx
Published year: xxxx
Author: Stephen King
Title: xxxx
Keywords: xxxx
Published by: xxxx
Published year: xxxx
Author: J.K Rowling
Title: Harry Potter and the Half Blood Prince
Keywords: xxxx
Published by: xxxx
Published year: xxxx
This is where it gets difficult for me. There is a Search by Author option for the user to search books. What I want to do is when the users search for any authors (e.g. J.K Rowling), it would output all (in this case, there are two J.K Rowling books) of the related components (Author, Title, Keywords, Published by, Published year). This is the last piece of the program, which I'm having very much difficulty in doing. Please help me, and thank you all in advance.
Is it possible for you to implement the text file in the form of a JSON file instead? It could be a better alternative since you could easily access all the values depending on the key you have chosen and search through those as well.
{
"Harry Potter and the Deathly Hollow" :
{
"Author": "J.K Rowling",
"Keywords": xxxx,
"Published by": xxxx,
"Published year": xxxx
},
'Example 2' :
{
"Author": "Stephen King"
"Keywords": xxxx
"Published by": xxxx
"Published year": xxxx
}
}
You can iterate through the lines of the text file like this:
with open(r"path\to\text_file.txt", "r") as books:
lines = books.readlines()
for index in range(len(lines)):
line = lines[index]
Now, get the author of each book by splitting the line on the ":" character and testing if the first part == "Author". Then, get the second part of the split string and strip it of the "\n" [newline] and " " characters to make sure there are no extra spaces or anything that will mess up the search on either side. I would also recomment lowercasing the author name and search query to make capitalisation not matter. Test if this is equal to the search query:
if line.split(":")[0] == "Author" and\
line.split(":")[1].strip("\n ").lower() == search_query.lower():
Then, in this if loop, print out all the required information about this book.
Completed code:
search_query = "J.K Rowling"
with open(r"books.txt", "r") as books:
lines = books.readlines()
for index in range(len(lines)):
line = lines[index]
if line.split(":")[0] == "Author" and line.split(":")[1].strip("\n ").lower() == search_query.lower():
print(*lines[index + 1: index + 5])
Generally, a lot of problems to be programmed can be resolved into a three-step process:
Read the input into an internal data structure
Do processing as required
Write the output
This problem seems like quite a good fit for that pattern:
In the first part, read the text file into an in-memory list of either dictionaries or objects (depending on what's expected by your course)
In the second part, search the in-memory list according to the search criteria; this will result in a shorter list containing the results
In the third part, print out the results neatly
It would be reasonable to put these into three separate functions, and to attack each of them separately
# To read the details from the file ex books.txt
with open("books.txt","r") as fd:
lines = fd.read()
#Split the lines based on Author. As Author word will be missing after split so add the Author to the result. The entire result is in bookdetails list.
bookdetails = ["Author" + line for line in lines.split("Author")[1:]]
#Author Name to search
authorName = "J.K Rowling"
# Search for the given author name from the bookdetails list. Split the result based on new line results in array of details.
result = [book.splitlines() for book in bookdetails if "Author: " + authorName in book]
print(result)
If you will always receive this format of the file and you want to transform it into a dictionary:
def read_author(file):
data = dict()
with open(file, "r") as f:
li = f.read().split("\n")
for e in li:
if ":" in e:
data[e.split(":")[0]] = e.split(":")[1]
return data['Author']
Note: The text file sometimes has empty lines so I check if the line contains the colon (:) before transforming it into a dict.
Then if you want a more generic method you can pass the KEY of the element you want:
def read_info(file, key):
data = dict()
with open(file, "r") as f:
li = f.read().split("\n")
for e in li:
if ":" in e:
data[e.split(":")[0]] = e.split(":")[1]
return data[key]
Separating the reading like the following you can be more modular:
class BookInfo:
def __init__(self, file) -> None:
self.file = file
self.data = None
def __read_file(self):
if self.data is None:
with open(self.file, "r") as f:
li = f.read().split("\n")
self.data = dict()
for e in li:
if ":" in e:
self.data[e.split(":")[0]] = e.split(":")[1]
def read_author(self):
self.__read_file()
return self.data['Author']
Then create objects for each book:
info = BookInfo("book.txt")
print(info.read_author())

Split and save text string on scrapy

I need split a substring from a string, exactly this source text:
Article published on: Tutorial
I want delete "Article published on:" And leave only
Tutorial
, so i can save this
i try with:
category = items[1]
category.split('Article published on:','')
and with
for p in articles:
bodytext = p.xpath('.//text()').extract()
joined_text = ''
# loop in categories
for each_text in text:
stripped_text = each_text.strip()
if stripped_text:
# all the categories together
joined_text += ' ' + stripped_text
joined_text = joined_text.split('Article published on:','')
items.append(joined_text)
if not is_phrase:
title = items[0]
category = items[1]
print('title = ', title)
print('category = ', category)
and this don't works, what im missing?
error with this code:
TypeError: 'str' object cannot be interpreted as an integer
You probably just forgot to assign the result:
category = category.replace('Article published on:', '')
Also it seems that you meant to use replace instead of split. The latter also works though:
category = category.split(':')[1]

Trying to format text when pulling from webpage HTML

I've created a basic counter for words in a song, but am having trouble formatting the album title and artist name from a given page on this lyrics website. Here's an example of what I am focused on:
I want to format it in this way:
Album Title: [Album Title] (Release_year)
Artist: [Artist Name]
I'm running into two problems:
The album title isn't enclosed in its own tag, so if I call the h1 tag I get both the album name, release year and artist name. How do I call them separately, or how do I break them up when calling them?
The album name has two blank lines and two blank spaces included in the string. How do I get rid of them? The release year prints right next to the album title, which is exactly what I'm looking for, but I cant get the album title to format properly.
This is what I currently have:
song_artist = soup.find("a",{"class":"artist"}).get_text()
album_title = soup.find("h1",{"class":"album_name"}).get_text()
print "Album Title: " + str(album_title)
print "Song Artist: " + str(song_artist.title())
which produces:
Thank you!!
album_title = soup.find("h1",{"class":"album_name"}).find(text=True).strip()
album_year = soup.find("span",{"class":"release_year"}).get_text().strip()
print 'Album Title: {} {}'.format(album_title, album_year)

XLRD/ Entrez: Search through Pubmed and extract the counts

I am working on a project that requires me to search through pubmed using inputs from an Excel spreadsheet and print counts of the results. I have been using xlrd and entrez to do this job. Here is what I have tried.
I need to search through pubmed using the name of the author, his/her medical school, a range of years, and his/her mentor's name, which are all in an Excel spreadsheet. I have used xlrd to turn each column with the required information into lists of strings.
from xlrd import open_workbook
book = xlrd.open_workbook("HEENT.xlsx").sheet_by_index(0)
med_name = []
for row in sheet.col(2):
med_name.append(row)
med_school = []
for row in sheet.col(3):
med_school.append(row)
mentor = []
for row in sheet.col(9):
mentor.append(row)
I have managed to print the counts of my specific queries using Entrez.
from Bio import Entrez
Entrez.email = "your#email.edu"
handle = Entrez.egquery(term="Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) ")
handle_1 = Entrez.egquery(term = "Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) AND Leoard P. Byk")
handle_2 = Entrez.egquery(term = "Jennifer Runch AND ((2012[Date - Publication] : 2017[Date - Publication])) AND Southern Illinois University School of Medicine")
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = []
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count.append(row["Count"])
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count.append(row["Count"])
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count.append(row["Count"])
print(pubmed_count)
>>>['3', '0', '0']
The problem is that I need to replace the student name ("Jennifer Runch") with the next student name in the list of student names("med_name"), the medical school with the next school, and the current mentor's name with the next mentor's name from the list.
I think I should write a for loop after declaring my email to pubmed, but I am not sure how to link the two blocks of code together. Does anyone know of an efficient way to connect the two blocks of code or know how to do this with a more efficient way than the one I have tried?
Thank you!
You got most of the code in place. It just needed to be modified slightly.
Assuming your table looks like this:
Jennifer Bunch |Southern Illinois University School of Medicine|Leonard P. Rybak
Philipp Robinson|Stanford University School of Medicine |Roger Kornberg
you could use the following code
import xlrd
from Bio import Entrez
sheet = xlrd.open_workbook("HEENT.xlsx").sheet_by_index(0)
med_name = list()
med_school = list()
mentor = list()
search_terms = list()
for row in range(0, sheet.nrows):
search_terms.append([sheet.cell_value(row, 0), sheet.cell_value(row,1), sheet.cell_value(row, 2)])
pubmed_counts = list()
for search_term in search_terms:
handle = Entrez.egquery(term="{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) ".format(search_term[0]))
handle_1 = Entrez.egquery(term = "{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[2]))
handle_2 = Entrez.egquery(term = "{0} AND ((2012[Date - Publication] : 2017[Date - Publication])) AND {1}".format(search_term[0], search_term[1]))
record = Entrez.read(handle)
record_1 = Entrez.read(handle_1)
record_2 = Entrez.read(handle_2)
pubmed_count = ['', '', '']
for row in record["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[0] = row["Count"]
for row in record_1["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[1] = row["Count"]
for row in record_2["eGQueryResult"]:
if row["DbName"] == "pubmed":
pubmed_count[2] = row["Count"]
print(pubmed_count)
pubmed_counts.append(pubmed_count)
Output
['3', '0', '0']
['1', '0', '0']
The required modification is to make the queries variable using format.
Some other modifications which are not necessary but might be helpful:
loop over the Excel sheet only once
store the pubmed_count in a predefined list because if values come back empty, the size of the output will vary making it hard to guess which value belongs to which query
everything could be even further optimized and prettified, e.g. store the queries in a list and loop over them which would give less code repetition but now it does the job.

Removed the default content in nested expression

I am using Pyparsing module and the nestedExpr function in it.
I want to give a delimitter instead of the default whitespace-delimited in the content argument of nestedexpr function.
If I have a text such as the following
text = "{{Infobox | birth_date = {{birth date and age|mf=yes|1981|1|31}}| birth_place = ((Memphis, Tennessee|Memphis)), ((Tennessee)), U.S.| instrument = ((Beatboxing)), guitar, keyboards, vocalsprint expr.parse| genre = ((Pop music|Pop)), ((contemporary R&B|R&B))| occupation = Actor, businessman, record producer, singer| years_active = 1992–present| label = ((Jive Records|Jive)), ((RCA Records|RCA)), ((Zomba Group of Companies|Zomba))| website = {{URL|xyz.com|Official website}} }}"
When I give nestedExpr('{{','}}').parseString(text) I need the output as the following list:
['Infobox | birth_date =' ,['birth date and age|mf=yes|1981|1|31'],'| birth_place = ((Memphis, Tennessee|Memphis)), ((Tennessee)), U.S.| instrument = ((Beatboxing)), guitar, keyboards, vocalsprint expr.parse| genre = ((Pop music|Pop)), ((contemporary R&B|R&B))| occupation = Actor, businessman, record producer, singer| years_active = 1992–present| label = ((Jive Records|Jive)), ((RCA Records|RCA)), ((Zomba Group of Companies|Zomba))| website =',[ 'URL|xyz.com|Official website' ]]
How can I give a ',' or '|' as the delimmiter instead of the whitespace-delimited characters? I tried giving the characters but it didnt work.

Categories

Resources