Split and save text string on scrapy

Split and save text string on scrapy - python

I need split a substring from a string, exactly this source text:
Article published on: Tutorial
I want delete "Article published on:" And leave only
Tutorial
, so i can save this
i try with:
category = items[1]
category.split('Article published on:','')
and with
for p in articles:
bodytext = p.xpath('.//text()').extract()
joined_text = ''
# loop in categories
for each_text in text:
stripped_text = each_text.strip()
if stripped_text:
# all the categories together
joined_text += ' ' + stripped_text
joined_text = joined_text.split('Article published on:','')
items.append(joined_text)
if not is_phrase:
title = items[0]
category = items[1]
print('title = ', title)
print('category = ', category)
and this don't works, what im missing?
error with this code:
TypeError: 'str' object cannot be interpreted as an integer

You probably just forgot to assign the result:
category = category.replace('Article published on:', '')
Also it seems that you meant to use replace instead of split. The latter also works though:
category = category.split(':')[1]

Related

How to extract certain paragraph from text file

def extract_book_info(self):
books_info = []
for file in os.listdir(self.book_folder_path):
title = "None"
author = "None"
release_date = "None"
last_update_date = "None"
language = "None"
producer = "None"
with open(self.book_folder_path + file, 'r', encoding = 'utf-8') as content:
book_info = content.readlines()
for lines in book_info:
if lines.startswith('Title'):
title = lines.strip().split(': ')
elif lines.startswith('Author'):
try:
author = lines.strip().split(': ')
except IndexError:
author = 'Empty'
elif lines.startswith('Release date'):
release_date = lines.strip().split(': ')
elif lines.startswith('Last updated'):
last_update_date = lines.strip().split(': ')
elif lines.startswith('Produce by'):
producer = lines.strip().split(': ')
elif lines.startswith('Language'):
language = lines.strip().split(': ')
elif lines.startswith('***'):
pass
books_info.append(Book(title, author, release_date, last_update_date, producer, language, self.book_folder_path))
with open(self.book_info_path, 'w', encoding="utf-8") as book_file:
for book_info in books_info:
book_file.write(book_info.__str__() + "\n")
I was using this code tried to extract the book title , author , release_date ,
last_update_date, language, producer, book_path).
This the the output I achieve:
['Title', 'The Adventures of Sherlock Holmes'];;;['Author', 'Arthur Conan Doyle'];;;None;;;None;;;None;;;['Language', 'English'];;;data/books_data/;;;
This is the output I should achieved.
May I know what method I should used to achieve the following output
The Adventures of Sherlock Holmes;;;Arthur Conan Doyle;;;November29,2002;;;May20,2019;;;English;;;
This is the example of input:
Title: The Adventures of Sherlock Holmes
Author: Arthur Conan Doyle
Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: May 20, 2019]
Language: English
Character set encoding: UTF-8
Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez
*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES ***
cover

str.split gives you a list as a result. You're using it to assign to a single value instead.
'Title: Sherlock Holmes'.split(':') # => ['Title', 'Sherlock Holmes']
What I can gather from your requirement you want to access the second element from the split every time. You can do so by:
...
for lines in book_info:
if lines.startswith('Author'):
_, author = lines.strip().split(':')
elif...
Be careful since this can throw an IndexError if there is no second element in a split result. (That's why there's a try on the author param in your code)
Also, avoid calling __str__ directly. That's what the str() function calls for you anyway. Use that instead.

Trying to format text when pulling from webpage HTML

I've created a basic counter for words in a song, but am having trouble formatting the album title and artist name from a given page on this lyrics website. Here's an example of what I am focused on:
I want to format it in this way:
Album Title: [Album Title] (Release_year)
Artist: [Artist Name]
I'm running into two problems:
The album title isn't enclosed in its own tag, so if I call the h1 tag I get both the album name, release year and artist name. How do I call them separately, or how do I break them up when calling them?
The album name has two blank lines and two blank spaces included in the string. How do I get rid of them? The release year prints right next to the album title, which is exactly what I'm looking for, but I cant get the album title to format properly.
This is what I currently have:
song_artist = soup.find("a",{"class":"artist"}).get_text()
album_title = soup.find("h1",{"class":"album_name"}).get_text()
print "Album Title: " + str(album_title)
print "Song Artist: " + str(song_artist.title())
which produces:
Thank you!!

album_title = soup.find("h1",{"class":"album_name"}).find(text=True).strip()
album_year = soup.find("span",{"class":"release_year"}).get_text().strip()
print 'Album Title: {} {}'.format(album_title, album_year)

change a key to a dictionary in google app engine - solved using a different approach

I'm having trouble changing a key value to a dictionary value
def get(self):
#Get all the Subjects
subjects = ndb.gql('SELECT name,order FROM Subject ORDER BY order ASC')
values = {'subjects':subjects}
#Get all the Contents
for subject in subjects:
contents = ndb.gql('SELECT * FROM Content WHERE ANCESTOR IS :1 ORDER BY order ASC',subject.key)
values[subject.name] = contents #***HERE is the issue***
Rather than getting a dictionary
value = {key:value}
I'm trying to get
value = {{key:value}:value}
Thanks in advance for any suggestions!
EDIT:
When I try
values['subject':subject.name] = contents
I get the error
TypeError: unhashable type
Solved: with a different approach:
def get(self):
#Get all the Subjects
subjects = ndb.gql('SELECT name,order FROM Subject ORDER BY order ASC')
values = {'subjects':subjects}
#Get all the Contents
values['contents'] = []
for subject in subjects:
#Formatting HTML output
subjectAll = subject.name + ' ' + subject.order
contents = ndb.gql('SELECT name,order FROM Content WHERE ANCESTOR IS :1 ORDER BY order ASC',subject.key)
values['contents'].append(subjectAll)
for content in contents:
#Formatting HTML output
contentAll = content.name + ' ' + content.order
values['contents'].append(contentAll)

Removed the default content in nested expression

I am using Pyparsing module and the nestedExpr function in it.
I want to give a delimitter instead of the default whitespace-delimited in the content argument of nestedexpr function.
If I have a text such as the following
text = "{{Infobox | birth_date = {{birth date and age|mf=yes|1981|1|31}}| birth_place = ((Memphis, Tennessee|Memphis)), ((Tennessee)), U.S.| instrument = ((Beatboxing)), guitar, keyboards, vocalsprint expr.parse| genre = ((Pop music|Pop)), ((contemporary R&B|R&B))| occupation = Actor, businessman, record producer, singer| years_active = 1992–present| label = ((Jive Records|Jive)), ((RCA Records|RCA)), ((Zomba Group of Companies|Zomba))| website = {{URL|xyz.com|Official website}} }}"
When I give nestedExpr('{{','}}').parseString(text) I need the output as the following list:
['Infobox | birth_date =' ,['birth date and age|mf=yes|1981|1|31'],'| birth_place = ((Memphis, Tennessee|Memphis)), ((Tennessee)), U.S.| instrument = ((Beatboxing)), guitar, keyboards, vocalsprint expr.parse| genre = ((Pop music|Pop)), ((contemporary R&B|R&B))| occupation = Actor, businessman, record producer, singer| years_active = 1992–present| label = ((Jive Records|Jive)), ((RCA Records|RCA)), ((Zomba Group of Companies|Zomba))| website =',[ 'URL|xyz.com|Official website' ]]
How can I give a ',' or '|' as the delimmiter instead of the whitespace-delimited characters? I tried giving the characters but it didnt work.

How to fix a KeyError in django?

print user_dic[id] displays the right result PersonA. This is when I input the id manually.
user_stream = {u'2331449': u'PersonB', u'17800013': u'PersonA'}
user_dic= {}
for item in user_stream:
user_dic[item['id']] = item['name']
id = '17800013'
print user_dic[id] #returns the right value
However, when I try to put the user_id through a for loop that iterates through json I get an error: KeyError at 17800013 for the line name = user_dic[user_id]. I don't understand why the user_dic[id] works when manually inputting the id, but user_dic[user_id] doesn't work when going through the for loop even though the input is the same.
#right fql query
fql_query = "SELECT created_time, post_id, actor_id, type, updated_time, attachment FROM stream WHERE post_id in (select post_id from stream where ('video') in attachment AND source_id IN ( SELECT uid2 FROM friend WHERE uid1=me()) limit 100)"
fql_var = "https://api.facebook.com/method/fql.query?access_token=" + token['access_token'] + "&query=" + fql_query + "&format=json"
data = urllib.urlopen(fql_var)
fb_stream = json.loads(data.read())
fb_feed = []
for post in fb_stream:
user_id = post["actor_id"]
name = user_dic[user_id] #this is the line giving me trouble
title = post["attachment"]["name"]
link = post["attachment"]["href"]
video_id = link[link.find('v=')+2 : link.find('v=')+13]
fb_feed.append([user_id, name, title, video_id])

There is no need for user_dic. What you are doing in first part is just a redundant work and you are also doing it wrong. Your user_stream is already in a form how you wanted it. Your first part should contain this line:
user_stream = {u'2331449': u'PersonB', u'17800013': u'PersonA'}
And in second part (at line where you are facing problem) you should do:
name = user_stream[user_id]
If you think that you will face KeyError then dict has a method .get, which returns None if the Key is not found. You can specify your value instead of None to return if there is KeyError
name = user_stream.get('user_id')
#returns None by default
name = user_stream.get('user_id', '')
#returns empty string now
#on both cases exception will not raised

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split and save text string on scrapy - python

You probably just forgot to assign the result: category = category.replace('Article published on:', '') Also it seems that you meant to use replace instead of split. The latter also works though: category = category.split(':')[1]

Related

How to extract certain paragraph from text file

Trying to format text when pulling from webpage HTML

change a key to a dictionary in google app engine - solved using a different approach

Removed the default content in nested expression

How to fix a KeyError in django?

Categories

Resources