String Manipulation for a text extracted with a css selector - python

I wrote a code to extract multiple information regarding movies. I have a problem with manipulating one of the strings :
'\n'
'\t\t\t\t85\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t'
I want to extract the duration of the movie, which, in this case, is the number 85.
I don't really know how to extract it since the format is pretty weird. My web scraping program yields items as dictionaries, for example:
{'film_director': ['Alfred Hitchcock'],
'film_rating': ['4.0'],
'film_time': '\n'
'\t\t\t\t81\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t',
'film_title': ['Rope'],
'film_year': ['1948']}
I have tried splitting it, but it doesn't seem to work. Any other ideas?

film_dict = {'film_director': ['Alfred Hitchcock'],
'film_rating': ['4.0'],
'film_time': '\n'
'\t\t\t\t81\xa0mins \xa0\n'
'\t\t\t\t\n'
'\t\t\t\t\tMore details at\n'
'\t\t\t\t\t',
'film_title': ['Rope'],
'film_year': ['1948']}
film_time = (film_dict ['film_time'].replace ('\t', '')[: 7])
print (film_time)
Line eleven takes the film time value, removes the tab character and the truncates it to just the part that you need. The replace method replaces '\n' with nothing which just removes it. [:7] slices from the beginning of the modified string up to character number 8.

Related

Splitting elements within a list and separate strings, then counting the length

If I have several lines of code, such that
"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child taking up her elders in that manner.
Be seated somewhere; and until you can speak pleasantly, remain silent."
I mounted into the window- seat: gathering up my feet, I sat cross-legged, like a Turk; and, having drawn the red moreen curtain nearly close, I was shrined in double retirement.
and I want to split the 'string' or sentences for each line by the ";" punctuation, I would do
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
words_split = words.split(";")
However, now I would get strings of text such that,
["Jane, I don't like cavillers or questioners', 'besides, there is something truly forbidding in a child taking up her elders in that manner.']
[Be seated somewhere', 'and until you can speak pleasantly, remain silent."']
['I mounted into the window- seat: gathering up my feet, I sat cross-legged, like a Turk', 'and, having drawn the red moreen curtain nearly close, I was shrined in double retirement.']
So it has now created two separate elements in this list.
How would I actually separate this list.
I know I need a 'for' loop because it needs to process through all the lines. I will need to use another 'split' method, however I have tried "\n" as well as ',' but it will not generate an answer, and the python thing says "AttributeError: 'list' object has no attribute 'split'". What would this mean?
Once I separate into separate strings, I want to calculate the length of each string, so i would do len(), etc.
You can iterate through the list of created words like this:
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
for sentence_part in words.split(";"):
print(sentence_part) # will print the elements of the list
print(len(sentence_part) # will print the length of the sentence parts
Alernatively if you just need the length for each of the parts:
for line in open("jane_eyre_sentences.txt"):
words = line.strip("\n")
sentence_part_lengths = [len(sentence_part) for sentence_part in words.split(";")]
Edit: With further information from your second post.
for count, line in enumerate(open("jane_eyre_sentences.txt")):
words = line.strip("\n")
if ";" in words:
wordssplit = words.split(";")
number_of_words_per_split = [(x, len(x.split())) for x in wordsplit]
print("Line {}: ".format(count), number_of_words_per_split)

How can I get the nth element of String for list of list in python?

I have my txt file something like this.
[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"]
[1, "I want to write a . I think I will.\n"]
[2, "#va_stress broke my twitter..\n"]
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"]
[4, "aww great "Picture to burn"\n"]
I have a some code which want to access the 2nd element of each array. When I use the code from Get the nth element from the inner list of a list of lists in Python It is giving each characters but not the entire string.
What could be the best way to make a loop for getting second element?
My code is something like this.
ALl the tweets are in the tweets[] list.
cluster = []
for idx, cls in enumerate(km.labels_):
if cls == 1:
# printing cluster 2 data.
# print tweets from the tweets array. like the entire line. But I
# want to get the String here not the entire line.
print tweets[idx]
cluster.append(tweets[idx])
Here, idx element is used to get specific queries. so tweets[idx] will print specific queries from the text file but It is printing the entire line like [2, "#va_stress broke my twitter..\n"] and I want string element only.
I guess what you are wanting is the string from each list. I am assuming you already would have a list of list parameter with each list from your text file.
hence you can apply this to get the values.
list_of_strings = filter(lambda x:x[1], list_of_lists)
You can try any of these method :
data=[[0, "we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....\n"],
[1, "I want to write a . I think I will.\n"],
[2, "#va_stress broke my twitter..\n"],
[3, "\" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry\n"],
[4, "aww great "Picture to burn"\n"]]
print(list(map(lambda x:x[1].strip(),data)))
or
print([i[1].strip() for i in data])
output:
['we break dance not hearts by Short Stack is my ringtone.... i LOVE that !!!.....', 'I want to write a . I think I will.', '#va_stress broke my twitter..', '" "Y must people insist on talking about stupid politics on the comments of a bubblegum pop . Sorry', 'aww great "Picture to burn"']
Try this:
data=eval('['+(open('file.txt').read().replace('\n', ', ')[:-2])+']')
result=[]
for i in data:
data.append(i[1])
The last 3 lines are pretty obvious, but here's what the first one does:
open('file.txt').read() opens the file and gets the contents
.replace('\n', ', ')[:-2] replaces the newlines with ,, except the last, so that it's formatted like a list. Skip [:-2]it if the last line doesn't end in a newline.
'['+...+']' adds [ and ] for more formatting as a list.
data=eval(...) makes creates the list and assigns it to data.
Just in case here is what the final lines do:
empty list created and assigned to result
i assigned to each value in data for the following line:
appends the second value of each list within data to result.

Python 3, if str is present replace with str [duplicate]

This question already has answers here:
How to replace multiple substrings of a string?
(28 answers)
Closed 6 years ago.
I'm writing a fairly simple python program to find and download videos from a particular site. I would like to have my script name the file by using the page title except the page title contains various strings i would like remove for e.g.,
The title is:
The Big Bang Theory S09E15 720p HDTV X264-DIMENSION
but the titles are not always consistent for e.g.,
The title is:
Triple 9 2016 READNFO HDRip AC3-EVO
How can I replace strings if they are present?
Maybe create a list or dictionary of possible strings and if they are present then remove them (or replace with empty string)? I have tried and tried to find an answer but cannot find anything that helps my situation.
Basically if "HDTV", "HDRip", "720p", "X264", etc are present then replace them otherwise carry on?
Simple example:
string = 'The Big Bang Theory S09E15 720p HDTV X264-DIMENSION'
dict = {'720p':'1080p'} # format 'substring':'replacement'
for key, value in dict.iteritems():
if key in string:
string.replace(key,value)
The only problem with this is that if you want to replace a word that could be part of another word. For example if you want to replace 'an' with a, then the string in this example would become 'The Big Bag Theory ... '. To fix this I would try breaking up the string into a set of words and compare the words to dictionary entries.
for undesired_word in ("HDTV", "HDRip", "720p", "X264"):
title = title.replace(undesired_word, "")
title = 'The Big Bang Theory S09E15 720p HDTV X264-DIMENSION'
if 'HDTV' in title:
title = title.replace('HDTV', '')
not very pythonic but it will do what you want
Kevins answer will work for you, but just in case you find yourself wanting to use a regex:
import re
string_to_replace = ["HDTV", "HDRip", "720p", "X264"]
regex_string = r"|".join(string_to_replace)
S = "The Big Bang Theory S09E15 720p HDTV X264-DIMENSION"
new_string = re.sub(regex_string, "", S, flags=re.I)
print(new_string)
prints:
The Big Bang Theory S09E15 -DIMENSION
Also, as you will notice the spaces that went after the strings you were replacing are still there, if you do not want that, you can change string_to_replace to include the spaces, like so: ["HDTV ", "HDRip ", "720p ", "X264 "] and this would result in the output being:
The Big Bang Theory S09E15 X264-DIMENSION

How to use text.split() and retain blank (empty) lines

New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!
First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.
use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)

Python parsing

I'm trying to parse the title tag in an RSS 2.0 feed into three different variables for each entry in that feed. Using ElementTree I've already parsed the RSS so that I can print each title [minus the trailing )] with the code below:
feed = getfeed("http://www.tourfilter.com/dallas/rss/by_concert_date")
for item in feed:
print repr(item.title[0:-1])
I include that because, as you can see, the item.title is a repr() data type, which I don't know much about.
A particular repr(item.title[0:-1]) printed in the interactive window looks like this:
'randy travis (Billy Bobs 3/21'
'Michael Schenker Group (House of Blues Dallas 3/26'
The user selects a band and I hope to, after parsing each item.title into 3 variables (one each for band, venue, and date... or possibly an array or I don't know...) select only those related to the band selected. Then they are sent to Google for geocoding, but that's another story.
I've seen some examples of regex and I'm reading about them, but it seems very complicated. Is it? I thought maybe someone here would have some insight as to exactly how to do this in an intelligent way. Should I use the re module? Does it matter that the output is currently is repr()s? Is there a better way? I was thinking I'd use a loop like (and this is my pseudoPython, just kind of notes I'm writing):
list = bandRaw,venue,date,latLong
for item in feed:
parse item.title for bandRaw, venue, date
if bandRaw == str(band)
send venue name + ", Dallas, TX" to google for geocoding
return lat,long
list = list + return character + bandRaw + "," + venue + "," + date + "," + lat + "," + long
else
In the end, I need to have the chosen entries in a .csv (comma-delimited) file looking like this:
band,venue,date,lat,long
randy travis,Billy Bobs,3/21,1234.5678,1234.5678
Michael Schenker Group,House of Blues Dallas,3/26,4321.8765,4321.8765
I hope this isn't too much to ask. I'll be looking into it on my own, just thought I should post here to make sure it got answered.
So, the question is, how do I best parse each repr(item.title[0:-1]) in the feed into the 3 separate values that I can then concatenate into a .csv file?
Don't let regex scare you off... it's well worth learning.
Given the examples above, you might try putting the trailing parenthesis back in, and then using this pattern:
import re
pat = re.compile('([\w\s]+)\(([\w\s]+)(\d+/\d+)\)')
info = pat.match(s)
print info.groups()
('Michael Schenker Group ', 'House of Blues Dallas ', '3/26')
To get at each group individual, just call them on the info object:
print info.group(1) # or info.groups()[0]
print '"%s","%s","%s"' % (info.group(1), info.group(2), info.group(3))
"Michael Schenker Group","House of Blues Dallas","3/26"
The hard thing about regex in this case is making sure you know all the known possible characters in the title. If there are non-alpha chars in the 'Michael Schenker Group' part, you'll have to adjust the regex for that part to allow them.
The pattern above breaks down as follows, which is parsed left to right:
([\w\s]+) : Match any word or space characters (the plus symbol indicates that there should be one or more such characters). The parentheses mean that the match will be captured as a group. This is the "Michael Schenker Group " part. If there can be numbers and dashes here, you'll want to modify the pieces between the square brackets, which are the possible characters for the set.
\( : A literal parenthesis. The backslash escapes the parenthesis, since otherwise it counts as a regex command. This is the "(" part of the string.
([\w\s]+) : Same as the one above, but this time matches the "House of Blues Dallas " part. In parentheses so they will be captured as the second group.
(\d+/\d+) : Matches the digits 3 and 26 with a slash in the middle. In parentheses so they will be captured as the third group.
\) : Closing parenthesis for the above.
The python intro to regex is quite good, and you might want to spend an evening going over it http://docs.python.org/library/re.html#module-re. Also, check Dive Into Python, which has a friendly introduction: http://diveintopython3.ep.io/regular-expressions.html.
EDIT: See zacherates below, who has some nice edits. Two heads are better than one!
Regular expressions are a great solution to this problem:
>>> import re
>>> s = 'Michael Schenker Group (House of Blues Dallas 3/26'
>>> re.match(r'(.*) \((.*) (\d+/\d+)', s).groups()
('Michael Schenker Group', 'House of Blues Dallas', '3/26')
As a side note, you might want to look at the Universal Feed Parser for handling the RSS parsing as feeds have a bad habit of being malformed.
Edit
In regards to your comment... The strings occasionally being wrapped in "s rather than 's has to do with the fact that you're using repr. The repr of a string is usually delimited with 's, unless that string contains one or more 's, where instead it uses "s so that the 's don't have to be escaped:
>>> "Hello there"
'Hello there'
>>> "it's not its"
"it's not its"
Notice the different quote styles.
Regarding the repr(item.title[0:-1]) part, not sure where you got that from but I'm pretty sure you can simply use item.title. All you're doing is removing the last char from the string and then calling repr() on it, which does nothing.
Your code should look something like this:
import geocoders # from GeoPy
us = geocoders.GeocoderDotUS()
import feedparser # from www.feedparser.org
feedurl = "http://www.tourfilter.com/dallas/rss/by_concert_date"
feed = feedparser.parse(feedurl)
lines = []
for entry in feed.entries:
m = re.search(r'(.*) \((.*) (\d+/\d+)\)', entry.title)
if m:
bandRaw, venue, date = m.groups()
if band == bandRaw:
place, (lat, lng) = us.geocode(venue + ", Dallas, TX")
lines.append(",".join([band, venue, date, lat, lng]))
result = "\n".join(lines)
EDIT: replaced list with lines as the var name. list is a builtin and should not be used as a variable name. Sorry.

Categories

Resources