Say I have a sentence such as:
The bird flies at night and has a very large wing span.
My goal is to split the string so that the result comes out to be:
and has a very large wing
I've tried using split(), however, my efforts have not been successful. How can I split the string into pieces, and delete the beginning part of the string and the end part?
import re
text = 'The bird flies at night and has a very large wing span.'
l = re.split(r'.+?(?=and)|(?<=wing).+?', text)[1]
out:
and has a very large wing
I guess this is the best way to do what you want:
s = "The bird flies at night and has a very large wing span."
and_position = s.find("and") # return the first index of "and" in the string
wing_position = s.find("wing") # similar to the above
result = s[and_position:wing_position+4] # this is called python's slice
If you're not familiar with python slice, read more at here.
Related
I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.
I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question
I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])
You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.
I've checked the site for an answer to this question and exhausted Google and my own patience trying to answer it myself, so here it is. Happy to be pointed to the answer if this is a dupe.
So I have a long regex--nothing complicated, just a bunch of simple conditions piped together. I'm using it to remove the piped words from the beginnings and ends of named entities I've extracted from news article data. The use case is, many of the names have these short words within them (think Centers for Disease Control and Prevention) but I want to remove the words when they appear at the beginning or end of the name. E.g., I don't want "Centers for Disease Control" counted differently from "the Centers for Disease Control" for obvious reasons.
I used this regex string on a large (>1M) list of named entities in Python 3.7.2 using the following code (file here):
with open('pnames.csv','r') as f:
named_entities = f.read().splitlines()
print(len([i for i in named_entities if i == 'the wall street journal']))
# 146
short_words = "^and\s|\sand$|^at\s|\sat$|^by\s|\sby$|^for\s|\sfor$|^in\s|\sin$|^of\s|\sof$|^on\s|\son$|^the\s|\sthe$|^to\s|\sto$"
cleaned_entities = [re.sub(short_words,"",i)
for i
in named_entities]
print(len([i for i in cleaned_entities
if i == 'the wall street journal']))
# 80 (huh, should be 0. Let me try again...)
cleaned_entities2 = [re.sub(short_words,"",i)
for i
in cleaned_entities]
print(len([i for i in cleaned_entities2
if i == 'the wall street journal']))
# 1 (better, but still unexpected. One more time...)
cleaned_entities3 = [re.sub(short_words,"",i)
for i
in cleaned_entities2]
print(len([i for i in cleaned_entities3
if i == 'the wall street journal']))
# 0 (this is what I expected on the first run!)
My question is, why doesn't the regex remove all the matching substrings in one pass? i.e., why is len([i for i in cleaned_entities if i == 'the wall street journal']) not equal to 0? Why does it take multiple runs to finish the job?
Things I've tried:
Restarting Spyder
Running the same code in Python 3.7.2, Python 3.6.2, and equivalent code in R 3.4.2 (the Pythons gave the exact same results, and R gave different numbers but I still had to run it several times to get to zero)
Running the code only on the substrings that match the regex (same result)
Running the code only on the strings that equal "the wall street journal" (works in one pass)
Substituting the regex "^the " in the above code (fixes all matches in one pass)
So yeah, any ideas would be helpful.
Your regular expression will only ever remove one layer of unwanted words per pass. So if you had a
sentence as:
and and at by in of the the wall street journal at the by on the
it would have needed many passes to completely remove everything.
The expression can be rearranged to make use of + to indicate one or more occurances of as follows:
import re
with open('pnames2.csv','r') as f:
named_entities = f.read().splitlines()
print(len([i for i in named_entities if i == 'the wall street journal']))
# 146
short_words = "^((and|at|by|for|in|of|on|the|to)\s)+|(\s(and|at|by|for|in|of|on|the|to))+$"
re_sw = re.compile(short_words)
cleaned_entities = [re_sw.sub("", i) for i in named_entities]
print(len([i for i in cleaned_entities if i == 'the wall street journal']))
# 0
The process can be sped up slightly by pre-compiling the regular expression. It would be even faster if you applied it to
the whole file rather than applying it on a line by line basis.
So I have this textfile, and in that file it goes like this... (just a bit of it)
"The truest love that ever heart
Felt at its kindled core
Did through each vein in quickened start
The tide of being pour
Her coming was my hope each day
Her parting was my pain
The chance that did her steps delay
Was ice in every vein
I dreamed it would be nameless bliss
As I loved loved to be
And to this object did I press
As blind as eagerly
But wide as pathless was the space
That lay our lives between
And dangerous as the foamy race
Of ocean surges green
And haunted as a robber path
Through wilderness or wood
For Might and Right and Woe and Wrath
Between our spirits stood
I dangers dared I hindrance scorned
I omens did defy
Whatever menaced harassed warned
I passed impetuous by
On sped my rainbow fast as light
I flew as in a dream
For glorious rose upon my sight
That child of Shower and Gleam"
Now, the calculate the length of words without the letter 'e' in each line of text. So in the first line it should have 4, then 5, then 17, etc.
My current code is
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
for word in line_strip_split:
if "e" not in word:
word_e = word
print (len(word_e))
My explanation is: Strip each word from each other by removing spaces, so it becomes ['Felt','at','its','kindled','core'], etc. Then we split each word because we can regard it individually when removing words with 'e'?. So we want words without e, then print the length of the string.
HOWEVER, this separates each word into a different line by splitting then separating the string? So this doesn't add all the words together in each line but separates it, so the answer becomes "4 / 2 / 3"
Try this:
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
words_with_no_e = []
for word in line_strip_split:
if "e" not in word:
# Adding words without e to a new list
words_with_no_e.append(word)
# ''.join() will returns all the elements of array concatenated
# len() will count the length
print(len(''.join(words_with_no_e)))
It append all the words without e in into new list in each line, then concatenate all words then it prints length of it.
This question already has answers here:
How to replace multiple substrings of a string?
(28 answers)
Closed 6 years ago.
I'm writing a fairly simple python program to find and download videos from a particular site. I would like to have my script name the file by using the page title except the page title contains various strings i would like remove for e.g.,
The title is:
The Big Bang Theory S09E15 720p HDTV X264-DIMENSION
but the titles are not always consistent for e.g.,
The title is:
Triple 9 2016 READNFO HDRip AC3-EVO
How can I replace strings if they are present?
Maybe create a list or dictionary of possible strings and if they are present then remove them (or replace with empty string)? I have tried and tried to find an answer but cannot find anything that helps my situation.
Basically if "HDTV", "HDRip", "720p", "X264", etc are present then replace them otherwise carry on?
Simple example:
string = 'The Big Bang Theory S09E15 720p HDTV X264-DIMENSION'
dict = {'720p':'1080p'} # format 'substring':'replacement'
for key, value in dict.iteritems():
if key in string:
string.replace(key,value)
The only problem with this is that if you want to replace a word that could be part of another word. For example if you want to replace 'an' with a, then the string in this example would become 'The Big Bag Theory ... '. To fix this I would try breaking up the string into a set of words and compare the words to dictionary entries.
for undesired_word in ("HDTV", "HDRip", "720p", "X264"):
title = title.replace(undesired_word, "")
title = 'The Big Bang Theory S09E15 720p HDTV X264-DIMENSION'
if 'HDTV' in title:
title = title.replace('HDTV', '')
not very pythonic but it will do what you want
Kevins answer will work for you, but just in case you find yourself wanting to use a regex:
import re
string_to_replace = ["HDTV", "HDRip", "720p", "X264"]
regex_string = r"|".join(string_to_replace)
S = "The Big Bang Theory S09E15 720p HDTV X264-DIMENSION"
new_string = re.sub(regex_string, "", S, flags=re.I)
print(new_string)
prints:
The Big Bang Theory S09E15 -DIMENSION
Also, as you will notice the spaces that went after the strings you were replacing are still there, if you do not want that, you can change string_to_replace to include the spaces, like so: ["HDTV ", "HDRip ", "720p ", "X264 "] and this would result in the output being:
The Big Bang Theory S09E15 X264-DIMENSION
New to python, need some help with my program. I have a code which takes in an unformatted text document, does some formatting (sets the pagewidth and the margins), and outputs a new text document. My entire code works fine except for this function which produces the final output.
Here is the segment of the problem code:
def process(document, pagewidth, margins, formats):
res = []
onlypw = []
pwmarg = []
count = 0
marg = 0
for segment in margins:
for i in range(count, segment[0]):
res.append(document[i])
text = ''
foundmargin = -1
for i in range(segment[0], segment[1]+1):
marg = segment[2]
text = text + '\n' + document[i].strip(' ')
words = text.split()
Note: segment [0] means the beginning of the document, and segment[1] just means to the end of the document if you are wondering about the range. My problem is when I copy text to words (in words=text.split() ) it does not retain my blank lines. The output I should be getting is:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me.
There now is your insular city of the Manhattoes, belted
round by wharves as Indian isles by coral reefs--commerce
surrounds it with her surf.
And what my current output looks like:
This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I
quietly take to the ship. There is nothing surprising in
this. If they but knew it, almost all men in their degree,
some time or other, cherish very nearly the same feelings
towards the ocean with me. There now is your insular city of
the Manhattoes, belted round by wharves as Indian isles by
coral reefs--commerce surrounds it with her surf.
I know the problem happens when I copy text to words, since it doesn't keep the blank lines. How can I make sure it copies the blank lines plus the words?
Please let me know if I should add more code or more detail!
First split on at least 2 newlines, then split on words:
import re
paragraphs = re.split('\n\n+', text)
words = [paragraph.split() for paragraph in paragraphs]
You now have a list of lists, one per paragraph; process these per paragraph, after which you can rejoin the whole thing into new text with double newlines inserted back in.
I've used re.split() to support paragraphs being delimited by more than 2 newlines; you could use a simple text.split('\n\n') if there are ever only going to be exactly 2 newlines between paragraphs.
use a regexp to find the words and the blank lines rather than split
m = re.compile('(\S+|\n\n)')
words=m.findall(text)