The goals of the function is to split one single string into multiple lines to make it more readable. The goal is to replace the first space found after at least n characters (since the beginning of the string, or since the last "\n" dropped in the string)
Hp:
you can assume no \n in the string
Example
Marcus plays soccer in the afternoon
f(10) should result in
Marcus plays\nsoccer in\nthe afternoon
The first space in Marcus plays soccer in the afternoonis skipped because Marcus is only 5 chars long. We put then a \n after plays and we start counting again. The space after soccer is therefore skipped, etc.
So far tried
def replace_space_w_newline_every_n_chars(n,s):
return re.sub("(?=.{"+str(n)+",})(\s)", "\\1\n", s, 0, re.DOTALL)
inspired by this
Try replacing
(.{10}.*?)\s
with
$1\n
Check it out here.
Example:
>>> import re
>>> s = 'Marcus plays soccer in the afternoo
>>> re.sub(r'(.{9}.*?)\s', r'\1\n', s)
'Marcus plays\nsoccer in\nthe afternoon'
Related
How to removing non English words from text in df.columns words contain letters and numbers
Ex
df['text']
'the interiors nrd studio | happy mothers day ”there is no influence so powerful as that of the mother.” —sara josepha hale... happy mother’s day mom & to all the mothers around the world! lots of light natasha
0wet3bxtfl'
'but still missing you every day happy mothers day francis mcclafferty (mccool) 9wlhju7cxf'
from the above 2 rows I need to remove the word '0wet3bxtfl' & '9wlhju7cxf'
The example includes to retain some strings that would not be found in a list of English words ("nrd", "mcclafferty", "mccool") while removing "0wet3bxtfl" and "9wlhju7cxf", so the expected result is probably best achieved by removing any non-whitespace sequences that contain either a letter followed by digit or a digit followed by letter (together with any spaces that follow), without regard to whether words are "English" or not.
The following would do this:
import re
...
filtered = re.sub('[^\s]*(\d[a-zA-Z]|[a-zA-Z]\d)[^\s]* *', '', df['text'])
Have a string:
s = "Now is the time for all good men to come to the aid of their country. Time is of the essence friends."
I want to divide s into substrings 25 characters in length like this:
"Now is the time for all\n",
"good men to come to the\n",
"aid of their country.\n",
"Time is of the essence,\n",
"friends."
or
Now is the time for all
good men to come to the
aid of their country.
Time is of the essence,
friends.
using spaces to pad the string 'equally' starting on the left to create a substring 25 characters.
Using split() I can divide the string s into a list of lists of words 25 characters long:
d=[]
s=0
c = a.split()
for i in c:
s+=len(i)
if s <= 25:
d.append(i)
else:
s=0
d=[]
d.append(i)
result:
d=['Now is the time for all', 'good men to come to the', 'aid of their country.', 'Time is of the essence', 'friends.']
Then use this list d to build the string t
I don't understand how I can pad in between the words in each group of words to reach a length of 25. It involves some circular method but I haven't found that method yet.
So I have this textfile, and in that file it goes like this... (just a bit of it)
"The truest love that ever heart
Felt at its kindled core
Did through each vein in quickened start
The tide of being pour
Her coming was my hope each day
Her parting was my pain
The chance that did her steps delay
Was ice in every vein
I dreamed it would be nameless bliss
As I loved loved to be
And to this object did I press
As blind as eagerly
But wide as pathless was the space
That lay our lives between
And dangerous as the foamy race
Of ocean surges green
And haunted as a robber path
Through wilderness or wood
For Might and Right and Woe and Wrath
Between our spirits stood
I dangers dared I hindrance scorned
I omens did defy
Whatever menaced harassed warned
I passed impetuous by
On sped my rainbow fast as light
I flew as in a dream
For glorious rose upon my sight
That child of Shower and Gleam"
Now, the calculate the length of words without the letter 'e' in each line of text. So in the first line it should have 4, then 5, then 17, etc.
My current code is
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
for word in line_strip_split:
if "e" not in word:
word_e = word
print (len(word_e))
My explanation is: Strip each word from each other by removing spaces, so it becomes ['Felt','at','its','kindled','core'], etc. Then we split each word because we can regard it individually when removing words with 'e'?. So we want words without e, then print the length of the string.
HOWEVER, this separates each word into a different line by splitting then separating the string? So this doesn't add all the words together in each line but separates it, so the answer becomes "4 / 2 / 3"
Try this:
for line in open("textname.txt"):
line_strip = line.strip()
line_strip_split = line_strip.split()
words_with_no_e = []
for word in line_strip_split:
if "e" not in word:
# Adding words without e to a new list
words_with_no_e.append(word)
# ''.join() will returns all the elements of array concatenated
# len() will count the length
print(len(''.join(words_with_no_e)))
It append all the words without e in into new list in each line, then concatenate all words then it prints length of it.
I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?
Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4
Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1
if one particular word does not end with another particular word, leave it. here is my string:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
i want to print and count all words between john and dead or death or died.
if john does not end with any of the died or dead or death words. leave it. start again with john word.
my code :
x = re.sub(r'[^\w]', ' ', x) # removed all dots, commas, special symbols
for i in re.findall(r'(?<=john)' + '(.*?)' + '(?=dead|died|death)', x):
print i
print len([word for word in i.split()])
my output:
got shot
2
with his john got killed or
6
with his wife
3
output which i want:
got shot
2
got killed or
3
with his wife
3
i don't know where i am doing mistake.
it is just a sample input. i have to check with 20,000 inputs at a time.
You can use this negative lookahead regex:
>>> for i in re.findall(r'(?<=john)(?:(?!john).)*?(?=dead|died|death)', x):
... print i.strip()
... print len([word for word in i.split()])
...
got shot
2
got killed or
3
with his wife
3
Instead of your .*? this regex is using (?:(?!john).)*? which will lazily match 0 or more of any characters only when john is not present in this match.
I also suggest using word boundaries to make it match complete words:
re.findall(r'(?<=\bjohn\b)(?:(?!\bjohn\b).)*?(?=\b(?:dead|died|death)\b)', x)
Code Demo
I assume, you want to start over, when there is another john following in your string before dead|died|death occur.
Then, you can split your string by the word john and start matching on the resulting parts afterwards:
x = 'john got shot dead. john with his .... ? , john got killed or died in 1990. john with his wife dead or died'
x = re.sub('\W+', ' ', re.sub('[^\w ]', '', x)).strip()
for e in x.split('john'):
m = re.match('(.+?)(dead|died|death)', e)
if m:
print(m.group(1))
print(len(m.group(1).split()))
yields:
got shot
2
got killed or
3
with his wife
3
Also, note that after the replacements I propose here (before splitting and matching), the string looks like this:
john got shot dead john with his john got killed or died in 1990 john with his wife dead or died
I.e., there are no multiple whitespaces left in a sequence. You manage this by splitting by a whitespace later, but I feel this is a bit cleaner.