find string of arbitrary length before a known string

find string of arbitrary length before a known string - python

Just say I have a string such as:
Lecture/NNP/B-NP/O delivered/VBD/B-VP/O at/IN/B-PP/B-PNP the/DT/B-NP/I-PNP UNESCO/NNP/I-NP/I-PNP House/NNP/I-NP/I-PNP in/IN/B-PP/B-PNP Paris/NNP-LOC/B-NP/I-PNP
I want to pull out every word which occurs before "/NNP/". This would mean my output is
Lecture, UNESCO, House
I tried:
print re.findall(r'/NNP/',string) then working backwards but I can't make it arbitrary. There is always a blank space leading the word which might help.
Edit: removed error in list.

Try this:
s = 'Lecture/NNP/B-NP/O delivered/VBD/B-VP/O at/IN/B-PP/B-PNP the/DT/B-NP/I-PNP UNESCO/NNP/I-NP/I-PNP House/NNP/I-NP/I-PNP in/IN/B-PP/B-PNP Paris/NNP-LOC/B-NP/I-PNP'
re.findall(r'(\S+)/NNP/', s)
=> ['Lecture', 'UNESCO', 'House']

Forward lookahead.
>>> re.findall('(?:\s|^)[^/]+(?=/NNP/)', 'Lecture/NNP/B-NP/O delivered/VBD/B-VP/O at/IN/B-PP/B-PNP the/DT/B-NP/I-PNP UNESCO/NNP/I-NP/I-PNP House/NNP/I-NP/I-PNP in/IN/B-PP/B-PNP Paris/NNP-LOC/B-NP/I-PNP')
['Lecture', 'UNESCO', 'House']

Related

how do i use string.replace() to replace only when the string is exactly matching

I have a dataframe with a list of poorly spelled clothing types. I want them all in the same format , an example is i have "trous" , "trouse" and "trousers", i would like to replace the first 2 with "trousers".
I have tried using string.replace but it seems its getting the first "trous" and changing it to "trousers" as it should and when it gets to "trouse", it works also but when it gets to "trousers" it makes "trousersersers"! i think its taking the strings which contain trous and trouse and trousers and changing them.
Is there a way i can limit the string.replace to just look for exactly "trous".
here's what iv troied so far, as you can see i have a good few changes to make, most of them work ok but its the likes of trousers and t-shirts which have a few similar changes to be made thats causing the upset.
newTypes=[]
for string in types:
underwear = string.replace(('UNDERW'), 'UNDERWEAR').replace('HANKY', 'HANKIES').replace('TIECLI', 'TIECLIPS').replace('FRAGRA', 'FRAGRANCES').replace('ROBE', 'ROBES').replace('CUFFLI', 'CUFFLINKS').replace('WALLET', 'WALLETS').replace('GIFTSE', 'GIFTSETS').replace('SUNGLA', 'SUNGLASSES').replace('SCARVE', 'SCARVES').replace('TROUSE ', 'TROUSERS').replace('SHIRT', 'SHIRTS').replace('CHINO', 'CHINOS').replace('JACKET', 'JACKETS').replace('KNIT', 'KNITWEAR').replace('POLO', 'POLOS').replace('SWEAT', 'SWEATERS').replace('TEES', 'T-SHIRTS').replace('TSHIRT', 'T-SHIRTS').replace('SHORT', 'SHORTS').replace('ZIP', 'ZIP-TOPS').replace('GILET ', 'GILETS').replace('HOODIE', 'HOODIES').replace('HOODZIP', 'HOODIES').replace('JOGGER', 'JOGGERS').replace('JUMP', 'SWEATERS').replace('SWESHI', 'SWEATERS').replace('BLAZE ', 'BLAZERS').replace('BLAZER ', 'BLAZERS').replace('WC', 'WAISTCOATS').replace('TTOP', 'T-SHIRTS').replace('TROUS', 'TROUSERS').replace('COAT', 'COATS').replace('SLIPPE', 'SLIPPERS').replace('TRAINE', 'TRAINERS').replace('DECK', 'SHOES').replace('FLIP', 'SLIDERS').replace('SUIT', 'SUITS').replace('GIFTVO', 'GIFTVOUCHERS')
newTypes.append(underwear)
types = newTypes

Assuming you're okay with not using string.replace(), you can simply do this:
lst = ["trousers", "trous" , "trouse"]
for i in range(len(lst)):
if "trous" in lst[i]:
lst[i] = "trousers"
print(lst)
# Prints ['trousers', 'trousers', 'trousers']
This checks if the shortest substring, trous, is part of the string, and if so converts the entire string to trousers.

Use a dict for string to be replaced:
d={
'trous': 'trouser',
'trouse': 'trouser',
# ...
}
newtypes=[d.get(string,string) for string in types]
d.get(string,string) will return string if string is not in d.

Python list in double index How to split it ety.origin list

Hi Everyone I have python to find origin of a word so I got result in list's How I want it to separate or split it with comma (,).
origin=ety.origins(wordtodo)
print(origin)
>>[Word(how, Middle English (1100-1500) [enm]), Word(haugr, Old Norse [non])]
in the result I want text inside (...) braket's and store into different variable
e.g.
forigin=(how, Middle English (1100-1500) [enm])
and
sorigin=(haugr, Old Norse [non])

forigin = repr(origin[0])[4:]
sorigin = repr(origin[1])[4:]

Author of ety here 🙂
ety.origins returns a list of Word objects.
Get properties of a Word with specific fields .word and .language, or use .pretty to get a string version of the word/language in format {word} ({lang}) - e.g. 'how (Middle English (1100-1500))'

Python string grouping?

Basically, I print a long message but I want to group all of those words into 5 character long strings.
For example "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner." I want to make that
"iPhon 6isn' tsimp lybig ger-i t'sbe terri never yway. Large r,yet drama tical lythi nner. "

As suggested by #vaultah, this is achieved by splitting the string by a space and joining them back without spaces; then using a for loop to append the result of a slice operation to an array. An elegant solution is to use a comprehension.
text = "iPhone 6 isn’t simply bigger — it’s better in every way. Larger, yet dramatically thinner."
joined_text = ''.join(text.split())
splitted_to_six = [joined_text[char:char+6] for char in range(0,len(joined_text),6)]
' '.join(splitted_to_six)
I'm sure you can use the re module to get back dashes and apostrophes as they're meant to be

Simply do the following.
import re
sentence="iPhone 6 isn't simply bigger - it's better in every way. Larger, yet dramatically thinner."
sentence = re.sub(' ', '', sentence)
count=0
new_sentence=''
for i in sentence:
if(count%5==0 and count!=0):
new_sentence=new_sentence+' '
new_sentence=new_sentence+i
count=count+1
print new_sentence
Output:
iPhon e6isn 'tsim plybi gger- it'sb etter ineve ryway .Larg er,ye tdram atica llyth inner .

Replace string is not changing value

I am trying to replace any i's in a string with capital I's. I have the following code:
str.replace('i ','I ')
However, it does not replace anything in the string. I am looking to include a space after the I to differentiate between any I's in words and out of words.
Thanks if you can provide help!
The exact code is:
new = old.replace('i ','I ')
new = old.replace('-i-','-I-')

new = old.replace('i ','I ')
new = old.replace('-i-','-I-')
You throw away the first new when you assign the result of the second operation over it.
Either do
new = old.replace('i ','I ')
new = new.replace('-i-','-I-')
or
new = old.replace('i ','I ').replace('-i-','-I-')
or use regex.

I think you need something like this.
>>> import re
>>> s = "i am what i am, indeed."
>>> re.sub(r'\bi\b', 'I', s)
'I am what I am, indeed.'
This only replaces bare 'i''s with I, but the 'i''s that are part of other words are left untouched.
For your example from comments, you may need something like this:
>>> s = 'i am sam\nsam I am\nThat Sam-i-am! indeed'
>>> re.sub(r'\b(-?)i(-?)\b', r'\1I\2', s)
'I am sam\nsam I am\nThat Sam-I-am! indeed'

strip sides of a string in python

I have a list like this:
Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]
I want to strip the unwanted characters using python so the list would look like:
Tomato
Populus trichocarpa
I can do the following for the first one:
name = ">Tomato4439"
name = name.strip(">1234567890")
print name
Tomato
However, I am not sure what to do with the second one. Any suggestion would be appreciated.

given:
s='Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]'
this:
s = s.split()
[s[0].strip('0123456789,'), s[-2].replace('[',''), s[-1].replace(']','')]
will give you
['Tomato', 'Populus', 'trichocarpa']
It might be worth investigating regular expressions if you are going to do this frequently and the "rules" might not be that static as regular expressions are much more flexible dealing with the data in that case. For the sample problem you present though, this will work.

import re
a = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
re.sub(r"^([A-Za-z]+).+\[([^]]+)\]$", r"\1 \2", a)
This gives
'Tomato Populus trichocarpa'

If the strings you're trying to parse are consistent semantically, then your best option might be classifying the different "types" of strings you have, and then creating regular expressions to parse them using python's re module.

>>> import re
>>> line = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
>>> match = re.match("^([a-zA-Z]+).*\[([a-zA-Z ]+)\].*",line)
>>> match.groups()
('Tomato', 'Populus trichocarpa')
edited to not include the [] on the 2nd part... this should work for any thing that matches the pattern of your query (eg starts with name, ends with something in []) it would also match
"Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa apples]" for example

Previous answers were simpler than mine, but:
Here is one way to print the stuff that you don't want.
tag = "Tomato4439, >gi|224089052|ref|XP_002308615.1| predicted protein [Populus trichocarpa]"
import re, os
find = re.search('>(.+?) \[', tag).group(1)
print find
Gives you
gi|224089052|ref|XP_002308615.1| predicted protein
Then you can use the replace function to remove that from the original string. And the translate function to remove the extra unwanted characters.
tag2 = tag.replace(find, "")
tag3 = str.translate(tag2, None, ">[],")
print tag3
Gives you
Tomato4439 Populus trichocarpa

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

find string of arbitrary length before a known string - python

Try this: s = 'Lecture/NNP/B-NP/O delivered/VBD/B-VP/O at/IN/B-PP/B-PNP the/DT/B-NP/I-PNP UNESCO/NNP/I-NP/I-PNP House/NNP/I-NP/I-PNP in/IN/B-PP/B-PNP Paris/NNP-LOC/B-NP/I-PNP' re.findall(r'(\S+)/NNP/', s) => ['Lecture', 'UNESCO', 'House']

Forward lookahead. >>> re.findall('(?:\s|^)[^/]+(?=/NNP/)', 'Lecture/NNP/B-NP/O delivered/VBD/B-VP/O at/IN/B-PP/B-PNP the/DT/B-NP/I-PNP UNESCO/NNP/I-NP/I-PNP House/NNP/I-NP/I-PNP in/IN/B-PP/B-PNP Paris/NNP-LOC/B-NP/I-PNP') ['Lecture', 'UNESCO', 'House']

Related

how do i use string.replace() to replace only when the string is exactly matching

Python list in double index How to split it ety.origin list

Python string grouping?

Replace string is not changing value

strip sides of a string in python

Categories

Resources