I have this list with part of speech tags and their specifics: ['VNW(pers,pron,nomin,red,2v,ev)', 'WW(pv,tgw,met-t)', 'LID(bep,stan,rest)']. As you can see there are no spaces between the characters, so it can be seen as one word.
Now I need a new list with only the part of speech tags, like this ['VNW', 'WW', 'LID'].
I tried removing the brackets and everything in them with a regex like this pattern = re.compile(r'(.*)').
I also tried to match only the capital letters, but I can't get it right. Suggestions?
Regular expression is not need for this case. Split by (; then get the first part only.
>>> 'VNW(pers,pron,nomin,red,2v,ev)'.split('(')
['VNW', 'pers,pron,nomin,red,2v,ev)']
>>> 'VNW(pers,pron,nomin,red,2v,ev)'.split('(')[0]
'VNW'
>>> xs = ['VNW(pers,pron,nomin,red,2v,ev)', 'WW(pv,tgw,met-t)',
'LID(bep,stan,rest)']
>>> [x.split('(')[0] for x in xs]
['VNW', 'WW', 'LID']
Some of the possible solutions are:
Removing Brackets using loop
l = ['VNW(pers,pron,nomin,red,2v,ev)', 'WW(pv,tgw,met-t)', 'LID(bep,stan,rest)']
for i in range(len(l)):
i1,i2=l[i].find('('),l[i].find(')')
l[i]=l[i][:i1]+l[i][i2+1:]
print l
Using Regex
import re
pattern = r'\([^)]*\)'
l = ['VNW(pers,pron,nomin,red,2v,ev)', 'WW(pv,tgw,met-t)', 'LID(bep,stan,rest)']
for i in range(len(l)):
l[i] = re.sub(pattern, '', l[i])
print l
Output: ['VNW', 'WW', 'LID']
Short solution using str.find() function:
l = ['VNW(pers,pron,nomin,red,2v,ev)', 'WW(pv,tgw,met-t)', 'LID(bep,stan,rest)']
result = [i[:i.find('(')] for i in l]
result contents:
['VNW', 'WW', 'LID']
For example:
In [102]: s=['VNW(pers,pron,nomin,red,2v,ev)', 'WW(pv,tgw,met-t)', 'LID(bep,stan,rest)']
In [103]: [x.split('(', 1)[0] for x in s]
Out[103]: ['VNW', 'WW', 'LID']
Related
I have a list like this:
my_list=["'-\\n'",
"'81\\n'",
"'-\\n'",
"'0913\\n'",
"'Assistant nursing\\n'",
"'0533\\n'",
"'0895 Astronomy\\n'",
"'0533\\n'",
"'Astrophysics\\n'",
"'0532\\n'"]
Is there any way to delete every thing from this list except words?
out put:
my_list=['Assistant nursing',
'Astronomy',
'Astrophysics',]
I know for example if i wanna remove integers in string form i can do this:
no_integers = [x for x in my_list if not (x.isdigit()
or x[0] == '-' and x[1:].isdigit())]
but it dosn't work well enough
The non-regex solution:
You can start by striping off the characters '-\\n, then take only the characters that are alphabets using str.isalpha or a white space, then filter out the sub-strings that are empty ''. You may need to strip off the white space characters in the end, whic
>>> list(filter(lambda x: x!='', (''.join(j for j in i.strip('\'-\\\\n') if j.isalpha() or j==' ').strip() for i in my_list)))
['Assistant nursing', 'Astronomy', 'Astrophysics']
If you want to use regex, you can use the pattern: '([A-Za-z].*?)\\\\n' with re.findall, then filter out the elements that are empty list, finally you can flatten the list
>>> import re
>>> list(filter(lambda x: x, [re.findall('([A-Za-z].*?)\\\\n', i) for i in my_list]))
[['Assistant nursing'], ['Astronomy'], ['Astrophysics']]
with regular expresssions
import re
my_list = # above
# remove \n, -, digits, ' symbols
my_new_list = [re.sub(r"[\d\\n\-']", '', s) for s in my_list]
# remove empty strings
my_new_list = [s for s in my_new_list if s != '']
print(my_new_list)
Output
['Assistat ursig', ' Astroomy', 'Astrophysics']
How to remove the part with "_" and numbers connected together in a string using Python?
For example,
Input: ['apple_3428','red_458','D30','green']
Excepted output: ['apple','red','D30','green']
Thanks!
This should work:
my_list = ['apple_3428','red_458','D30','green']
new_list = []
for el in my_list:
new_list.append(el.split('_')[0])
new_list will be ['apple', 'red', 'D30', 'green'].
Basically you split every element of my_list (which are supposed to be strings) and then you take the first, i.e. the part before the _. If _ is not present, the string will not be split.
Using regular expressions with re.sub:
import re
[re.sub("_\d+$", "", x) for x in ['apple_3428','red_458','D30','green']]
# ['apple_3428','red_458','D30','green']
This will strip an underscore followed by only digits from the end of a string.
I am not sure which is needed, so present few options
Also list comp is better instead of map + lambda, also list comp is more pythonic, List comprehension vs map
\d+ stand for atleast one digit
\d* stand for >= 0 digit
>>> import re
>>> list(map(lambda x: re.sub('_\d+$', '', x), ['green_', 'green_458aaa']))
['green', 'greenaaa']
>>> list(map(lambda x: re.sub('_\d*', '', x), ['green_', 'green_458aaa']))
['green', 'greenaaa']
>>> list(map(lambda x: re.sub('_\d+', '', x), ['green_', 'green_458aaa']))
['green_', 'greenaaa']
>>> list(map(lambda x: x.split('_', 1)[0], ['green_', 'green_458aaa']))
['green', 'green']
Try this:
output_list = [x.split('_')[0] for x in input_list]
input_list = ['apple_3428','red_458','D30','green']
output_list = []
for i in input_list:
output_list.append(i.split('_', 1)[0])
You can simply split the string.
What is the best way to replace every string in the list?
For example if I have a list:
a = ['123.txt', '1234.txt', '654.txt']
and I would like to have:
a = ['123', '1234', '654']
Assuming that sample input is similar to what you actually have, use os.path.splitext() to remove file extensions:
>>> import os
>>> a = ['123.txt', '1234.txt', '654.txt']
>>> [os.path.splitext(item)[0] for item in a]
['123', '1234', '654']
Use a list comprehension as follows:
a = ['123.txt', '1234.txt', '654.txt']
answer = [item.replace('.txt', '') for item in a]
print(answer)
Output
['123', '1234', '654']
Assuming that all your strings end with '.txt', just slice the last four characters off.
>>> a = ['123.txt', '1234.txt', '654.txt']
>>> a = [x[:-4] for x in a]
>>> a
['123', '1234', '654']
This will also work if you have some weird names like 'some.txtfile.txt'
You could split you with . separator and get first item:
In [486]: [x.split('.')[0] for x in a]
Out[486]: ['123', '1234', '654']
Another way to do this:
a = [x[: -len("txt")-1] for x in a]
What is the best way to replace every string in the list?
That completely depends on how you define 'best'. I, for example, like regular expressions:
import re
a = ['123.txt', '1234.txt', '654.txt']
answer = [re.sub('^(\w+)\..*', '\g<1>', item) for item in a]
#print(answer)
#['123', '1234', '654']
Depending on the content of the strings, you could adjust it:
\w+ vs [0-9]+ for only digits
\..* vs \.txt if all strings end with .txt
data.colname = [item.replace('anythingtoreplace', 'desiredoutput') for item in data.colname]
Please note here 'data' is the dataframe, 'colname' is the column name you might have in that dataframe. Even the spaces are accounted, if you want to remove them from a string or number. This was quite useful for me. Also this does not change the datatype of the column so you might have to do that separately if required.
I have a list of strings like such,
['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
Given a keyword list like ['for', 'or', 'and'] I want to be able to parse the list into another list where if the keyword list occurs in the string, split that string into multiple parts.
For example, the above set would be split into
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Currently I've split each inner string by underscore and have a for loop looking for an index of a key word, then recombining the strings by underscore. Is there a quicker way to do this?
>>> [re.split(r"_(?:f?or|and)_", s) for s in l]
[['happy_feet'],
['happy_hats', 'cats'],
['sad_fox', 'mad_banana'],
['sad_pandas', 'happy_cats', 'people']]
To combine them into a single list, you can use
result = []
for s in l:
result.extend(re.split(r"_(?:f?or|and)_", s))
>>> pat = re.compile("_(?:%s)_"%"|".join(sorted(split_list,key=len)))
>>> list(itertools.chain(pat.split(line) for line in data))
will give you the desired output for the example dataset provided
actually with the _ delimiters you dont really need to sort it by length so you could just do
>>> pat = re.compile("_(?:%s)_"%"|".join(split_list))
>>> list(itertools.chain(pat.split(line) for line in data))
You could use a regular expression:
from itertools import chain
import re
pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
result = list(chain.from_iterable(pattern.split(w) for w in input_list))
The pattern is dynamically created from your list of keywords. The string 'happy_hats_for_cats' is split on '_for_':
>>> re.split(r'_for_', 'happy_hats_for_cats')
['happy_hats', 'cats']
but because we actually produced a set of alternatives (using the | metacharacter) you get to split on any of the keywords:
>>> re.split(r'_(?:for|or|and)_', 'sad_pandas_and_happy_cats_for_people')
['sad_pandas', 'happy_cats', 'people']
Each split result gives you a list of strings (just one if there was nothing to split on); using itertools.chain.from_iterable() lets us treat all those lists as one long iterable.
Demo:
>>> from itertools import chain
>>> import re
>>> keywords = ['for', 'or', 'and']
>>> input_list = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
>>> pattern = re.compile(r'_(?:{})_'.format('|'.join([re.escape(w) for w in keywords])))
>>> list(chain.from_iterable(pattern.split(w) for w in input_list))
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Another way of doing this, using only built-in method, is to replace all occurrence of what's in ['for', 'or', 'and'] in every string with a replacement string, say for example _1_ (it could be any string), then at then end of each iteration, to split over this replacement string:
l = ['happy_feet', 'happy_hats_for_cats', 'sad_fox_or_mad_banana','sad_pandas_and_happy_cats_for_people']
replacement_s = '_1_'
lookup = ['for', 'or', 'and']
lookup = [x.join('_'*2) for x in lookup] #Changing to: ['_for_', '_or_', '_and_']
results = []
for i,item in enumerate(l):
for s in lookup:
if s in item:
l[i] = l[i].replace(s,'_1_')
results.extend(l[i].split('_1_'))
OUTPUT:
['happy_feet', 'happy_hats', 'cats', 'sad_fox', 'mad_banana', 'sad_pandas', 'happy_cats', 'people']
Is there a better way to pull A and F from this: A13:F20
a="A13:F20"
import re
pattern = re.compile(r'\D+\d+\D+')
matches = re.search(pattern, a)
num = matches.group(0)
print num[0]
print num[len(num)-1]
output
A
F
note: the digits are of unknown length
You don't have to use regular expressions, or re at all. Assuming you want just letters to remain, you could do something like this:
a = "A13:F20"
a = filter(lambda x: x.isalpha(), a)
I'd do it like this:
>>> re.findall(r'[a-z]', a, re.IGNORECASE)
['A', 'F']
Use a simple list comprehension, as a filter and get only the alphabets from the actual string.
print [char for char in input_string if char.isalpha()]
# ['A', 'F']
You could use re.sub:
>>> a="A13.F20"
>>> re.sub(r'[^A-Z]', '', a) # Remove everything apart from A-Z
'AF'
>>> re.sub(r'[A-Z]', '', a) # Remove A-Z
'13.20'
>>>
If you're working with strings that all have the same format, you can just cut out substrings:
a="A13:F20"
print a[0], a[4]
More on python slicing in this answer:
Is there a way to substring a string in Python?