Using regex to remove substrings from list items in python

Using regex to remove substrings from list items in python - python

Im sure this must be a duplicate question but I can't find an answer anywhere. I have a list with multiple strings as below:
['>ctg7180000016561_3757\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561_3824\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561_4513\nT\n']
And all I want to do is remove the numbers after the underscore, so in this example the output would be:
['>ctg7180000016561\nAAAAATTTAGTTAAAACTATAACATTAGCTTGTCAAGCTAAAATTACTATGTAAGTAGTAATTTTTA\n', '>ctg7180000016561\nATCCCTCAAATAGCACCCATTAACTGATTATCCTTATTCTTAATATTCACCACCTCTCTCCTAATATTTAGAGCTTCTAACTATTTCTTTATCATGTACCCCCCCAAAAAATCTGTTTTTTATAAAAAAACTAGTATAAATAACTGATCATGATAACTAACCTCTTTTCGTCTTTCGACCCCTCTACTAACTTAAATACTAACTTTAACTGAGTTAGGACTATCCTCGGGGTGGCTGTAATCCCGAGGATATTTTGGATTATCCCCTCGCGTTTCTCCCTGCTTTGAATAAAACTTATCAGTACTCTTCACAAAGAATTCAAAGTCCTTGTTAACAACAAAAAATCCCAAGGCAGAACCCTAATCCTGATTTCCTTATTTTCTATTATTTTATTTAATAACTTCATAGGACTATTCCCATATATTTTCACATCCACAAGTCACATAGTATTAACCCTGTCCCTGGCTCTCCCCATATGACTAAGATTTATATTGTATGGGTGGGTAAATAATACAACCCACATGCTAGCCCATCTAGTACCCCAAGGAACCCCTGCCGTTCTAATACCATTTATGGTGTGTATTGAAACAATCAGAAATGTTATCCGACCCGGCACCCTGGCAATCCGGCTATCCGCAAATATAATTGCAGGACACCTACTAATAACCCTTCTAGGTAACACGGGAAAC\n', '>ctg7180000016561\nT\n']
I am using regex and I have a perfect match but I cant work out how to actually remove the substrings. My code so far is:
pattern = re.compile('_[0-9]*')
for x in SequenceList:
re.sub(pattern, '', x)
I'm aware that this is just changing the variable x, but even when I just print x within the for loop the pattern isn't removed. How do I actually remove the pattern and alter the list?
Thank you and sorry if this is already answered somewhere!

Strings are immutable. So, re.sub will create a new string. Instead, you can use list comprehension to create a new list with the replaced strings like this
import re
pattern = re.compile(r"_\d+")
print [pattern.sub("", item) for item in data]

Related

Convert list of string to dict - Remove extra comma [duplicate]

This question already has answers here:
Convert a String representation of a Dictionary to a dictionary
(11 answers)
Closed 1 year ago.
I am trying to create a dictionary from a list of strings. My attempt to convert this list of string to list of dictionary is as below:
author_dict = [[dict(map(str.strip, s.split(':')) for s in author_transform.split(','))] for author_transform in list_of_strings]
Everything was working fine until I encountered this piece of string:
[[country:United States,affiliation:University of Maryland, Baltimore County,name:tim oates,id:2217452330,gridid:grid.266673.0,affiliationid:79272384,order:2],........,[]]
As this string has an extra comma(,) in the middle of the intended value of affiliation key: my list is getting a spit at the wrong place. Is there a way (or idea) I can use to avoid this kind of situation?
If it is not possible, any suggestions on how can I ignore thiskind of list?

I would solve this by using a regular expression for splitting. This way you can split only on those commas that are followed by a colon without another comma in between.
In your code, replace
author_transform.split(',')
with
re.split(',(?=[^,]+:)', author_transform)
(And don’t forget to import re, of course.)
So, the whole code snippet becomes this:
author_dict = [
[
dict(map(str.strip, s.split(':'))
for s in re.split(',(?=[^,]+:)', author_transform))
]
for author_transform in list_of_strings
]
I took the liberty of reformatting the code, so the structure of the list comprehensions becomes clear.

How to check if a line contains a string in Python

I'm trying to check if a subString exists in a string using regular expression.
RE : re_string_literal = '^"[a-zA-Z0-9_ ]+"$'
The thing is, I don't want to match any substring. I'm reading data from a file:
Now one of the lines have this text:
cout<<"Hello"<<endl;
I just want to check if there's a string inside the line and if yes, store it in a list.
I have tried the re.match method but it only works if we have to match a pattern, but in this case, I just want to check if a string exists or not, if yes, store it somewhere.
re_string_lit = '^"[a-zA-Z0-9_ ]+"$'
text = 'cout<<"Hello World!"<<endl;'
re.match(re_string_lit,text)
It doesn't output anything.
In simple words,
I just want to extract everything inside ""

If you just want to extract everything inside "" then string splitting would be much simpler way of doing things.
>>> a = 'something<<"actualString">>something,else'
>>> b = a.split('"')[1]
>>> b
'actualString'
The above example would only work for not more than 2 instances of double quotes ("), but you could make it work by iterating over every substring extracted using split method and applying a much simpler Regular Expression.

This worked for me:
re.search('"(.+?)"', 'cout<<"Hello"<<endl')

How can I join different segments of a list?

I'm having trouble in a school project because I don't know how to join elements of a list in segments. Here's an example: Let's say I have the following list:
list = ["T","h","i","s","I","s","A","L","i","s","t",]
How could I join this list so that the program outputs the following?:
Output: ["This","Is","A","List"]

Assuming list is your input, and without giving you the answer outright since it's a school project you should do yourself, here are some hints.
You'll want to check if a character is uppercase to know when the start of a word is. With python, you can use isupper() (ex: 'C'.isupper() would return True).
Python strings are iterable.
You can add a character to the end of a string using += (ex: myWord += 'a')
You can add a string to a list using append (ex: myList.append(myWord))
Remember this is a learning experience and there's no real value to being given the answer outright, if that's what you were hoping for. Best of luck and welcome to StackOverflow.

You can use regex for this
import re
list = ["T","h","i","s","I","s","A","L","i","s","t",]
sep=[s for s in re.split("([A-Z][^A-Z]*)", ''.join(list)) if s]
print(sep)

Find two of the same character in a string with regular expressions

This is in reference to a question I asked before here
I received a solution to the problem in that question but ended up needing to go with regex for this particular part.
I need a regular expression to search and replace a string for instances of two vowels in a row that are the same, so the "oo" in "took", or the "ee" in "bees" and replace it with the one of the letters that was replaced and a :.
Some examples of expected behavior:
"took" should become "to:k"
"waaeek" should become "wa:e:k"
"raaag" should become "ra:ag"
Thank you for the help.

Try this:
re.sub(r'([aeiou])\1', r'\1:', str)

Search for ([aeiou])\1 and replace it with \1:
I don't know about python, but you should be able to make the regex case insensitive and global with something like /([aeiou])\1/gi

What NOT to do:
As noted, this will match any two vowels together. Leaving this answer as an example of what NOT to do. The correct answer (in this case) is to use backreferences as mentioned in numerous other answers.
import re
data = ["took","waaeek","raaag"]
for s in data:
print re.sub(r'([aeiou]){2}',r'\1:',s)
This matches exactly two occurrences {2} of any member of the set [aeiou]. and replaces it with the vowel, captured with the parens () and placed in the sub string by the \1 followed by a ':'
Output:
to:k
wa:e:k
ra:ag

You'll need to use a back reference in your search expression. Try something like: ([a-z])+\1 (or ([a-z])\1 for just a double).

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?

The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.

I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'

Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True

There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex to remove substrings from list items in python - python

Strings are immutable. So, re.sub will create a new string. Instead, you can use list comprehension to create a new list with the replaced strings like this import re pattern = re.compile(r"_\d+") print [pattern.sub("", item) for item in data]

Related

Convert list of string to dict - Remove extra comma [duplicate]

How to check if a line contains a string in Python

How can I join different segments of a list?

Find two of the same character in a string with regular expressions

Difference in regex behavior between Perl and Python?

Categories

Resources