splitting strings after certain characters

splitting strings after certain characters - python

I would like to split my string after certain characters are found.
identifier = filecontent_id[0].split("SV=")[0]
I have this, but this "deletes" everything before "SV=" and I would like for it to "delete" everything 1 character after it. For example, it would "delete" everything after "SV=1" but I did not put 1 there because it doesn't always equal 1. The string is:
>tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ
and I am trying to only get:
>tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1

A regex might be better, but the below works
SPLIT="SV="
line=">tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ"
print line.split(SPLIT)[0] + SPLIT + line.split(SPLIT)[1][0]

Related

How to remove a specific pattern from re.findall() results

I have a re.findall() searching for a pattern in python, but it returns some undesired results and I want to know how to exclude them. The text is below, I want to get the names, and my statement (re.findall(r'([A-Z]{4,} \w. \w*|[A-Z]{4,} \w*)', text)) is returning this:
'ERIN E. SCHNEIDER',
'MONIQUE C. WINKLER',
'JASON M. HABERMEYER',
'MARC D. KATZ',
'JESSICA W. CHAN',
'RAHUL KOLHATKAR',
'TSPU or taken',
'TSPU or the',
'TSPU only',
'TSPU was',
'TSPU and']
I want to get rid of the "TSPU" pattern items. Does anyone know how to do it?
JINA L. CHOI (NY Bar No. 2699718)
ERIN E. SCHNEIDER (Cal. Bar No. 216114) schneidere#sec.gov
MONIQUE C. WINKLER (Cal. Bar No. 213031) winklerm#sec.gov
JASON M. HABERMEYER (Cal. Bar No. 226607) habermeyerj#sec.gov
MARC D. KATZ (Cal. Bar No. 189534) katzma#sec.gov
JESSICA W. CHAN (Cal. Bar No. 247669) chanjes#sec.gov
RAHUL KOLHATKAR (Cal. Bar No. 261781) kolhatkarr#sec.gov
The Investor Solicitation Process Generally Included a Face-to-Face Meeting, a Technology Demonstration, and a Binder of Materials [...]

You can use
\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?
See this regex demo. Details:
\b - a word boundary (else, the regex may "catch" a part of a word that contains TSPU)
(?!TSPU\b) - a negative lookahead that fails the match if there is TSPU string followed with a non-word char or end of string immediately to the right of the current location
[A-Z]{4,} - four or more uppercase ASCII letters
(?:(?:\s+\w\.)?\s+\w+)? - an optional occurrence of:
(?:\s+\w\.)? - an optional occurrence of one or more whitespaces, a word char and a literal . char
\s+ - one or more whitespaces
\w+ - one or more word chars.
In Python, you can use
re.findall(r'\b(?!TSPU\b)[A-Z]{4,}(?:(?:\s+\w\.)?\s+\w+)?', text)

You can do some simple .filter-ing, if your array was results,
removed_TSPU_results = list(filter(lambda: not result.startswith("TSPU"), results))

SAS/Python: Find any spaces followed by a non-space string and replace space with a different value

I have data that looks like this:
1937 Paredes 3-1
1939 Suazo 2-0
1941 Fernandez 4-0
1944 Wilchez 2-1
…
2017 Miralles 5-7
I want to read each line as a line of text. Find any instance of a space followed by a number, character, or any non-space symbol. Replace the space that precedes that number, character, or any non-space symbol with a "|" in the following manner:
1937 |Paredes |3-1
1939 |Suazo |2-0
1941 |Fernandez |4-0
1944 |Wilchez |2-1
...
2017 |Miralles |5-7
Any idea how to this in SAS or Python?

You might use re.sub matching a space and assert a non whitespace char to the right:
import re
test_str = ("1937 Paredes 3-1\n\n"
"1939 Suazo 2-0\n\n"
"1941 Fernandez 4-0\n\n"
"1944 Wilchez 2-1")
result = re.sub(r" (?=\S)", "|", test_str)
if result:
print (result)
Output
1937|Paredes|3-1
1939|Suazo|2-0
1941|Fernandez|4-0
1944|Wilchez|2-1
Or find multiple whitespace chars without a newline
result = re.sub(r"[^\S\r\n]+(?=\S)", "|", test_str)

I don't understand the need to preserve the other spaces. Why not just remove them all?
data _null_;
infile 'have.txt' truncover;
file 'want.txt' dsd dlm='|';
input (var1-var3) (:$100.);
put var1-var3;
run;
Results
1937|Paredes|3-1
1939|Suazo|2-0
1941|Fernandez|4-0
1944|Wilchez|2-1
2017|Miralles|5-7

Regular expression in Python, 2-3 numbers then 2 letters

I am trying to do autodetection of bra size in a list of clothes. While I managed to extract only the bra items, I am now looking at extracting the size information and I think I am almost there (thanks to the stackoverflow community). However, there is a particular case that I could not find on another post.
I am using:
regexp = re.compile(r' \d{2,3} ?[a-fA-F]([^bce-zBCE-Z]|$)')
So
Possible white space if not at the beginning of the description
two or three numbers
Another possible white space or not
Any letters (lower or upper case) between A and F
and then another letter for the two special case AA and FF or the end of the string.
My question is, is there a way to have the second letter to be a match of the first letter (AA or FF) because in my case, my code output me some BA and CA size which are not existing
Examples:
Not working:
"bh sexig top matchande h&m genomskinligt parti svart detaljer 42 basic plain" return "42 ba" instead of not found
"puma, sport-bh, strl: 34cd, svart/grå", I guess the customer meant c/d
Working fine:
"victoria's secret, bh, strl: 32c, gul/vit" returns "32 c"
"pink victorias secret bh 75dd burgundy" returns "75 dd"
Thanks!

You might use
\d{2,3} ?([a-fA-F])\1?(?![a-fA-F])
Explanation
\d{2,3} ? Match a space, 2-3 digits and optional space
([a-fA-F])\1? Capture a-fA-F in group 1 followed by an optional backreference to group 1
(?![a-fA-F]) Negative lookahead, assert what is on the right is not a-fA-F
Regex demo

Regex to match strings in quotes that contain only 3 or less capitalized words

I've searched and searched, but can't find an any relief for my regex woes.
I wrote the following dummy sentence:
Watch Joe Smith Jr. and Saul "Canelo" Alvarez fight Oscar de la Hoya and Genaddy Triple-G Golovkin for the WBO belt GGG. Canelo Alvarez and Floyd 'Money' Mayweather fight in Atlantic City, New Jersey. Conor MacGregor will be there along with Adonis Superman Stevenson and Mr. Sugar Ray Robinson. "Here Goes a String". 'Money Mayweather'. "this is not a-string", "this is not A string", "This IS a" "Three Word String".
I'm looking for a regular expression that will return the following when used in Python 3.6:
Canelo, Money, Money Mayweather, Three Word String
The regex that has gotten me the closest is:
(["'])[A-Z](\\?.)*?\1
I want it to only match strings of 3 capitalized words or less immediately surrounded by single or double quotes. Unfortunately, so far it seem to match any string in quotes, no matter what the length, no matter what the content, as long is it begins with a capital letter.
I've put a lot of time into trying to hack through it myself, but I've hit a wall. Can anyone with stronger regex kung-fu give me an idea of where I'm going wrong here?

Try to use this one: (["'])((?:[A-Z][a-z]+ ?){1,3})\1
(["']) - opening quote
([A-Z][a-z]+ ?){1,3} - Capitalized word repeating 1 to 3 times separated by space
[A-Z] - capital char (word begining char)
[a-z]+ - non-capital chars (end of word)
_? - space separator of capitalized words (_ is a space), ? for single word w/o ending space
{1,3} - 1 to 3 times
\1 - closing quote, same as opening
Group 2 is what you want.
Match 1
Full match 29-37 `"Canelo"`
Group 1. 29-30 `"`
Group 2. 30-36 `Canelo`
Match 2
Full match 146-153 `'Money'`
Group 1. 146-147 `'`
Group 2. 147-152 `Money`
Match 3
Full match 318-336 `'Money Mayweather'`
Group 1. 318-319 `'`
Group 2. 319-335 `Money Mayweather`
Match 4
Full match 398-417 `"Three Word String"`
Group 1. 398-399 `"`
Group 2. 399-416 `Three Word String`
RegEx101 Demo: https://regex101.com/r/VMuVae/4

Working with the text you've provided, I would try to use regular expression lookaround to get the words surrounded by quotes and then apply some conditions on those matches to determine which ones meet your criterion. The following is what I would do:
[p for p in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt) if all(x.istitle() for x in p.split(' ')) and len(p.split(' ')) <= 3]
txt is the text you've provided here. The output is the following:
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']
Cleaner:
matches = []
for m in re.findall('(?<=[\'"])[\w ]{2,}(?=[\'"])', txt):
if all(x.istitle() for x in m.split(' ')) and len(m.split(' ')) <= 3:
matches.append(m)
print(matches)
# ['Canelo', 'Money', 'Money Mayweather', 'Three Word String']

Here's my go at it: ([\"'])(([A-Z][^ ]*? ?){1,3})\1

How to write multiple lines in a single line?

How can I write multiple lines in a single line? My inputs are like this:
HOXC11
HOXC11, HOX3H, MGC4906
human, Homo sapiens
HOXB6
HOXB6, HOX2, HU-2, HOX2B, Hox-2.2
human, Homo sapiens
HOXB13
HOXB13
human, Homo sapiens
PAX5
PAX5, BSAP
human, Homo sapiens
I need to make it into a single line like this:
HOXC11 HOXC11, HOX3H, MGC4906 human, Homo sapiens
HOXB6 HOXB6, HOX2, HU-2, HOX2B, Hox-2.2 human, Homo sapiens
HOXB13 HOXB13 human, Homo sapiens

Assuming your input is from a file, let's call it homosapiens.txt, you can go from the specified input to the desired output as follow:
with open('homosapiens.txt', 'r') as f:
for line in f:
if line == 'human, Homo sapiens':
print line # this will print and go to a newline
elif line:
print line, # the comma after line suppresses the newline

textInput = textInput.rstrip('\n')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting strings after certain characters - python

A regex might be better, but the below works SPLIT="SV=" line=">tr|A0A024RAP8|A0A024RAP8_HUMAN HCG2009644, isoform CRA_b OS=Homo sapiens GN=KLRC4-KLRK1 PE=4 SV=1MGWIRGRRSRHSWEMSEFHNYNLDLKKSDFSTRWQ" print line.split(SPLIT)[0] + SPLIT + line.split(SPLIT)[1][0]

Related

How to remove a specific pattern from re.findall() results

SAS/Python: Find any spaces followed by a non-space string and replace space with a different value

Regular expression in Python, 2-3 numbers then 2 letters

Regex to match strings in quotes that contain only 3 or less capitalized words

How to write multiple lines in a single line?

Categories

Resources