Python - Edit lines in a list - python

I have insert my text file with about 10 lines in the form of a list. Now I want to cut off the firstpart in each line.
To be precise, the first 5 words should be cut off.
How exactly do I have to do this?
Edit:
I have insert my text file:
with open("test.txt", "r") as file:
list = []
for line in file:
list += [line.strip()]
print(list)
If i only have one line, this works for me:
newlist = " ".join(list.split(" ")[5:])
print(newlist)
But how can I do this with a list (10 lines)

Python has a method split() that splits a string by a delimiter. By default, it splits at every white space. Then, to cut the first 5 words you can either copy all the other items in the list starting from index 5, or, delete the indexes 0-4 from the list.

Perhaps something along the lines of:
text = []
with open('input.txt') as f:
for l in f.readlines():
words = l.split()
text.append(words[5:])
Obviously you should do all sorts of error checking here but the gist should be there.

If you want to remove all first 5 words from all lines in the file, you don't have to read it line by line and then split it.
You can read the whole file, then then use re.sub to remove the first 5 words surrounded by spaces.
import re
pattern = r"(?m)^[^\S\n]*(?:\S+[^\S\n]+){4}\S+[^\S\n]*"
with open('test.txt') as f:
print(re.sub(pattern, "", f.read()))
The pattern matches:
(?m) Inline modifier for multiline
^ Start of string
[^\S\n]* Match optional leading spaces without newlines
(?:\S+[^\S\n]+){4} Repeat 4 times matching 1+ non whitespace chars followed by 1+ spaces
\S+ Match 1+ non whitespace chars
[^\S\n]* Match optional trailing spaces without a newline
See a regex 101 demo for the matches.

Related

regex multiline matching in python

I want to filter for 'here is a sample' and all the lines afterwards until 2 new lines:
Here is my file (you can use it as a logfile):
here is a sample text
random line1
here is a sample text
random line2
random line3
random line4
should not match
random line 6
here is a sample
random line 5
I tried:
\r?\n?(here is a sample).*\r?\n?(.*)
With that I only filter the next line if I do the last part '\r?\n?(.*)'
again I get another line..
My question. What regex expression do I need in order to match all lines until I see 2 new lines.
If you want to match all until you have 2 newline, but also want to match the last occurrence if there are no 2 newlines:
^here is a sample.*(?:\n(?!\n).*)*
The pattern matches:
^ Start of string
here is a sample.* Match literally and the rest of the line
(?: Non capture group to repeat as a whole part
\n(?!\n) Match a newline, and assert that it is not directly followed by a newline
.* Match the rest of the line
)* Close the non capture group and optionally repeat it
Regex demo
If there should be 2 newlines present, you can use a capture group for the part that you want to keep, and match the 2 newlines to make sure that they are present.
^(here is a sample.*(?:\n(?!\n).*)*)\n\n
Regex demo

how to write a regular expression to match a small part of a repeating pattern?

I have the following pattern to match :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
For some context, it's part of a larger file , which contains many similar patterns separated by commas :
(10,'more random stuff 21325','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page'),
(11,'more random stuff 1nyny5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','subcat'),
(14,'more random stuff 21dd5','random stuff','2014-10-26 04:50:23','','uca-default-u-kn','page')
my goal is to ditch all patterns ending with 'page' and to keep the rest. For that, I'm trying to use
regular expressions to identify those patterns. Here is the one I come out with for now :
"\(.*?,\'page\'\)"
However, it's not working as expected.
In the following python code, I use this regex, and replace every match with an empty string :
import re
txt = "(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),"
txt += "(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),"
txt += "(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),"
txt += "(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),"
txt += "(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),"
txt += "(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),"
txt += "(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
new_txt = re.sub("\(.*?,\'page\'\)", "",txt)
I was expecting that new_text would contains all patterns ending with 'subcat', and remove all
patterns ending with 'page', however, I obtain :
new_txt = ,,,,
What's happening here ? How can I change my regex to obtain the desired result ?
We might be tempted to do a regex replacement here, but that would basically always leave open edge cases, as #Wiktor has correctly pointed out in a comment below. Instead, a more foolproof approach is to use re.findall and simply extract every tuple with does not end in 'page'. Here is an example:
parts = re.findall(r"\(\d+,'[^']*?'(?:,'[^']*?'){4},'(?!page')[^']*?'\),?", txt)
print(''.join(parts))
This prints:
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),(15,'Anti-fascism','DL.8:NB�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
The regex pattern used above just matches a leading number, followed by 5 singly quoted terms, and then a sixth singly quoted term which is not 'page'. Then, we string join the tuples in the list output to form a string.
What happens is that you concatenate the string, then then remove all until the first occurrence of ,'page') leaving only the trailing comma's.
Another workaround might be using a list of the strings, and join them with a newline instead of concatenating them.
Then use your pattern matching an optional comma and newline at the end to remove the line, leaving the ones that end with subcat
import re
lines = [
"(10,'Redirects_from_moves','*..2NN:,#2.FBHRP:D6ܽ�','2014-10-26 04:50:23','','uca-default-u-kn','page'),",
"(11,'Redirects_with_old_history','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','page'),",
"(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),",
"(13,'Anarchism','random_stuff','2020-01-23 13:27:44',' ','uca-default-u-kn','page'),",
"(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(15,'Anti-fascism','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),",
"(16,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page'),",
"(17,'Articles_containing_French-language_text','*D*L.8:NB\r�','2020-01-23 13:27:44','','uca-default-u-kn','page')"
]
new_txt = re.sub("\(.*,'page'\)(?:,\n)?", "", '\n'.join(lines))
print(new_txt)
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Or you can use a list comprehension to keep the lines that do not match the pattern.
result = [line for line in lines if not re.match(r"\(.*,'page'\),?$", line)]
print('\n'.join(result))
Output
(12,'Unprintworthy_redirects','*..2NN:,#2.FBHRP:D6ܽ�','2010-08-26 22:38:36','','uca-default-u-kn','subcat'),
(14,'Anti-capitalism','random_stuff','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
�','2020-01-23 13:27:44','','uca-default-u-kn','subcat'),
Another option to match the parts that end with 'page') for the example data:
\(\d+,[^)]*(?:\)(?!,\s*\(\d+,)[^)]*)*,'page'\),?
The pattern matches:
\(\d+, Match ( followed by 1+ digits and a comma
[^)]* Optionally match any char except )
(?: Non capture group
\)(?!,\s*\(\d+,)[^)]* Only match a ) when not directly followed by the pattern ,\s*\(\d+, which matches the start of the parts in the example data
)* Close group and optionally repeat
,'page'\),? Match ,'page') with an optional comma
Regex demo

No luck finding regex pattern python

I am having no luck getting anything from this regex search.
I have a text file that looks like this:
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
I want to extract the lines that begin with "REF*23*" and ending with the "~"
txtfile = open(i + fileName, "r")
for line in txtfile:
line = line.rstrip()
p = re.findall(r'^REF*23*.+~', line)
print(p)
But this gives me nothing. As much as I'd like to dig deep into regex with python I need a quick solution to this. What i'm eventually wanting is just the digits between the last "*" and the "~" Thanks
You do not really need a regex if the only task is to extract the lines that begin with "REF*23*" and ending with the "~":
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line)
print(results)
If you need to get the digit chunks:
results = []
with open(i + fileName, "r") as txtfile:
for line in txtfile:
line = line.rstrip()
if line.startswith('REF*23*') and line.endswith('~'):
results.append(line[7:-1]) # Just grab the slice
See non-regex approach demo.
NOTES
In a regex, * must be escaped to match a literal asterisk
You read line by line, re.findall(r'^REF*23*.+~', line) makes little sense as the re.findall method is used to get multiple matches while you expect one
Your regex is not anchored on the right, you need $ or \Z to match ~ at the end of the line. So, if you want to use a regex, it would look like
m = re.search(r'^REF\*23\*(.*)~$', line):
if m:
results.append(m.group(1)) # To grab just the contents between delimiters
# or
results.append(line) # To get the whole line
See this Python demo
In your case, you search for lines that start and end with fixed text, thus, no need using a regex.
Edit as an answer to the comment
Another text file is a very long unbroken like with hardly any spaces. I need to find where a section begins with REF*0F* and ends with ~, with the number I want in between.
You may read the file line by line and grab all occurrences of 1+ digits between REF*0F* and ~:
results = []
with open(fileName, "r") as txtfile:
for line in txtfile:
res = re.findall(r'REF\*0F\*(\d+)~', line)
if len(res):
results.extend(res)
print(results)
You can entirely use string functions to get only the digits (though a simple regex might be more easy to understand, really):
raw = """
REF*0F*452574437~
REF*1L*627783972~
REF*23*526344060~
REF*6O*1024817112~
DTP*336*D8*20140623~
DTP*473*D8*20191001~
DTP*474*D8*20191031~
DTP*473*D8*20191101~
"""
result = [digits[:-1]
for line in raw.split("\n") if line.startswith("REF*23*") and line.endswith("~")
for splitted in [line.split("*")]
for digits in [splitted[-1]]]
print(result)
This yields
['526344060']
* is a special character in regex, so you have to escape it as #The Fourth Bird points out. You are using an raw string, which means you don't have to escape chars from Python-language string parsing, but you still have to escape it for the regex engine.
r'^REF\*23\*.+~'
or
'^REF\\*23\\*.+~'
# '\\*' -> '\*' by Python string
# '\*' matches '*' literally by regex engine
will work. Having to escape things twice leads to the Leaning Toothpick Syndrome. Using a raw-string means you have to escape once, "saving some trees" in this regard.
Additional changes
You might also want to throw parens around .+ to match the group, if you want to match it. Also change the findall to match, unless you expect multiple matches per line.
results = []
with open(i + fileName, "r") as txtfile:
line = line.rstrip()
p = re.match(r'^REF\*23\*(.+)~', line)
if p:
results.append(int(p.group(1)))
Consider using a regex tester such as this one.

python regex command to extract data excluding comment line

I need to extract data in a data file beginning with the letter
"U"
or
"L"
and exclude comment lines beginning with character "/" .
Example:
/data file FLG.dat
UAB-AB LRD1503 / reminder latches
I used a regex pattern in the python program which results in only capturing the comment lines. I'm only getting comment lines but not the identity beginning with character.
You can use ^([UL].+?)(?:/.*|)$. Code:
import re
s = """/data file FLG.dat
UAB-AB LRD1503 / reminder latches
LAB-AB LRD1503 / reminder latches
SAB-AB LRD1503 / reminder latches"""
lines = re.findall(r"^([UL].+?)(?:/.*|)$", s, re.MULTILINE)
If you want to delete spaces at the end of string you can use list comprehension with same regular expression:
lines = [match.group(1).strip() for match in re.finditer(r"^([UL].+)/.*$", s, re.MULTILINE)]
OR you can edit regular expression to not include spaces before slash ^([UL].+?)(?:\s*/.*|)$:
lines = re.findall(r"^([UL].+?)(?:\s*/.*|)$", s, re.MULTILINE)
In case the comments in your data lines are optional here's a regular expression that covers both types, lines with or without a comment.
The regular expression for that is R"^([UL][^/]*)"
(edited, original RE was R"^([UL][^/]*)(/.*)?$")
The first group is the data you want to extract, the 2nd (optional group) would catch the comment if any.
This example code prints only the 2 valid data lines.
import re
lines=["/data file FLG.dat",
"UAB-AB LRD1503 / reminder latches",
"UAB-AC LRD1600",
"MAB-AD LRD1700 / does not start with U or L"
]
datare=re.compile(R"^([UL][^/]*)")
matches = ( match.group(1).strip() for match in ( datare.match(line) for line in lines) if match)
for match in matches:
print(match)
Note how match.group(1).strip() extracts the first group of your RE and strip() removes any trailing spaces in your match
Also note that you can replace lines in this example with a file handle and it would work the same way
If the matches = line looks too complicated, it's an efficient way for writing this:
for line in lines:
match = datare.match(line)
if match:
print(match.group(1).strip())

Remove a specific pattern in fasta sequences

I have a fasta file like this,
>IWB12626
AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAA[T/C]GACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA
>IWB49383
CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTC[A/G]ATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC
I need to remove square brackets "[]","/" and the nucleotide that follows this symbol "/", so basically choosing the 1st of the two variants. This is my script, but I dont know how to specify to program that I need one letter be removed after /.
with open('myfile.fasta') as f:
with open('outfile.fasta', 'w') as out:
for line in f:
if line.startswith('>'):
out.write(line)
else:
out.write(line.translate(None, '[/a-z0-9]'))
my expected output,
>IWB12626
AACTTGAGGGACGTGCAGCTAAGGGAGGACTACTATCCAGCACCGGAGAATGACATGATGATCACAGAGATGCGGGCTGAATCTTGCCTCCGGTTTGAGCA
>IWB49383
CMGCTCATTTCTGCCGGGCTCGATAGCTGCCCTGTTCTTGAGAAGATCTCAATTAAGGTGGAGGGCGATCTCCGGACTTGTCCGCGTCCATTTCACGGGTC
You could use re.sub function.
with open('myfile.fasta') as f:
with open('outfile.fasta', 'w') as out:
for line in f:
if line.startswith('>'):
out.write(line)
else:
out.write(re.sub(r'[\[\]]|/.', '', line))
/. matches / and also the character following forward slash. [\[\]] character class which matches [ or ] symbols. | called alternation operator or logical OR operator usually used to combine two patterns. So by replacing all the matched characters with an empty string will give you the desired output.

Categories

Resources