Simple regex problem: Removing all new lines from a file - python

I'm becoming acquainted with python and am creating problems in order to help myself learn the ins and outs of the language. My next problem comes as follows:
I have copied and pasted a huge slew of text from the internet, but the copy and paste added several new lines to break up the huge string. I wish to programatically remove all of these and return the string into a giant blob of characters. This is obviously a job for regex (I think), and parsing through the file and removing all instances of the newline character sounds like it would work, but it doesn't seem to be going over all that well for me.
Is there an easy way to go about this? It seems rather simple.

The two main alternatives: read everything in as a single string and remove newlines:
clean = open('thefile.txt').read().replace('\n', '')
or, read line by line, removing the newline that ends each line, and join it up again:
clean = ''.join(l[:-1] for l in open('thefile.txt'))
The former alternative is probably faster, but, as always, I strongly recommend you MEASURE speed (e.g., use python -mtimeit) in cases of your specific interest, rather than just assuming you know how performance will be. REs are probably slower, but, again: don't guess, MEASURE!
So here are some numbers for a specific text file on my laptop:
$ python -mtimeit -s"import re" "re.sub('\n','',open('AV1611Bible.txt').read())"
10 loops, best of 3: 53.9 msec per loop
$ python -mtimeit "''.join(l[:-1] for l in open('AV1611Bible.txt'))"
10 loops, best of 3: 51.3 msec per loop
$ python -mtimeit "open('AV1611Bible.txt').read().replace('\n', '')"
10 loops, best of 3: 35.1 msec per loop
The file is a version of the KJ Bible, downloaded and unzipped from here (I do think it's important to run such measurements on one easily fetched file, so others can easily reproduce them!).
Of course, a few milliseconds more or less on a file of 4.3 MB, 34,000 lines, may not matter much to you one way or another; but as the fastest approach is also the simplest one (far from an unusual occurrence, especially in Python;-), I think that's a pretty good recommendation.

I wouldn't use a regex for simply replacing newlines - I'd use string.replace(). Here's a complete script:
f = open('input.txt')
contents = f.read()
f.close()
new_contents = contents.replace('\n', '')
f = open('output.txt', 'w')
f.write(new_contents)
f.close()

import re
re.sub(r"\n", "", file_contents_here)

I know this is a python learning problem, but if you're ever trying to do this from the command-line, there's no need to write a python script. Here are a couple of other ways:
cat $FILE | tr -d '\n'
awk '{printf("%s", $0)}' $FILE
Neither of these has to read the entire file into memory, so if you've got an enormous file to process, they might be better than the python solutions provided.

Old question, but since it was in my search results for a similar query, and no one has mentioned the python string functions strip() || lstrip() || rstrip(), I'll just add that for posterity (and anyone who prefers not to use re when not necessary):
old = open('infile.txt')
new = open('outfile.txt', 'w')
stripped = [line.strip() for line in old]
old.close()
new.write("".join(stripped))
new.close()

All the examples using <string>.replace('\n','') is the correct method to remove all carriage returns.
If you are interested in removing all redundant new lines for debugging etc., here is how:
import re
re.sub(r"(\n)\1{2,}", "", _your_string).strip()

Related

How to concatenate sequences in the same multiFASTA files and then print result to a new FASTA file?

I have a folder with over 50 FASTA files each with anywhere from 2-8 FASTA sequences within them, here's an example:
testFOR.id_AH004930.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGA
>AH004930|2:237-401_Miopithecus_talapoin
GGGT
>AH004930|2:502-580_Miopithecus_talapoin
CTTTGCT
>AH004930|2:681-747_Miopithecus_talapoin
GGTG
testFOR.id_M95099.fasta
>M95099|1:1-90_Homo_sapien
TCTTTGC
>M95099|1:100-243_Homo_sapien
ATGGTCTTTGAA
They're all grouped based on their ID number (in this case AH004930 and M95099), which I've managed to extract from the original raw multiFASTA file using the very handy seqkit code found HERE.
What I am aiming to do is:
Use cat to put these sequences together within the file like this:
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
(I'm not fussed about the nucleotide position, I'm fussed about the ID and species name!)
Print this result out into a new FASTA file.
Ideally I'd really like to have all of these 50 files condensed into 1 FASTA that I can then go ahead and filter/align:
GENE_L.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTG
>M95099|1:1-90_Homo_sapien
TCTTTGCATGGTCTTTGAA
....
So far I have found a way to achieve what I want but only one file at a time (using this code: cat myfile.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' > output.fasta which I've sadly lost the link for the credit for) but a lot of these file names are very similar, so it's inevitable that if I did it manually, I'd miss some/it would be way too slow.
I have tried to put this into a loop and it's kind of there! But what it does is it cats each FASTA file, put's it into a new one BUT only keeps the first header, leaving me with a massive stitched together sequence;
for FILE in *; do cat *.fasta| sed -e '1!{/^>.*/d;}'| sed ':a;N;$!ba;s/\n//2g' > output.fasta; done
output.fasta
>AH004930|2:1-128_Miopithecus_talapoin
ATGAGGGTCTTTGCTGGTGTCTTTGCATGGTCTTTGAAGGTCTTTGAAATGAGTGGT...
I wondered if making a loop similar to the one HERE would be any good but I am really unsure how to get it to print each header once it opens a new file.
How can I cat these sequences, print them into a new file and still keep these headers?
I would really appreciate any advice on where I've gone wrong in the loop and any solutions suitable for a zsh shell! I'm open to any python or linux solution. Thank you kindly in advance
This might work for you (GNU sed):
sed -s '1h;/>/d;H;$!d;x;s/\n//2g' file1 file2 file3 ...
Set -s to treat each file separately.
Copy the first line.
Delete any other lines containing >.
Append all other lines to the first.
Delete these lines except for the last.
At the end of the file, swap to the copies and remove all newlines except the first.
Repeat for all files.
Alternative for non-GNU seds:
for file in *.fasta; do sed '1h;/>/d;H;$!d;x;s/\n//2g' "$file"; done
N.B. MacOS sed may need to be put into a script and invoked using the -f option or split into several pieces using the -e option (less the ; commands), your luck may vary.
Or perhaps:
for file in file?; do sed $'1h;/>/d;H;$!d;x;s/\\n/#/;s/\\n//g;s/#/\\n/' "$file"; done
Not sure I understand exactly your issue, but if you simply want to concatenate contents from many files to a single file I believe the (Python) code below should work:
import os
input_folder = 'path/to/your/folder/with/fasta/files'
output_file = 'output.fasta'
with open(output_file, 'w') as outfile:
for file_name in os.listdir(input_folder):
if not file_name.endswith('.fasta'): # ignore this
continue
file_path = os.path.join(input_folder, file_name)
with open(file_path, 'r') as inpfile:
outfile.write(inpfile.read())

Running grep through Python - doesn't work

I have some code like this:
f = open("words.txt", "w")
subprocess.call(["grep", p, "/usr/share/dict/words"], stdout=f)
f.close()
I want to grep the MacOs dictionary for a certain pattern and write the results to words.txt. For example, if I want to do something like grep '\<a.\>' /usr/share/dict/words, I'd run the above code with p = "'\<a.\>'". However, the subprocess call doesn't seem to work properly and words.txt remains empty. Any thoughts on why that is? Also, is there a way to apply regex to /usr/share/dict/words without calling a grep-subprocess?
edit:
When I run grep '\<a.\>' /usr/share/dict/words in my terminal, I get words like: aa
ad
ae
ah
ai
ak
al
am
an
ar
as
at
aw
ax
ay as results in the terminal (or a file if I redirect them there). This is what I expect words.txt to have after I run the subprocess call.
Like #woockashek already commented, you are not getting any results because there are no hits on '\<a.\>' in your input file. You are probably actually hoping to find hits for \<a.\> but then obviously you need to omit the single quotes, which are messing you up.
Of course, Python knows full well how to look for a regex in a file.
import re
rx = re.compile(r'\ba.\b')
with open('/usr/share/dict/words', 'Ur') as reader, open('words.txt', 'w') as writer:
for line in reader:
if rx.search(line):
print(line, file=writer, end='')
The single quotes here are part of Python's string syntax, just like the single quotes on the command line are part of the shell's syntax. In neither case are they part of the actual regular expression you are searching for.
The subprocess.Popen documentation vaguely alludes to the frequently overlooked fact that the shell's quoting is not necessary or useful when you don't have shell=True (which usually you should avoid anyway, for this and other reasons).
Python unfortunately doesn't support \< and \> as word boundary operators, so we have to use (the functionally equivalent) \b instead.
The standard input and output channels for the process started by call() are bound to the parent’s input and output. That means the calling programm cannot capture the output of the command. Use check_output() to capture the output for later processing:
import subprocess
f = open("words.txt", "w")
output = subprocess.check_output(['grep', p ,'-1'])
file.write(output)
print output
f.close()
PD: I hope it works, i cant check the answer because i have not MacOS to try it.

Print lines between line numbers from a large file

I have a very large text file has more than 30 GB size. For some reasons, I want to read lines between 1000000 and 2000000 and compare with user input string. If it matches, I need to write the line content into to another file.
I know how to read a file line by line.
input_file = open('file.txt', 'r')
for line in input_file:
print line
But if the size of file is large, it really affect performance right? How to address this in an optimized way.
You can use itertools.islice:
from itertools import islice
with open('file.txt') as fin:
lines = islice(fin, 1000000, 2000000) # or whatever ranges
for line in lines:
# do something
Of course, if your lines are fixed length, you can use that to directly fin.seek() to the start of the line. Otherwise, the approach above still has to read n lines until islice starts producing output, but is just really a convenient way to limit the range.
You could use linecache.
Let me cite from the docs: "The linecache module allows one to get any line from any file, while attempting to optimize internally, using a cache, the common case where many lines are read from a single file.":
import linecache
for i in xrange(1000000, 2000000)
print linecache.getline('file.txt', i)
Do all your lines have the same size? If that were the case you could probably use seek() to directly jump to the first line you are interested into. Otherwise, you're going to have to iterate through the entire file because there is no way of telling in advance where each line starts:
input_file = open('file.txt', 'r')
for index, line in enumerate(input_file):
# Assuming you start counting from zero
if 1000000 <= index <= 2000000:
print line
For small files, the linecache module can be useful.
If you're on Linux, have you considered using the os.system or commands Python modules to directly execute shell commands like sed, awk, head or tail to do this?
Running the command: os.system("tail -n+50000000 test.in | head -n10")
will read line 50.000.000 to 50.000.010 from the file test.in This post on stackoverflow discusses different ways of calling commands, if performance is key there may be more efficient methods than os.system.
This discussion on unix.stackexchange discusses in-depth how to select specific ranges of a text file using the command line:
100,000,000-line file generated by seq 100000000 > test.in
Reading lines 50,000,000-50,000,010
Tests in no particular order
real time as reported by bash's builtin time
The combination of tail and head, or using sed seem to offer the quickest solutions.
4.373 4.418 4.395 tail -n+50000000 test.in | head -n10
5.210 5.179 6.181 sed -n '50000000,50000010p;57890010q' test.in
5.525 5.475 5.488 head -n50000010 test.in | tail -n10
8.497 8.352 8.438 sed -n '50000000,50000010p' test.in
22.826 23.154 23.195 tail -n50000001 test.in | head -n10
25.694 25.908 27.638 ed -s test.in <<<"50000000,50000010p"
31.348 28.140 30.574 awk 'NR<57890000{next}1;NR==57890010{exit}' test.in
51.359 50.919 51.127 awk 'NR >= 57890000 && NR <= 57890010' test.in
Generally, you cannot just jump to line number x in file, because text line can have variable lenght, so they can occupy anything between one and gazillion bytes.
However, if you expect to seek in those files very often, you can index them, remembering in separate files at which bytes starts, let's say, every thousandth line. They you can open file and use file.seek() to go to part of file you are interested in, and start itereting from there.
Best way i found is :
lines_data = []
text_arr = multilinetext.split('\n')
for i in range(line_number_begin, line_number_end):
lines_data.append(multilinetext[i])

Can I read and write file in one line with Python?

with ruby I can
File.open('yyy.mp4', 'w') { |f| f.write(File.read('xxx.mp4')}
Can I do this using Python ?
Sure you can:
with open('yyy.mp4', 'wb') as f:
f.write(open('xxx.mp4', 'rb').read())
Note the binary mode flag there (b), since you are copying over mp4 contents, you don't want python to reinterpret newlines for you.
That'll take a lot of memory if xxx.mp4 is large. Take a look at the shutil.copyfile function for a more memory-efficient option:
import shutil
shutil.copyfile('xxx.mp4', 'yyy.mp4')
Python is not about writing ugly one-liner code.
Check the documentation of the shutil module - in particular the copyfile() method.
http://docs.python.org/library/shutil.html
You want to copy a file, do not manually read then write bytes, use file copy functions which are generally much better and efficient for a number of reasons in this simple case.
If you want a true one-liner, you can replace line-breaks by semi-colons :
import shutil; shutil.copyfile("xxx.mp4","yyy.mp4")
Avoid this! I did that once to speed-up an extremely specific case completely unrelated to Python, but by the presence of line-breaks in my python -c "Put 🐍️ code here" command-line and the way Meson handle it.

Are there a set of simple scripts to manipulate csv files available somewhere?

I am looking for a few scripts which would allow to manipulate generic csv files...
typically something like:
add-row FILENAME INSERT_ROW
get-row FILENAME GREP_ROW
replace-row FILENAME GREP_ROW INSERT_ROW
delete-row FILENAME GREP_ROW
where
FILENAME the name of a csv file, with the first row containing headers, "" used to delimit strings which might contain ','
GREP_ROW a string of pairs field1=value1[,fieldN=valueN,...] used to identify a row based on its fields values in a csv file
INSERT_ROW a string of pairs field1=value1[,fieldN=valueN,...] used to replace(or add) the fields of a row.
peferably in python using the csv package...
ideally leveraging python to associate each field as a variable and allowing more advanced GREP rules like fieldN > XYZ...
Perl has a tradition of in-place editing derived from the unix philosophy.
We could for example write simple add-row-by-num.pl command as follows :
#!/usr/bin/perl -pi
BEGIN { $ln=shift; $line=shift; }
print "$line\n" if $ln==$.;
close ARGV if eof;
Replace the third line by $_="$line\n" if $ln==$.; to replace lines. Eliminate the $line=shift; and replace the third line by $_ = "" if $ln==$.; to delete lines.
We could write a simple add-row-by-regex.pl command as follows :
#!/usr/bin/perl -pi
BEGIN { $regex=shift; $line=shift; }
print "$line\n" if /$regex/;
Or simply the perl command perl -pi -e 'print "LINE\n" if /REGEX/'; FILES. Again, we may replace the print $line by $_="$line\n" or $_ = "" for replace or delete, respectively.
We do not need the close ARGV if eof; line anymore because we need not rest the $. counter after each file is processed.
Is there some reason the ordinary unix grep utility does not suffice? Recall the regular expression (PATERN){n} matches PATERN exactly n times, i.e. (\s*\S+\s*,){6}{\s*777\s*,) demands a 777 in the 7th column.
There is even a perl regular expression to transform your fieldN=value pairs into this regular expression, although I'd use split, map, and join myself.
Btw, File::Inplace provides inplace editing for file handles.
Perl has the DBD::CSV driver, which lets you access a CSV file as if it were an SQL database. I've played with it before, but haven't used it extensively, so I can't give a thorough review of it. If your needs are simple enough, this may work well for you.
App::CCSV does some of that.
The usual way in Python is to use the csv.reader to load the data into a list of tuples, then do your add/replace/get/delete operations on that native python object, and then use csv.writer to write the file back out.
In-place operations on CSV files wouldn't make much sense anyway. Since the records are not typically of fixed length, there is no easy way to insert, delete, or modify a record without moving all the other records at the same time.
That being said, Python's fileinput module has a mode for in-place file updates.

Categories

Resources