question = "question:"
# QUESTION
f = open('dopeit.rtf', encoding="utf8", errors='ignore')
line = f.readline()
while line:
if question in line:
newline = line.replace('Question: ', '"')
print(newline + '"', end=",")
# use realine() to read next line
line = f.readline()
f.close()
the output is something like this
","Who directed Star Wars?
","Who was the only non Jedi in the original Star Wars trilogy to use a lightsaber?
","What kind of flower was enchanted and dying in Beauty and the Beast?
","Which is the longest movie ever made?
I want it to be like this:
"Who directed Star Wars?","Who was the only non Jedi in the original Star Wars trilogy to use a lightsaber?","What kind of flower was enchanted and dying in Beauty and the Beast?","Which is the longest movie ever made?
So how can I make these changes? I tried using the "end" command but it still seems to bring it to the next line? Am I doing something wrong???
newline = line.replace('Question: ', '"').replace("\n", "")
I have two input files and I want to mix them and output the result into a third files. In the following I will use a toy example to explain the format of the files and the desired output. Each file contain 4-line pattern which is repeated (but contains a different sequence), and I only include single 4-line:
input file 1:
#readheader1
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
input file 2:
#readheader2
AATTAATT
+
FFFFFFFF
...
desired ouput:
#readheader1_AATTAATT
ACACACACACACACACACACACACACACACACACACACACACACACACACACAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
...
So I want to attach thefirst line of every four line from the first file using an underscore with the small sequence found in the second line of every four line from the second file. and I simply output 2n, 3rd, and 4rd line of every four line of the first line, as is, into the output.
I am looking for any script (linux bash, python, c++, etc) that can optimize what I have below:
I wrote this code to do the task, but I found it to be slow (takes more than a day for inputs of size 60 GB and 15 GB); note that the input files are in fastq.gz format so I open them using gzip:
...
r1_file = gzip.open(r1_file_name, 'r') # input file 1
i1_file = gzip.open(i1_file_name, 'r') # input file 2
out_file_R1 = gzip.open('_R1_barcoded.fastq.gz', 'wb') # output file
r1_header = ''
r1_seq = ''
r1_orient = ''
r1_qual = ''
i1_seq = ''
cnt = 1
with gzip.open(r1_file_name, 'r') as r1_file:
for r1_line in r1_file:
if cnt==1:
r1_header = str.encode(r1_line.decode("ascii").split(" ")[0])
next(i1_file)
if cnt==2:
r1_seq = r1_line
i1_seq = next(i1_file)
if cnt==3:
r1_orient = r1_line
next(i1_file)
if cnt==4:
r1_qual = r1_line
next(i1_file)
out_4line = r1_header + b'_' + i1_seq + r1_seq + r1_orient + r1_qual
out_file_R1.write(out_4line)
cnt = 0
cnt += 1
i1_file.close()
out_file_R1.close()
Then that I have the two outputs made using 2 dataset, I wish to interleave the output files: 4 lines from the first file, 4 lines from the second file, 4 lines from the first, and so on...
Using paste utility (from GNU coreutils) and GNU sed:
paste file1 file2 |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
If files are gzipped then use:
paste <(gzip -dc file1.gz) <(gzip -dc file2.gz) |
sed -E 'N; s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/; N; N; s/\t[^\n]*//g' > file.out
Note: This assumes no tab characters in file1 and file2
Explanation: Assume that file1 and file2 contains these lines:
File1:
Header1
ACACACACAC
XX
FFFFFFFFFFFF
File2:
Header2
AATTAATT
YY
GGGGGG
After the paste command, lines are merged, separated by TABs:
Header1\tHeader2
ACACACACAC\tAATTAATT
XX\tYY
FFFFFFFFFFFF\tGGGGGG
The \t above denotes a tab character. These lines are fed to sed. sed reads the first line, the pattern space becomes
Header1\tHeader2
The N command adds a newline to the pattern space, then appends the next line (ACACACACAC\tAATTAATT) of input to the pattern space. Pattern space becomes
Header1\tHeader2\nACACACACAC\tAATTAATT
and is matched against regex \t.*\n([^\t]*)\t(.*) as denoted below.
Header1\tHeader2\nACACACACAC\tAATTAATT
||^^^^^^^||^^^^^^^^^^||^^^^^^^^
\t .* \n ([^\t]*) \t (.*)
|| || \1 || \2
The \n denotes a newline character. Then the matching part is replaced with _\2\n\1 by the s/\t.*\n([^\t]*)\t(.*)/_\2\n\1/ command. Pattern space becomes
Header1_AATTAATT\nACACACACAC
The two N commands read the next two lines. Now pattern space is
Header1_AATTAATT\nACACACACAC\nXX\tYY\nFFFFFFFFFFFF\tGGGGGG
The s/\t[^\n]*//g command removes all parts between a TAB (inclusive) and newline (exclusive). After this operation the final pattern space is
Header1_AATTAATT\nACACACACAC\nXX\nFFFFFFFFFFFF
which is printed out as
Header1_AATTAATT
ACACACACAC
XX
FFFFFFFFFFFF
I have a txt file as here:
pid,party,state,res
SC5,Republican,NY,Donald Trump 45%-Marco Rubio 18%-John Kasich 18%-Ted Cruz 11%
TB1,Republican,AR,Ted Cruz 27%-Marco Rubio 23%-Donald Trump 23%-Ben Carson 11%
FX2,Democratic,MI,Hillary Clinton 61%-Bernie Sanders 34%
BN1,Democratic,FL,Hillary Clinton 61%-Bernie Sanders 30%
PB2,Democratic,OH,Hillary Clinton 56%-Bernie Sanders 35%
what I want to do, is check that the % of each "res" gets to 100%
def addPoll(pid,party,state,res,filetype):
with open('Polls.txt', 'a+') as file: # open file temporarly for writing and reading
lines = file.readlines() # get all lines from file
file.seek(0)
next(file) # go to next line --
#this is suppose to skip the 1st line with pid/pary/state/res
for line in lines: # loop
line = line.split(',', 3)[3]
y = line.split()
print y
#else:
#file.write(pid + "," + party + "," + state + "," + res+"\n")
#file.close()
return "pass"
print addPoll("123","Democratic","OH","bla bla 50%-Asd ASD 50%",'f')
So in my code I manage to split the last ',' and enter it into a list, but im not sure how I can get only the numbers out of that text.
You can use regex to find all the numbers:
import re
for line in lines:
numbers = re.findall(r'\d+', line)
numbers = [int(n) for n in numbers]
print(sum(numbers))
This will print
0 # no numbers in the first line
97
85
97
92
93
The re.findall() method finds all substrings matching the specified pattern, which in this case is \d+, meaning any continuous string of digits. This returns a list of strings, which we cast to a list of ints, then take the sum.
It seems like what you have is CSV. Instead of trying to parse that on your own, Python already has a builtin parser that will give you back nice dictionaries (so you can do line['res']):
import csv
with open('Polls.txt') as f:
reader = csv.DictReader(f)
for row in reader:
# Do something with row['res']
pass
For the # Do something part, you can either parse the field manually (it appears to be structured): split('-') and then rsplit(' ', 1) each - separated part (the last thing should be the percent). If you're trying to enforce a format, then I'd definitely go this route, but regex are also a fine solution too for quickly pulling out what you want. You'll want to read up on them, but in your case, you want \d+%:
# Manually parse (throws IndexError if there isn't a space separating candidate name and %)
percents = [candidate.rsplit(' ', 1)[1] for candidate row['res'].split('-')]
if not all(p.endswith('%') for p in percents):
# Handle bad percent (not ending in %)
pass
else:
# Throws ValueError if any of the percents aren't integers
percents = [int(p[:-1]) for p in percents]
if sum(percents) != 100:
# Handle bad total
pass
Or with regex:
percents = [int(match.group(1)) for match in re.finditer(r'(\d+)%', row['res'])]
if sum(percents) != 100:
# Handle bad total here
pass
Regex is certainly shorter, but the former will enforce more strict formatting requirements on row['res'] and will allow you to later extract things like candidate names.
Also some random notes:
You don't need to open with 'a+' unless you plan to append to the file, 'r' will do (and 'r' is implicit, so you don't have to specify it).
Instead of next() use a for loop!
The file has the following format:
Component_name - version - author#email.com - multi-line comment with new lines and other white space characters
\t ...continue multi-line comment
Component_name2 - version - author2#email.com - possibly multi-line comment with new lines and other white space characters
Component_name - version - author#email.com - possibly multi-line comment with new lines and other white space characters 2
Component_name - version - author2#email.com - possibly multi-line comment with new lines and other white space characters 2
and so on...
After parsing the output format should be grouped by component_name:
output = [
"component_name" -> ["version - author#email.com - comment 1", "version - author#email.com - comment 2", ...],
"component_name2" -> [...],
...
]
Currently, this is what I have so far to parse it:
reTemp = r"[\w\_\-]*( \- )(\d*\.?){3}( \- )[\w\d\_\-\.\#]*( \- )[\S ]*"
numData = 4
reFormat = re.compile(reTemp)
textFileLines = textFile.split("\n")
temp = [x.split(" - ", numData - 1) for x in textFileLines if re.search(reFormat, x)]
m = filter(None, temp) # remove all empty lists
group = groupby(m, lambda y: y[0].strip())
This works well for single line comments but fails with multi-line comments. Also, I am not sure if Regex is the right tool for this. Is there a better/pythonic way to do this?
EDIT:
Multi-line comments are tab delimited \t on a new line (e.g. look at first entry above)
Comments are GIT commit messages and can contain JSON or code
Entries are separated by a newline character
I've had to deal with structured data files like this and ended up writing a state machine to parse the file. Something like this (rough pseudocode):
for line in file:
if line matches new_record_regex:
records.append(record)
record = {"version": field1, "author": field2, "comment": field3}
else:
record["comment"] += line
You might want to formalize the file format as a grammar and then use one of the many parsers / parser generators Python has to offer to interpret the file according to the grammar.
I tried to read lines like below:
A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1
xQ,1xT
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x
H,1xY
A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1
xT
Where each Even lines is a continuation of ODD lines but which is split by "\n\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s\s" so I want to replace those '\n\s(n)' to '' and join back to end of ODD lines .
FOR EXAMPLE:
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x
H,1xY
TO
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1xH,1xY
CODE:
import os
import sys
import re
lines=["A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1"," xQ,1xT","A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1x"," H,1xY","A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1"," xT"]
for i in lines:
print i.replace(" ","")
Here,I just replaced spaces by empty space but i didnt get how to join those replaced EVEN lines to end of ODD lines.
So could some one help me to do the same.
Thanking you in advance.
Hi guys , First of all Many more thanks for your kind replies. I tried all the ways but the followed one works correct:
WILD= open("INPUT.txt", 'r')
merged = []
for line in WILD:
if line.startswith(" "):
merged[-1] += line.strip()
else:
merged.append(line.replace("\n",""))
OUTPUT:
A:129 Tyr -P- 9 - - - 10xR,4xG,3xD,3xK,2xP,2xV,2xY,1xE,1xI,1xL,1xM,1xN,1xQ,1xT
A:181 Ser -P- 8 - - - 9xR,9xS,8xG,4xT,3xD,3xL,3xQ,3xV,2xK,2xM,1xA,1xF,1xH,1xY
A:50 His --- 9 - - - 17xL,9xA,4xK,3xI,3xR,3xV,2xN,2xS,1xC,1xE,1xH,1xQ,1xT
Instead of that replace statement, you can just use str.strip to strip away whitespace at the beginning or the end of the string. Also, you can use zip to iterate pairs of lines.
for x, y in zip(l[::2],l[1::2]):
print "".join([x, y.strip()])
Or use next to get the next line if this is an iterator, like a file.
for x in iterator:
y = next(iterator)
print "".join([x, y.strip()])
Both ways, all the even lines (0, 2, ...) go to x and all the odd ones (1, 3, ...) to y.
Of course, this is assuming that all the entries in the list/file are spanning exactly two lines.
If they can span an arbitrary number of lines (just one, or two, or maybe five), then this will get more complicated. In this case, you might try something like this:
merged = []
for line in lines:
if line.startswith(" "):
merged[-1] += line.strip()
else:
merged.append(line)
Note: If thoses are indeed lines from a file, you might have to apply strip to all the lines, i.e. also x.strip() and merged.append(line.strip()), as each line will be terminated by \n which you might want to get rid of.
Read the entire file as a single string, then replace the entire whitespace with a single tab:
filepointer = open("INPUT.txt")
text = filepointer.read()
text = re.sub(r"\n\s{20,}", "\t", text)
This matches and removes sequences of a newline followed by 20 or more spaces, replacing them with a tab. (That way I don't have to count the precise number of spaces, and the program still works if some lines are slightly different).
If you don't want a tab between the joined lines, just use a space (" ") instead of "\t".
And if you must have the result as a list of lines, split text afterwards:
merged = text.splitlines()