i'm new to python and right now i'm out of ideas.
What i'm trying to: i got a file
example:
254 578 name1 *--21->28--* secname1
854 548 name2 *--21->28--* secname2
944 785 name3 *--21->28--* secname3
1025 654 name4 *--21->28--* secname4
between those files are a lot of spaces and i wan't to remove specific spaces between "name*" and "secname*" for each row. I don't know what to do to as seen in the example remove the character/spaces 21 -> 28
What i got so far:
fobj_in = open("85488_66325_R85V54.txt")
fobj_out = open("85488_66325_R85V54.txt","w")
for line in fobj_in:
fobj_in.close()
fobj_out.close()
At the end it should look like:
254 578 name1 secname1
854 548 name2 secname2
944 785 name3 secname3
1025 654 name4 secname4
To remove characters by specific index positions you have to use slicing
for line in open('85488_66325_R85V54.txt'):
newline = line[:21] + line[29:]
print(newline)
removes the characters in column 21:28 (which are all whitespaces in your example)
Just split the line and pop the element you don't need.
fobj_in = open('85488_66325_R85V54','r')
fobj_out = open('85488_66325_R85V54.txt', 'a')
for line in fobj_in:
items = line.split()
items.pop(3)
fobj_out.write(' '.join(items)+'\n')
fobj_in.close()
fobj_out.close()
You can just use the string object's split method, like so:
f = open('my_file.txt', 'r')
data = f.readlines()
final_data = []
for line in data:
bits = line.split()
final_data.append([bits[0], bits[1], bits[2], bits[4]])
Basically I'm just illustrating how to use that split method to break each line into individual chunks, at which point you can do whatever you wish, like print all of those bits and selectively discard one of the columns.
I can suggest a robust method to correct the input line.
#!/usr/bin/env ipython
# -----------------------------------
line='254 578 name1 *--21->28--* secname1';
# -----------------------------------
def correctline(line,marker='*'):
status=0;
lineout='';
for val in line:
if val=='*':
status=abs(status-1);continue
if status==0:
lineout=lineout+val;
elif status == 1:
lineout=lineout
# -----------------------------------
while lineout.__contains__(' '):
lineout=lineout.replace(' ',' ');
return lineout
# ------------------------------------
print correctline(line)
Basically, it loops through the elements of the input file. When it finds some marker from which onward to skip the text, it skips it and finally just replaces too many spaces with one space.
If the names are of varying lengths and you dont want to just remove a set number of spaces between them you can search for blank characters to find where sname begins and name ends:
# open file in "read" mode
fobj_in = open("85488_66325_R85V54.txt", "r")
# use readlines to create a list, each member containing a line of 85488_66325_R85V54.txt
lines = fobj_in.readlines()
# For each line search from the end backwards for the first " " char
# when this char is found create first_name which is a list containing the
# elements of line from here onwards and a second list which is the elements up to
# this point. Then search for a non " " char and remove the blank spaces.
# remaining_line and first_name can then be concatenated back together using
# + with the desired number of spaces between then (in this case 12).
for line_number, line in enumerate(lines):
first_name_found = False
new_line_created = False
for i in range(len(line)):
if(line[-i] is " " and first_name_found is False):
first_name = line[-i+1:]
remaining_line = line[:-i+1]
first_name_found = True
for j in range(len(remaining_line)):
if(remaining_line[-j-1] is not " " and new_line_created == False):
new_line = remaining_line[0:-j]+ " "*12 + first_name
new_line_created = True
lines[line_number] = new_line
then just write lines to 85488_66325_R85V54.txt.
You could try to do it as follows:
for line in fobj_in:
setstring = line
print(setstring.replace(" ", "")
Related
I have a script that puts the line that starts with #Solution 1 in a new file together with the name of the input file. But I want to add the piece belonging to Major from the input file. Can someone please help me to figure out how to get the piece of text?
The script now:
#!/usr/bin/env python3
import os
dr = "/home/nwalraven/Result_pgx/Runfolder/Runres_Aldy" outdr = "/home/nwalraven/Result_pgx/Runfolder/Aldy_res_txt" tag = ".aldy"
for f in os.listdir(dr):
if f.endswith(tag):
print(f)
new_file_name = f.split('_')[0]+'.txt' # get the name of the file before the '_' and add '.txt' to it
with open(dr+"/"+f) as file:
for line in file.readlines():
f
if line.startswith("#Solution 1"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(f.split('.')[0] + "\n")
new_file.write(line + "\n")
if line.startswith("#Solution 2"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(line + "\n")
print("Meerdere oplossingen gevonden! Check Aldy bestand" )
The input:
file = EMQN3-S3_COMT.aldy
#Sample Gene SolutionID Major Minor Copy Allele Location Type Coverage Effect dbSNP Code Status
#Solution 1: *Met, *ValB
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 0 Met 19950234 C>T 530 H62= rs4633
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 0 Met 19951270 G>A 651 V158M rs4680
EMQN3-S3 COMT 1 *Met/*ValB Met;ValB 1 ValB
file = EMQN3-S3_CYP2B6.aldy
#Sample Gene SolutionID Major Minor Copy Allele Location Type Coverage Effect dbSNP Code Status
#Solution 1: *1.001, *1.001
EMQN3-S3 CYP2B6 1 *1/*1 1.001;1.001 0 1.001
EMQN3-S3 CYP2B6 1 *1/*1 1.001;1.001 1 1.001
The result it gives right now:
EMQN3-S3_COMT.aldy
#Solution 1: *Met, *ValB
EMQN3-S3_CYP2B6.aldy
#Solution 1: *1.001, *1.001
The result I need:
EMQN3-S3_COMT.aldy
#Solution 1: *Met/*ValB
EMQN3-S3_CYP2B6.aldy
#Solution 1: *1/*1
If you print out the line, you could use regular expression to replace text before printing the line.
On the other hand, if you know it always starts with a fixed number of chars, then it's easier and faster to edit the line manually.
With regex:
# Importing regular expressions
import re
# Setting up regex replacement to replace ", " with "/"
regex = "\, "
replacement = "/"
...
# Format the line before printing it
line_formatted = re.sub(regex, replacement, line)
new_file.write(line.replace(regex, replacement) + "\n") # edited
...
Try to replace this part of your script:
...
if line.startswith("#Solution 1"):
with open(outdr+"/"+new_file_name,"a",newline='\n') as new_file:
new_file.write(f.split('.')[0] + "\n")
solution = "/".join([x.strip().split(".")[0] for x in line.split(",")])
new_file.write(solution + "\n")
...
It will do the following:
split the string into two tokens, based on the comma
strip them
remove the decimal part (if any) from the token
rejoin the string using the slash.
Hope it helps.
I have a file with lines in this format:
CALSPHERE 1
1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996
2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319
CALSPHERE 2
1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990
2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421
..etc.
I would like to parse this into a dictionary of the format:
{CALSPHERE 1:(1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996, 2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319),
CALSPHERE 2:(1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990, 2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421),...}
I'm puzzled as to how to parse this, so that every third line is the key, with the following two lines forming a tuple for the value. What would be the best way to do this in python?
I've attempted to add some logic for "every third line" though it seems kind of convoluted; something like
with open(r"file") as f:
i = 3
for line in f:
if i%3=0:
key = line
else:
#not sure what to do with the next lines here
If your file always have the same distribution (i.e: the 'CALSPHERE' word -or any other that you want it as your dictionary key-, followed by two lines), you can achieve what you want by doing something as follows:
with open(filename) as file:
lines = file.read().splitlines()
d = dict()
for i in range(0, len(lines), 3):
d[lines[i].strip()] = (lines[i + 1], lines[i + 2])
Output:
{
'CALSPHERE 1': ('1 00900U 64063C 20161.15561498 .00000210 00000-0 21550-3 0 9996', '2 00900 90.1544 28.2623 0029666 80.8701 43.4270 13.73380512769319'),
'CALSPHERE 2': ('1 00902U 64063E 20161.16836122 .00000025 00000-0 23933-4 0 9990', '2 00902 90.1649 30.9038 0019837 126.9344 3.6737 13.52683749559421')
}
Assuming that your content is in file.txt you can use the following.
It shall work for any number of CALSPHERE keyword occurrences and also various number of entries between.
with open('file.txt') as inp:
buffer = []
for line in inp:
# remove newline
copy = line.replace('\n','')
# check if next entry
if 'CALSPHERE' in copy:
buffer.append([])
# add line
buffer[-1].append(copy)
# put the output into dictionary
res = {}
for chunk in buffer:
# safety check
if len(chunk) > 1:
res[chunk[0]] = tuple( chunk[1:] )
print(res)
I want to read some files with Python that contain certain data I need.
The structure of the files is like this:
NAME : a280
COMMENT : drilling problem (Ludwig)
TYPE : TSP
DIMENSION: 280
EDGE_WEIGHT_TYPE : EUC_2D
NODE_COORD_SECTION
1 288 149
2 288 129
3 270 133
4 256 141
5 256 157
6 246 157
7 236 169
8 228 169
9 228 161
So, the file starts with a few lines that contain data I need, then there are some random lines I do NOT need, and then there are lines with numerical data that I do need. I read everything that I need to read just fine.
However, my problem is that I cannot find a way to bypass the random number of lines that are sandwiched between the data I need. The lines from file to file can be 1, 2 or more. It would be silly to hardcode some f.readline() commands in there to bypass this.
I have thought of some regular expression to check if the line is starting with a string, in order to bypass it, but I'm failing.
In other words, there can be more lines like "NODE_COORD_SECTION" that I don't need in my data.
Any help is highly appreciated.
Well you can simply check if every line is valid (stuff you need) and if it is not, you just skip it. For example:
line_list = line.split()
if line_list[0] not in ['NAME', 'COMMENT', 'TYPE', ...]:
break
if len(line_list) != 3:
break
if len(line_list) == 3 and (type(line_list[0]) != int or type(line_list[1]) != int or type(line_list[2]) != int):
break
It would be nice if you add some format to the "lines of your file" and if you showed some code, but her's what I would try.
I would first define a list of strings containing an indication of a valid line, then I would split the current line into a list of strings and check if the first element corresponds to any of the elements in a list of valid strings.
In case the first string doesn't corresponds to any of the strings in the list of valid strings, I would check if the first element is an integer, and so on...
current_line = 'LINE OF TEXT FROM FILE'
VALID_WORDS = ['VALID_STR1','VALID_STR2','VALID_STR3']
elems = current_line.split(' ')
valid_line = False
if elems[0] in VALID_WORDS:
# If the first str is in the list of valid words,
# continue...
valid_line = True
else if len(elems)==3:
# If it's not in the list of valid words BUT has 3
# elements, check if it's and int
try:
valid_line = isinstance(int(elems[0]),int)
except Exception as e:
valid_line = False
if valid_line:
# Your thing
pass
else:
# Not a valid line
continue
I have some structured data in a text file:
Parse.txt
name1
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
gggggggg
some of the detail4s do not have data and would be replaced by "-":
name2
detail:
aaaaaaaa
bbbbbbbb
cccccccc
detail1:
dddddddd
detail2:
eeeeeeee
detail3:
ffffffff
detail4:
-
How do i parse the data to get the elements below detail1, detail2 and detail3 of only the data with empty detail4s?
So far i have a partially working code but the problem is that it gets each item 40 times. Please help.
Code:
data = []
with open("parse.txt","r",encoding="utf-8") as text_file:
for line in text_file:
data.append(line)
det4li = []
finali= []
for elem,det4 in zip(data,data[1:]):
if "detail4" in elem:
det4li .append(det4)
if "-" in det4:
for elem1,det1,det2,det3 in zip(data,data[1:],data[3:],data[5:]):
if "detail1:" in elem1:
finali.append(det1.strip() + "," + det2.strip() + "," + det3)
Current Output: 40 records of dddddddd,eeeeeeee,ffffffff
Desired Output: dddddddd,eeeeeeee,ffffffff
Don't try to look ahead. Look behind, by storing preceding data:
final = []
with open("parse.txt","r",encoding="utf-8") as text_file:
section = {}
last_header = None
for line in text_file:
line = line.strip()
if line.startswith('detail'):
# detail line, record for later use
last_header = line.rstrip(':')
elif not last_header:
# name line, store as such
section['name'] = line
else:
section[last_header] = line
if last_header == 'detail4':
# section complete, process
if line == '-':
# A section we want to keep
final.append(section)
# reset section data
section, last_header = {}, None
This has the added advantage that you now don't need to read the whole file into memory. If you turn this into a generator (by putting it into a function and replacing the final.append(section) line with yield section), you can even process those matching sections as you read the file without sacrificing readability.
The user can delete and add data in result.txt, PersonA might not exist but PersonQ might one time, but not the next. How do I get the data from the file, get it into different lines and calculate total/average when I don't know which persons exists in the file from time to time?:
PersonA;342;454;559;
PersonB;444;100;545;
PersonC;332;567;491;
PersonD;142;612;666;
I wanna present it like this:
PersonA 342 454 559 TOTAL AVERAGE
PersonB 444 100 545 TOTAL AVERAGE
PersonC 332 567 491 TOTAL AVERAGE
PersonD 142 612 666 TOTAL AVERAGE
What can I write after this to get it right?
def show_result():
text_file = open('result.txt', 'r')
for line in text_file:
if ';' in line:
line2 = line.split(";")
print line2
I want to use this:
line_total = sum(map(int, line2[1:]))
line_average = line_total / len(line2[1:])
But recieve error message:
File "C:\Users\HKI\Desktop\test3.py", line 32, in show_result
line_total = sum(map(int, line2[1:]))
ValueError: invalid literal for int() with base 10: ''
Don't wanna use Panda or similar.
All the lines contain a trailing ';' which adds an empty character to the end of the split. An attempt to convert the empty string '' to int raises that error.
You should do a right strip on the last semi-colon before splitting:
line = line.strip() # remove new line character and white spaces
line2 = line.rstrip(';').split(";")
just change line2 = line.split(";") to line2 = line.split(";")[:-1],then it will work.