I am a python noob ,imagine i have a .txt file which contains
123456789 1234 apple\wasdsa\sgfgf\sgf\rgfd.csv
124555669 6547 mango\sdf\hjt\sthsdth\eth.txt
564984565 58475 ksfjk\hjkf\tkohj\fdgs.opp
and the list continues.But i need it to format it like this with the header and with the serial numbers which will keep on increment according to the number of lines :
Sr.no. MD5 Size Path
1 123456789 1234 apple\wasdsa\sgfgf\sgf\rgfd.csv
2 124555669 6547 mango\sdf\hjt\sthsdth\eth.txt
3 564984565 58475 ksfjk\hjkf\tkohj\fdgs.opp
I am not able to overwrite it on the same .txt file , and also i am not able to generate the serial number. Please help me.
You could use
data = """
123456789 1234 test123
124555669 6547 test456
564984565 58475 test789
"""
header = "Sr.no. MD5 Size Path\n"
output = header + "\n".join(
"{}\t{}".format(line_number, line)
for line_number, line in enumerate(
(item for item in data.split("\n") if item), 1))
print(output)
Which would yield
Sr.no. MD5 Size Path
1 123456789 1234 test123
2 124555669 6547 test456
3 564984565 58475 test789
Question is if these escaped characters are really in the actual string?
Related
I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()
I have a .txt file. It has 3 different columns. The first one is just numbers. The second one is numbers which starts with 0 and it goes until 7. The final one is a sentence like. And I want to keep them in different lists because of matching them for their numbers. I want to write a function. How can I separate them in different lists without disrupting them?
The example of .txt:
1234 0 my name is
6789 2 I am coming
2346 1 are you new?
1234 2 Who are you?
1234 1 how's going on?
And I have keep them like this:
----1----
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
----2----
2346 1 are you new?
----3-----
6789 2 I am coming
What I've tried so far:
inputfile=open('input.txt','r').read()
m_id=[]
p_id=[]
packet_mes=[]
input_file=inputfile.split(" ")
print(input_file)
input_file=line.split()
m_id=[int(x) for x in input_file if x.isdigit()]
p_id=[x for x in input_file if not x.isdigit()]
With your current approach, you are reading the entire file as a string, and performing a split on a whitespace (you'd much rather split on newlines instead, because each line is separated by a newline). Furthermore, you're not segregating your data into disparate columns properly.
You have 3 columns. You can split each line into 3 parts using str.split(None, 2). The None implies splitting on space. Each group will be stored as key-list pairs inside a dictionary. Here I use an OrderedDict in case you need to maintain order, but you can just as easily declare o = {} as a normal dictionary with the same grouping (but no order!).
from collections import OrderedDict
o = OrderedDict()
with open('input.txt', 'r') as f:
for line in f:
i, j, k = line.strip().split(None, 2)
o.setdefault(i, []).append([int(i), int(j), k])
print(dict(o))
{'1234': [[1234, 0, 'my name is'],
[1234, 2, 'Who are you?'],
[1234, 1, "how's going on?"]],
'6789': [[6789, 2, 'I am coming']],
'2346': [[2346, 1, 'are you new?']]}
Always use the with...as context manager when working with file I/O - it makes for clean code. Also, note that for larger files, iterating over each line is more memory efficient.
Maybe you want something like that:
import re
# Collect data from inpu file
h = {}
with open('input.txt', 'r') as f:
for line in f:
res = re.match("^(\d+)\s+(\d+)\s+(.*)$", line)
if res:
if not res.group(1) in h:
h[res.group(1)] = []
h[res.group(1)].append((res.group(2), res.group(3)))
# Output result
for i, x in enumerate(sorted(h.keys())):
print("-------- %s -----------" % (i+1))
for y in sorted(h[x]):
print("%s %s %s" % (x, y[0], y[1]))
The result is as follow (add more newlines if you like):
-------- 1 -----------
1234 0 my name is
1234 1 how's going on?
1234 2 Who are you?
-------- 2 -----------
2346 1 are you new?
-------- 3 -----------
6789 2 I am coming
It's based on regexes (module re in python). This is a good tool when you want to match simple line based patterns.
Here it relies on spaces as columns separators but it can as easily be adapted for fixed width columns.
The results is collected in a dictionary of lists. each list containing tuples (pairs) of position and text.
The program waits output for sorting items.
It's a quite ugly code but it's quite easy to understand.
raw = []
with open("input.txt", "r") as file:
for x in file:
raw.append(x.strip().split(None, 2))
raw = sorted(raw)
title = raw[0][0]
refined = []
cluster = []
for x in raw:
if x[0] == title:
cluster.append(x)
else:
refined.append(cluster)
cluster = []
title = x[0]
cluster.append(x)
refined.append(cluster)
for number, group in enumerate(refined):
print("-"*10+str(number)+"-"*10)
for line in group:
print(*line)
I am fairly new to Python. I have a text file containing many blocks of data in following format along with other unnecessary blocks.
NOT REQUIRED :: 123
Connected Part-1:: A ~$
Connected Part-3:: B ~$
Connector Location:: 100 200 300 ~$
NOT REQUIRED :: 456
Connected Part-2:: C ~$
i wish to extract the info (A,B,C, 100 200 300) corresponding to each property ( connected part-1, Connector location) and store it as list to use it later. I have prepared following code which reads file, cleans the line and store it as list.
import fileinput
with open('C:/Users/file.txt') as f:
content = f.readlines()
for line in content:
if 'Connected Part-1' in line or 'Connected Part-3' in line:
if 'Connected Part-1' in line:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
print ('PART_1:',connected_part_1)
if 'Connected Part-3' in line:
connected_part_3 = [s.strip(' \n ~ $ Connected Part -3 ::') for s in content]
print ('PART_3:',connected_part_3)
if 'Connector Location' in line:
# removing unwanted characters and converting into the list
content_clean_1 = [s.strip('\n ~ $ Connector Location::') for s in content]
#converting a single string item in list to a string
s = " ".join(content_clean_1)
# splitting the string and converting into a list
weld_location= s.split(" ")
print ('POSITION',weld_location)
here is the output
PART_1: ['A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
POSITION ['d', 'Part-1::', 'A', '\t\tConnector', 'Location::', '100.00', '200.00', '300.00', '\t\tConnected', 'Part-3::', 'C~\t']
PART_3: ['1:: A', '\t\tConnector Location:: 100.00 200.00 300.00', '\t\tConnected Part-3:: C~\t']
From the output of this program, i may conclude that, since 'content' is the string consisting all the characters in the file, the program is not reading an individual line. Instead it is considering all text as single string. Could anyone please help in this case?
I am expecting following output:
PART_1: ['A']
PART_3: ['C']
POSITION: ['100.00', '200.00','300.00']
(Note) When i am using individual files containing single line of data, it works fine. Sorry for such a long question
I will try to make it clear, and show how I would do it without regex. First of all, the biggest issue with the code presented is that when using the string.strip function the entire content list is being read:
connected_part_1 = [s.strip(' \n ~ $ Connected Part -1 ::') for s in content]
Content is the entire file lines, I think you want simply something like:
connected_part_1 = [line.strip(' \n ~ $ Connected Part -1 ::')]
How to parse the file is a bit subjective, but given the file format posted as input, I would do it like this:
templatestr = "{}: {}"
with open('inputreadlines.txt') as f:
content = f.readlines()
for line in content:
label, value = line.split('::')
ltokens = label.split()
if ltokens[0] == 'Connected':
print(templatestr.format(
ltokens[-1], #The last word on the label
value.split()[:-1])) #the split value without the last word '~$'
elif ltokens[0] == 'Connector':
print(value.split()[:-1]) #the split value without the last word '~$'
else: #NOT REQUIRED
pass
You can use the string.strip function to remove the funny characters '~$' instead of removing the last token as in the example.
What my text is
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = first label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
What i want
$TITLE = XXXX YYYY
1 $SUBTITLE= XXXX YYYY ANSA
2 $LABEL = new label
3 $DISPLACEMENTS
4 $MAGNITUDE-PHASE OUTPUT
5 $SUBCASE ID = 30411
The code i am using
import re
fo=open("test5.txt", "r+")
num_lines = sum(1 for line in open('test5.txt'))
count=1
while (count <= num_lines):
line1=fo.readline()
j= line1[17 : 72]
j1=re.findall('\d+', j)
k=map(int,j1)
if (k==[30411]):
count1=count-4
line2=fo.readlines()[count1]
r1=line2[10:72]
r11=str(r1)
r2="new label"
r22=str(r2)
newdata = line2.replace(r11,r22)
f1 = open("output7.txt",'a')
lines=f1.writelines(newdata)
else:
f1 = open("output7.txt",'a')
lines=f1.writelines(line1)
count=count+1
The problem is in the writing of line. Once 30411 is searched and then it has to go 3 lines back and change the label to new one. The new output text should have all the lines same as before except label line. But it is not writing properly. Can anyone help?
Apart from many blood-curdling but noncritical problems, you are calling readlines() in the middle of an iteration using readline(), causing you to read lines not from the beginning of the file but from the current position of the fo handle, i.e. after the line containing 30411.
You need to open the input file again with a separate handle or (better) store the last 4 lines in memory instead of rereading the one you need to change.
I have a text file with the size of all files on different servers with extension *.AAA I would like to extract the filename + size from each servers that are bigger than 20 GB. I know how to extract a line from a file and display it but here is my example and what I would like to Achieve.
The example of the file itself:
Pad 1001
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:07 AM 894,889,984 File1.AAA
05/25/2015 07:18 AM 25,673,969,664 File2.AAA
02/11/2016 02:07 AM 17,879,040 File3.AAA
05/25/2015 07:18 AM 12,386,304 File4.AAA
10/13/2008 10:29 AM 1,186,988,032 File3.AAA_oct13
02/15/2016 11:15 AM 2,799,263,744 File5.AAA
6 File(s) 30,585,376,768 bytes
0 Dir(s) 28,585,127,936 bytes free
Pad 1002
Volume in drive \\192.168.0.101\c$ has no label.
Volume Serial Number is XXXX-XXXX
Directory of \\192.168.0.101\c$\TESTUSER\
02/11/2016 02:08 AM 1,379,815,424 File1.AAA
02/11/2016 02:08 AM 18,542,592 File3.AAA
02/15/2016 12:41 AM 853,659,648 File5.AAA
3 File(s) 2,252,017,664 bytes
0 Dir(s) 49,306,902,528 bytes free
Here is what I would like as my output The Pad# and the file that is bigger than 20GB:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA
I will eventually put this in a excel spreadsheet but this I know how.
Any Ideas?
Thank you
The following should get you started:
import re
output = []
with open('input.txt') as f_input:
text = f_input.read()
for pad, block in re.findall(r'(Pad \d+)(.*?)(?=Pad|\Z)', text, re.M + re.S):
file_list = re.findall(r'^(.*? +([0-9,]+) +.*?\.AAA\w*?)$', block, re.M)
for line, length in file_list:
length = int(length.replace(',', ''))
if length > 2e10: # Or your choice of what 20GB is
output.append((pad, line))
print output
This would display a list with one tuple entry as follows:
[('Pad 1001', '05/25/2015 07:18 AM 25,673,969,664 File2.AAA')]
[EDIT] Here is my approach:
import re
result = []
with open('txtfile.txt', 'r') as f:
content = [line.strip() for line in f.readlines()]
for line in content:
m = re.findall('\d{2}/\d{2}/\d{4}\s+\d{2}:\d{2}\s+(A|P)M\s+([0-9,]+)\s+((?!.AAA).)*.AAA((?!.AAA).)*', line)
if line.startswith('Pad') or m and int(m[0][1].replace(',','')) > 20 * 1024 ** 3:
result.append(line)
print re.sub('Pad\s+\d+$', '', ' '.join(result))
Output is:
Pad 1001 05/25/2015 07:18 AM 25,673,969,664 File2.AAA