I'm wondering if there's a simple method for deleting particular rows and columns in python. Apologies if this is a trivial question.
To give some context, I'm currently writing a script to automate a series of linux commands(specifically ciao Chandra telescope analysis commands), part of which saves the output of a certain command to a .dat file. At present the output has included some rows and columns which I don't want in there...
E.G the data currently looks like:
Data for Table Block HISTOGRAM
--------------------------------------------------------------------------------
ROW CELL RCENTRE RHALFWIDTH AREA COUNTS SUR_BRI
1 1 1.016260150 1.016260150 12.9783552105 0 0
2 1 3.048780450 1.016260150 38.9350656315 1.0 0.02568378873336
3 1 5.081300750 1.016260150 64.8917760526 1.0 0.01541027324001
4 1 7.113821050 1.016260150 90.8484864736 1.0 0.01100733802858
5 1 9.146341350 1.016260150 116.8051968946 0 0
6 1
-------------------------------------
-------------------------------------
I want to remove the first few rows which incorporate the "Data for Table Block Histogram" and dashes, and also the first two columns which begin with "ROW" and "CELL"?
Thanks in advance
Assuming your separator is tabulation or that you don't need the exact same distance between columns at the end, and assuming the interesting lines begin with a number (as shown in your example) you could write something like that:
def cutFile(fname, firstWantedColumn):
"""Keep wanted lines and colums: detect lines beginning with number and keep input columns"""
f=open(fname, "r")
lf = f.readlines()
f.close()
txtOut = ""
for l in lf:
if l[0].isdigit(): #detect if first char is a number
txtOut += "\t".join(l.split()[firstWantedColumn:]) + "\n"
g = open(".".join(fname.split(".")[:-1]) + "_cutted" + fname.split(".")[-1])
g.write(txtOut)
g.close()
cutFile("myFile.dat", 2)
Edit: this is a bruteforce solution, maybe you were talking about an advanced and oneliner solution but I'm not sur this exists.
Related
I have a text file that looks something like this :
original-- expected output--
0 1 2 3 4 5 SET : {0,1,2,3,4,5}
1 3 RELATION:{(1,3),(3,1),(5,4),(4,5)}
3 1
5 4 REFLEXIVE : NO
4 5 SYMMETRIC : YES
and part of the code is having it print out the first line in curly braces, and the rest within one giant curly braces and each binary set in parentheses. I am still a beginner but I wanted to know if there is some way in python to make one loop that treats the first line differently than the rest?
try this with filename is your file ..
with open("filename.txt", "r") as file:
set_firstline = []
first_string = file.readline()
list_of_first_string = list(first_string)
for i in range(len(list_of_first_string)):
if str(i) in first_string:
set_firstline.append(i)
print(set_firstline)
OUTPUT : [0,1,2,3,4,5]
im new as well. so hope I can help you
I have 2 files where file 1 has the below lines and file 2 has the following lines with some million records. Now I want to search file 1 entries in file 2 and generate the report with sum of 2nd column and the corresponding line next to each other in new file.
File 1 entries:
/dataset1
/dataset2
File 2 entries:
12 5 /opt/dataset1
6 0 /opt/dataset2
5 8 /dataset1
Looking for sum of 2nd column values with pattern next to each other
13 /dataset1
0 /datase2
thank you
CS
I would first process File 1 and create a regex with the following format:
\d\s+(\d)\s+\S*(\/dataset1|\/dataset2)
After creating the regex, just use re.findall to find all the relevant information, and sum all the matches. It should be easy...
Of course, the regex doesn't have a fixed format, you would need to generate it according to the lines of the first file. Something like that:
def generate_regex(file1_lines):
regex = "\d\s+(\d)\s+\S*("
for line in file1_lines:
line = line.replace(r"/", r"\/")
regex += line.strip() + "|"
regex = regex[:-1] + ")"
return regex
I have a list of 3-gram terms of around 10000 in a .txt file. I want to match these terms within multiple .GHC files under a directory and count the occurrences of each of the terms.
One of these files looks like this:
ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b kernel32.dll+0xb511 kernel32.dll+0x16d4f
I want the resulting output to be like this in a dataframe:
N_gram_term_1 N_gram_term_2 ............ N_gram_term_n
2 1 0
3 2 4
3 0 3
the 2nd line here indicates that N_gram_term_1 has appeared 2 times in one file and N_gram_term_2 1 time and so on.
the 3rd line indicates that N_gram_term_1 has appeared 3 times in second file and N_gram_term_2 2 times and so on.
If I need to be more clear about something, please let me know.
I am sure you have implementations for this purpoes, perhaps in sklearn. A simple implementation from scratch, though would be:
import sys
d = {} # dictionary that will have 1st key = file and 2 key = 3gram
for file in sys.argv[1:]: # These are all files to be analyzed
d[file] = {} # The value here is a nested dictionary
with open(file) as f: # Opening each file at a time
for line in f: # going through every row of the file
g = line.strip()
if g in d[file]:
d[file][g] +=1
else:
d[file][g] = 1
import pandas
print(pandas.DataFrame(d).T)
I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()
I have previously found a way to count the prefixes, as shown below, so is there a way similar to this which is so obvious I'm missing it completely?
for i in range (0, len(hardprefix)):
if len(word) > len(hardprefix[i]):
if word.startswith(hardprefix[i]):
hardprefixcount += 1
break
I need this code to use the first column of the file and count the number of a set array of suffixes found within these words
This is what i have so far
for i in range (0, len(easysuffix)):
if len (word) > len(easysuffix[i]):
if word.endswith(easysuffix[i]):
easysuffixcount += 1
break
below is a sample of my data from the csv file, with the arrays using the suffixes below that
on 1
only 4
our 1
own 1
part 7
piece 4
pieces 4
place 1
pressed 1
riot 1
september 1
shape 3
hardsuffix = ['ism']
easysuffix = ['ity', 'esome', 'ece']
Your input file is tab delimited CSV so you can use the csv module to process it.
import csv
suffixes = ['ity', 'esome', 'ece']
with open('input.csv') as words:
suffix_count = 0
reader = csv.reader(words, delimiter='\t')
for word, _ in reader:
if any(word.endswith(suffix) for suffix in suffixes):
suffix_count += 1
print "Found {} suffix(es)".format(suffix_count)