Count the occurrences of 3 gram terms within multiple files - python

I have a list of 3-gram terms of around 10000 in a .txt file. I want to match these terms within multiple .GHC files under a directory and count the occurrences of each of the terms.
One of these files looks like this:
ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b ntdll.dll+0x1e8bd ntdll.dll+0x11a7 ntdll.dll+0x1e6f4 kernel32.dll+0xaa7f kernel32.dll+0xb50b kernel32.dll+0xb511 kernel32.dll+0x16d4f
I want the resulting output to be like this in a dataframe:
N_gram_term_1 N_gram_term_2 ............ N_gram_term_n
2 1 0
3 2 4
3 0 3
the 2nd line here indicates that N_gram_term_1 has appeared 2 times in one file and N_gram_term_2 1 time and so on.
the 3rd line indicates that N_gram_term_1 has appeared 3 times in second file and N_gram_term_2 2 times and so on.
If I need to be more clear about something, please let me know.

I am sure you have implementations for this purpoes, perhaps in sklearn. A simple implementation from scratch, though would be:
import sys
d = {} # dictionary that will have 1st key = file and 2 key = 3gram
for file in sys.argv[1:]: # These are all files to be analyzed
d[file] = {} # The value here is a nested dictionary
with open(file) as f: # Opening each file at a time
for line in f: # going through every row of the file
g = line.strip()
if g in d[file]:
d[file][g] +=1
else:
d[file][g] = 1
import pandas
print(pandas.DataFrame(d).T)

Related

Using python to search for strings in a file and use output to group the content of the second folder

I tried writing a python code that search for one/more strings in file1.txt, and then then make a change to the findall output (e.g., change cap0001 to 1). Next the code use the modfied output to group the content of file2.txt based on matches to column "capNo" in File2.txt.
File1.txt:
>cap00001 supr2
x2shh qewrrw
dsfff rggfdd
>cap00002 supr5
dadamic adertsy
waeee ddccmet
File2.txt
Ref capNo qual
AM1 1 Good
AM8 1 Good
AM7 2 Poor
AM2 2 Good
AM9 2 Good
AM6 3 Poor
AM1 3 Poor
AM2 3 Good
Require output:
capNo counts
1 2
2 3
The following code did not work for me:
import re
With open("File1.txt","r") as InFile1:
for line in InFile1:
match=re.findall(r'cap\d+',line)
if len(match) > 0:
match=match.remove(cap0000)
With open("File2.txt","r") as InFile2:
df=InFile2.read()
df2=df.groupby(match)["capNo"].value_counts()
print(df2)
How can I get this code working? Thanks
Change the Withs to with
Call the read function:
e.g.
with open('File1.txt') as f:
InFile1 = f.read()
# Do something with InFile1
In your code df is a string - you can't call groupby on it (did you mean to convert it to a pandas DataFrame?)

Needs to find line pattern in another file and get the sum of corresponding value in particular column using python

I have 2 files where file 1 has the below lines and file 2 has the following lines with some million records. Now I want to search file 1 entries in file 2 and generate the report with sum of 2nd column and the corresponding line next to each other in new file.
File 1 entries:
/dataset1
/dataset2
File 2 entries:
12 5 /opt/dataset1
6 0 /opt/dataset2
5 8 /dataset1
Looking for sum of 2nd column values with pattern next to each other
13 /dataset1
0 /datase2
thank you
CS
I would first process File 1 and create a regex with the following format:
\d\s+(\d)\s+\S*(\/dataset1|\/dataset2)
After creating the regex, just use re.findall to find all the relevant information, and sum all the matches. It should be easy...
Of course, the regex doesn't have a fixed format, you would need to generate it according to the lines of the first file. Something like that:
def generate_regex(file1_lines):
regex = "\d\s+(\d)\s+\S*("
for line in file1_lines:
line = line.replace(r"/", r"\/")
regex += line.strip() + "|"
regex = regex[:-1] + ")"
return regex

Python print .psl format without quotes and commas

I am working on a linux system using python3 with a file in .psl format common to genetics. This is a tab separated file that contains some cells with comma separated values. An small example file with some of the features of a .psl is below.
input.psl
1 2 3 x read1 8,9, 2001,2002,
1 2 3 mt read2 8,9,10 3001,3002,3003
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
I need to filter this file to extract only regions of interest. Here, I extract only rows with a value of 9 in the fourth column.
import csv
def read_psl_transcripts():
psl_transcripts = []
with open("input.psl") as input_psl:
csv_reader = csv.reader(input_psl, delimiter='\t')
for line in input_psl:
#Extract only rows matching chromosome of interest
if '9' == line[3]:
psl_transcripts.append(line)
return psl_transcripts
I then need to be able to print or write these selected lines in a tab delimited format matching the format of the input file with no additional quotes or commas added. I cant seem to get this part right and additional brackets, quotes and commas are always added. Below is an attempt using print().
outF = open("output.psl", "w")
for line in read_psl_transcripts():
print(str(line).strip('"\''), sep='\t')
Any help is much appreciated. Below is the desired output.
1 2 3 9 read3 8,9,10,11 4001,4002,4003,4004
1 2 3 9 read4 8,9,10,11 4001,4002,4003,4004
You might be able to solve you problem with a simple awk statement.
awk '$4 == 9' input.pls > output.pls
But with python you could solve it like this:
write_pls = open("output.pls", "w")
with open("input.pls") as file:
for line in file:
splitted_line = line.split()
if splitted_line[3] == '9':
out_line = '\t'.join(splitted_line)
write_pls.write(out_line + "\n")
write_pls.close()

run a search over the items of a list, and save each search into a file

I have a data.dat file that has 3 columns: The 3rd column is just the numbers 1 to 6 repeated again and again:
( In reality, column 3 has numbers from 1 to 1917, but for a minimal working example, let's stick to 1 to 6 )
# Title
127.26 134.85 1
127.26 135.76 2
127.26 135.76 3
127.26 160.97 4
127.26 160.97 5
127.26 201.49 6
125.88 132.67 1
125.88 140.07 2
125.88 140.07 3
125.88 165.05 4
125.88 165.05 5
125.88 203.06 6
137.20 140.97 1
137.20 140.97 2
137.20 148.21 3
137.20 155.37 4
137.20 155.37 5
137.20 184.07 6
I would like to:
1) extract the lines that contain 1 in the 3rd column and save them to a file called mode_1.dat.
2) extract the lines that contain 2 in the 3rd column and save them to a file called mode_2.dat.
3) extract the lines that contain 3 in the 3rd column and save them to a file called mode_3.dat.
.
.
.
6) extract the lines that contain 6 in the 3rd column and save them to a file called mode_6.dat.
In order to accomplish this, I have:
a) defined a variable factor = 6
a) created a one_to_factor list that has numbers 1 to 6
b) The re.search statement is in charge of extracting the lines for each value of one_to_factor. %s are the i inside the one_to_factor list
c) append these results to an empty LINES list.
However, this does not work. I cannot manage to extract the lines that contain i in the 3rd column and save them to a file called mode_i.dat
I would appreciate if you could help me.
factor = 6
one_to_factor = range(1,factor+1)
LINES = []
f_2 = open('data.dat', 'r')
for line in f_2:
for i in one_to_factor:
if re.search(r' \b%s$' %i , line):
print 'line = ', line
LINES.append(line)
print 'LINES =' , LINES
I would do it like this:
no regexes, just use str.split() to split according to whitespace
use last item (the digit) of the current line to generate the filename
use a dictionary to open the file the first time, and reuse the handle for subsequent matches (write title line at file open)
close all handles in the end
code:
title_line="# Vol \t Freq \t Mod \n"
handles = dict()
next(f_2) # skip title
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
# create files first time id encountered
if filename in handles:
pass
else:
handles[filename] = open(filename,"w")
handles[filename].write(title_line) # write title
handles[filename].write(line)
# close all files
for v in handles.values():
v.close()
EDIT: that's the fastest way but the problem is if you have too many suffixes (like in your real example), you'll get "too many open files" exception. So for this case, there's a slightly less efficient method but which works too:
import glob,os
# pre-processing: cleanup old files if any
for f in glob.glob("mode_*.dat"):
os.remove(f)
next(f_2) # skip title
s = set()
title_line="# Vol \t Freq \t Mod \n"
for line in f_2:
toks = line.split()
filename = "mode_{}.dat".format(toks[-1])
with open(filename,"a") as f:
if filename in s:
pass
else:
s.add(filename)
f.write(title_line)
f.write(line)
It basically opens as append mode, writes the lines, and closes the file.
(the set is used to detect first write in this file, so title can be written before the data)
There's a directory cleanup first to ensure that no data is left from a previous computation (append mode expects that no file exists, and if input data set changes, there's a possibility that there's an indentifier not present in the new dataset, so there would be an "orphan" file remaining from previous run)
First, instead of looping on you one_to_factor, you can get the index in one step :
index = line[-1] # Last character on the line
Then, you can check if index is in your one_to_factor list.
You should created a dictionary of lists to store your lines.
Something like :
{ "1" : [line1, line7, ...],
"2" : ....
}
And then you can use the key of the dictionnary to create the file and populate it with lines.

How to count the suffixes in words contained in a csv file?

I have previously found a way to count the prefixes, as shown below, so is there a way similar to this which is so obvious I'm missing it completely?
for i in range (0, len(hardprefix)):
if len(word) > len(hardprefix[i]):
if word.startswith(hardprefix[i]):
hardprefixcount += 1
break
I need this code to use the first column of the file and count the number of a set array of suffixes found within these words
This is what i have so far
for i in range (0, len(easysuffix)):
if len (word) > len(easysuffix[i]):
if word.endswith(easysuffix[i]):
easysuffixcount += 1
break
below is a sample of my data from the csv file, with the arrays using the suffixes below that
on 1
only 4
our 1
own 1
part 7
piece 4
pieces 4
place 1
pressed 1
riot 1
september 1
shape 3
hardsuffix = ['ism']
easysuffix = ['ity', 'esome', 'ece']
Your input file is tab delimited CSV so you can use the csv module to process it.
import csv
suffixes = ['ity', 'esome', 'ece']
with open('input.csv') as words:
suffix_count = 0
reader = csv.reader(words, delimiter='\t')
for word, _ in reader:
if any(word.endswith(suffix) for suffix in suffixes):
suffix_count += 1
print "Found {} suffix(es)".format(suffix_count)

Categories

Resources