Divide Every Other Line Between Two Files - python

I am currently trying to automate a process that alarms me if we see a sudden change in amount of VMs running month over month. Here is what the data looks like
January.txt February.txt
Web Fleet Web Fleet
100 112
Proxy Fleet Proxy Fleet
25 22
Beta Fleet Beta Fleet
12 10
I basically want to open the two different files, and have python divide each line with numbers between the two files. From there, I can say "if <= 1 then alarm" kind of thing. But I cant seem to figure out how to tell it to do every other line between two different files. Normally I would do this in bash but I am trying to keep the entire process in a currently running python script that currently generates the files and performs some other tasks to get this data.
Here is a sample of a way I got it sort of working with bash
paste January.txt feburary.txt | awk 'NR%2==0' | awk '{ print $1 / $2 }'
EDIT: The data is always in the same order, Web Fleet always at the top, Proxy Fleet always Second, so on and so on.

Code -
with open('January.txt', 'r') as f1, open('February.txt', 'r') as f2:
for x, y in zip(f1.read().splitlines()[1::2], f2.read().splitlines()[1::2]):
print(float(x) / int(y))
Output -
0.8928571428571429
1.1363636363636365
1.2

It's probably simpler to read the contents of both files upfront, zip them together, then iterate moving 2 lines at a time:
with open("jan.txt") as file1:
with open("feb.txt") as file2:
lines = zip(file1.readlines(), file2.readlines())
for line1, line2 in lines[1::2]:
val1 = float(line1.strip())
val2 = float(line2.strip())
print val1/val2
The [1::2] bit means to start from index 1, and move 2 items at a time
An alternative implementation that doesn't read the file contents upfront:
with open("jan.txt") as file1:
with open("feb.txt") as file2:
while True:
file1.readline()
file2.readline()
line1 = file1.readline()
line2 = file2.readline()
if line1 == "" or line2 == "":
break
val1 = float(line1.strip())
val2 = float(line2.strip())
print val1/val2

Related

How to optimize checking IP adresses and getting a country?

Using Python 3.9.2 on Win10 I´m trying to get the country for IP-adresses in a log file (about 65000 lines). I have a .csv containing IP ranges (22000 lines) and respective countries, looking like:
[...]
2.16.9.0,2.16.9.255,DE,Germany
2.16.11.0,2.16.11.255,FR,France
2.16.12.0,2.16.13.255,CH,Switzerland
2.16.23.0,2.16.23.255,DE,Germany
2.16.30.0,2.16.33.255,DE,Germany
2.16.34.0,2.16.34.255,FR,France
[...]
I'm using python's ipaddress and iterate through the list of ranges and check if the current IP is within a range to get the country. Before, I check for two conditions to be true.
My goal is to count how many connections came from each of the three countrys. An example:
import ipaddress
import csv
with open (PATH) as logfile
logfile_lines = [line.split('\t') for line in logfile]
with open (PATH,r) as ipdaten
ipdaten_lines = [line.split(',') for line in ipdaten]
streams_france=0
for line in logfile_lines:
line2 = int(line[9])
stream = str(line[3])
iplog = line[1]
ipobj= ipaddress.ip_address(iplog)
[...]
if line2 > 60 and stream == "stream2":
for ips in ipdaten_lines:
if ipobj >= ipaddress.IPv4Address(ips[0]) and ipobj <= ipaddress.IPv4Address(ips[1]):
land = ips[3]
if land == "France\n":
streams_france+=1
break
[...]
The code works, but it is very slow. After far over 1 hour it is still running. For line2 > 60 and stream == "stream2" there are about 9000 cases in which both are True.

Obtain tsv from text with a specific pattern

I'm a biologist and I need to take information on a text file
I have a file with plain text like that:
12018411
Comparison of two timed artificial insemination (TAI) protocols for management of first insemination postpartum.
TAI|timed artificial insemination|0.999808
Two estrus-synchronization programs were compared and factors influencing their success over a year were evaluated. All cows received a setup injection of PGF2alpha at 39 +/- 3 d postpartum. Fourteen days later they received GnRH, followed in 7 d by a second injection of PGF2alpha. Cows (n = 523) assigned to treatment 1 (modified targeted breeding) were inseminated based on visual signs of estrus at 24, 48, or 72 h after the second PGF2alpha injection. Any cow not observed in estrus was inseminated at 72 h. Cows (n = 440) assigned to treatment 2 received a second GnRH injection 48 h after the second PGF2alpha, and all were inseminated 24 h later. Treatment, season of calving, multiple birth, estrual status at insemination, number of occurrences of estrus before second PGF2alpha, prophylactic use of PGF2alpha, retained fetal membranes, and occurrence of estrus following the setup PGF2alpha influenced success. Conception rate was 31.2% (treatment 1) and 29.1% (treatment 2). A significant interaction occurred between protocol and estrual status at insemination. Cows in estrus at insemination had a 45.8% (treatment 1) or 35.4% (treatment 2) conception rate. The conception rate for cows not expressing estrus at insemination was 19.2% (treatment 1) and 27.7% (treatment 2). Provided good estrous detection exists, modified targeted breeding can be as successful as other timed artificial insemination programs. Nutritional, environmental, and management strategies to reduce postpartum disorders and to minimize the duration of postpartum anestrus are critical if synchronization schemes are used to program first insemination after the voluntary waiting period.
8406022
Deletion of the beta-turn/alpha-helix motif at the exon 2/3 boundary of human c-Myc leads to the loss of its immortalizing function.
The protein product (c-Myc) of the human c-myc proto-oncogene carries a beta-turn/alpha-helix motif at the exon2/exon3 boundary. The amino acid (aa) sequence and secondary structure of this motif are highly conserved among several nuclearly localized oncogene products, c-Myc, N-Myc, c-Fos, SV40 large T and adenovirus (Ad) Ela. Removal of this region from Ad E1a results in the loss of the transforming properties of the virus without destroying its known transregulatory functions. In order to analyse whether deletion of the above-mentioned region from c-Myc has a similar effect on its transformation activity, we constructed a deletion mutant (c-myc delta) lacking the respective aa at the exon2/exon3 boundary. In contrast to the c-myc wild-type gene product, constitutive expression of c-myc delta does not lead to the immortalization of primary mouse embryo fibroblast cells (MEF cells). This result indicates that c-Myc and Ad El a share a common domain which is involved in the transformation process by both oncogenes.
aa|amino acid|0.99818
Ad|adenovirus|0.96935
MEF cells|mouse embryo fibroblast cells|0.994648
The first line is the id, the second line is the title, the third line used to be the abstract (sometimes there are abbreviations) and the lasts lines (if there are) are abbreviations with double space, the abbreviation, the meaning, and a number. You can see :
GA|general anesthesia|0.99818
Then there is a line in blank and start again: ID, Title, Abstract, Abbreviations or ID, Title, Abbreviations, Abstract.
And I need to take this data and convert to a TSV file like that:
12018411 TAI timed artificial insemination
8406022 aa amino acids
8406022 Ad adenovirus
... ... ...
First column ID, second column Abbreviation, and third column Meaning of this abbreviation.
I tried to convert first in a Dataframe and then convert to TSV but I don't know how take the information of the text with the structure I need.
And I tried with this code too:
from collections import namedtuple
import pandas as pd
Item= namedtuple('Item', 'ID')
items = []
with open("identify_abbr-out.txt", "r", encoding='UTF-8') as f:
lines= f.readlines()
for line in lines:
if line== '\n':
ID= ¿nextline?
if line.startswith(" "):
Abbreviation = line
items.append(Item(ID, Abbreviation))
df = pd.DataFrame.from_records(items, columns=['ID', 'Abbreviation'])
But I don't know how to read the next line and the code not found because there are some lines in blank in the middle between the corpus and the title sometimes.
I'm using python 3.8
Thank you very much in advance.
Assuming test.txt has your input data, I used simple file read functions to process the data -
file1 = open('test.txt', 'r')
Lines = file1.readlines()
outputlines = []
outputline=""
counter = 0
for l in Lines:
if l.strip()=="":
outputline = ""
counter = 0
elif counter==0:
outputline = outputline + l.strip() + "|"
counter = counter + 1
elif counter==1:
counter = counter + 1
else:
if len(l.split("|"))==3 and l[0:2]==" " :
outputlines.append(outputline + l.strip() +"\n")
counter = counter + 1
file1 = open('myfile.txt', 'w')
file1.writelines(outputlines)
file1.close()
Here file is read, line by line, a counter is kept and reset when there is a blank line, and ID is read in just next line. If there are 3 field "|" separated row, with two spaces in beginning, row is exported with ID

Create multiple files based on user input in python

i am new to python and I've written a code which create configuration files for my application. I've created the code which works for 2 IP's but it may happen that user may input more IP's and for each increase in Ip the config file will be changed. There are authentication servers and they can be either 1 or 2 only.
I am passing input to python code by a file name "inputfile", below is how it look like:
EnterIp_list: ip_1 ip_2
authentication_server: as_1 as_2
Below is how the final configuration files are created:
configfile1: configfile2:
App_ip: ip_1 App_ip: ip_2
app_number: 1 app_number: 2
authen_server: as_1 authen_server: as_2
Below is how python3 code looks:
def createconfig(filename, app_ip, app_number, authen_server)
with open(filename, 'w') as inf:
inf.write("App_ip=" + app_ip + "\n")
inf.write("app_numbber=" + app_number)
inf.write("authen_server="+ authen_server)
with open("inputfile") as f:
for line in f:
if EnterIP_list in line:
a= line.split("=")
b = a[1].split()
if authentiation_server in line:
c= line.split("=")
d=c[1].split()
createconfig(configfile1, b[0], 1, d[0])
createconfig(configfile2, b[1], 2, d[1])
Users has freedom to input as many IP's as they wish for. Can someone please suggest what need to be done to make code more generic and robust so that it will work for any number of input ip's ??? also value for app_number increases with each new ip added.
There will always be two authentication server and they go in round robin e.g. the third app ip will be associated to "as_1" again.
You just need to iterate over your ip list in b, be aware that your current code only works for the last line of your "inputfile". As long as there is only one line, thats ok.
with open("inputfile") as f:
for line in f:
a= line.split("=")
b = a[1].split()
app_count = 1
for ip in b:
createconfig("configfile%s" % app_count , ip, app_count)
app_count += 1
Edit: Solution updated regarding your code change.
with open("inputfile") as f:
for line in f:
if EnterIP_list in line:
ips = line.split("=")[1].split()
if authentiation_server in line:
auth_servers = line.split("=")[1].split()
app_count = 1
for ip, auth_server in zip(ips, auth_servers):
createconfig("configfile%s" % app_count , ip, app_count, auth_server)
app_count += 1
A not so great way of doing it without modifying so much of your code would be to remove the last two createconfig() calls and instead do it in a loop once you have b as follows:
with open("inputfile") as f:
for line in f:
a= line.split("=")
b = a[1].split()
for app_number in b:
createconfig("configfile{}".format(app_number), b[app_number], app_number)

Python: Speeding up script with inputs of > billion rows

I've seen a lot of tips going around on how to speed up python code or make it more efficient. I've tried some things on the code below like: change global variables to local variables, whenever possible, using .format to create strings instead of adding strings, trying to not create multiple variables. But still, this script takes 1h25 to run. I have two input files:
1) A bed file, two column (tab delimited) file with number or code in the first column, and numbers in the second column. It has ~2 billion lines, where the combination of numbers is unique (it has all the positions in a genome; the first column is the chromosome, the second is the position):
1 1
1 2
1 3
1 4
...
2) a complex file, where the first few (~3000 lines) are a header that start with #, and then an entry for, again, a combination of number/code + number in the first two columns. This two columns make the link with the first file (1 1 in file 1 is the the same as 1 1 in file 2). This has ~22 million of rows. Here is an example of the first three lines:
1 1 . G . 32.9939 . DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923 GT:PL:GQ 0/1:0:36
1 2 . T . 32.9939 . DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923 GT:PL:GQ ./.:0:36
1 3 . C . 32.9939 . DP=1;MQ0F=0;AF1=0;AC1=0;DP4=1,0,0,0;MQ=60;FQ=-29.9923 GT:PL:GQ 1/1:0:36
Question: I want to filter rows in the first file, if those rows on the second file have a 0/0, 0/1 or 1/1 (the 4th possibility is ./.) in the last column (so I need to pars the last column, to reach those three characters)
The added complexity is that file #2 has to be read through a pipe from another program because it's compressed in a specific way done by that program (opening this file takes a long time on its own but nothing I can do about it...)
Call: program view file2.vcf.gz | my_script.py file1.bed
import sys
import re
import time
start_time = time.time()
def make_vcf_dict(vcf):
mydict={}
for line in (line for line in vcf if not line.startswith("#")):
line=line.strip().split()
genotype=line[-1].split(':')[0]
motif=re.compile('\./\.')
if motif.match(genotype) is None:
mydict.setdefault('{}:{}'.format(line[0],line[1]),'{}:{}'.format(line[0],line[1]))
return mydict
def create_output_bed(bed,data):
print "creating output"
for line in (line for line in data if line.startswith('#CHROM')):
output_name='{}_mask_positions.bed'.format(line.strip().split()[-1])
print output_name
output=open(output_name,'w')
print "making dictionary"
for line in bed:
line=line.strip().split()
#creating the same entry as in dict:
region='{}:{}'.format(line[0], line[1])
if region not in mydict:
output.write('{}\t{}\n'.format(line[0],line[1]))
output.close()
bed.close()
return
print "reading data"
data=sys.stdin.readlines() #.readlines here is key!!
mydict=make_vcf_dict(data)
#read the bed file:
print "Writing output"
create_output_bed(open(sys.argv[1],'r'),data)
print("--- %s seconds ---" % (time.time() - start_time))
I was wondering if there would be a more efficient way to deal with this entirely? Not making a dictionary, splitting my file? I have a 32 core server to deal with this and little experience with scripting...
Thank you!
If the second file has only a few million rows (not billion as the first), then I expect the data to fit in the memory.
I have a 32 core server to deal with this
Parallelizing it won't help you much because the main bottleneck is the disk, not the CPU. Unless the data was distributed among many files on different disks.
However, you do have some improvements you can make:
Move the regex compilation outside the loop (motif=re.compile('\./\.')).
Use set instead of dict.
Avoid the format, just use a tuple.
Don't read all the lines beforehand.
Avoid going over stdin twice.
Avoid doing anything twice.
import sys
import re
import time
start_time = time.time()
def make_vcf(vcf_input):
output = set()
motif=re.compile('\./\.')
for line in vcf_input:
line = line.strip().split()
if line[0].startswith('#CHROM'):
output_name = '{}_mask_positions.bed'.format(line[-1])
continue
elif line[0].startswith("#"):
continue
genotype=line[-1].split(':')[0]
if motif.match(genotype) is None:
output.add( (line[0],line[1]) )
return output_name, output
def create_output_bed(output_name, vcf, bed):
print "creating output:", output_name
output = open(output_name,'w')
print "making dictionary"
for line in bed:
line = line.strip().split()
#creating the same entry as in dict:
region = line[0], line[1]
if region not in vcf:
output.write('{}\t{}\n'.format(line[0],line[1]))
output.close()
bed.close()
return
print "reading data"
output_name, vcf = make_vcf(sys.stdin.readlines())
#read the bed file:
print "Writing output"
create_output_bed(output_name, vcf, open(sys.argv[1],'r'))
print("--- %s seconds ---" % (time.time() - start_time))

Search and sort data from several files

I have a set of 1000 text files with names in_s1.txt, in_s2.txt and so. Each file contains millions of rows and each row has 7 columns like:
ccc245 1 4 5 5 3 -12.3
For me the most important is the values from the first and seventh columns; the pairs ccc245 , -12.3
What I need to do is to find between all the in_sXXXX.txt files, the 10 cases with the lowest values of the seventh column value, and I also need to get where each value is located, in which file. I need something like:
FILE 1st_col 7th_col
in_s540.txt ccc3456 -9000.5
in_s520.txt ccc488 -723.4
in_s12.txt ccc34 -123.5
in_s344.txt ccc56 -45.6
I was thinking about using python and bash for this purpose but at the moment I did not find a practical approach. All what I know to do is:
concatenate all in_ files in IN.TXT
search the lowest values there using: for i in IN.TXT ; do sort -k6n $i | head -n 10; done
given the 1st_col and 7th_col values of the top ten list, use them to filter the in_s files, using grep -n VALUE in_s*, so I get for each value the name of the file
It works but it is a bit tedious. I wonder about a faster approach only using bash or python or both. Or another better language for this.
Thanks
In python, use the nsmallest function in the heapq module -- it's designed for exactly this kind of task.
Example (tested) for Python 2.5 and 2.6:
import heapq, glob
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield fname, items[0], float(items[6])
f.close()
result = heapq.nsmallest(10, my_iterable(), lambda x: x[2])
print result
Update after above answer accepted
Looking at the source code for Python 2.6, it appears that there's a possibility that it does list(iterable) and works on that ... if so, that's not going to work with a thousand files each with millions of lines. If the first answer gives you MemoryError etc, here's an alternative which limits the size of the list to n (n == 10 in your case).
Note: 2.6 only; if you need it for 2.5 use a conditional heapreplace() as explained in the docs. Uses heappush() and heappushpop() which don't have the key arg :-( so we have to fake it.
import glob
from heapq import heappush, heappushpop
from pprint import pprint as pp
def my_iterable():
for fname in glob.glob("in_s*.txt"):
f = open(fname, "r")
for line in f:
items = line.split()
yield -float(items[6]), fname, items[0]
f.close()
def homegrown_nlargest(n, iterable):
"""Ensures heap never has more than n entries"""
heap = []
for item in iterable:
if len(heap) < n:
heappush(heap, item)
else:
heappushpop(heap, item)
return heap
result = homegrown_nlargest(10, my_iterable())
result = sorted(result, reverse=True)
result = [(fname, fld0, -negfld6) for negfld6, fname, fld0 in result]
pp(result)
I would:
take first 10 items,
sort them and then
for every line read from files insert the element into those top10:
in case its value is lower than highest one from current top10,
(keeping the sorting for performance)
I wouldn't post the complete program here as it looks like homework.
Yes, if it wasn't ten, this would be not optimal
Try something like this in python:
min_values = []
def add_to_min(file_name, one, seven):
# checks to see if 7th column is a lower value than exiting values
if len(min_values) == 0 or seven < max(min_values)[0]:
# let's remove the biggest value
min_values.sort()
if len(min_values) != 0:
min_values.pop()
# and add the new value tuple
min_values.append((seven, file_name, one))
# loop through all the files
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
add_to_min(file_name, columns[0], float(columns[6]))
# print answers
for (seven, file_name, one) in min_values:
print file_name, one, seven
Haven't tested it, but it should get you started.
Version 2, just runs the sort a single time (after a prod by S. Lott):
values = []
# loop through all the files and make a long list of all the rows
for file_name in os.listdir(<dir>):
f = open(file_name)
for line in file_name.readlines():
columns = line.split()
values.append((file_name, columns[0], float(columns[6]))
# sort values, print the 10 smallest
values.sort()
for (seven, file_name, one) in values[:10]
print file_name, one, seven
Just re-read you question, with millions of rows, you might run out of RAM....
A small improvement of your shell solution:
$ cat in.txt
in_s1.txt
in_s2.txt
...
$ cat in.txt | while read i
do
cat $i | sed -e "s/^/$i /" # add filename as first column
done |
sort -n -k8 | head -10 | cut -d" " -f1,2,8
This might be close to what you're looking for:
for file in *; do sort -k6n "$file" | head -n 10 | cut -f1,7 -d " " | sed "s/^/$file /" > "${file}.out"; done
cat *.out | sort -k3n | head -n 10 > final_result.out
If your files are million lines, you might want to consider using "buffering". the below script goes through those million lines, each time comparing field 7 with those in the buffer. If a value is smaller than those in the buffer, one of them in buffer is replaced by the new lower value.
for file in in_*.txt
do
awk -vt=$t 'NR<=10{
c=c+1
val[c]=$7
tag[c]=$1
}
NR>10{
for(o=1;o<=c;o++){
if ( $7 <= val[o] ){
val[o]=$7
tag[o]=$1
break
}
}
}
END{
for(i=1;i<=c;i++){
print val[i], tag[i] | "sort"
}
}' $file
done

Categories

Resources