Extract parts of chunked data in Python

Extract parts of chunked data in Python - python

I have a data that looks like this:
INFO : Reading PDB list file 'model3.list'
INFO : Successfully read 10 / 10 PDBs from list file 'model3.list'
INFO : Successfully read 10 Chain structures
INFO : Processed 40 of 45 MAXSUBs
INFO : CPU time = 0.02 seconds
INFO : ======================================
INFO : 3D-Jury (Threshold: > 10 pairs # > 0.200)
INFO : ======================================
INFO : Rank Model Pairs File
INFO : 1 : 1 151 pdbs2/model.165.pdb
INFO : 2 : 7 145 pdbs2/model.150.pdb
INFO : 3 : 6 144 pdbs2/model.144.pdb
INFO : 4 : 9 142 pdbs2/model.125.pdb
INFO : 5 : 4 137 pdbs2/model.179.pdb
INFO : 6 : 8 137 pdbs2/model.191.pdb
INFO : 7 : 10 137 pdbs2/model.147.pdb
INFO : 8 : 3 135 pdbs2/model.119.pdb
INFO : 9 : 5 131 pdbs2/model.118.pdb
INFO : 10 : 2 129 pdbs2/model.128.pdb
INFO : ======================================
INFO : Pairwise single linkage clustering
INFO : ======================================
INFO : Hierarchical Tree
INFO : ======================================
INFO : Node Item 1 Item 2 Distance
INFO : 0 : 6 1 0.476 pdbs2/model.144.pdb pdbs2/model.165.pdb
INFO : -1 : 7 4 0.484 pdbs2/model.150.pdb pdbs2/model.179.pdb
INFO : -2 : 9 2 0.576 pdbs2/model.125.pdb pdbs2/model.128.pdb
INFO : -3 : -2 0 0.598
INFO : -4 : 10 -3 0.615 pdbs2/model.147.pdb
INFO : -5 : -1 -4 0.618
INFO : -6 : 8 -5 0.620 pdbs2/model.191.pdb
INFO : -7 : 3 -6 0.626 pdbs2/model.119.pdb
INFO : -8 : 5 -7 0.629 pdbs2/model.118.pdb
INFO : ======================================
INFO : 1 Clusters # Threshold 0.800 (0.8)
INFO : ======================================
INFO : Item Cluster
INFO : 1 : 1 pdbs2/model.165.pdb
INFO : 2 : 1 pdbs2/model.128.pdb
INFO : 3 : 1 pdbs2/model.119.pdb
INFO : 4 : 1 pdbs2/model.179.pdb
INFO : 5 : 1 pdbs2/model.118.pdb
INFO : 6 : 1 pdbs2/model.144.pdb
INFO : 7 : 1 pdbs2/model.150.pdb
INFO : 8 : 2 pdbs2/model.191.pdb
INFO : 9 : 2 pdbs2/model.125.pdb
INFO : 10 : 2 pdbs2/model.147.pdb
INFO : ======================================
INFO : Centroids
INFO : ======================================
INFO : Cluster Centroid Size Spread
INFO : 1 : 1 10 0.566 pdbs2/model.165.pdb
INFO : 2 : 10 3 0.777 pdbs2/model.147.pdb
INFO : ======================================
And it constitutes a chunk of many more data.
The chunks are denoted with starting line
INFO : Reading PDB list file 'model3.list'
What I want to do is to extract parts of chunk here:
INFO : ======================================
INFO : Cluster Centroid Size Spread
INFO : 1 : 1 10 0.566 pdbs2/model.165.pdb
INFO : 2 : 10 3 0.777 pdbs2/model.147.pdb
INFO : ======================================
At the end of the day a dictionary that looks like this:
{1:"10 pdbs2/model.165.pdb",
2:"3 pdbs2/model.147.pdb"}
Namely with cluster number as key and values as cluster size + file_model name.
What's the way to achieve that in Python?
I'm stuck with this code:
import csv
import json
import os
import argparse
import re
def main():
"""docstring for main"""
file = "max_cluster_output.txt"
with open(file, 'r') as tsvfile:
tabreader = csv.reader(tsvfile, delimiter=' ')
for line in tabreader:
linelen = len(line)
if "Centroids" in line:
print line
#if linelen >= 32 and linelen <= 34:
# print linelen, line
if __name__ == '__main__':
main()

I would do this using regexes.
I would have an outer loop that
reads lines until it finds "INFO : Reading PDB list file"
reads lines until it finds "INFO : Cluster Centroid Size Spread"
inner loop that:
creates dictionary entries from each subsequent line, until the line no longer matches
INFO: <number> : <number> <number> <number> <string>
It would look something like this (not tested):
import re
FILENAME = "foo.txt"
info = {}
try:
with open(FILENAME) as f:
while True:
for line in f:
if re.match("^INFO\s+:\s+Reading PDB list file", line):
break
for line in f:
if re.match("^INFO\s+:\s+Cluster\s+Centroid\s+Size\s+Spread", line):
break
# We're up to the data
for line in f:
# look for INFO : Cluster-number Centroid-number Size-number spread-blah File-string
match = re.match(^INFO\s+:\s+(?P<Cluster>\d+)\s+\d+\s+(?P<Size>\d+).*\s(?P<FileName>\S+)$, line)
if match:
info[match.group("Cluster")] = "%s %s" % (match.group('Size'), match.group("FileName"))
else:
break
except StopIteration:
print "done"
This code is here just to show the types of things to use (looping, file iterator, breaking, regexes) ... its by no means necessarily the most elegant way (and it is not tested).

Related

How to parse Log file to object list

I'm working with data tipe Log (ROS).
Multiple objects are saved in Log file like this:
header:
seq: 2
stamp:
secs: 1596526199
nsecs: 140017032
frame_id: ''
level: 2
name: "/replicator_node"
msg: "Replicator node dumping to /tmp/replicator_dumps"
file: "replicator_node.py"
function: "__init__"
line: 218
topics: [/move_mongodb_entries/status, /move_mongodb_entries/goal, /move_mongodb_entries/result,
/move_mongodb_entries/cancel, /rosout, /move_mongodb_entries/feedback]
header:
seq: 2
stamp:
secs: 1596526198
nsecs: 848793029
frame_id: ''
level: 2
name: "/mongo_server"
msg: "2020-08-04T09:29:58.848+0200 [initandlisten] connection accepted from 127.0.0.1:58672\
\ #1 (1 connection now open)"
file: "mongodb_server.py"
function: "_mongo_loop"
line: 139
topics: [/rosout]
As you can see not everything is in same line as it's name.
I want to pars it to have object list - so I could access it like that:
object[1].msg would give me:
"2020-08-04T09:29:58.848+0200 [initandlisten] connection accepted from 127.0.0.1:58672 #1 (1 connection now open)"
Also, sometimes file name is something like: \home\nfoo\foo.py which results in log file as:
file: "\home
foo\foo.py"

It's an interesting exercise... Assuming that the structure is really consistent for all log entries, you can try something like this - pretty convoluted, but it works for the example in the question:
ros = """[your log above]"""
def manage_lists_2(log_ind, list_1, list_2, mystr):
if log_ind == 0:
list_1.append(mystr.split(':')[0].strip())
list_2[-log_ind].append(mystr.split(':')[1].strip())
m_keys2 = []
m_key_vals2 = [[],[]]
header_keys2 = []
header_key_vals2 = [[],[]]
stamp_keys2 = []
stamp_key_vals2 = [[],[]]
for log in logs:
for l in log.splitlines():
if l[0]!=" ":
items = [m_keys2, m_key_vals2]
elif l[0:3] != " ":
items = [header_keys2, header_key_vals2]
else:
items = [stamp_keys2, stamp_key_vals2]
manage_lists_2(logs.index(log), items[0], items[1], l)
for val in m_key_vals2:
for a, b, in zip(m_keys2,val):
print(a, ": ",b)
if a == "header":
for header_key in header_keys2:
print('\t',header_key,':',header_key_vals2[m_keys2.index(a)][header_keys2.index(header_key)])
if header_key == "stamp":
for stamp_key in stamp_keys2:
print('\t\t',stamp_key,':',stamp_key_vals2[m_keys2.index(a)][stamp_keys2.index(stamp_key)])
print('---')
Output:
header :
seq : 2
stamp :
secs : 1596526199
nsecs : 140017032
frame_id : 'one id'
level : 2
name : "/replicator_node"
msg : "Replicator node dumping to /tmp/replicator_dumps"
file : "replicator_node.py"
function : "__init__"
line : 218
topics : [/move_mongodb_entries/status, /move_mongodb_entries/goal, /move_mongodb_entries/result, /move_mongodb_entries/cancel, /rosout, /move_mongodb_entries/feedback]
---
header :
seq : 2
stamp :
secs : 1596526199
nsecs : 140017032
frame_id : 'one id'
level : 3
name : "/mongo_server"
msg : "2020-08-04T09
file : "mongodb_server.py"
function : "_mongo_loop"
line : 139
topics : [/rosout]
Having gone through that, I would recommend that - if you are going to do this on a regular basis - you find a way to store the data in xml format; it's a natural fit for it.

Selenium return empty images after few images

I am working with selenium and I want to get to images. The problem is that the selenium works up to 21 images and after that, it returns empty URLs like below.
1 : https://photo.venus.com/im/19230307.jpg?preset=dept
2 : https://photo.venus.com/im/18097354.jpg?preset=dept
3 : https://photo.venus.com/im/19230311.jpg?preset=dept
4 : https://photo.venus.com/im/19234200.jpg?preset=dept
5 : https://photo.venus.com/im/17307902.jpg?preset=dept
6 : https://photo.venus.com/im/19305650.jpg?preset=dept
7 : https://photo.venus.com/im/19060456.jpg?preset=dept
8 : https://photo.venus.com/im/18295767.jpg?preset=dept
9 : https://photo.venus.com/im/19102600.jpg?preset=dept
10 : https://photo.venus.com/im/19230297.jpg?preset=dept
11 : https://photo.venus.com/im/16181113.jpg?preset=dept
12 : https://photo.venus.com/im/19101047.jpg?preset=dept
13 : https://photo.venus.com/im/19150290.jpg?preset=dept
14 : https://photo.venus.com/im/19042244.jpg?preset=dept
15 : https://photo.venus.com/im/19230329.jpg?preset=dept
16 : https://photo.venus.com/im/19101040.jpg?preset=dept
17 : https://photo.venus.com/im/17000870.jpg?preset=dept
18 : https://photo.venus.com/im/19100952.jpg?preset=dept
19 : https://photo.venus.com/im/19183658.jpg?preset=dept
20 : https://photo.venus.com/im/19102243.jpg?preset=dept
21 : https://photo.venus.com/im/18176590.jpg?preset=dept
22 : data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC
23 : data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC
24 : data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC
25 : data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsQAAA7EAZUrDhsAAAANSURBVBhXYzh8+PB/AAffA0nNPuCLAAAAAElFTkSuQmCC
26 : ...
I even used time sleep, but it has not worked. Any ideas would be appreciated.
Here is also my code:
url = 'https://www.venus.com/products.aspx?BRANCH=7~63~'
driver.get(url)
product_container_ls = driver.find_elements_by_class_name('product-container')
for prd in product_container_ls:
# Finding elements of images by class name
image_lm = prd.find_element_by_class_name('main')
# The url to image
image_url = image_lm.get_attribute('src')
print(image_id, ': ', image_url)
# Image Path
image_path = os.path.join(directory, f'{image_name}.jpg')
# Getting and saving the image
urllib.request.urlretrieve(image_url, image_path)
image_id += 1
time.sleep(3)
driver.quit()
Thanks!

Look for the attribute of data-original rather than the src since this how they are lazy loading the images. I modified the following variable and got all the images
image_url = image_lm.get_attribute('data-original')
Here is a sample of my print out for that variable:
https://photo.venus.com/im/18235739.jpg?preset=dept
https://photo.venus.com/im/19034244.jpg?preset=dept
https://photo.venus.com/im/17199949.jpg?preset=dept
https://photo.venus.com/im/19121197.jpg?preset=dept
https://photo.venus.com/im/18235918.jpg?preset=dept
https://photo.venus.com/im/18366410.jpg?preset=dept

Empty json object when read text file

I'm testing a python script with text data. Able to run the script and return valid json file if the text include in the script but I got empty json object when run the script and with separate text file.
The output only empty json file
{
"ospf": []
}
The code below return empty json object when run it with read text file
import json
result = {}
l = []
with open('data.txt') as myf:
for i in myf:
if i:
p = [parameter for parameter in i.split("*")]
for line, x in enumerate(p[0].split("\n")):
if x and "Ls id" in x:
ls_id, ip = x.split(": ")
ls_id = ls_id.strip()
ip = ip.strip()
for y in p[1:]:
if y and "P-2-P" in y:
temp = {ls_id:ip}
for items in y.split("\n"):
try:
key, value = items.split(": ")
key = key.strip()
value = value.strip()
temp[key] = value
except ValueError:
pass
l.append(temp)
result["ospf"] = l
print (json.dumps(result,indent=2))
with open('data.json', 'w') as json_file:
json.dump(result, json_file)
When executed the code below ok with the text data include as data..no problem
import json
data = '''
Type : Router
Ls id : 1.1.1.2
Adv rtr : 1.1.1.2
Ls age : 201
Len : 84
Link count: 5
* Link ID: 1.1.1.2
Data : 255.255.255.255
Link Type: StubNet
Metric : 1
Priority : Medium
* Link ID: 1.1.1.4
Data : 192.168.100.34
Link Type: P-2-P
Metric : 1
* Link ID: 192.168.100.33
Data : 255.255.255.255
Link Type: StubNet
Metric : 1
Priority : Medium
* Link ID: 1.1.1.1
Data : 192.168.100.53
Link Type: P-2-P
Metric : 1
* Link ID: 192.168.100.54
Data : 255.255.255.255
Link Type: StubNet
Metric : 1
Priority : Medium
Type : Router
Ls id : 1.1.1.1
Adv rtr : 1.1.1.1
Ls age : 1699
Len : 96
Options : ASBR E
seq# : 80008d72
chksum : 0x16fc
Link count: 6
* Link ID: 1.1.1.1
Data : 255.255.255.255
Link Type: StubNet
Metric : 1
Priority : Medium
* Link ID: 1.1.1.1
Data : 255.255.255.255
Link Type: StubNet
Metric : 12
Priority : Medium
* Link ID: 1.1.1.3
Data : 192.168.100.26
Link Type: P-2-P
Metric : 10
* Link ID: 192.168.100.25
Data : 255.255.255.255
Link Type: StubNet
Metric : 10
Priority : Medium
* Link ID: 1.1.1.2
Data : 192.168.100.54
Link Type: P-2-P
Metric : 10
* Link ID: 192.168.100.53
Data : 255.255.255.255
Link Type: StubNet
Metric : 10
Priority : Medium'''
import json
result = {}
l = []
for i in data.split("\n\n"):
if i:
p = [parameter for parameter in i.split("*")]
for line, x in enumerate(p[0].split("\n")):
if x and "Ls id" in x:
ls_id, ip = x.split(": ")
ls_id = ls_id.strip()
ip = ip.strip()
for y in p[1:]:
if y and "P-2-P" in y:
temp = {ls_id:ip}
for items in y.split("\n"):
try:
key, value = items.split(": ")
key = key.strip()
value = value.strip()
temp[key] = value
except ValueError:
pass
l.append(temp)
result["ospf"] = l
print (json.dumps(result,indent=2))
with open('data.json', 'w') as json_file:
json.dump(result, json_file)
I'm not sure where i make wrong. Please advise me further. Thank you.

A simple workaround would be to concat the file to one large string. then your code works as expected. This is definetly no clean answer but you could let the rest of your code unchanged.
import json
result = {}
l = []
with open('data.txt') as myf:
a = ''.join(myf)
for i in a.split("\n\n"):
if i:
p = [parameter for parameter in i.split("*")]
for line, x in enumerate(p[0].split("\n")):
if x and "Ls id" in x:
ls_id, ip = x.split(": ")
ls_id = ls_id.strip()
ip = ip.strip()
for y in p[1:]:
if y and "P-2-P" in y:
temp = {ls_id:ip}
for items in y.split("\n"):
try:
key, value = items.split(": ")
key = key.strip()
value = value.strip()
temp[key] = value
except ValueError:
pass
l.append(temp)
result["ospf"] = l
print (json.dumps(result,indent=2))
with open('data.json', 'w') as json_file:
json.dump(result, json_file)

Using text file data, classification and make other text file in python

using python, i want to seperated some data file.
file form is text file and there are no tabs only one space between inside data.
here is example file,
//test.txt
Class name age room fund.
13 A 25 B101 300
12 B 21 B102 200
9 C 22 B103 200
13 D 25 B102 100
20 E 23 B105 100
13 F 25 B103 300
11 G 25 B104 100
13 H 22 B101 300
I want to take only line containing specific data,
class : 13 , fund 300
,and save another text file.
if this code was worked, making text file is that
//new_test.txt
Class name age room fund.
13 A 25 B101 300
13 F 25 B103 300
13 H 22 B101 300
thanks.
Hk

This should do.
with open('new_test.txt','w') as new_file:
with open('test.txt') as file:
print(file.readline(),end='',file=new_file)
for line in file:
arr=line.strip().split()
if arr[0]=='13' and arr[-1]=='300':
print(line,end='',file=new_file)
However, you should include your code when asking a question. It ensures that the purpose of this site is served.

If you want to filter your data:
def filter_data(src_file, dest_file, filters):
data = []
with open(src_file) as read_file:
header = [h.lower().strip('.') for h in read_file.readline().split()]
for line in read_file:
values = line.split()
row = dict(zip(header, values))
data.append(row)
for k, v in filters.items():
if data and row.get(k, None) != v:
data.pop()
break
with open(dest_file, 'w') as write_file:
write_file.write(' '.join(header) + '\n')
for row in data:
write_file.write(' '.join(row.values()) + '\n')
my_filters = {
"class": "13",
"fund": "300"
}
filter_data(src_file='test.txt', dest_file='new_test.txt', filters=my_filters)

python return list from linux command output

I am new to python and I'm learning rapidly, but this is beyond my current level of understanding. I'm trying to to pull the output from the linux command apcaccess into a list in python.
apcaccess is a linux command to get the status of an APC UPS. The output is this:
$ apcaccess
APC : 001,035,0933
DATE : 2014-11-12 13:38:27 -0500
HOSTNAME : doormon
VERSION : 3.14.10 (13 September 2011) debian
UPSNAME : UPS
CABLE : USB Cable
DRIVER : USB UPS Driver
UPSMODE : Stand Alone
STARTTIME: 2014-11-12 12:28:00 -0500
MODEL : Back-UPS ES 550G
STATUS : ONLINE
LINEV : 118.0 Volts
LOADPCT : 15.0 Percent Load Capacity
BCHARGE : 100.0 Percent
TIMELEFT : 46.0 Minutes
MBATTCHG : 5 Percent
MINTIMEL : 3 Minutes
MAXTIME : 0 Seconds
SENSE : Medium
LOTRANS : 092.0 Volts
HITRANS : 139.0 Volts
ALARMDEL : 30 seconds
BATTV : 13.6 Volts
LASTXFER : No transfers since turnon
NUMXFERS : 2
XONBATT : 2014-11-12 12:33:35 -0500
TONBATT : 0 seconds
CUMONBATT: 53 seconds
XOFFBATT : 2014-11-12 12:33:43 -0500
STATFLAG : 0x07000008 Status Flag
SERIALNO : 4B1335P17084
BATTDATE : 2013-08-28
NOMINV : 120 Volts
NOMBATTV : 12.0 Volts
FIRMWARE : 904.W1 .D USB FW:W1
END APC : 2014-11-12 13:38:53 -0500
I've tried different iterations of Popen such as:
def check_apc_ups():
output = subprocess.Popen("apcaccess", stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
x1, x2, x3, x4, x5 = output
I would like to be able to pull each line into a list or tuple containing all 32 and then only display/print what I need, such as TIMELEFT and BCHARGE.
Any help would be greatly appreciated.

There are already answers how to get the output of the command into python.
It is not clear what you are going to do with the output. Maybe, a dictionary (dict) is better than a list for you:
# stolen from Hackaholic's answer
import subprocess
child = subprocess.Popen('apcaccess',stdout=subprocess.PIPE)
msg,err = child.communicate()
# now create the dict:
myDict={}
#for i in msg.split("\n"): # loop over lines
for i in msg.splitlines(): # EDIT: See comments
splitted=i.split(":") # list like ["HOSTNAME ", " doormon"]
# remove leading & trailing spaces, add to dict
myDict[splitted[0].strip()]=splitted[1].strip()
#Now, you can easily access the items:
print myDict["SERIALNO"]
print myDict["STATUS"]
print myDict["BATTV"]
for k in myDict.keys():
print k +" = "+ myDict[k]

from subprocess import check_output
out = check_output(["apcaccess"])
spl = [ele.split(":",1)for ele in out.splitlines()]
d = {k.rstrip():v.lstrip() for k,v in spl}
print(d['BCHARGE'])
print(d["TIMELEFT"])
100.0 Percent
46.0 Minutes
from subprocess import check_output
def get_apa():
out = check_output(["apcaccess"])
spl = [ele.split(":", 1) for ele in out.splitlines()]
d = {k.rstrip(): v.lstrip() for k, v in spl}
return d
output = get_apa()
print (output['BCHARGE'])
100.0 Percent
To print all key/values pairings:
for k,v in get_apa().items():
print("{} = {}".format(k,v))

what you need is subprocess module
import subprocess
child = subprocess.Popen('apcaccess',stdout=subprocess.PIPE)
msg,err = child.communicate()
print(msg.split())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract parts of chunked data in Python - python

Related

How to parse Log file to object list

Selenium return empty images after few images

Empty json object when read text file

Using text file data, classification and make other text file in python

python return list from linux command output

Categories

Resources