Export tensorboard (with pytorch) data into csv with python - python

I have Tensorboard data and want it to download all of the csv files behind the data, but I could not find anything from the official documentation. From StackOverflow, I found only this question which is 7 years old and also it's about TensorFlow while I am using PyTorch.
We can do this manually, as we can see in the screenshot, manually there is an option. I wonder if we can do that via code or it is not possible? As I have a lot of data to process.

With the help of this script Below is the shortest working code it gets all of the data in dataframe then you can play further.
import traceback
import pandas as pd
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
# Extraction function
def tflog2pandas(path):
runlog_data = pd.DataFrame({"metric": [], "value": [], "step": []})
try:
event_acc = EventAccumulator(path)
event_acc.Reload()
tags = event_acc.Tags()["scalars"]
for tag in tags:
event_list = event_acc.Scalars(tag)
values = list(map(lambda x: x.value, event_list))
step = list(map(lambda x: x.step, event_list))
r = {"metric": [tag] * len(step), "value": values, "step": step}
r = pd.DataFrame(r)
runlog_data = pd.concat([runlog_data, r])
# Dirty catch of DataLossError
except Exception:
print("Event file possibly corrupt: {}".format(path))
traceback.print_exc()
return runlog_data
path="Run1" #folderpath
df=tflog2pandas(path)
#df=df[(df.metric != 'params/lr')&(df.metric != 'params/mm')&(df.metric != 'train/loss')] #delete the mentioned rows
df.to_csv("output.csv")

Related

Python wildcard search

I have a Lambda python function that I inherited which searches and reports on installed packages on EC2 instances. It pulls this information from SSM Inventory where the results are output to an S3 bucket. All of the installed packages have specific names until now. Now we need to report on Palo Alto Cortex XDR. The issue I'm facing is that this product includes the version number in the name and we have different versions installed. If I use the exact name (i.e. Cortex XDR 7.8.1.11343) I get reporting on that particular version but not others. I want to use a wild card to do this. I import regex (import re) on line 7 and then I change line 71 to xdr=line['Cortex*']) but it gives me the following error. I'm a bit new to Python and coding so any explanation as to what I'm doing wrong would be helpful.
File "/var/task/SoeSoftwareCompliance/RequiredSoftwareEmail.py", line 72, in build_html
xdr=line['Cortex*'])
import configparser
import logging
import csv
import json
from jinja2 import Template
import boto3
import re
# config
config = configparser.ConfigParser()
config.read("config.ini")
# logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# #TODO
# refactor common_csv_header so that we use one with variable
# so that we write all content to one template file.
def build_html(account=None,
ses_email_address=None,
recipient_email=None):
"""
:param recipient_email:
:param ses_email_address:
:param account:
"""
account_id = account["id"]
account_alias = account["alias"]
linux_ec2s = []
windows_ec2s = []
ec2s_not_in_ssm = []
excluded_ec2s = []
# linux ec2s html
with open(f"/tmp/{account_id}_linux_ec2s_required_software_report.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
if line["platform-type"] == "Linux":
item = dict(id=line['instance-id'],
name=line['instance-name'],
ip=line['ip-address'],
ssm=line['amazon-ssm-agent'],
cw=line['amazon-cloudwatch-agent'],
ch=line['cloudhealth-agent'])
# skip compliant linux ec2s where are values are found
compliance_status = not all(item.values())
if compliance_status:
linux_ec2s.append(item)
# windows ec2s html
with open(f"/tmp/{account_id}_windows_ec2s_required_software_report.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
if line["platform-type"] == "Windows":
item = dict(id=line['instance-id'],
name=line['instance-name'],
ip=line['ip-address'],
ssm=line['Amazon SSM Agent'],
cw=line['Amazon CloudWatch Agent'],
ch=line['CloudHealth Agent'],
mav=line['McAfee VirusScan Enterprise'],
trx=line['Trellix Agent'],
xdr=line['Cortex*'])
# skip compliant windows ec2s where are values are found
compliance_status = not all(item.values())
if compliance_status:
windows_ec2s.append(item)
# ec2s not found in ssm
with open(f"/tmp/{account_id}_ec2s_not_in_ssm.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
item = dict(name=line['instance-name'],
id=line['instance-id'],
ip=line['ip-address'],
pg=line['patch-group'])
ec2s_not_in_ssm.append(item)
# display or hide excluded ec2s from report
display_excluded_ec2s_in_report = json.loads(config.get("settings", "display_excluded_ec2s_in_report"))
if display_excluded_ec2s_in_report == "true":
with open(f"/tmp/{account_id}_excluded_from_compliance.csv", "r") as fp:
lines = csv.DictReader(fp)
for line in lines:
item = dict(id=line['instance-id'],
name=line['instance-name'],
pg=line['patch-group'])
excluded_ec2s.append(item)
# pass data to html template
with open('templates/email.html') as file:
template = Template(file.read())
# pass parameters to template renderer
html = template.render(
linux_ec2s=linux_ec2s,
windows_ec2s=windows_ec2s,
ec2s_not_in_ssm=ec2s_not_in_ssm,
excluded_ec2s=excluded_ec2s,
account_id=account_id,
account_alias=account_alias)
# consolidated html with multiple tables
tables_html_code = html
client = boto3.client('ses')
client.send_email(
Destination={
'ToAddresses': [recipient_email],
},
Message={
'Body': {
'Html':
{'Data': tables_html_code}
},
'Subject': {
'Charset': 'UTF-8',
'Data': f'SOE | Software Compliance | {account_alias}',
},
},
Source=ses_email_address,
)
print(tables_html_code)
If I understand your problem correctly, you are getting a KeyError exception because Python does not support wildcards out of the box. A csv.DictReader creates a standard Python dictionary for each row in csv. Python's dictionary is just an associative array without pattern matching.
You can implement this by regex, though. If you have a dictionary line and you don't know the full name of a key you are looking for, you can solve it by re.search function.
line = {'Cortex XDR 7.8.1.11343': 'Some value you are looking for'}
val = next(v for k, v in line.items() if re.search('Cortex.+', k))
print(val) # 'Some value you are looking for'
Be aware that this assumes that a line dictionary contains at least one item that matches the 'Cortex.+' pattern and returns the first match. You would have to refactor this a bit to change this.
1. import os - missing in the code
2. def build_html(account=None -> When the account is pass with Nonetype and below error will thrown in account["id"] and account["alias"].
Ex:
Traceback (most recent call last):
File "C:\Users\pro\Documents\project\pywilds.py", line 134, in <module>
build_html(account=None)
File "C:\Users\pro\Documents\project\pywilds.py", line 33, in build_html
account_id = account["id"]
TypeError: 'NoneType' object is not subscriptable
I hope it helps..

Converting JSON to CSV using Python. How to remove certain text/characters if found, and how to better format the cell?

I apologise in advanced if i have not provided enough information,using wrong terminology or im not formatting my question correctly. This is my first time asking questions here.
This is the script for the python script: https://pastebin.com/WWViemwf
This is the script for the JSON file (contains the first 4 elements hydrogen, helium, lithium, beryllium): https://pastebin.com/fyiijpBG
As seen, I'm converting the file from ".json" to ".csv".
The JSON file sometimes contains fields that say "NotApplicable" or "Unknown". Or it will show me weird text that I'm not familiar with.
For example here:
"LiquidDensity": {
"data": "NotAvailable",
"tex_description": "\\text{liquid density}"
},
And here:
"MagneticMoment": {
"data": "Unknown",
"tex_description": "\\text{magnetic dipole moment}"
},
Here is the code ive made to convert from ".json" to ".csv":
#liquid density
liquid_density = element_data["LiquidDensity"]["data"]
if isinstance(liquid_density, dict):
liquid_density_value = liquid_density["value"]
liquid_density_unit = liquid_density["tex_unit"]
else:
liquid_density_value = liquid_density
liquid_density_unit = ""
However in the csv file it shows up like this.
I'm also trying to remove these characters that i'm seeing in the ".csv" file.
In the JSON file, this is how the data is viewed:
"AtomicMass": {
"data": {
"value": "4.002602",
"tex_unit": "\\text{u}"
},
"tex_description": "\\text{atomic mass}"
},
And this is how i coded to convert, using Python:
#atomic mass
atomic_mass = element_data["AtomicMass"]["data"]
if isinstance(atomic_mass, dict):
atomic_mass_value = atomic_mass["value"]
atomic_mass_unit = atomic_mass["tex_unit"]
else:
atomic_mass_value = atomic_mass
atomic_mass_unit = ""
What have i done wrong?
I've tried replacing:
#melting point
melting_point = element_data["MeltingPoint"]["data"]
if isinstance(melting_point, dict):
melting_point_value = melting_point["value"]
melting_point_unit = melting_point["tex_unit"]
else:
melting_point_value = melting_point
melting_point_value = ""
With:
#melting point
melting_point = element_data["MeltingPoint"]["data"]
if isinstance(melting_point, dict):
melting_point_value = melting_point["value"]
melting_point_unit = melting_point["tex_unit"]
elif melting_point == "NotApplicable" or melting_point == "Unknown":
melting_point_value = ""
melting_point_unit = ""
else:
melting_point_value = melting_point
melting_point_unit = ""
However that doesn't seem to work.
Your code is fine, what went wrong is at the writing, let me take out some part of it.
#I will only be using Liquid Density as example, so I won't be showing the others
headers = [..., "Liquid Density", ...]
#liquid_density data reading part
liquid_density = element_data["LiquidDensity"]["data"]
if isinstance(liquid_density, dict):
liquid_density_value = liquid_density["value"]
liquid_density_unit = liquid_density["tex_unit"]
else:
liquid_density_value = liquid_density
liquid_density_unit = ""
#your writing of the data into the csv
writer.writerow([..., liquid_density, ...])
You write liquid_density directly into your csv, that is why it shows the dictionary. If you want to write the value only, I believe you should change the value in write line to
writer.writerow([..., liquid_density_value, ...])

Improving the runtime of a pandas loop [duplicate]

This question already has an answer here:
Pandas - Explanation on apply function being slow
(1 answer)
Closed 7 months ago.
I am actively running some Python code in jupyter on a df consisting of about 84k rows. I'm estimating this is going to take somewhere in the neighborhood of 9 hours at this rate. My code is below, I have read that ideally one would vectorize for max speed but being sort of new to Python and coding in general, I'm not sure how I can go about changing the below code to vectorize it. The goal is to look at the value in the first column of the dataframe and add that value to the end of a url. I then check the first line in the url and compare it to some predetermined values to find out if there is a match. Any advice would be greatly appreciated!
#Python 3
import pandas as pd
import urllib
no_res = "Item Not Found"
error = "Does Not Exist"
for i in df1.index:
path = 'http://xxxx/xxx/xxx.pl?part=' + str(df1['ITEM_ID'][i])
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = lines[0]
if r == no_res:
sap = 'NO'
elif r == error:
sap = 'ERROR'
else:
sap = 'YES'
df1["item exists"][i] = sap
df1["Path"][i] = path
df1["URL return value"][i] = r
Edit adding test code below
import concurrent.futures
import pandas as pd
import urllib
import numpy as np
def my_func(df_row):
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
raw = str(f.read().decode("utf-8"))
lines = raw.split('\n')
r = df_row['0']
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
my_df = my_df.apply(lambda col: col.astype('category'))
executor = concurrent.futures.ProcessPoolExecutor(8)
futures = [executor.submit(my_func, row) for _,row in my_df.iterrows()]
concurrent.futures.wait(futures)
This is throwing the following error (shortened):
DoneAndNotDoneFutures(done={<Future at 0x1cfe4938040 state=finished raised BrokenProcessPool>, <Future at 0x1cfe48b8040 state=finished raised BrokenProcessPool>,
Since you are doing some outside operation with a URL, I do not think vectorization is a solution (let possible).
The bottleneck of your operation is the following line
f = urllib.request.urlopen(parsed_path)
This line waits for the response and is blocking, as mentioned your operation is I/O bound. The CPU can start other jobs while waiting for the response. The solution to address this is using concurrency.
Edit: My original answer was using python built-in multi threading which was problematic. The best way to do multiprocessing/threading with pandas data frame is using "dask" library.
The following code is tested with the dummy data set on my PC and on average speeds up the naive for loop by ~ 12 times.
#%%
import time
import urllib.request
import pandas as pd
import numpy as np
import dask.dataframe as dd
def my_func(df_row):
df_row = df_row.copy()
no_res = "random"
error = "entered"
path = "http://www.google.com"
parsed_path = path.replace(' ','%20')
f = urllib.request.urlopen(parsed_path)
# I had to change the encoding on my machine.
raw = str(f.read().decode("'windows-1252"))
lines = raw.split('\n')
r = df_row[0]
if r == no_res:
sap = "NO"
elif r == error:
sap = "ERROR"
else:
sap = "YES"
df_row['4'] = sap
df_row['5'] = lines[0]
df_row['6'] = r
return df_row
def run():
print("started.")
n = 1000
my_df = pd.DataFrame(np.random.choice(['random','words','entered'], size=(n,3)))
my_df = my_df.apply(lambda col: col.astype('category'))
my_df['4'] = ""
my_df['5'] = ""
my_df['6'] = ""
# Literally dask partitions the original dataframe into
# npartitions chunks and use them in apply function
# in parallel.
my_ddf = dd.from_pandas(my_df, npartitions=15)
start = time.time()
q = my_ddf.apply(my_func, axis= 1, meta=my_ddf)
# num_workers is number of threads used,
print(q.compute(num_workers= 50))
time_end = time.time()
print(f"Elapsed: {time_end - start:10.2f}")
if __name__ == "__main__":
run()
dask provides many other tools and options to facilitate concurrent processing and it would be a good idea to take a look at its documentation to investigate other options.
P.S. : if you run the above code too many times on google you will receive "HTTP Error 429: Too Many Requests". This happens to prevent something like DDoS attack on a public server. So, if for your real job you are querying a public website, you may end up receiving the same 429 response, if you try 84K queries in a short time.

KeyError for 'snippet' when using YouTube Data API RelatedToVideoID feature

This is my first ever question on Stack Overflow so please do tell me if anything remains unclear. :)
My issue is somewhat related to this thread. I am trying to use the YouTube Data API to sample videos for my thesis. I have done so successfully with the code below; however, when I change the criterion from a query (q) to relatedToVideoId, the unpacking section breaks for some reason.
It works outside of my loop, but not inside it (same story for the .get() suggestion from the other thread). Does anyone know why this might be and how I can solve it?
This is the (shortened) code I wrote which you can use to replicate the issue:
import numpy as np
import pandas as pd
# Allocate credentials:
from googleapiclient.discovery import build
api_key = "YOUR KEY SHOULD GO HERE"
# Session Build
youtube = build('youtube', 'v3', developerKey = api_key)
df_sample_v2 = pd.DataFrame(columns = ["Video.ID", "Title", "Channel Name"])
keywords = ['Global Warming',
'Coronavirus'
]
iter = list(range(1, 150))
rand_selec_ids = ['H6u0VBqNBQ8',
'LEZCxxKp0hM'
]
for i in iter:
# Search Request
request = youtube.search().list(
part = "snippet",
#q = keywords[4],
relatedToVideoId = rand_selec_ids[1],
type = "video",
maxResults = 5000,
videoCategoryId = 28,
order = "relevance",
eventType = "completed",
videoDuration = "medium"
)
# Save Response
response = request.execute()
# Unpack Response
rows = []
for i in list(range(0, response['pageInfo']['resultsPerPage'])):
rows.append([response['items'][i]['id']['videoId'],
response['items'][i]['snippet']['title'], # this is the problematic line
response['items'][i]['snippet']['channelTitle']]
)
temp = pd.DataFrame(rows, columns = ["Video.ID", "Title", "Channel Name"])
df_sample_v2 = df_sample_v2.append(temp)
print(f'{len(df_sample_v2)} videos retrieved!')
The KeyError I get is at the second line of rows.append() where I try to access the snippet.
KeyError Traceback (most recent call last)
<ipython-input-90-c6c01139e372> in <module>
45
46 rows.append([response['items'][i]['id']['videoId'],
---> 47 response['items'][i]['snippet']['title'],
48 response['items'][i]['snippet']['channelTitle']]
49 )
KeyError: 'snippet'
Your issue stems from the fact that the property resultsPerPage should not be used as an indicator for the size of the array items.
The proper way to iterate the items obtained from the API is as follows (this is also the general pythonic way of doing such kind of iterations):
for item in response['items']:
rows.append([
item['id']['videoId'],
item['snippet']['title'],
item['snippet']['channelTitle']
])
You may well add to your code the something like the debugging code below to convince yourself about the claim I made.
print(f"resultsPerPage={response['pageInfo']['resultsPerPage']}")
print(f"len(items)={len(response['items'])}")

Optimize Python Script to parse xml

I'm parsing the US Patent XML files (downloaded from Google patent dumps) using Python and Beautifulsoup; parsed data is exported to MYSQL database.
Each year's data contains close to 200-300K patents - which means parsing 200-300K xml files.
The server on which I'm running the python script is pretty powerful - 16 cores, 160 gigs of RAM, etc. but still it is taking close to 3 days to parse one year's worth of data.
I've been learning and using python since 2 years - so I can get stuff done but do not know how to get it done in the most efficient manner. I'm reading on it.
How can I optimize the below script to make it efficient?
Any guidance would be greatly appreciated.
Below is the code:
from bs4 import BeautifulSoup
import pandas as pd
from pandas.core.frame import DataFrame
import MySQLdb as db
import os
cnxn = db.connect('xx.xx.xx.xx','xxxxx','xxxxx','xxxx',charset='utf8',use_unicode=True)
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("<?xml "):
yield "".join(buffer)
buffer = []
buffer.append(line)
yield "".join(buffer)
file.close()
def get_data(soup):
df = pd.DataFrame(columns = ['doc_id','patcit_num','patcit_document_id_country', 'patcit_document_id_doc_number','patcit_document_id_kind','patcit_document_id_name','patcit_document_id_date','category'])
if soup.findAll('us-citation'):
cit = soup.findAll('us-citation')
else:
cit = soup.findAll('citation')
doc_id = soup.findAll('publication-reference')[0].find('doc-number').text
for x in cit:
try:
patcit_num = x.find('patcit')['num']
except:
patcit_num = None
try:
patcit_document_id_country = x.find('country').text
except:
patcit_document_id_country = None
try:
patcit_document_id_doc_number = x.find('doc-number').text
except:
patcit_document_id_doc_number = None
try:
patcit_document_id_kind = x.find('kind').text
except:
patcit_document_id_kind = None
try:
patcit_document_id_name = x.find('name').text
except:
patcit_document_id_name = None
try:
patcit_document_id_date = x.find('date').text
except:
patcit_document_id_date = None
try:
category = x.find('category').text
except:
category = None
print doc_id
val = {'doc_id':doc_id,'patcit_num':patcit_num, 'patcit_document_id_country':patcit_document_id_country,'patcit_document_id_doc_number':patcit_document_id_doc_number, 'patcit_document_id_kind':patcit_document_id_kind,'patcit_document_id_name':patcit_document_id_name,'patcit_document_id_date':patcit_document_id_date,'category':category}
df = df.append(val, ignore_index=True)
df.to_sql(name = 'table_name', con = cnxn, flavor='mysql', if_exists='append')
print '1 doc exported'
i=0
l = os.listdir('/path/')
for item in l:
f = '/path/'+item
print 'Currently parsing - ',item
for xml_string in separated_xml(f):
soup = BeautifulSoup(xml_string,'xml')
if soup.find('us-patent-grant'):
print item, i, xml_string[177:204]
get_data(soup)
else:
print item, i, xml_string[177:204],'***********************************soup not found********************************************'
i+=1
print 'DONE!!!'
Here is a tutorial on multi-threading, because currently that code will run on 1 thread, 1 core.
Remove all try/except statements and handle the code properly. Exceptions are expensive.
Run a profiler to find the chokepoints, and multi-thread those or find a way to do them less times.
So, you're doing two things wrong. First, you're using BeautifulSoup, which is slow, and second, you're using a "find" call, which is also slow.
As a first cut, look at lxml's ability to pre-compile xpath queries (Look at the heading "The Xpath class). That will give you a huge speed boost.
Alternatively, I've been working on a library to do this kind of parsing declaratively, using best practices for lxml speed, including precompiled xpath called yankee.
Yankee on PyPI |
Yankee on GitHub
You could do the same thing with yankee like this:
from yankee.xml import Schema, fields as f
# Create a schema for citations
class Citation(Schema):
num = f.Str(".//patcit")
country = f.Str(".//country")
# ... and so forth for the rest of your fields
# Then create a "wrapper" to get all the citations
class Patent(Schema):
citations = f.List(".//us-citation|.//citation")
# Then just feed the Schema your lxml.etrees for each patent:
import lxml.etree as ET
schema = Patent()
for _, doc in ET.iterparse(xml_string, "xml"):
result = schema.load(doc)
The result will look like this:
{
"citations": [
{
"num": "<some value>",
"country": "<some value>",
},
{
"num": "<some value>",
"country": "<some value>",
},
]
}
I would also check out Dask to help you multithread it more efficiently. Pretty much all my projects use it.

Categories

Resources