Write to CSV cell by cell in a row using Python - python

I've written a Python boto3 code to get the average EC2 CPU utilization/day for the last 2 days. Here's the code:
import boto3
import datetime
import csv
accountId = boto3.client('sts').get_caller_identity()['Account']
session = boto3.session.Session()
region = session.region_name
ec2 = session.resource('ec2',region_name=region)
s3 = session.resource('s3')
fields = ['Account' , 'Region' , 'InstanceID' , 'InstanceName']
start = datetime.datetime.today() - datetime.timedelta(days=2)
end = datetime.datetime.today()
instanceId = ''
instanceName = ''
rows = []
filename = 'CPUUtilization.csv'
def get_cpu_utilization(instanceId):
cw = boto3.client('cloudwatch',region_name=region)
res = cw.get_metric_statistics(
Namespace = 'AWS/EC2',
Period = 86400,
StartTime = start,
EndTime = end,
MetricName = 'CPUUtilization',
Statistics = ['Average'],
Unit = 'Percent',
Dimensions = [
{
'Name' : 'InstanceId',
'Value' : instanceId
}
]
)
return res
def lambda_handler(event, context):
for instance in ec2.instances.all():
if instance.tags != None:
for tags in instance.tags:
if tags['Key'] == 'Name':
instanceName = tags['Value']
break
instanceId = str(instance.id)
response = get_cpu_utilization(instanceId)
rows.append([accountId, region, instanceId, instanceName])
for r in response['Datapoints']:
day = r['Timestamp'].date()
week = day.strftime('%a')
avg = r['Average']
day_uti = ' '.join([str(day),week])
fields.append(day_uti)
rows.append([avg])
with open("/tmp/"+filename, 'w+') as csvfile:
csvwriter = csv.writer(csvfile)
csvwriter.writerow(fields)
csvwriter.writerows(rows)
csvfile.close()
s3.Bucket('instances-cmdb').upload_file('/tmp/CPUUtilization.csv', 'CPUUtilization.csv')
The output written to the CSV file is like this:
The average CPU utilization value is printed in the A3 cell, but this has to be printed/written to E2 cell under the date. And all the subsequent days to be written to 1st row and the corresponding values should go to 2nd row, cell by cell, under their respective dates.
How can I achieve this?
I have a couple of other questions related to AWS CloudWatch metrics.
This particular instance was in stopped state the whole day (1st April 2022). Still this Lambda function is giving some CPU utilization value on that day. When I checked for the same from the console, I don't see any data. How is this possible? Am I making any mistake?
When I ran this function multiple times, I got different CPU utilization values. The above attached image was from 1st execution (Avg CPU utilization=0.110935...). Below is the result from 2nd execution
Here the avg CPU utilization for the same instance on the same day is different(0.53698..) from previous result. Is this mistake from my side or what?
Please help.
NOTE: There is only one instance in my account and it was in stopped state the whole day (1st April 2022) and started only on 2nd April 2022 at around 8:00PM IST.

You need to rethink your logic for adding columns for each datapoint returned.
The row list contains one entry per row. It starts with this:
rows.append([accountId, region, instanceId, instanceName])
That creates one entry in the list that is a list with four values.
Later, the code attempts to add another column with:
rows.append([avg])
This results in rows having the value of [[accountId, region, instanceId, instanceName], [avg]].
This is adding another row, which is why it is appearing in the CSV file as a separate line. Rather than adding another row, the code needs to add another entry in the existing row.
The easiest way to do this would be to save the row in a list and only add the 'row' once you have all the information for the row.
So, you could replace this line:
rows.append([accountId, region, instanceId, instanceName])
with:
current_row = [accountId, region, instanceId, instanceName]
And you could later add to it with:
current_row.append(avg)
Then, after the for loop has completed adding all the columns, it can be stored with:
rows.append(current_row)
Also, be careful with this line:
fields.append(day_uti)
It is adding the date to the fields list, but if there is more than one instance, each instance will add an entry. I presume you want them to be the same date, so it won't work out like you expect.

Related

DataFrame returns Value Error after adding auto index

This script needs to query the DC server for events. Since this is done live, each time the server is queried, it returns query results of varying lengths. The log file is long and messy, as most logs are. I need to filter only the event names and their codes and then create a DataFrame. Additionally, I need to add a third column that counts the number of times each event took place. I've done most of it but can't figure out how to fix the error I'm getting.
After doing all the filtering from Elasticsearch, I get two lists - action and code - which I have emulated here.
action_list = ['logged-out', 'logged-out', 'logged-out', 'Directory Service Access', 'Directory Service Access', 'Directory Service Access', 'logged-out', 'logged-out', 'Directory Service Access', 'created-process', 'created-process']
code_list = ['4634', '4634', '4634', '4662', '4662', '4662', '4634', '4634', '4662','4688']
I then created a list that contains only the codes that need to be filtered out.
event_code_list = ['4662', '4688']
My script is as follows:
import pandas as pd
from collections import Counter
#Create a dict that combines action and code
lists2dict = {}
lists2dict = dict(zip(action_list,code_list))
# print(lists2dict)
#Filter only wanted eventss
filtered_events = {k: v for k, v in lists2dict.items() if v in event_code_list}
# print(filtered_events)
index = 1 * pd.RangeIndex(start=1, stop=2) #add automatic index to DataFrame
df = pd.DataFrame(filtered_events,index=index)#Create DataFrame from filtered events
#Create Auto Index
count = Counter(df)
action_count = dict(Counter(count))
action_count_values = action_count.values()
# print(action_count_values)
#Convert Columns to Rows and Add Index
new_df = df.melt(var_name="Event",value_name="Code")
new_df['Count'] = action_count_values
print(new_df)
Up until this point, everything works as it should. The problem is what comes next. If there are no events, the script outputs an empty DataFrame. This works fine. However, if there are events, then we should see the events, the codes, and the number of times each event occurred. The problem is that it always outputs 1. How can I fix this? I'm sure it's something ridiculous that I'm missing.
#If no alerts, create empty DataFrame
if new_df.empty:
empty_df = pd.DataFrame(columns=['Event','Code','Count'])
empty_df['Event'] = ['-']
empty_df['Code'] = ['-']
empty_df['Count'] = ['-']
empty_df.to_html()
html = empty_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
else: #else, output alerts + codes + count
new_df.to_html()
html = new_df.to_html()
with open('alerts.html', 'w') as f:
f.write(html)
Any help is appreciated.
It is because you are collecting the result as dictionary - the repeated records are ignored. You lost the record count here: lists2dict = dict(zip(action_list,code_list)).
You can do all these operations very easily on dataframe. Just construct a pandas dataframe from given lists, then filter by code, groupby, and aggregate as count:
df = pd.DataFrame({"Event": action_list, "Code": code_list})
df = df[df.Code.isin(event_code_list)] \
.groupby(["Event", "Code"]) \
.agg(Count = ("Code", len)) \
.reset_index()
print(df)
Output:
Event Code Count
0 Directory Service Access 4662 4
1 created-process 4688 2

How to build a dataframe from scratch while filling in missing data? (details included in question)

I have a dataframe which looks like the following (Name of the first dataframe(image below) is relevantdata in the code):
I want the dataframe to be transformed to the following format:
Essentially, I want to get the relevant confirmed number for each Key for all the dates that are available in the dataframe. If a particular date is not available for a Key, we make that value to be zero.
Currently my code is as follows (A try/except block is used as some Keys don't have the the whole range of dates, hence a Keyerror occurs the first time you refer to that date using countrydata.at[date,'Confirmed'] for the respective Key, hence the except block will make an entry of zero into the dictionary for that date):
relevantdata = pandas.read_csv('https://raw.githubusercontent.com/open-covid-19/data/master/output/data_minimal.csv')
dates = relevantdata['Date'].unique().tolist()
covidcountries = relevantdata['Key'].unique().tolist()
data = dict()
data['Country'] = covidcountries
confirmeddata = relevantdata[['Date','Key','Confirmed']]
for country in covidcountries:
for date in dates:
countrydata = confirmeddata.loc[lambda confirmeddata: confirmeddata['Key'] == country].set_index('Date')
try:
if (date in data.keys()) == False:
data[date] = list()
data[date].append(countrydata.at[date,'Confirmed'])
else:
data[date].append(countrydata.at[date,'Confirmed'])
except:
if (date in data.keys()) == False:
data[date].append(0)
else:
data[date].append(0)
finaldf = pandas.DataFrame(data = data)
While the above code accomplished what I want in getting the dataframe in the format I require, it is way too slow, having to loop through every key and date. I want to know if there is a better and faster method to doing the same without having to use a nested for loop. Thank you for all your help.

Iterate a piece of code connecting to API using two variables pulled from two lists

I'm trying to run a script (API to google search console) over a table of keywords and dates in order to check if there was improvement in keyword performance (SEO) after the date.
Since i'm really clueless im guessing and trying but Jupiter notebook isn't responding so i can't even tell if im wrong...
This git was made by Josh Carty
the git from which i took this code is:
https://github.com/joshcarty/google-searchconsole
Already pd.read_csv the input table (consist of two columns 'keyword' and 'date'),
made the columns into two separate lists (or maybe it better to use dictionary/other?):
KW_list and
Date_list
I tried:
for i in KW_list and j in Date_list:
for i in KW_list and j in Date_list:
account = searchconsole.authenticate(client_config='client_secrets.json',
credentials='credentials.json')
webproperty = account['https://www.example.com/']
report = webproperty.query.range(j, days=-30).filter('query', i, 'contains').get()
report2 = webproperty.query.range(j, days=30).filter('query', i, 'contains').get()
df = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
df
Expect to see the data frame of all the different keywords (keyowrd1-stat1 , keyword2 - stats2 below, etc. [no overwrite]) at the dates 30 days before the date in the neighbor cell (in the input file)
or at least some respond from J.notebook so i will know what is going on.
Try using the zip function to combine the lists into a list of tuples. This way, the date and the corresponding keyword are combined.
account = searchconsole.authenticate(client_config='client_secrets.json', credentials='credentials.json')
webproperty = account['https://www.example.com/']
df1 = None
df2 = None
first = True
for (keyword, date) in zip(KW_list, Date_list):
report = webproperty.query.range(date, days=-30).filter('query', keyword, 'contains').get()
report2 = webproperty.query.range(date, days=30).filter('query', keyword, 'contains').get()
if first:
df1 = pd.DataFrame(report)
df2 = pd.DataFrame(report2)
first = False
else:
df1 = df1.append(pd.DataFrame(report))
df2 = df2.append(pd.DataFrame(report2))

Check logs with Spark

I'm new to Spark and I'm trying to develop a python script that reads a csv file with some logs:
userId,timestamp,ip,event
13,2016-12-29 16:53:44,86.20.90.121,login
43,2016-12-29 16:53:44,106.9.38.79,login
66,2016-12-29 16:53:44,204.102.78.108,logoff
101,2016-12-29 16:53:44,14.139.102.226,login
91,2016-12-29 16:53:44,23.195.2.174,logoff
And checks if a user had some strange behaviors, for example if he has done two consecutive 'login' without doing 'logoff'. I've loaded the csv as a Spark dataFrame and I wanted to compare the log rows of a single user, ordered by timestamp and checking if two consecutive events are of the same type (login - login , logoff - logoff). I'm searching for doing it in a 'map-reduce' way, but at the moment I can't figure out how to use a reduce function that compares consecutive rows.
The code I've written works, but the performance are very bad.
sc = SparkContext("local","Data Check")
sqlContext = SQLContext(sc)
LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
N_USERS = 10*1000
dataFrame = sqlContext.read.format("com.databricks.spark.csv").load(LOG_FILE_PATH)
dataFrame = dataFrame.selectExpr("C0 as userID","C1 as timestamp","C2 as ip","C3 as event")
wrongUsers = []
for i in range(0,N_USERS):
userDataFrame = dataFrame.where(dataFrame['userId'] == i)
userDataFrame = userDataFrame.sort('timestamp')
prevEvent = ''
for row in userDataFrame.rdd.collect():
currEvent = row[3]
if(prevEvent == currEvent):
wrongUsers.append(row[0])
prevEvent = currEvent
badUsers = sqlContext.createDataFrame(wrongUsers)
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
First (not related but still), be sure that the number of entries per user is not that big because that collect in for row in userDataFrame.rdd.collect(): is dangerous.
Second, you don't need to leave the DataFrame area here to use classical Python, just stick to Spark.
Now, your problem. It's basically "for each line I want to know something from the previous line": that belongs to the concept of Window functions and to be precise the lag function. Here are two interesting articles about Window functions in Spark: one from Databricks with code in Python and one from Xinh with (I think easier to understand) examples in Scala.
I have a solution in Scala, but I think you'll pull it off translating it in Python:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.lag
import sqlContext.implicits._
val LOG_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/flume/events/*"
val RESULTS_FILE_PATH = "hdfs://quickstart.cloudera:8020/user/cloudera/spark/script_results/prova/bad_users.csv"
val data = sqlContext
.read
.format("com.databricks.spark.csv")
.option("inferSchema", "true")
.option("header", "true") // use the header from your csv
.load(LOG_FILE_PATH)
val wSpec = Window.partitionBy("userId").orderBy("timestamp")
val badUsers = data
.withColumn("previousEvent", lag($"event", 1).over(wSpec))
.filter($"previousEvent" === $"event")
.select("userId")
.distinct
badUsers.write.format("com.databricks.spark.csv").save(RESULTS_FILE_PATH)
Basically you just retrieve the value from the previous line and compare it to the value on your current line, if it's a match that is a wrong behavior and you keep the userId. For the first line in your "block" of lines for each userId, the previous value will be null: when comparing with the current value, the boolean expression will be false so no problem here.

group a bunch of files by day

I do have a bunch of files containing atmospheric measurements in one directory. Fileformat is NetCDF. Each file has a timestamp (variable 'basetime'). I can read all files and plot individual measurement events (temperature vs. altitude).
What I need to do next is "group the files by day" and plot all measurements taken at one single day together in one plot. Unfortunately I have no clue how to do that.
One idea is to use the variable 'measurement_day' as it is defined in the code below.
For each day I normally do have four different files containing temp. and altitude.
Ideally the data of those four different files should be grouped (e.g. for plotting)
I hope my question is clear. Can anyone please help me.
EDIT: I try to use a dictionary now but I have trouble to determine whether one entry already exists for one measurement day. Please see edited code below
from netCDF4 import Dataset
data ={} # was edited
for f in listdir(path):
if isfile(join(path,f)):
full_path = join(path,f)
f = Dataset(full_path, 'r')
basetime = f.variables['base_time'][:]
altitude = f.variables['alt'][:]
temp = f.variables['tdry'][:]
actual_date = strftime("%Y-%m-%d %H:%M:%S", gmtime(basetime))
measurement_day = strftime("%Y-%m-%d", gmtime(basetime))
# check if dict entries for day already exist, if not create empty dict
# and lists inside
if len(data[measurement_day]) == 0:
data[measurement_day] = {}
else: pass
if len(data[measurement_day]['temp']) == 0:
data[measurement_day]['temp'] = []
data[measurement_day]['altitude'] = []
else: pass
I get the following error message:
Traceback (most recent call last):... if len(data[measurement_day]) == 0:
KeyError: '2009/05/28'
Can anyone please help me.
I will try. Though I'm not totally clear on what you already have.
I can read all files and plot individual measurement events
(temperature vs. altitude). What I need to do next is "group the files
by day" and plot all measurements taken at one single day together in
one plot.
From this, I am assuming that you know how to plot the information given a list of Datasets. To get that list of Datasets, try something like this.
from netCDF4 import Dataset
# a dictionary of lists that hold all the datasets from a given day
grouped_datasets = {}
for f in listdir(path):
if isfile(join(path,f)):
full_path = join(path,f)
f = Dataset(full_path, 'r')
basetime = f.variables['base_time'][:]
altitude = f.variables['alt'][:]
temp = f.variables['tdry'][:]
actual_date = strftime("%Y-%m-%d %H:%M:%S", gmtime(basetime))
measurement_day = strftime("%Y-%m-%d", gmtime(basetime))
# if we haven't encountered any datasets from this day yet...
if measurement_day not in grouped_datasets:
# add that day to our dict
grouped_datasets[measurement_day] = []
# now append our dataset to the correct day (list)
grouped_datasets[measurement_day].append(f)
Now you have a dictionary keyed on measurement_day. I'm not sure how you are graphing your data, so this is as far as I can get you. Hope it helps, good luck.

Categories

Resources