How to combine boto3 python code and get one csv output

How to combine boto3 python code and get one csv output - python

I have written two codes to get some details about EC2, the reason written two code, I am not able to get the 'ComputerName' in EC2 describe_instance, so I have created separate code using boto3 client SSM get the 'ComputerName'. Now I tried to combine both codes into single code and get the output in single csv with separate columns and rows, someone help me with the below code to get the single csv output. Also please find the sample output.
import boto3
import csv
profiles = ['Dev_Databases','Dev_App','Prod_Database','Prod_App']
########################EC2-Details################################
csv_ob=open("EC2-Inventory.csv","w" ,newline='')
csv_w=csv.writer(csv_ob)
csv_w.writerow(["S_NO","profile","Instance_Id",'Instance_Type','Platform','State','LaunchTime','Privat_Ip'])
cnt=1
for ec2 in profiles:
aws_mag_con=boto3.session.Session(profile_name=ec2)
ec2_con_re=aws_mag_con.resource(service_name="ec2",region_name="ap-southeast-1")
for each in ec2_con_re.instances.all():
print(cnt,ec2,each.instance_id,each.instance_type,each.platform,each.state,each.launch_time.strftime("%Y-%m-%d"),each.private_ip_address,)
csv_w.writerow([cnt,ec2,each.instance_id,each.instance_type,each.platform,each.state,each.launch_time.strftime("%Y-%m-%d"),each.private_ip_address])
cnt+=1
csv_ob.close()
#######################HostName-Details###########################
csv_ob1=open("Hostname-Inventory.csv","w" ,newline='')
csv_w1=csv.writer(csv_ob1)
csv_w1.writerow(["S_NO",'Profile','InstanceId','ComputerName','PlatformName'])
cnt1=1
for ssm in profiles:
session = boto3.Session(profile_name=ssm)
ssm_client=session.client('ssm', region_name='ap-southeast-1')
paginator = ssm_client.get_paginator('describe_instance_information')
response_iterator = paginator.paginate(Filters=[{'Key': 'PingStatus','Values': ['Online']}])
for item in response_iterator:
for instance in item['InstanceInformationList']:
if instance.get('PingStatus') == 'Online':
InstanceId = instance.get('InstanceId')
ComputerName = instance.get('ComputerName')#.replace(".WORKGROUP", "")
PlatformName = instance.get('PlatformName')
print(InstanceId,ComputerName,PlatformName)
csv_w1.writerow([cnt1,ssm,InstanceId,ComputerName,PlatformName])
cnt1+=1
csv_ob1.close()
Sample Output Below:

Related

python skips broken json output

Im using masscan and the example script but when I print the results only the command is printed because the json output for scan is broken(atleast one issue mentiont that). How can I fix that/still acces the json data?
The example code:
import masscan
mas = masscan.PortScanner()
mas.scan('172.0.8.78/24', ports='22,80,8080', arguments='--max-rate 1000')
print(mas.scan_result)
Masscan should return the scan results.
Instead it returns: {"command_line": "masscan -oJ - 0.0.0.0/24 -p 25565 --rate=1000", "scan":{}

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

I'm working on building a Dataflow pipeline that reads a CSV file (containing 250,000 rows) from my Cloud Storage bucket, modifies the value of each row and then writes the modified contents to a new CSV in the same bucket. With the code below I'm able to read and modify the contents of the original file, but when I attempt to write the contents of the new file in GCS I get the following error:
google.api_core.exceptions.TooManyRequests: 429 POST https://storage.googleapis.com/upload/storage/v1/b/my-bucket/o?uploadType=multipart: {
"error": {
"code": 429,
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"errors": [
{
"message": "The rate of change requests to the object my-bucket/product-codes/URL_test_codes.csv exceeds the rate limit. Please reduce the rate of create, update, and delete requests.",
"domain": "usageLimits",
"reason": "rateLimitExceeded"
}
]
}
}
: ('Request failed with status code', 429, 'Expected one of', <HTTPStatus.OK: 200>) [while running 'Store Output File']
My code in Dataflow:
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
import traceback
import sys
import pandas as pd
from cryptography.fernet import Fernet
import google.auth
from google.cloud import storage
fernet_secret = 'aD4t9MlsHLdHyuFKhoyhy9_eLKDfe8eyVSD3tu8KzoP='
bucket = 'my-bucket'
inputFile = f'gs://{bucket}/product-codes/test_codes.csv'
outputFile = 'product-codes/URL_test_codes.csv'
#Pipeline Logic
def product_codes_pipeline(project, env, region='us-central1'):
options = PipelineOptions(
streaming=False,
project=project,
region=region,
staging_location="gs://my-bucket-dataflows/Templates/staging",
temp_location="gs://my-bucket-dataflows/Templates/temp",
template_location="gs://my-bucket-dataflows/Templates/Generate_Product_Codes.py",
subnetwork='https://www.googleapis.com/compute/v1/projects/{}/regions/us-central1/subnetworks/{}-private'.format(project, env)
)
# Transform function
def genURLs(code):
f = Fernet(fernet_secret)
encoded = code.encode()
encrypted = f.encrypt(encoded)
decrypted = f.decrypt(encrypted.decode().encode())
decoded = decrypted.decode()
if code != decoded:
print(f'Error: Code {code} and decoded code {decoded} do not match')
sys.exit(1)
url = 'https://some-url.com/redeem/product-code=' + encrypted.decode()
return url
class WriteCSVFIle(beam.DoFn):
def __init__(self, bucket_name):
self.bucket_name = bucket_name
def start_bundle(self):
self.client = storage.Client()
def process(self, urls):
df = pd.DataFrame([urls], columns=['URL'])
bucket = self.client.get_bucket(self.bucket_name)
bucket.blob(f'{outputFile}').upload_from_string(df.to_csv(index=False), 'text/csv')
# End function
p = beam.Pipeline(options=options)
(p | 'Read Input CSV' >> beam.io.ReadFromText(inputFile, skip_header_lines=1)
| 'Map Codes' >> beam.Map(genURLs)
| 'Store Output File' >> beam.ParDo(WriteCSVFIle(bucket)))
p.run()
The code produces URL_test_codes.csv in my bucket, but the file only contains one row (not including the 'URL' header) which tells me that my code is writing/overwriting the file as it processes each row. Is there a way to bulk write the contents of the entire file instead of making a series of requests to update the file? I'm new to Python/Dataflow so any help is greatly appreciated.

Let's point out the issues: the evident one is a quota issue from GCS side, reflected by the '429' error codes. But as you noted, this is derived from the inherent issue, which is more related to how you try to write your data to your blob.
Since a Beam Pipeline generates a Parallel Collection of elements, when you add elements to your PCollection, each pipeline step will be executed for each element, in other words, your ParDo function will try to write something to your output file once per element in your PCollection.
So, there are some issues with your WriteCSVFIle function. For example, in order to write your PCollection to GCS, it would be better to use a separate pipeline task focused on writing the whole PCollection, such as follows:
First, you can import this Function already included in Apache Beam:
from apache_beam.io import WriteToText
Then, you use it at the end of your pipeline:
| 'Write PCollection to Bucket' >> WriteToText('gs://{0}/{1}'.format(bucket_name, outputFile))
With this option, you don't need to create a storage client or reference a blob, the function just needs to receive the GCS URI where it would write the final result and you can adjust it according to the parameters you can find in the documentation.
With this, you only need to address the Dataframe created in your WriteCSVFIle function. Each pipeline step creates a new PCollection, so if a Dataframe-creator function should receive an element from a PCollection of URLs, then the new PCollection elements resulting from the Dataframe function will have 1 dataframe per url following your current logic, but since it seems you just want to write the results from genURLs considering that 'URL' is the only column in your dataframe, maybe going directly from genURLs to WriteToText can output what you're looking for.
Either way, you can adjust your pipeline accordingly, but at least with the WriteToText transform it would take care of writing your whole final PCollection to your Cloud Storage bucket.

How To Export GCP Security Command Center Findings To BigQuery?

Similar to this: How to export GCP's Security Center Assets to a Cloud Storage via cloud Function?
I need to export the Findings as seen in the Security Command Center to BigQuery so we can easily filter the data we need and generate custom reports.
Using this documentation as an example (https://cloud.google.com/security-command-center/docs/how-to-api-list-findings#python), I wrote the following:
from google.cloud import securitycenter
from google.cloud import bigquery
JSONPath = "Path to JSON File For Service Account"
client = securitycenter.SecurityCenterClient().from_service_account_json(JSONPath)
BQclient = bigquery.Client().from_service_account_json(JSONPath)
table_id = "project.security_center.assets"
org_name = "organizations/1234567891011"
all_sources = "{org_name}/sources/-".format(org_name=org_name)
finding_result_iterator = client.list_findings(request={"parent": all_sources})
for i, finding_result in enumerate(finding_result_iterator):
errors = BQclient.insert_rows_json(table_id, finding_result)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
However, that then gave me the error:
"json_rows argument should be a sequence of dicts".
Any help with this would be greatly appreciated :)

Not sure if this existed back then in Q2 of 2021, but now there is documentation telling how to do this:
https://cloud.google.com/security-command-center/docs/how-to-analyze-findings-in-big-query
You can create exports of SCC findings to bigquery using this command:
gcloud scc bqexports create BIG_QUERY_EXPORT \
--dataset=DATASET_NAME \
--folder=FOLDER_ID | --organization=ORGANIZATION_ID | --project=PROJECT_ID \
[--description=DESCRIPTION] \
[--filter=FILTER]
Filter will allow to filter out unwanted findings (they will be in SCC, but won't be copied to the BigQuery).
It's useful if you want to export findings from one project or selected categories only. (Use -category:CATEGORY to exclude categories, works the same on different parameters as well).

I managed to sort this by writing:
for i, finding_result in enumerate(finding_result_iterator):
rows_to_insert = [
{u"category": finding_result.finding.category, u"name": finding_result.finding.name, u"project": finding_result.resource.project_display_name, u"external_uri": finding_result.finding.external_uri},
]

InvalidS3ObjectException when calling the AnalyzeDocument operation:

InvalidS3ObjectException when calling the AnalyzeDocument operation: Unable to get object metadata from S3. Check object key, region and/or access permissions."
I keep getting this error. Over. And. Over. This program worked with my test cases of what I'm bringing in, the json with a {"body":"imagename.jpg"}. But the very moment I try to utilize the actual code my JS brings in, I get this error. The thing that confuses me is that I've checked the regions and they are fine. I went into my account and created users with full access to all AWS and S3 features, and utilized those logins, I've used my root account, everything. All I'm trying to do is access an image from my s3 bucket. Why won't it work? Below is my code. It works if I utilize the test case I provided above, but the moment I try and use the website it's connected to, it doesn't work.
def main(event, context):
key_map, value_map, block_map = get_kv_map(event) #Take map variables in to get the key and value map we need.
It goes to this function...
def get_kv_map(event):
filePath = event
fileExt = filePath.get('body')
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket('react-images-ex')
obj = bucket.Object(bucket)
client = boto3.client('textract') #We utilize boto3's textract lib
response = client.analyze_document(Document={'S3Object': {'Bucket': 'react-images-ex', 'Name': fileExt}}, FeatureTypes=['FORMS'])
# Get the text blocks
blocks=response['Blocks'] #We make a blocks variable that will be the blocks we find in the document
# get key and value maps
key_map = {}
value_map = {}
block_map = {}
for block in blocks: #Traverse the blocks found in the document
block_id = block['Id'] #Set variable for blockId to the Id's found on that block location
block_map[block_id] = block #Make the block map at that ID be the block variable
if block['BlockType'] == "KEY_VALUE_SET": #if we see that the type of block we're on is a key and value set pair, we check if it's a key or not. If it's not a key, we know it's a value. We send it to the respective map.
if 'KEY' in block['EntityTypes']:
key_map[block_id] = block
else:
value_map[block_id] = block
return key_map, value_map, block_map #Return the maps we need after they're filled.
I have been told before this code is fine, and it should work. So why exactly is it that I get this error?

Based on the comments.
The issue with body was that it was json string, not actual json object.
The solution was to parse the string into json:
fileExt = json.loads(filePath.get('body'))

Try awscli to see if you can access the image in s3:
aws s3 ls s3://react-images-ex/<some-fileExt>
Either you are parsing the fileExt wrongly, or you don't have S3 permission to access the file. The awscli command will help to verify this.

How to get ID of EMR matching specific name only with boto3

How do I get a list of AWS EMR cluster IDs matching a specific name with boto3?
I have this code here:
import sys
import time
import boto3
client = boto3.client("emr")
cluster_name = 'Adhoc-CSDP-EMR'
response = client.list_clusters(
ClusterStates=[
'RUNNING', 'WAITING'
]
)
for cluster in response['Clusters']:
print(cluster['Name'])
print(cluster['Id'])
That will print all clusters in the running or waiting state. How do I filter the results that match cluster_name?

Umm, why can't we do something like this?
matching_cluster_ids = list()
for cluster in response['Clusters']:
if cluster_name == cluster['Name']:
matching_cluster_ids.append(cluster['Id'])
Later you can execute a describe_cluster() (or any other operation) on any of the matching cluster_ids if you want.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to combine boto3 python code and get one csv output - python

Related

python skips broken json output

Error trying to write CSV file to Google Cloud Storage from Dataflow pipeline

How To Export GCP Security Command Center Findings To BigQuery?

InvalidS3ObjectException when calling the AnalyzeDocument operation:

How to get ID of EMR matching specific name only with boto3

Categories

Resources