I've created a lambda that scans my s3 bucket and collects some metadata for each object found in the S3. However, I am hitting a roadblock when exporting a CSV with the data of the s3 object. My CSV only returns one record, how can I get my CSV to return all objects?
Please see my Lambda code below:
import re
import datetime
from datetime import date
import os
import math
import csv
s3 = boto3.client('s3')
logger = logging.getLogger()
logger.setLevel(logging.INFO)
time=date.today().strftime("%d/%m/%Y")
def lambda_handler(event, context):
s3_resource = boto3.resource('s3')
result = []
bucket = s3_resource.Bucket('dev-bucket')
key='csv_file.csv'
for object in bucket.objects.all():
name=object.key
size=object.size
si=list(name)
dates=object.last_modified.strftime("%d/%m/%Y")
owner=object.owner['DisplayName']
days_since_creation= datetime.datetime.strptime(time, "%d/%m/%Y") - datetime.datetime.strptime(dates, "%d/%m/%Y")
days_since_creation=days_since_creation.days
to_delete =[]
if days_since_creation >= 30:
to_delete = 'Y'
else:
to_delete = 'N'
myfile = open("/tmp/csv_file.csv", "w+")
writer = csv.writer(myfile,delimiter='|')
rows = name, size, dates, days_since_creation
rows=list(rows)
writer.writerow(rows)
myfile.close()
#upload the data into s3
s3.upload_file('/tmp/csv_file.csv', 'dev-bucket', 'cleanuptest.csv')
print(rows)
My Current output is this below:
09ff0687-a644-4d5e-9de8-277594b194a6.csv.metadata|280|29/11/2021|78
The preferred output would be:
0944ee8b-1e17-496a-9196-0caed1e1de11.csv.metadata|152|08/12/2021|69
0954d7e5-dcc6-4cb6-8c07-70cbf37a73ef.csv|8776432|16/11/2021|91
0954d7e5-dcc6-4cb6-8c07-70cbf37a73ef.csv.metadata|336|16/11/2021|91
0959edc4-fa02-493f-9c05-9040964f4756.csv|6338|29/11/2021|78
0959edc4-fa02-493f-9c05-9040964f4756.csv.metadata|225|29/11/2021|78
0965cf32-fc31-4acc-9c32-a983d8ea720d.txt|844|10/12/2021|67
0965cf32-fc31-4acc-9c32-a983d8ea720d.txt.metadata|312|10/12/2021|67
096ed35c-e2a7-4ec4-8dae-f87b42bfe97c.csv|1761|09/12/2021|68
Unfortunately, I cannot get it right, I'm not sure what I'm doing wrong. Help would be appreciated
I think in your current setup, you open and close the file for each row. So, basically, at the end your file will have the last row.
What you probably want is this:
myfile = open("/tmp/csv_file.csv", "w+")
for object in bucket.objects.all():
<the looping logic>
myfile.close()
s3.upload_file('/tmp/csv_file.csv', 'dev-bucket', 'cleanuptest.csv')
You can prove that opening & closing the file each time rewrites the file by running the below minimal version of your script:
import csv
myfile1 = open("csv_file.csv", "w+")
writer1 = csv.writer(myfile1,delimiter='|')
row1 = "a", "b", "c"
rows1 = list(row1)
writer1.writerow(rows1)
myfile1.close()
print(rows1)
myfile2 = open("csv_file.csv", "w+")
writer2 = csv.writer(myfile2,delimiter='|')
row2 = "x", "y", "z"
rows2 = list(row2)
writer2.writerow(rows2)
myfile2.close()
print(rows2)
Output in file:
x|y|z
FYI you can also open the file in append mode using a to ensure the rows are not overwritten.
myfile = open("/tmp/csv_file.csv", "a")
Using w has the below caveat as mentioned in the docs:
'w' for only writing (an existing file with the same name will be erased)
'a' opens the file for appending; any data written to the file is automatically added to the end.
...
Related
I have one input file in which there is one row where multiple mu(μ) characters are there. Python code just open the file and does some manipulation and we save that file in .csv format. When I save that file in .csv it is producing some weird and funny characters (�). The attached images show the input file and output files when I open in Excel.
Input CSV file:
Output CSV file:
from pathlib import Path
import pandas as pd
import time
import argparse
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('path',
help='define the directory to folder/file')
start = time.time()
def main(path_files):
rs_columns = "SourceFile,RowNum,SampleID,Method,Element,Result".split(",")
rs = pd.DataFrame(columns=rs_columns)
if path_files.is_file():
fnames = [path_files]
else:
fnames = list(Path(path_files).glob("*.csv"))
for fn in fnames:
if "csv" in str(fn):
#df = pd.read_csv(str(fn))
df = pd.read_csv(str(fn), header=None, sep='\n')
df = df[0].str.split(',', expand=True)
else:
print("Unknown file", str(fn))
non_null_columns = [col for col in df.columns if df.loc[:, col].notna().any()]
# loop thru each column for the whole file and create a row of results in the output file
for i in range(1,len(non_null_columns)):
SourceFile = Path(fn.name)
Method = "WetScreening"
Element = df.iloc[1,i]
print(Element)
for j in range(2,len(df)):
RowNum = j+1
Result = df.iloc[j,i]
SampleID = df.iloc[j,0]
rs = rs.append(pd.DataFrame({
"SourceFile": [SourceFile],
"RowNum": [RowNum],
"SampleID": [SampleID],
"Method": [Method],
"Element": [Element],
"Result": [Result]
}),ignore_index=True)
rs.to_csv("check.csv",index=False)
print("Output: check.csv")
if __name__== "__main__":
start = time.time()
args = parser.parse_args()
path = Path(args.path)
main(path)
print("Processed time: ", time.time()-start)
Attach files here
Any help????
Try encoding to utf-8:
rs.to_csv("check.csv",index=False, encoding='UTF-8')
See also Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign
That answer mentions the BOM bytes (0xEF, 0xBB, 0xBF) at the start of the file that acts as a utf-8 signature.
rd.to_csv('file.csv', index=False, encoding='utf-8-sig')
I want to use textract (via aws cli) to extract tables from a pdf file (located in an s3 location) and export it into a csv file. I have tried writing a .py script but am struggling to read from the file.
Any suggestions for writing the .py script is welcome.
This is my current script. I run into the error:
File "extract-table.py", line 63, in get_table_csv_results
bash: File: command not found
blocks=response['Blocks']
KeyError: 'Blocks'
import webbrowser, os
import json
import boto3
import io
from io import BytesIO
import sys
from pprint import pprint
def get_rows_columns_map(table_result, blocks_map):
rows = {}
for relationship in table_result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
cell = blocks_map[child_id]
if cell['BlockType'] == 'CELL':
row_index = cell['RowIndex']
col_index = cell['ColumnIndex']
if row_index not in rows:
# create new row
rows[row_index] = {}
# get the text value
rows[row_index][col_index] = get_text(cell, blocks_map)
return rows
def get_text(result, blocks_map):
text = ''
if 'Relationships' in result:
for relationship in result['Relationships']:
if relationship['Type'] == 'CHILD':
for child_id in relationship['Ids']:
word = blocks_map[child_id]
if word['BlockType'] == 'WORD':
text += word['Text'] + ' '
if word['BlockType'] == 'SELECTION_ELEMENT':
if word['SelectionStatus'] =='SELECTED':
text += 'X '
def get_table_csv_results(file_name):
with open(file_name, 'rb') as file:
img_test = file.read()
bytes_test = bytearray(img_test)
print('Image loaded', file_name)
# process using image bytes
# get the results
client = boto3.client('textract')
#Response
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
})
# Get the text blocks
blocks=response['Blocks']
pprint(blocks)
blocks_map = {}
table_blocks = []
for block in blocks:
blocks_map[block['Id']] = block
if block['BlockType'] == "TABLE":
table_blocks.append(block)
if len(table_blocks) <= 0:
return "<b> NO Table FOUND </b>"
csv = ''
for index, table in enumerate(table_blocks):
csv += generate_table_csv(table, blocks_map, index +1)
csv += '\n\n'
return csv
def generate_table_csv(table_result, blocks_map, table_index):
rows = get_rows_columns_map(table_result, blocks_map)
table_id = 'Table_' + str(table_index)
# get cells.
csv = 'Table: {0}\n\n'.format(table_id)
for row_index, cols in rows.items():
for col_index, text in cols.items():
csv += '{}'.format(text) + ","
csv += '\n'
csv += '\n\n\n'
return csv
def main(file_name):
table_csv = get_table_csv_results(file_name)
output_file = 'output.csv'
# replace content
with open(output_file, "wt") as fout:
fout.write(table_csv)
# show the results
print('CSV OUTPUT FILE: ', output_file)
# Document
s3BucketName = "chrisyou.sagemi.com"
documentName = "DETAIL.pdf"
if __name__ == "__main__":
file_name = sys.argv[1]
main(file_name)
There is a much simpler way using the Amazon Textractor Textractor library. pip install amazon-textract-textractor
This will create a csv per table in your pdf document. e.g output_p0_t0.csv
from textractor import Textractor
def extract_tables(s3_file_path, output_directory, s3_output_path):
extractor = Textractor(profile_name="default")
document = extractor.start_document_analysis(s3_file_path, textractor.data.constants.TextractFeatures.TABLES, s3_output_path)
for j, page in enumerate(document.pages):
for i, table in enumerate(document.tables):
with open(output_directory+f'/output_p{j}_t{i}.csv', 'w') as csv_file:
csv_file.write(table.to_csv())
return document
document = extract_tables('s3://<INPUT_FILE.PDF>', './<LOCAL_DIRECTORY_FOR_CSV>', 's3://<TEXTRACT_OUTPUT_DIRECTORY>')
I had to make slight changes to #Thomas answer by importing profile `
extractor = Textractor(profile_name="default") right after importing Textractor as shown below to avoid getting this error -> NameError: name 'textractor' is not defined.
from textractor import Textractor
extractor = Textractor(profile_name="default")
def extract_tables(s3_file_path, output_directory, s3_output_path):
document = extractor.start_document_analysis(s3_file_path, textractor.data.constants.TextractFeatures.TABLES, s3_output_path)
for j, page in enumerate(document.pages):
for i, table in enumerate(document.tables):
with open(output_directory+f'/output_p{j}_t{i}.csv', 'w') as csv_file:
csv_file.write(table.to_csv())
return document
document = extract_tables('s3://<INPUT_FILE.PDF>', './<LOCAL_DIRECTORY_FOR_CSV>', 's3://<TEXTRACT_OUTPUT_DIRECTORY>')
Hope it helps someone out there.
I have the following txt files (10000s) in multiple directories eg.
BaseDirectory\04_April\2019-04-14\UniqeDirectoryName1 (username)\345308457384745637.txt
BaseDirectory\04_April\2019-04-14\UniqeDirectoryName2 (username)\657453456456546543.txt
BaseDirectory\04_April\2019-04-14\UniqeDirectoryName3 (username)\234545743564356774.txt
BaseDirectory\05_May\2019-05-14\UniqeDirectoryName1 (username)\266434564564563565.txt
BaseDirectory\05_May\2019-05-14\UniqeDirectoryName2 (username)\934573845739632048.txt
BaseDirectory\05_May\2019-05-14\UniqeDirectoryName3 (username)\634534534535654501.txt
so in other words in each date folder there are multiple directories that again contains text files.
import os
import re
import csv
for path, subdirs, files in os.walk("E:\\BaseDir\\"):
for name in files:
file_fullinfo = os.path.join(path, name)
path, filename = os.path.split(file_fullinfo)
NoExtension = os.path.splitext(file_fullinfo)[0]
file_noext = str(NoExtension)
file_splitinfo = re.split('\\\\', file_noext, 0)
file_month = file_splitinfo[2]
file_date = file_splitinfo[3]
file_folder = re.sub(r'\([^)]*\)', '', file_splitinfo[4])
file_name = file_splitinfo[5]
file_category = file_folder
My script generates the following..
['E:', 'BaseDirectory', '04_April', '2019-04-09', 'UniqeDirectoryName', '345308457384745637.txt', 'UniqeDirectoryName']
So far so good, writing this to a generic CSV file is also straight forward, but I want to create a new CSV file based on the changing date like this.
E:\BaseDir\2019-04-09.csv
file_folder, file_name, file_category
'UniqeDirectoryName', '543968732948754398','UniqeDirectoryName'
'UniqeDirectoryName', '345308457384745637','UniqeDirectoryName'
'UniqeDirectoryName', '324089734983987439','UniqeDirectoryName'
E:\BaseDir\2019-05-14.csv
file_folder, file_name, file_category
'UniqeDirectoryName', '543968732948754398','UniqeDirectoryName'
'UniqeDirectoryName', '345308457384745637','UniqeDirectoryName'
'UniqeDirectoryName', '324089734983987439','UniqeDirectoryName'
How can I accomplise this can't quite wrap my head a around it, the struggle of being a Python noob is real.. :)
If you can live without the first line as a header row it can be achieved quite simply.
output_file_path = 'D:/output_files/' + file_date + '.csv'
with open(file=output_file_path, mode='a') as csv_file: #open a csv file to write to in append mode
csv_file.write("my data\n")
if you absolutely must have the header then you can test if the file exists first, if it doesn't exist write the header row
import os.path
output_file_path = 'D:/output_files/' + file_date + '.csv'
if not os.path.exists(output_file_path): #open a csv file to write header row if doesn't exist
with open(file=output_file_path, mode='a') as csv_file:
csv_file.write("my header row\n")
with open(file=output_file_path, mode='a') as csv_file: #open a csv file to write to in append mode
csv_file.write("my data\n")
I'm writing a Python script to generate a QR code from the first column in a csv (concatenated with a local name), and that part works well. The csv just has three columns and looks like this:
ID First Last
144 Jerry Seinfeld
491 George Costanza
104 Elaine Benes
99 Cosmo Kramer
And I use my Python script to take that file, append a prefix to the IDs (in this case, 'NBC') and then create QR codes for each record in a new folder. It's a little long but all of this seems to work fine also:
import csv
import qrcode
import os
import shutil
import time
import inquirer
#Identify Timestamp
timestr = time.strftime("%Y%m%d-%H%M%S")
local = 'NBC'
#Load csv
filename = "stackoverflowtest.csv"
#Path to new local folder
localfolder = local
localimagefolder = localfolder+'/image'
localfilefolder = localfolder+'/files'
#Check/create folders based on local
if not os.path.exists(localfolder):
os.makedirs(localfolder)
if not os.path.exists(localimagefolder):
os.makedirs(localimagefolder)
if not os.path.exists(localfilefolder):
os.makedirs(localfilefolder)
#Copy uploaded file to their local's file folder
shutil.copy2(filename, localfilefolder+'/'+local+'-'+timestr+'.csv') # complete target filename given
#Read csv and generate QR code for local+first column of csv
with open(filename, 'rU') as csvfile:
next(csvfile, None) #skip header row
reader = csv.reader(csvfile, delimiter=',', dialect=csv.excel_tab)
for i, row in enumerate(reader):
labeldata = row[0] #Choose first column of data to create QR codes
print labeldata
qr = qrcode.QRCode(
version=1,
error_correction=qrcode.constants.ERROR_CORRECT_L,
box_size=10,
border=4,
)
qr.add_data(local+"-"+labeldata)
qr.make()
img = qr.make_image()
img.save(localimagefolder+"/"+local+"-"+labeldata+".png".format(i)) #Save image
It creates the NBC folder, copies each csv file in one subfolder, and creates the QR codes for each ID (NBC-144,NBC-491,NBC-104,NBC-99) in another.
The part where I'm running into a problem is opening the csv and writing the filepath/filename back to the csv (or a copy of the csv since from what I've read, I likely can't do it to the same one). Is that possible?
The closest I've come with a script that works is appending the local name with the ID and writing that back to a column but I can't seem to figure out how to do the same with a variable, let alone a filepath/filename:
import csv
import os
import sys
filename = 'stackoverflowtest.csv'
newfilename = 'stackoverflowtest2.csv'
local = 'NBC'
with open(filename, 'rU') as f:
reader = csv.reader(f)
with open(newfilename, 'w') as g:
writer = csv.writer(g)
for row in reader:
new_row = row[0:] + ['-'.join([local, row[0]])]
writer.writerow(new_row)
Is it possible to write something like that within my existing script to add a column for the filepath and filename? Everything I try breaks -- especially if I attempt to do it in the same script.
EDIT:
This is my closest attempt that overwrote the existing file
f=open(newfilename,'r+')
w=csv.writer(f)
for path, dirs, files in os.walk(path):
for filename in files:
w.writerow([newfilename])
Also it's still in a separate script.
Since I can't run the code in your question directly, I had to commented-out portions of it in what's below for testing, but think it does everything you wanted in one loop in one script.
import csv
#import qrcode
import os
import shutil
import time
#import inquirer
# Identify Timestamp
timestr = time.strftime("%Y%m%d-%H%M%S")
local = 'NBC'
# Load csv
filename = "stackoverflowtest.csv"
# Path to new local folder
localfolder = local
localimagefolder = os.path.join(localfolder, 'image')
localfilefolder = os.path.join(localfolder, 'files')
# Check/create folders based on local
if not os.path.exists(localfolder):
os.makedirs(localfolder)
if not os.path.exists(localimagefolder):
os.makedirs(localimagefolder)
if not os.path.exists(localfilefolder):
os.makedirs(localfilefolder)
# Copy uploaded file to their local's file folder
target = os.path.join(localfilefolder, local+'-'+timestr+'.csv') # Target filename
#shutil.copy2(filename, target) # Don't need to do this.
# Read csv and generate QR code for local+first column of csv
with open(filename, 'rb') as csvfile, open(target, 'wb') as outfile:
reader = csv.reader(csvfile, delimiter=',', dialect=csv.excel_tab)
writer = csv.writer(outfile, delimiter=',', dialect=csv.excel_tab)
next(reader) # Skip header row.
for row in reader:
id, first, last = row
# qr = qrcode.QRCode(
# version=1,
# error_correction=qrcode.constants.ERROR_CORRECT_L,
# box_size=10,
# border=4,
# )
#
# qr.add_data(local+"-"+id)
# qr.make()
#
# img = qr.make_image()
imagepath = os.path.join(localimagefolder, local+"-"+id+".png")
# img.save(imagepath) # Save image.
print "saving img:", imagepath
writer.writerow(row + [local+'-'+id, imagepath])
Output from sample input data:
144,Jerry,Seinfeld,NBC-144,NBC/image/NBC-144.png
491,George,Costanza,NBC-491,NBC/image/NBC-491.png
104,Elaine,Benes,NBC-104,NBC/image/NBC-104.png
99,Cosmo,Kramer,NBC-99,NBC/image/NBC-99.png
So I have the following code and I'm trying to export a csv and immediately open it in Python.
# define weekly pull code
def GT_Weekly_Run(keys):
# connect to Google
connector = pyGTrends(google_username, google_password)
# make request
connector.request_report(keys, geo="US")
# wait a random amount of time between requests to avoid bot detection
time.sleep(randint(5, 10))
# download file
connector.save_csv(path, '_' + "GT_Weekly" + '_' + keys)
name = path, '_' + "GT_Weekly" + '_' + keys
with open(name + '.csv', 'rt') as csvfile:
csvReader = csv.reader(csvfile)
data = []
data = [row for row in csvReader if row and row[0].startswith("20")]
week_df = pd.DataFrame(data)
cols = ["Date", "Trend"]
week_df.columns = [cols]
The problem is that I'm not able to match the save as file name with the open file name. Have tried a number of things but keep getting errors regarding
IOError: [Errno 2] No such file or directory: 'GT_Weekly_football.csv'
TypeError: can only concatenate tuple (not "str") to tuple
Is there anything that looks off. I just need to go from saving the file as X and using that same name (X) to import it back in.
Thanks!
I would recommend you create a variable to hold the filename. That way, the same name will be used both for creation and loading back.
import os
# define weekly pull code
def GT_Weekly_Run(keys):
# connect to Google
connector = pyGTrends(google_username, google_password)
# make request
connector.request_report(keys, geo="US")
# wait a random amount of time between requests to avoid bot detection
time.sleep(randint(5, 10))
# download file
filename = "_GT_Weekly_" + keys
connector.save_csv(path, filename)
with open(os.path.join(path, filename), 'rt') as csvfile:
csvReader = csv.reader(csvfile)
data = []
data = [row for row in csvReader if row and row[0].startswith("20")]
week_df = pd.DataFrame(data)
cols = ["Date", "Trend"]
week_df.columns = [cols]
It is safer to make use of Python's os.path.join function to create your full file names.
Also take a look at the keys parameter you are passing to GT_Weekly_Run, it should just be a simple string.