Read an object from Alibaba OSS and modify it using pandas python - python

So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName

import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName

Related

How to interact (read/write/delete) files in S3 bucket directly through streamlit webapp?

I am trying to load a file to s3 bucket through streamlit web UI. Also next step is to read or delete files present in s3 bucket through streamlit app in an interactive way .. I have written a code to upload file on streamlit app but failing in loading it to s3 bucket.
We have a pretty detailed doc on this. You can use s3fs to make the connection, e.g.:
# streamlit_app.py
import streamlit as st
import s3fs
import os
# Create connection object.
# `anon=False` means not anonymous, i.e. it uses access keys to pull data.
fs = s3fs.S3FileSystem(anon=False)
# Retrieve file contents.
# Uses st.experimental_memo to only rerun when the query changes or after 10 min.
#st.experimental_memo(ttl=600)
def read_file(filename):
with fs.open(filename) as f:
return f.read().decode("utf-8")
content = read_file("testbucket-jrieke/myfile.csv")
# Print results.
for line in content.strip().split("\n"):
name, pet = line.split(",")
st.write(f"{name} has a :{pet}:")

Azure Grab Data from Blob Storage w. Python (No downloading)

I'm trying to open a series of different cracked documents / texts that we've stored in Azure Blob storage, ideally pushing them all into a pandas db. I do not want to download them (I'm going to be opening them from a Docker Container), I just want to store the information in memory.
The file structure looks like: Azure Blob Storage -> MyContainer -> UUIDFolderNames (many) -> 1 "knowledge.json" file in each Folder.
What I've got working:
container = ContainerClient.from_connection_string( <my connection str>, <MyContainer> )
blob_list = container.list_blobs()
for blob in blob_list:
blobClient = container.get_blob_client( blob ) #Not sure this is needed
Ideally for each item in my for loop, I'd do something like opening the .json file, then adding it's text to a row in my dataframe. However, I can't actually manage to open any of the JSON files.
What I've tried:
#1
name = blob.name
json.loads( name )
#2
with open(name, 'r') as f:
data = json.load( f )
Errors:
#1 Json Decoder Error Expecting Value: line 1 column 1 (char 0)
#2: No such file or directory
I've tried other sillier things like json.loads( blob ) or json.loads('knowledge.json') (no folder name in path), but those are kinda nonsensicle things that I was just trying to see if they worked, they're not exactly reasonable.
Most methods (including on Azure's documentation) download the file first, but again, I don't want to download the file.
*Edit: I realized that its somewhat obvious why the file's cannot be found - json.load etc will look in my local directory / where I'm running the python file from, rather than the blob location. Still, not sure how to load a file w.o downloading it.
With the help of the below block you will be able to view the JSON blob:
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
Below is the full code to read JSON files from blobs, later you can add this data to your dataframes:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
import logging
import sys
import azure.functions as func
from azure.storage import blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def UploadFiles():
CONNECTION_STRING="ENTER_CONNECTION_STR"
Container_name="gatherblobs"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
if __name__ == '__main__':
UploadFiles()

Reading tdms files with python and notebook på Azure

I am trying to read tdms files from one Azure data lake to another and convert them to parquet at the same time. I managed to install the package nptdms in Azure Data Factory and ran the code line below
from nptdms import TdmsFile
But I don't know how to give value path_to_file in the second code line or the third.
2. tdms_file = TdmsFile.read("path_to_file.tdms")
Every files in Azure data lake has an URL as file path in this format:
https://xxxyyy.blob.core.windows.net/name_of_file.tdms
It did not work. I believe that the nptdms package was just written for on-premises and it does not work with cloud syntax
I wonder anyone has and can share experience with reading tdms.files in Azure platform.
Since the files may be bigger in size, you should download and store it in a temporary file so that you can pass the file path to TdmsFile.open or TdmsFile.read.
tmp_file.name is its path here.
from shutil import copyfileobj
from urllib.request import urlopen
from tempfile import NamedTemporaryFile
from nptdms import TdmsFile
with urlopen('http://www-personal.acfr.usyd.edu.au/zubizarreta/f/exampleMeasurements.tdms') as response:
with NamedTemporaryFile(delete=False) as tmp_file:
copyfileobj(response, tmp_file)
tdms_file = TdmsFile.open(tmp_file.name)
for group in tdms_file.groups():
group_name = group.name
print(f'Group name: {group_name}')
for channel in group.channels():
channel_name = channel.name
print(f'Channel name: {channel_name}')

Read .odt and .doc File from url in python

How can i extract text from '.odt' and '.doc' format file from url using python ? I tried searching for it but couldn't find anything.
Any lead will be helpful.
from odf import text, teletype
from odf.opendocument import load
textdoc = load(r"C:\Users\OMS\Downloads\sample1.odt")
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
a=teletype.extractText(allparas[i])
print(a)
this works for local .odt file but now i need to extract from an
"https://abc.s3.ap-south-1.amazonaws.com/sample1.odt"
Assume connection to aws s3 has been done using boto3 .
Following is tested with Python3.6 and with this test odt file;
import boto3
import io
from odf import text, teletype
from odf.opendocument import load
s3_client = boto3.resource('s3') #TODO: change aws connection logic as per your setup
# TODO: refactor name, readability
def get_contents(file_name):
obj = s3_client.Object('s3_bucket_name', file_name) # TODO: change aws s3 bucket name as per your setup
body = obj.get()['Body'].read()
return load(io.BytesIO(body))
textdoc = get_contents("test.odt") # TODO: change odt file name as per your setup
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
a = teletype.extractText(allparas[i])
print(a)

Extract particular file from zip blob stored in azure container with python using Jupyter notebook

I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
blob_service=BlockBlobService(account_name=ACCOUNT_NAME,account_key=ACCOUNT_KEY)
blob_list=blob_service.list_blobs(CONTAINER_NAME)
allBlobs = []
for blob in blob_list:
allBlobs.append(blob.name)
sampleZipFile = allBlobs[0]
print(sampleZipFile)
The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
credential=key
)
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())
you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
"""
Created on Mon Apr 1 11:14:56 2019
#author: moverm
"""
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
Here is the output for the same
Hope it helps.

Categories

Resources