I am trying to read tdms files from one Azure data lake to another and convert them to parquet at the same time. I managed to install the package nptdms in Azure Data Factory and ran the code line below
from nptdms import TdmsFile
But I don't know how to give value path_to_file in the second code line or the third.
2. tdms_file = TdmsFile.read("path_to_file.tdms")
Every files in Azure data lake has an URL as file path in this format:
https://xxxyyy.blob.core.windows.net/name_of_file.tdms
It did not work. I believe that the nptdms package was just written for on-premises and it does not work with cloud syntax
I wonder anyone has and can share experience with reading tdms.files in Azure platform.
Since the files may be bigger in size, you should download and store it in a temporary file so that you can pass the file path to TdmsFile.open or TdmsFile.read.
tmp_file.name is its path here.
from shutil import copyfileobj
from urllib.request import urlopen
from tempfile import NamedTemporaryFile
from nptdms import TdmsFile
with urlopen('http://www-personal.acfr.usyd.edu.au/zubizarreta/f/exampleMeasurements.tdms') as response:
with NamedTemporaryFile(delete=False) as tmp_file:
copyfileobj(response, tmp_file)
tdms_file = TdmsFile.open(tmp_file.name)
for group in tdms_file.groups():
group_name = group.name
print(f'Group name: {group_name}')
for channel in group.channels():
channel_name = channel.name
print(f'Channel name: {channel_name}')
Related
I'm trying to open a series of different cracked documents / texts that we've stored in Azure Blob storage, ideally pushing them all into a pandas db. I do not want to download them (I'm going to be opening them from a Docker Container), I just want to store the information in memory.
The file structure looks like: Azure Blob Storage -> MyContainer -> UUIDFolderNames (many) -> 1 "knowledge.json" file in each Folder.
What I've got working:
container = ContainerClient.from_connection_string( <my connection str>, <MyContainer> )
blob_list = container.list_blobs()
for blob in blob_list:
blobClient = container.get_blob_client( blob ) #Not sure this is needed
Ideally for each item in my for loop, I'd do something like opening the .json file, then adding it's text to a row in my dataframe. However, I can't actually manage to open any of the JSON files.
What I've tried:
#1
name = blob.name
json.loads( name )
#2
with open(name, 'r') as f:
data = json.load( f )
Errors:
#1 Json Decoder Error Expecting Value: line 1 column 1 (char 0)
#2: No such file or directory
I've tried other sillier things like json.loads( blob ) or json.loads('knowledge.json') (no folder name in path), but those are kinda nonsensicle things that I was just trying to see if they worked, they're not exactly reasonable.
Most methods (including on Azure's documentation) download the file first, but again, I don't want to download the file.
*Edit: I realized that its somewhat obvious why the file's cannot be found - json.load etc will look in my local directory / where I'm running the python file from, rather than the blob location. Still, not sure how to load a file w.o downloading it.
With the help of the below block you will be able to view the JSON blob:
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
Below is the full code to read JSON files from blobs, later you can add this data to your dataframes:
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient,PublicAccess
import os
import logging
import sys
import azure.functions as func
from azure.storage import blob
from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient, __version__
def UploadFiles():
CONNECTION_STRING="ENTER_CONNECTION_STR"
Container_name="gatherblobs"
service_client=BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = service_client.get_container_client(Container_name)
for blobs in container_client.list_blobs():
blob_client = service_client.get_blob_client(container=Container_name, blob=blobs)
content = blob_client.download_blob()
contentastext = content.readall()
print(contentastext)
if __name__ == '__main__':
UploadFiles()
I need to open and work on data coming in a text file with python.
The file will be stored in the Azure Blob storage or Azure file share.
However, my question is can I use the same modules and functions like os.chdir() and read_fwf() I was using in windows? The code I wanted to run:
import pandas as pd
import os
os.chdir( file_path)
df=pd.read_fwf(filename)
I want to be able to run this code and file_path would be a directory in Azure blob.
Please let me know if it's possible. If you have a better idea where the file can be stored please share.
Thanks,
As far as I know, os.chdir(path) can only operate on local files. If you want to move files from storage to local, you can refer to the following code:
connect_str = "<your-connection-string>"
blob_service_client = BlobServiceClient.from_connection_string(connect_str)
container_name = "<container-name>"
file_name = "<blob-name>"
container_client = blob_service_client.get_container_client(container_name)
blob_client = container_client.get_blob_client(file_name)
download_file_path = "<local-path>"
with open(download_file_path, "wb") as download_file:
download_file.write(blob_client.download_blob().readall())
pandas.read_fwf can read blob directly from storage using URL:
For example:
url = "https://<your-account>.blob.core.windows.net/test/test.txt?<sas-token>"
df=pd.read_fwf(url)
So, my data is in the format of CSV files in the OSS bucket of Alibaba Cloud.
I am currently executing a Python script, wherein:
I download the file into my local machine.
Do the changes using Python script in my local machine.
Store it in AWS Cloud.
I have to modify this method and schedule a cron job in Alibaba Cloud to automate the running of this script.
The Python script will be uploaded into Task Management of Alibaba Cloud.
So the new steps will be:
Read a file from the OSS bucket into Pandas.
Modify it - Merging it with other data, some column changes. - Will be done in pandas.
Store the modified file into AWS RDS.
I am stuck at the first step itself.
Error Log:
"No module found" for OSS2 & pandas.
What is the correct way of doing it?
This is a rough draft of my script (on how was able to execute script in my local machine):
import os,re
import oss2 -- **throws an error. No module found.**
import datetime as dt
import pandas as pd -- **throws an error. No module found.**
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
df = pd.read_csv(bucket.get_object(order_file)) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
import os,re
import oss2
import datetime as dt
import pandas as pd
import tarfile
import mysql.connector
from datetime import datetime
from itertools import islice
import io ## include this new library
dates = (dt.datetime.now()+dt.timedelta(days=-1)).strftime("%Y%m%d")
def download_file(access_key_id,access_key_secret,endpoint,bucket):
#Authentication
auth = oss2.Auth(access_key_id, access_key_secret)
# Bucket name
bucket = oss2.Bucket(auth, endpoint, bucket)
# Download the file
try:
# List all objects in the fun folder and its subfolders.
for obj in oss2.ObjectIterator(bucket, prefix=dates+'order'):
order_file = obj.key
objectName = order_file.split('/')[1]
bucket_object = bucket.get_object(order_file).read() ## read the file from OSS
img_buf = io.BytesIO(bucket_object))
df = pd.read_csv(img_buf) # to read into pandas
# FUNCTION to modify and upload
print("File downloaded")
except:
print("Pls check!!! File not read")
return objectName
I had uploaded zip file in my azure account as a blob in azure container.
Zip file contains .csv, .ascii files and many other formats.
I need to read specific file, lets say ascii file data containing in zip file. I am using python for this case.
How to read particular file data from this zip file without downloading it on local? I would like to handle this process in memory only.
I am also trying with jypyter notebook provided by azure for ML functionality
I am using ZipFile python package for this case.
Request you to assist in this matter to read the file
Please find following code snippet.
blob_service=BlockBlobService(account_name=ACCOUNT_NAME,account_key=ACCOUNT_KEY)
blob_list=blob_service.list_blobs(CONTAINER_NAME)
allBlobs = []
for blob in blob_list:
allBlobs.append(blob.name)
sampleZipFile = allBlobs[0]
print(sampleZipFile)
The below code should work. This example accesses an Azure Container using an Account URL and Key combination.
from azure.storage.blob import BlobServiceClient
from io import BytesIO
from zipfile import ZipFile
key = r'my_key'
service = BlobServiceClient(account_url="my_account_url",
credential=key
)
container_client = service.get_container_client('container_name')
zipfilename = 'myzipfile.zip'
blob_data = container_client.download_blob(zipfilename)
blob_bytes = blob_data.content_as_bytes()
inmem = BytesIO(blob_bytes)
myzip = ZipFile(inmem)
otherfilename = 'mycontainedfile.csv'
filetoread = BytesIO(myzip.read(otherfilename))
Now all you have to do is pass filetoread into whatever method you would normally use to read a local file (eg. pandas.read_csv())
you could use below code for reading file inside .zip file without extracting in python
import zipfile
archive = zipfile.ZipFile('images.zip', 'r')
imgdata = archive.read('img_01.png')
For details , you can refer to ZipFile docs here
Alternatively, you can do something like this
-- coding: utf-8 --
"""
Created on Mon Apr 1 11:14:56 2019
#author: moverm
"""
import zipfile
zfile = zipfile.ZipFile('C:\\LAB\Pyt\sample.zip')
for finfo in zfile.infolist():
ifile = zfile.open(finfo)
line_list = ifile.readlines()
print(line_list)
Here is the output for the same
Hope it helps.
I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?
This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs
import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.
We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank
Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')