Loading a parquet file from a GitHub repository - python

I tried to read a parquet (.parq) file I have stored in a GitHub project, using the following script:
import pandas as pd
import numpy as np
import ipywidgets as widgets
import datetime
from ipywidgets import interactive
from IPython.display import display, Javascript
import warnings
warnings.filterwarnings('ignore')
parquet_file = r'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
and it gave me this error:
ArrowInvalid: Could not open Parquet input source '': Parquet
magic bytes not found in footer. Either the file is corrupted or this
is not a parquet file.
Does anyone know what this error message means and how I can load the file in my GitHub repository? Thank you in advance.

You should use the URL under the domain raw.githubusercontent.com.
As for your example:
parquet_file = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')

You can read parquet files directly from a web URL like this. However, when reading a data file from a git repository you need to make sure it is the raw file url:
url = 'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq?raw=true'

Related

How to parse CSV into pandas dataframe

I am having a couple issues with setting up a way to automate the download of a csv. The two issues are when downloading using a simple pandas read_csv(url) method I get and SSL error, so I switched to using requests and trying to parse the response. The next issues is that I am getting some errors in parsing the response. I'm not sure if the reason is that the URL is actually returning a zip file and if that is how can I get around that.
Here is the URL: https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/
and here is the code:
import pandas as pd
import numpy as np
import os
import io
import requests
import urllib3
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
url = "https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/"
res = requests.get(url).content
data = pd.read_csv(io.StringIO(res.decode('utf-8')))
If the content is zip format, you should unzip it, and use its contents (csv, txt...).
I wasn't able to download the file due to the low speed from host
Here is the answer I found although I don't really need to actually save these files locally, so if anyone knows how to parse zipfiles without downloading that would be great. Also not sure why I get that SSL error with pandas, but not with requests...
import requests
import zipfile
from io import BytesIO
url = "https://www.californiadgstats.ca.gov/download/interconnection_rule21_applications/"
pathSave = "C:/Users/wherever"
filename = url.split('/')[-1]
r = requests.get(url)
zipfile= zipfile.ZipFile(BytesIO(r.content))
zipfile.extractall(pathSave)

Reading tdms files with python and notebook på Azure

I am trying to read tdms files from one Azure data lake to another and convert them to parquet at the same time. I managed to install the package nptdms in Azure Data Factory and ran the code line below
from nptdms import TdmsFile
But I don't know how to give value path_to_file in the second code line or the third.
2. tdms_file = TdmsFile.read("path_to_file.tdms")
Every files in Azure data lake has an URL as file path in this format:
https://xxxyyy.blob.core.windows.net/name_of_file.tdms
It did not work. I believe that the nptdms package was just written for on-premises and it does not work with cloud syntax
I wonder anyone has and can share experience with reading tdms.files in Azure platform.
Since the files may be bigger in size, you should download and store it in a temporary file so that you can pass the file path to TdmsFile.open or TdmsFile.read.
tmp_file.name is its path here.
from shutil import copyfileobj
from urllib.request import urlopen
from tempfile import NamedTemporaryFile
from nptdms import TdmsFile
with urlopen('http://www-personal.acfr.usyd.edu.au/zubizarreta/f/exampleMeasurements.tdms') as response:
with NamedTemporaryFile(delete=False) as tmp_file:
copyfileobj(response, tmp_file)
tdms_file = TdmsFile.open(tmp_file.name)
for group in tdms_file.groups():
group_name = group.name
print(f'Group name: {group_name}')
for channel in group.channels():
channel_name = channel.name
print(f'Channel name: {channel_name}')

How to parse_dates for the Imported CSV File in google colab ?? the CSV file is imported from the Local drive

In Google Colab I have imported the CSV file from local drive, using the below code :
from google.colab import files
uploaded = files.upload()
then to read the CSV file to parse_date I have the below code :
import pandas as pd
import io
df = pd.read_csv(io.StringIO(uploaded['Auto.csv'], parse_dates = ['Date'],date_parser=parse))
print(df)
it show's the error message as below :
TypeError: StringIO() takes at most 2 arguments (3 given)
But when importing file from github it works good for example shown below :
df = pd.read_csv('https://raw.githubusercontent.com/master/dataset/electricity_consumption.csv', parse_dates = ['Bill_Date'],date_parser=parse) #this code works good from github
so I want to parse_dates for the csv file imported from the Local drive ??? Kindly help me on this???
Data set looks like this :

Create a pandas dataframe from a qrc resource file

I would like to save a CSV file into a qrc file and than read it putting its contents in a pandas dataframe, but I have some problems.
I created a qrc file called res.qrc:
<!DOCTYPE RCC><RCC version="1.0">
<qresource>
<file>dataset.csv</file>
</qresource>
</RCC>
I compiled it obtaining the res_rc.py file.
To read it I created a python script called resource.py:
import pandas as pd
import res_rc
from PySide.QtCore import *
file = QFile(":/dataset.csv")
df = pd.read_csv(file.fileName())
print(df)
But I obtain the error: IOError: File :/dataset.csv does not exist
All the files (resource.py, res.qrs, res_rc.py, dataset.csv) are in the same folder.
If I do res_rc.qt_resource_data I can see the contents.
How can I create the pandas dataframe?
The qresource is a virtual path that only Qt knows how to obtain it and can change internally without warnings, in these cases what must be done is to read all the data and convert it into a stream with io.BytesIO
import io
import pandas as pd
from PySide import QtCore
import res_rc
file = QtCore.QFile(":/dataset.csv")
if file.open(QtCore.QIODevice.ReadOnly):
f = io.BytesIO(file.readAll().data())
df = pd.read_csv(f)
print(df)

How to download a Excel file from behind a paywall into a pandas dataframe?

I have this website that requires log in to access data.
import pandas as pd
import requests
r = requests.get(my_url, cookies=my_cookies) # my_cookies are imported from a selenium session.
df = pd.io.excel.read_excel(r.content, sheetname=0)
Reponse:
IOError: [Errno 2] No such file or directory: 'Ticker\tAction\tName\tShares\tPrice\...
Apparently, the str is processed as a filename. Is there a way to process it as a file? Alternatively can we pass cookies to pd.get_html?
EDIT: After further processing we can now see that this is actually a csv file. The content of the downloaded file is:
In [201]: r.content
Out [201]: 'Ticker\tAction\tName\tShares\tPrice\tCommission\tAmount\tTarget Weight\nBRSS\tSELL\tGlobal Brass and Copper Holdings Inc\t400.0\t17.85\t-1.00\t7,140\t0.00\nCOHU\tSELL\tCohu Inc\t700.0\t12.79\t-1.00\t8,953\t0.00\nUNTD\tBUY\tUnited Online Inc\t560.0\t15.15\t-1.00\t-8,484\t0.00\nFLXS\tBUY\tFlexsteel Industries Inc\t210.0\t40.31\t-1.00\t-8,465\t0.00\nUPRO\tCOVER\tProShares UltraPro S&P500\t17.0\t71.02\t-0.00\t-1,207\t0.00\n'
Notice that it is tab delimited. Still, trying:
# csv version 1
df = pd.read_csv(r.content)
# Returns error, file does not exist. Apparently read_csv() is also trying to read it as a file.
# csv version 2
fh = io.BytesIO(r.content)
df = pd.read_csv(fh) # ValueError: No columns to parse from file.
# csv version 3
s = StringIO(r.content)
df = pd.read_csv(s)
# No error, but the resulting df is not parsed properly; \t's show up in the text of the dataframe.
Simply wrap the file contents in a BytesIO:
with io.BytesIO(r.content) as fh:
df = pd.io.excel.read_excel(fh, sheetname=0)
This functionality was included in an update from 2014. According to the documentation it is as simple as providing the url:
The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. For instance, a local file could be file://localhost/path/to/workbook.xlsx
Based on the code you've provided, it looks like you are using pandas 0.13.x? If you can upgrade to a newer version (code below is tested with 0.16.x) you can get this to work without the additional utilization of the requests library. This was added in 0.14.1
data2 = pd.read_excel(data_url)
As an example of a full script (with the example XLS document taken from the original bug report stating the read_excel didn't accept a URL):
import pandas as pd
data_url = "http://www.eia.gov/dnav/pet/xls/PET_PRI_ALLMG_A_EPM0_PTC_DPGAL_M.xls"
data = pd.read_excel(data_url, "Data 1", skiprows=2)

Categories

Resources