How do you import a ndjson file in jupyter notebook

How do you import a ndjson file in jupyter notebook - python

I have tried the code below but it's not working
import json
with open("/Users/elton/20210228test2.ndjson") as f:
test2data = ndjson.load(f)

This works for me. import ndjson instead of import json. See more here: https://pypi.org/project/ndjson/
import ndjson
# load from file-like objects
with open('data.ndjson') as f:
data = ndjson.load(f)

Related

Loading a parquet file from a GitHub repository

I tried to read a parquet (.parq) file I have stored in a GitHub project, using the following script:
import pandas as pd
import numpy as np
import ipywidgets as widgets
import datetime
from ipywidgets import interactive
from IPython.display import display, Javascript
import warnings
warnings.filterwarnings('ignore')
parquet_file = r'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')
and it gave me this error:
ArrowInvalid: Could not open Parquet input source '': Parquet
magic bytes not found in footer. Either the file is corrupted or this
is not a parquet file.
Does anyone know what this error message means and how I can load the file in my GitHub repository? Thank you in advance.

You should use the URL under the domain raw.githubusercontent.com.
As for your example:
parquet_file = 'https://raw.githubusercontent.com/smaanan/sev.en_commodities/main/random_deals.parq'
df = pd.read_parquet(parquet_file, engine='auto')

You can read parquet files directly from a web URL like this. However, when reading a data file from a git repository you need to make sure it is the raw file url:
url = 'https://github.com/smaanan/sev.en_commodities/blob/main/random_deals.parq?raw=true'

How to import the json file in python

I am trying to import a json file to python and then export is to an excel file using the following code:
import pandas as pd
df = pd.read_json('pub_settings.json')
df.to_excel('pub_settings.xlsx')
but i am getting the following error:
can anyone please tell me what i am doing wrong?

First import json file as a dictionary using following code:-
import json
with open("") as f:
data = json.load(f)
Then you can use following link to convert it to xlsx:-
https://pypi.org/project/tablib/0.9.3/

Python: Reading fortran file from url

I would like to do the following in Python 3: Read in a FortranFile, but from an URL rather than a local file. The reason is that for my concrete example there are a lot of files and I want to avoid having to download them all first.
I have managed to
a) read in a simple .txt file from an URL
import urllib
from urllib.request import urlopen
url='http://www.deus-consortium.org/deus-library/filelist/deus_file_list_501.txt'
data=urllib.request.urlopen(url)
i=0
for line in data: # files are iterable
print(i,line)
i+=1
#alternative: data.read()
b) read in a local FortranFile (binary little endian unformated Fortran file) like this:
The file is from: http://www.deus-consortium.org/deus-library/efiler1/Babel_le/boxlen648_n2048_lcdmw7/post/fof/output_00090/fof_boxlen648_n2048_lcdmw7_masst_00000
from scipy.io import FortranFile
filename='../../Downloads/fof_boxlen648_n2048_rpcdmw7_masst_00000'
ff = FortranFile(filename, 'r')
nhalos=ff.read_ints(dtype=np.int32)[0]
print('number of halos in file',nhalos)
Is there any way to avoid downloading and reading FortranFiles directly from the URL? I tried
import urllib
from urllib.request import urlopen
url='http://www.deus-consortium.org/deus-library/efiler1/Babel_le/boxlen648_n2048_lcdmw7/cube_00090/fof_boxlen648_n2048_lcdmw7_cube_00000'
pathname = urllib.request.urlopen(url)
ff = FortranFile(pathname, 'r')
ff.read_ints()
gives "OSError: obtaining file position failed". pathname.read() doesn't work either because it's a fortran file.
Any ideas? Thanks in advance!

Maybe you can use tempfile module to download and read the data?
For example:
import urllib
import tempfile
from scipy.io import FortranFile
from urllib.request import urlopen
url='http://www.deus-consortium.org/deus-library/efiler1/Babel_le/boxlen648_n2048_lcdmw7/cube_00090/fof_boxlen648_n2048_lcdmw7_cube_00000'
with tempfile.TemporaryFile() as fp:
fp.write(urllib.request.urlopen(url).read())
fp.seek(0)
ff = FortranFile(fp, 'r')
info = ff.read_ints()
print(info)
Prints:
[12808737]

How to load a json file in jupyter notebook using pandas?

I am trying to load a json file in my jupyter notebook
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as plt
import json
%matplotlib inline
with open("pud.json") as datafile:
data = json.load(datafile)
dataframe = pd.DataFrame(data)
I am getting the following error
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Please help

If you want to load a json file use pandas.read_json.
pandas.read_json("pud.json")
This will load the json as a dataframe.
The function usage is as shown below
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
You can get more information about the parameters here
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html

Another way using json!
import pandas as pd
import json
with open('File_location.json') as f:
data = json.load(f)
df=pd.DataFrame(data)

with open('pud.json', 'r') as file:
variable_name = json.load(file)
The json file will be loaded as python dictionary.

This code you are writing here is completely okay . The problem is the .json file that you are loading is not a JSON file. Kindly check that file.

How to download a CSV file from the World Bank's dataset

I would like to automate the download of CSV files from the World Bank's dataset.
My problem is that the URL corresponding to a specific dataset does not lead directly to the desired CSV file but is instead a query to the World Bank's API. As an example, this is the URL to get the GDP per capita data: http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv.
If you paste this URL in your browser, it will automatically start the download of the corresponding file. As a consequence, the code I usually use to collect and save CSV files in Python is not working in the present situation:
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen("%s" %(baseUrl))
myData = csv.reader(remoteCSV)
How should I modify my code in order to download the file coming from the query to the API?

This will get the zip downloaded, open it and get you a csv object with whatever file you want.
import urllib2
import StringIO
from zipfile import ZipFile
import csv
baseUrl = "http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv"
remoteCSV = urllib2.urlopen(baseUrl)
sio = StringIO.StringIO()
sio.write(remoteCSV.read())
# We create a StringIO object so that we can work on the results of the request (a string) as though it is a file.
z = ZipFile(sio, 'r')
# We now create a ZipFile object pointed to by 'z' and we can do a few things here:
print z.namelist()
# A list with the names of all the files in the zip you just downloaded
# We can use z.namelist()[1] to refer to 'ny.gdp.pcap.cd_Indicator_en_csv_v2.csv'
with z.open(z.namelist()[1]) as f:
# Opens the 2nd file in the zip
csvr = csv.reader(f)
for row in csvr:
print row
For more information see ZipFile Docs and StringIO Docs

import os
import urllib
import zipfile
from StringIO import StringIO
package = StringIO(urllib.urlopen("http://api.worldbank.org/v2/en/indicator/ny.gdp.pcap.cd?downloadformat=csv").read())
zip = zipfile.ZipFile(package, 'r')
pwd = os.path.abspath(os.curdir)
for filename in zip.namelist():
csv = os.path.join(pwd, filename)
with open(csv, 'w') as fp:
fp.write(zip.read(filename))
print filename, 'downloaded successfully'
From here you can use your approach to handle CSV files.

We have a script to automate access and data extraction for World Bank World Development Indicators like: https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
The script does the following:
Downloading the metadata data
Extracting metadata and data
Converting to a Data Package
The script is python based and uses python 3.0. It has no dependencies outside of the standard library. Try it:
python scripts/get.py
python scripts/get.py https://data.worldbank.org/indicator/GC.DOD.TOTL.GD.ZS
You also can read our analysis about data from World Bank:
https://datahub.io/awesome/world-bank

Just a suggestion than a solution. You can use pd.read_csv to read any csv file directly from a URL.
import pandas as pd
data = pd.read_csv('http://url_to_the_csv_file')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do you import a ndjson file in jupyter notebook - python

I have tried the code below but it's not working import json with open("/Users/elton/20210228test2.ndjson") as f: test2data = ndjson.load(f)

This works for me. import ndjson instead of import json. See more here: https://pypi.org/project/ndjson/ import ndjson # load from file-like objects with open('data.ndjson') as f: data = ndjson.load(f)

Related

Loading a parquet file from a GitHub repository

How to import the json file in python

Python: Reading fortran file from url

How to load a json file in jupyter notebook using pandas?

How to download a CSV file from the World Bank's dataset

Categories

Resources