Using pyspark to read json file directly from a website - python

is it possible to use sqlContext to read a json file directly from a website?
for instance I can read file as such:
myRDD = sqlContext.read.json("sample.json")
but get I an error when I try something like this:
myRDD = sqlContext.read.json("http://192.168.0.13:9200/sample.json")
I'm using Spark 1.4.1
Thanks in advance!

It is not possible. Paths you use should point to either local file system or other file system supported by Hadoop. As long as sample.json has an expected format (single object per line) you can try something like this:
import json
import requests
r = requests.get("http://192.168.0.13:9200/sample.json")
df = sqlContext.createDataFrame([json.loads(line) for line in r.iter_lines()])

Related

Read XML file from URL in Python

I'm using an open source project call OpenTripPlanner which is a tool that I plan to use to simulate a lot of itineraries from one point to another at a given time. So far, I've managed to find the URL where an XML file containing all information about an itineraries is located. The XML is built upon request so the URL isn't static. The URL looks something like this :
http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK
(You need to have an OpenTripPlanner server running to open it)
Now, I want to read these XML files and do some data analysis using python 3, but I can't find a way to read the files. I've tried to use urllib.request to download the file locally, but the file that I get from this is oddly formed. It looks something like this
{"requestParameters":{"date":"2017/12/04","mode":"TRANSIT,WALK","fromPlace":"48.40915, -71.04996","toPlace":"48.41428, -71.06996","time":"8:00:00"},"plan":{"date":1512392400000,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"orig":"","vertexType":"NORMAL"},"to":{"name":"Destination","lon":-71.06996,"lat":48.41428,"orig":"","vertexType":"NORMAL"},"itineraries":[{"duration":1538,"startTime":1512392809000,"endTime":1512394347000,"walkTime":934,"transitTime":602,"waitingTime":2,"walkDistance":1189.6595112715966,"walkLimitExceeded":false,"elevationLost":0.0,"elevationGained":0.0,"transfers":0,"legs":[{"startTime":1512392809000,"endTime":1512393537000,"departureDelay":0,"arrivalDelay":0,"realTime":false,"distance":926.553,"pathway":false,"mode":"WALK","route":"","agencyTimeZoneOffset":-18000000,"interlineWithPreviousLeg":false,"from":{"name":"Origin","lon":-71.04996,"lat":48.40915,"departure":1512392809000,"orig":"","vertexType":"NORMAL"},"to":{"name":"Roitelets / Martinets","stopId":"1:370","stopCode":"370","lon":-71.047688,"lat":48.401531,"arrival":1512393537000,"departure":1512393538000,"stopIndex":15,"stopSequence":16,"vertexType":"TRANSIT"},"legGeometry":{"points":"s{mfHb{spL|ExBp#sDl#V##lB|#j#FL?j#GbCk#|A]vEsA^KBA|C{#pCeACS~CuA`#Q","length":19},"rentedBike":false,"transitLeg":false,"duration":728.0,"steps":[{"distance":131.991,"relativeDirection":"DEPART","streetName":"Rue D.-V.-Morrier","absoluteDirection":"SOUTH","stayOn":false,"area":false,"bogusName":false,"lon":-71.04961760502248,"lat":48.4090671692228,"elevation":[]},{"distance":72.319,"relativeDirection":"LEFT","streetName":"Rue Lorenzo-Genest","absoluteDirection":"EAST","stayOn":false,"area":false,"bogusName":false,"lon":-71.0502299,"lat":48.4079519,"elevation":[]}
And when I try to open the file in a browser, I get an error that says
XML Parsing Error: not well-formed
Location: http://localhost:63342/XML_reader/file.xml?_ijt=e1d6h53s4mh1ak94sqortejf9v
Line Number 1, Column 1: ...
The script I'm using is very simple, it looks like this
import urllib.request
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.xml")
How can I make the outputted XML files well-formed? Is there an other way besides urllib.request that I may want to try?
Thanks a lot
To import this file as JSON data (not XML) you need the JSON library
import urllib.request
import json
from pprint import pprint
testfile = urllib.request.URLopener()
file_name = 'http://localhost:8080/otp/routers/default/plan?fromPlace=48.40915,%20-71.04996&toPlace=48.41428,%20-71.06996&date=2017/12/04&time=8:00:00&mode=TRANSIT,WALK'
testfile.retrieve(file_name, "file.json")
data = json.load(open('file.json'))
pprint(data)
json.load reads the JSON data and convert into a Python object (https://docs.python.org/2/library/json.html?highlight=json%20load#json.load)
pprint is for "Pretty printing" the JSON data (https://docs.python.org/2/library/pprint.html)

Opening a file that has been uploaded in Flask

I'm trying to modify a csv that is uploaded into my flask application. I have the logic that works just fine when I don't upload it through flask.
import pandas as pd
import StringIO
with open('example.csv') as f:
data = f.read()
data = data.replace(',"', ",'")
data = data.replace('",', "',")
df = pd.read_csv(StringIO.StringIO(data), header=None, sep=',', quotechar="'")
print df.head(10)
I upload it to flask and access it using
f = request.files['data_file']
When I run it through the code above, replacing open('example.csv') with open(f), I get the following error
coercing to Unicode: need string or buffer, FileStorage found
I have figured out that the problem is the file type here. I can't use open on my file because open is looking for a file name and when the file is uploaded to flask it is the instance of the file that is being passed to the open command. However, I don't know how to make this work. I've tried skipping the open command and just using data = f.read() but that doesn't work. Any suggestions?
Thanks
FileStorage is a file-like wrapper around the incoming data. You can pass it directly to read_csv.
pd.read_csv(request.files['data_file'])
You most likely should not be performing those replace calls on the data, as the CSV module should handle that and the naive replacement can corrupt data in quoted columns. However, if you still need to, you can read the data out just like you were before.
data = request.files['data_file'].read()
If your data has a mix of quoting styles, you should fix the source of your data.
Answering my own question in case someone else needs this.
FileStorage objects have a .stream attribute which will be an io.BytesIO
f = request.files['data_file']
df = pandas.read_csv(f.stream)

Mongoexport exporting invalid json files

I collected some tweets from the twitter API and stored it to mongodb, I tried exporting the data to a JSON file and didn't have any issues there, until I tried to make a python script to read the JSON and convert it to a csv. I get this traceback error with my code:
json.decoder.JSONDecodeError: Extra data: line 367 column 1 (char 9745)
So, after digging around the internet I was pointed to check the actual JSON data in an online validator, which I did. This gave me the error of:
Multiple JSON root elements
from the site https://jsonformatter.curiousconcept.com/
Here are pictures of the 1st/2nd object beginning/end of the file:
or a link to the data here
Now, the problem is, I haven't found anything on the internet of how to handle that error. I'm not sure if it's an error with the data I've collected, exported, or if I just don't know how to work with it.
My end game with these tweets is to make a network graph. I was looking at either Networkx or Gephi, which is why I'd like to get a csv file.
Robert Moskal is right. If you can address the issue at source and use --jsonArray flag when you use mongoexport then it will make the problem easier i guess. If you can't address it at source then read the below points.
The code below will extract you the individual json objects from the given file and convert them to python dictionaries.
You can then apply your CSV logic to each individual dictionary.
If you are using csv module then I would say use unicodecsv module as it would handle the unicode data in your json objects.
import json
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
json_block = []
print json_dict
If you want to convert it to CSV using pandas you can use the below code:
import json, pandas as pd
with open('path_to_your_json_file', 'rb') as infile:
json_block = []
dictlist=[]
for line in infile:
json_block.append(line)
if line.startswith('}'):
json_dict = json.loads(''.join(json_block))
dictlist.append(json_dict)
json_block = []
df = pd.DataFrame(jsonlist)
df.to_csv('out.csv',encoding='utf-8')
If you want to flatten out the json object you can use pandas.io.json.json_normalize() method.
Elaborating on #MYGz suggestion to use --jsonArray
Your post doesn't show how you exported the data from mongo. If you use the following via the terminal, you will get valid json from mongodb:
mongoexport --collection=somecollection --db=somedb --jsonArray --out=validfile.json
Replace somecollection, somedb and validfile.json with your target collection, target database, and desired output filename respectively.
The following: mongoexport --collection=somecollection --db=somedb --out=validfile.json...will NOT give you the results you are looking for because:
By default mongoexport writes data using one JSON document for every
MongoDB document. Ref
A bit late reply, and I am not sure it was available the time this question was posted. Anyway, now there is a simple way to import the mongoexport json data as follows:
df = pd.read_json(filename, lines=True)
mongoexport provides each line as a json objects itself, instead of the whole file as json.

How to read hadoop map file using python?

I have map file that is block compressed using DefaultCodec. The map file is created by java application like this:
MapFile.Writer writer =
new MapFile.Writer(conf, path,
MapFile.Writer.keyClass(IntWritable.class),
MapFile.Writer.valueClass(BytesWritable.class),
MapFile.Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()));
This file is stored in hdfs and I need to read some key,values from it in another application using python. I can't find any library that can do that. Do you have any suggestion and example?
Thanks
I would suggest using Spark which has a function called textFile() which can read files from HDFS and turn them into RDDs for further processing using other Spark libraries.
Here's the documentation : Pyspark
Create a reader as follow:
path = '/hdfs/path/to/file'
key = LongWritable()
value = LongWritable()
reader = MapFile.Reader(path)
while reader.next(key, value):
print key, value
Check out these hadoop.io.MapFile Python examples
And available methods in MapFile.py

How should one go about collecting data from a .CSV file using Python?

I am attempting to use the Yahoo! Finance API to gather current stock quotes using Python, a language I am particularly new to. The Yahoo! Finance API appears to offer their data in the form of a .CSV file that can be downloaded.
I am wondering what the best way to use this data is? Is it inefficient to download the file and then read it; is there a method to convert it to a JSON or XML file that I can parse using something like urllib?
The .CSV I am getting is being generated on the following page, in this case the quote for Microsoft (MSFT):
http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snl1
Many thanks in advance.
Python has a built in csv module.
http://docs.python.org/library/csv.html
To answer your downloading question:
file = r"http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snl1"
import urllib
text = urllib.urlopen(file).read()
>>> print text
... "MSFT","Microsoft Corpora",29.51
Based on csv module and DictReader, it will be more easy to parse data after urlopen, I reuse the code above from #kreativitea
import csv
import urllib
file = r"http://finance.yahoo.com/d/quotes.csv?s=MSFT&f=snl1"
quotefile = urllib.urlopen(file)
fieldheaders = ["abbr","name","index"]
reader = csv.DictReader(quotefile,fieldnames=fieldheaders)
for row in reader:
print row
the result is
$ quote.py
{'index': '29.51', 'abbr': 'MSFT', 'name': 'Microsoft Corpora'}
the row in for loop is the hash table, which is easy to deal with

Categories

Resources