SOQL Socrata query datetime between - python

May I know what is wrong with my below code ? I would like to query all where date_occ is between '2015-01-10T12:00:00' and '2015-12-31T24:00:00'
response = requests.get('https://data.lacity.org/api/id/7fvc-faax.json?$select=*&$where = date_occ between 2015-01-10T12:00:00 and 2015-12-131T24:00:00')
I get the following error:
Unrecognized arguments [$where ]
I realise the following doesn't work as well
response = requests.get('https://data.lacity.org/api/id/7fvc-faax.json?$select=*&vict_age >20')
data = response.json()
data = json_normalize(data)
data = pd.DataFrame(data)
But this works:
response = requests.get('https://data.lacity.org/api/id/7fvc-faax.json?$select=*&vict_sex=M')
what am I missing here?

There are a few questions and answers posed in this one. Starting with your second query first; where you want to look at age above 20 years-old. Looking at the metadata (click the down arrow), the victim age is not numeric and is a text string. Thus, you won't be able to use operators like greater than, less than, etc. However, you can look for "equal to". The query below will work:
https://data.lacity.org/resource/7fvc-faax.json?$where=vict_age = '20'
Note: I've dropped the $select and am just using $where for simpler display.
Your third example works since you've set it to query a text field. If you want LA to change it to a numeric, click the "Contact Dataset Owner" under the ellipsis button.
Your first question on dates has a few changes. First, your single quotation marks were not aligned and some were missing. Second, the latter date is 2015-12-131T24:00:00, which has an invalid day. Finally, the data on the portal does not have a timestamp, so you only need the year-month-day. This will work:
https://data.lacity.org/resource/7fvc-faax.json?$where=date_occ between '2015-01-10' and '2015-12-13'
Finally, I would recommend that you use the URL structure, https://data.lacity.org/resource/7fvc-faax.json? instead of /api/id/. The former is the proper URL structure for Socrata-based APIs.

Related

how to get nested data with pandas and request

I'm going crazy trying to get data through an API call using request and pandas. It looks like it's nested data, but I cant get the data i need.
https://xorosoft.docs.apiary.io/#reference/sales-orders/get-sales-orders
above is the api documentation. I'm just trying to keep it simple and get the itemnumber and qtyremainingtoship, but i cant even figure out how to access the nested data. I'm trying to use DataFrame to get it, but am just lost. any help would be appreciated. i keep getting stuck at the 'Data' level.
type(json['Data'])
df = pd.DataFrame(['Data'])
df.explode('SoEstimateHeader')
df.explode('SoEstimateHeader')
Cell In [64], line 1
df.explode([0:])
^
SyntaxError: invalid syntax
I used the link to grab a sample response from the API documentation page you provided. From the code you provided it looks like you are already able to get the data and I'm assuming the you have it as a dictionary type already.
From what I can tell I don't think you should be using pandas, unless its some downstream requirement in the task you are doing. But to get the ItemNumber & QtyRemainingToShip you can use the code below.
# get the interesting part of the data out of the api response
data_list = json['Data']
#the data_list is only one element long, so grab the first element which is of type dictionary
data = data_list[0]
# the dictionary has two keys at the top level
so_estimate_header = data['SoEstimateHeader']
# similar to the data list the value associated with "SoEstimateItemLineArr" is of type list and has 1 element in it, so we grab the first & only element.
so_estimate_item_line_arr = data['SoEstimateItemLineArr'][0]
# now we can grab the pieces of information we're interested in out of the dictionary
qtyremainingtoship = so_estimate_item_line_arr["QtyRemainingToShip"]
itemnumber = so_estimate_item_line_arr["ItemNumber"]
print("QtyRemainingToShip: ", qtyremainingtoship)
print("ItemNumber: ", itemnumber)
Output
QtyRemainingToShip: 1
ItemNumber: BC
Side Note
As a side note I wouldn't name any variables json because thats also the name of a popular library in python for parsing json, so that will be confusing to future readers and will clash with the name if you end up having to import the json library.

Scraping data from a http & javaScript site

I currently want to scrape some data from an amazon page and I'm kind of stuck.
For example, lets take this page.
https://www.amazon.com/NIKE-Hyperfre3sh-Athletic-Sneakers-Shoes/dp/B01KWIUHAM/ref=sr_1_1_sspa?ie=UTF8&qid=1546731934&sr=8-1-spons&keywords=nike+shoes&psc=1
I wanted to scrape every variant of shoe size and color. That data can be found opening the source code and searching for 'variationValues'.
There we can see sort of a dictionary containing all the sizes and colors and, below that, in 'asinToDimentionIndexMap', every product code with numbers indicating the variant from the variationValues 'dictionary'.
For example, in asinToDimentionIndexMap we can see
"B01KWIUH5M":[0,0]
Which means that the product code B01KWIUH5M is associated with the size '8M US' (position 0 in variationValues size_name section) and the color 'Teal' (same idea as before)
I want to scrape both the variationValues and the asinToDimentionIndexMap, so i can associate the IndexMap numbers to the variationValues one.
Another person in the site (thanks for the help btw) suggested doing it this way.
script = response.xpath('//script/text()').extract_frist()
import re
# capture everything between {}
data = re.findall(script, '(\{.+?\}_')
import json
d = json.loads(data[0])
d['products'][0]
I can sort of understand the first part. We get everything that's a 'script' as a string and then get everything between {}. The issue is what happens after that. My knowledge of json is not that great and reading some stuff about it didn't help that much.
Is it there a way to get, from that data, 2 dictionaries or lists with the variationValues and asinToDimentionIndexMap? (maybe using some regular expressions in the middle to get some data out of a big string). Or explain a little bit what happens with the json part.
Thanks for the help!
EDIT: Added photo of variationValues and asinToDimensionIndexMap
I think you are close Manuel!
The following code will turn your scraped source into easy-to-select boxes:
import json
d = json.loads(data[0])
JSON is a universal format for storing object information. In other words, it's designed to interpret string data into object data, regardless of the platform you are working with.
https://www.w3schools.com/js/js_json_intro.asp
I'm assuming where you may be finding things a challenge is if there are any errors when accessing a particular "box" inside you json object.
Your code format looks correct, but your access within "each box" may look different.
Eg. If your 'asinToDimentionIndexMap' object is nested within a smaller box in the larger 'products' object, then you might access it like this (after running the code above):
d['products'][0]['asinToDimentionIndexMap']
I've hacked and slash a little bit so you can better understand the structure of your particular json file. Take a look at the link below. On the right-hand side, you will see "which boxes are within one another" - which is precisely what you need to know for accessing what you need.
JSON Object Viewer
For example, the following would yield "companyCompliancePolicies_feature_div":
import json
d = json.loads(data[0])
d['updateDivLists']['full'][0]['divToUpdate']
The person helping you before outlined a general case for you, but you'll need to go in an look at structure this way to truly find what you're looking for.
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
asinVariationValues = re.findall(r'asinVariationValues\" : ({.*?}})', ' '.join(script))[0]
dimensionValuesData = re.findall(r'dimensionValuesData\" : (\[.*\])', ' '.join(script))[0]
asinToDimensionIndexMap = re.findall(r'asinToDimensionIndexMap\" : ({.*})', ' '.join(script))[0]
dimensionValuesDisplayData = re.findall(r'dimensionValuesDisplayData\" : ({.*})', ' '.join(script))[0]
Now you can easily convert them to json as use them combine as you wish.

Quandl data, API call

Recently I am reading some stock prices database in Quandl using API call to extract the data. But I am really confused by the example I have.
import requests
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
Can anyone explain that to me?
1) for api_url, if I copy that webepage, it says 404 not found. So if I want to use other database, how do I prepare this api_usl? What does '% stock' mean?
2) here request looks like to be used to extract the data, what is the format of the raw_data? How do I know the column names? How do I extract the columns?
To expand on my comment above:
% stock is a string formatting operation, replacing %s in the preceding string with the value referenced by stock. Further details can be found here
raw_data actually references a Response object (part of the requests module - details found here
To expand on your code.
import requests
#Set the stock we are interested in, AAPL is Apple stock code
stock = 'AAPL'
#Your code
api_url = 'https://www.quandl.com/api/v1/datasets/WIKI/%s.json' % stock
session = requests.Session()
session.mount('http://', requests.adapters.HTTPAdapter(max_retries=3))
raw_data = session.get(api_url)
# Probably want to check that requests.Response is 200 - OK here
# to make sure we got the content successfully.
# requests.Response has a function to return json file as python dict
aapl_stock = raw_data.json()
# We can then look at the keys to see what we have access to
aapl_stock.keys()
# column_names Seems to be describing the individual data points
aapl_stock['column_names']
# A big list of data, lets just look at the first ten points...
aapl_stock['data'][0:10]
Edit to answer question in comment
So the aapl_stock[column_names] shows Date and Open as the first and second values respectively. This means they correspond to positions 0 and 1 in each element of the data.
Therefore to access date use aapl_stock['data'][0:10][0] (date value for first ten items) and to access the value for open use aapl_stock['data'][0:78][1] (open value for first 78 items).
To get a list of every value in the dataset, where each element is a list with values for Date and Open you could add something like aapl_date_open = aapl_stock['data'][:][0:1].
If you are new to python I seriously recommend looking at the list slice notation, a quick intro can be found here

Python Syntax Incorrect for Email Creation

I am trying to write out some basic python for my kolab email server. For the primary_mail, I want it to be first initial last name, such as jdoe. The default is first name (dot) last name. john.doe#domain.com
I have came up the following:
primary_mail ='%(givenname)s'[0:1]%(surname)s#%(domain)s
Which I want to basically say jdoe#domain.com
givenname would be someone's full name. (i.e John)
surname would be someone's last name. (i.e Doe)
domain is the email domain. domain.com
When python goes to canonify it, it comes up with some mumbo jumbo like so:
'john[0:1]'doe#domain.com
Can someone help me out with correcting this? I am so close.
EDIT:
According to kolab documentation, it looks like it is something like:
"{0}#{1}": "format('%(uid)s', '%(domain)s')"
This of course doesn't work for me though....
EDIT 2:
I am getting the following in my error logs:
imaps[1916]: ptload completely failed: unable to canonify identifier: 'john'[0:1]doe#domain.com
String formatting is by far the easiest, most readable and preferred way of accomplishing this:
first_name = 'John'
surname = 'Smith'
domain = 'company.com'
primary_mail = '{initial}{surname}#{domain}'.format(initial=first_name[0].lower(), surname=surname.lower(), domain=domain)
primary_mail now equals 'jsmith#company.com'. You define a string containing named placeholders in braces, then call the format method to have those placeholders replaced at runtime with the appropriate values. Here, we take the first character of first_name and convert it to lower case, convert the entirety of surname also, and leave domain unchanged.
You can read more on string formatting at the Python 2.7 docs.
James Scholes is right that format is a better way of doing it, however reading the Kolab documentation it seems that you can only give the format string, and they use the % style formatter internally, where you can't change it. From
the Kolab 'primary_mail' documentation
primary_mail = %(givenname)s.%(surname)s#%(domain)s
The equivalent of the following Python is then executed:
primary_mail = "%(givenname)s.%(surname)s#%(domain)s" % {
"givenname": "Maria",
"surname": "Moller",
"preferredlanguage": "en_US"
}
In this case, we need a modifier to the format conversation. We have %(givenname)s, which ensures that givenname is a string. We can also specify a minimum length, followed by a . and then a precision. This is normally only used it numbers, but we can use it for strings, too. Here is a format string with no minimum length, but a maximum length (precision) of 1 character:
"%(givenname).1s"
So you probably want a string like this:
"%(givenname).1s%(surname)#%(domain)"

Scraping blog and saving date to database causes DateError: unknown date format

I am working on a project where I scrape a number of blogs, and save a selection of the data to a SQLite database. Such as the title of the post, the date it was posted, and the content of the post.
The goal in the end is to do some fancy textual analyses, but right now I have a problem with writing the data to the database.
I work with the library pattern for Python. (the module about databases can be found here)
I am busy with the third blog now. The data from the two other blogs is already saved in the database, and for the third blog, which is similarly structured, I adapted the code.
There are several functions well integrated with each other, they work fine. I also got access to all the data the right way, when I try it out in IPython Notebook it works fine. When I ran the code as a trial in the Console for only one blog page (it has 43 altogether), it also worked and saved everything nicely in the database. But when I ran it again for 43 pages, it threw a data error.
There are some comments and print statements inside the functions now which I used for debugging. The problem seems to happen in the function parse_post_info, which passes a dictionary on to the function that goes over all blog pages and opens every single post, and then saves the dictionary that the function parse_post_info returns IF it is not None, but I think it IS empty because something about the date format goes wrong.
Also - why does the code work once, and the same code throws a dateerror the second time:
DateError: unknown date format for '2015-06-09T07:01:55+00:00'
Here is the function:
from pattern.db import Database, field, pk, date, STRING, INTEGER, BOOLEAN, DATE, NOW, TEXT, TableError, PRIMARY, eq, all
from pattern.web import URL, Element, DOM, plaintext
def parse_post_info(p):
""" This function receives a post Element from the post list and
returns a dictionary with post url, post title, labels, date.
"""
try:
post_header = p("header.entry-header")[0]
title_tag = post_header("a < h1")[0]
post_title = plaintext(title_tag.content)
print post_title
post_url = title_tag("a")[0].href
date_tag = post_header("div.entry-meta")[0]
post_date = plaintext(date_tag("time")[0].datetime).split("T")[0]
#post_date = date(post_date_text)
print post_date
post_id = int(((p).id).split("-")[-1])
post_content = get_post_content(post_url)
labels = " "
print labels
return dict(blog_no=blog_no,
post_title=post_title,
post_url=post_url,
post_date=post_date,
post_id=post_id,
labels=labels,
post_content=post_content
)
except:
pass
The date() function returns a new Date, a convenient subclass of Python's datetime.datetime. It takes an integer (Unix timestamp), a string or NOW.
You can have diff with local time.
Also the format is "YYYY-MM-DD hh:mm:ss".
The convert time format can be found here

Categories

Resources