I'm trying to import the following dataset and store it in a pandas dataframe: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh/data
I use the following code:
r = requests.get('https://data.nasa.gov/resource/gh4g-9sfh.json')
meteor_data = r.json()
df = pd.DataFrame(meteor_data)
print(df.shape)
The resulting dataframe only has 1000 rows. I need it to have all 45,716 rows. How do I do this?
Check out the docs on the $limit parameter
The $limit parameter controls the total number of rows returned, and
it defaults to 1,000 records per request.
Note: The maximum value for $limit is 50,000 records, and if you
exceed that limit you'll get a 400 Bad Request response.
So you're just getting the default number of records back.
You will not be able to get more than 50,000 records in a single API call - this will take multiple calls using $limit together with $offset
Try:
https://data.nasa.gov/resource/gh4g-9sfh.json$limit=50000
See Why am I limited to 1,000 rows on SODA API when I have an App Key
DO LIKE This ans set limit
import pandas as pd
from sodapy import Socrata
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.nasa.gov", None)
# Example authenticated client (needed for non-public datasets):
# client = Socrata(data.nasa.gov,
# MyAppToken,
# userame="user#example.com",
# password="AFakePassword")
# First 2000 results, returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("gh4g-9sfh", limit=2000)
# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
Related
There is a huge dataframe with hundreds of thousands of rows for every single client. I want to summarise this dataframe into another dataframe where a single row contains the summarised data of all the rows of that client.
The problem is it is not the only code, it contains similar 1000 more lines.
and it takes a lot of time to execute. but when I run this in R. It is 10 times faster. Attaching the R code for reference as well.
Is there a way I can make it fast like R code?
Python Code:
for i in range(len(client)):
print(i)
sub = data.loc[data['Client Name']==client['Client Name'][i],:]
client['requests'][i] = len(sub)
client['ppt_req'][i] = len(sub)/sub['CID'].nunique()
client['approval'][i] = (((sub['verify']=='Yes').sum())/client['requests'][i])*100
client['denial'][i] = (((sub['verify']=='No').sum())/client['requests'][i])*100
client['male'][i] = (((sub['gender']=='Male').sum())/client['requests'][i])*100
client['female'][i] = (((sub['gender ']=='female').sum())/client['requests'][i])*100
R Code:
for(i in 1: nrow(client))
{print(i)
#i=1
sub<-subset(data,data$Client.Name==client$Client.Name[i])
client$requests[i]<-nrow(sub)
client$ppt_req[i]<-nrow(sub)/(length(unique(sub$CID)))
client$approval[i]<-((as.numeric(table(sub$verify=="Yes")["TRUE"]))/client$requests[i])*100
client$denial[i]<-((as.numeric(table(sub$verify=="No")["TRUE"]))/client$requests[i])*100
client$male[i]<-((as.numeric(table(sub$gender)["Male"]))/client$requests[i])*100
client$female[i]<-((as.numeric(table(sub$gender)["Female"]))/client$requests[i])*100
Iterating over Pandas dataframe using Python loops is very slow.
But the main issue comes from the line data.loc[data['Client Name']==client['Client Name'][i],:] which walk through the whole dataframe data for each client. This means this line will finally iterate over >100,000 of strings >100,000 times, and thus tens of billions of costly string comparisons are made. Not to mention that per-group computation are replicated for each client.
You can solve this by using a groupby on client names followed by a merge.
Here is a sketch of code (untested):
# If the number of client name in `data` is much more important than in `client`,
# one can filter `data` before applying the next `groupby` using:
# client['Client Name'].unique()
# Generate a compact dataframe containing the information for each
# possible client name that appear in `data`.
clientDataInfos = pd.DataFrame(
{
'requests': len(group),
'ppt_req': len(group) / group['CID'].nunique(),
'approval': (((group['verify']=='Yes').sum()) / len(group)) * 100,
'denial': (((group['verify']=='No').sum()) / len(group)) * 100,
'male': (((group['gender']=='Male').sum()) / len(group)) * 100,
'female': (((group['gender ']=='female').sum()) / len(group)) * 100
} for name,group in data.groupby('Client Name')
)
# Extend `client` with the precomputed information in `clientDataInfos`.
# The extended columns should not already appear in `client`.
client = client.merge(clientDataInfos, on='Client Name')
Pytrends for Google Trends data does not return a column if there is no data for a search parameter on a specific region.
The code below is from pytrends.request
def interest_over_time(self):
"""Request data from Google's Interest Over Time section and return a dataframe"""
over_time_payload = {
# convert to string as requests will mangle
'req': json.dumps(self.interest_over_time_widget['request']),
'token': self.interest_over_time_widget['token'],
'tz': self.tz
}
# make the request and parse the returned json
req_json = self._get_data(
url=TrendReq.INTEREST_OVER_TIME_URL,
method=TrendReq.GET_METHOD,
trim_chars=5,
params=over_time_payload,
)
df = pd.DataFrame(req_json['default']['timelineData'])
if (df.empty):
return df
df['date'] = pd.to_datetime(df['time'].astype(dtype='float64'),
unit='s')
df = df.set_index(['date']).sort_index()
From the code above, if there is no data, it just returns df, which will be empty.
My question is, how can I make it return a column with "No data" on every line and the search term as header, so that I can clearly see for which search terms there is no data?
Thank you.
I hit this problem, then I hit this web page. My solution was to ask Google trends for data on a search item it would have data for, then rename the column and 0 the data.
I used the ".drop" method to get rid of the "isPartial" column and the ".rename" method to change the column name. To zero the data in the column, I did the following, I created a function:
#Make value zero
def MakeZero(x):
return x *0
Then using the ".apply" method on the dataframe to 0 the column.
ThisYrRslt=BlankResult.apply(MakeZero)
: ) But the question is, what search term do you ask google trends about that will always return a value? I chose "Google". : )
I'm sure you can think of some better ones, but it's hard to leave those words in commercial code.
I am building a word cloud from public Tweets. I have connected to the API via Tweepy and have successfully gotten it to return Tweets related to my search term, but for some reason can only get it to return 15 Tweets.
import pandas as pd
# subject of word cloud
search_term = 'ENTER SEARCH TERM HERE'
# creating dataframe containing the username and corresponding tweet content relating to our search term
df = pd.DataFrame(
[tweet.user.id, tweet.user.name, tweet.text] for tweet in api.search(q=search_term, lang="en")
)
# renaming columns of data frame
df.rename(columns={0 : 'user id'}, inplace=True)
df.rename(columns={1 : 'screen name'}, inplace=True)
df.rename(columns={2 : 'text'}, inplace=True)
df
By default, the standard search API that API.search uses returns up to 15 Tweets per page.
You need to specify the count parameter, up to a maximum of 100, if you want to retrieve more per request.
If you want more than 100 or a guaranteed amount, you'll need to look into paginating using tweepy.Cursor.
I am using Firestore to store time series data being pulled from a sensor. I am using Python to push the data, namely the Firebase-Admin package for verification. I chose to store this data using arrays, where each index corresponds with an array across different fields. Is there a way to add non-unique elements to the array? Or can arrays only store unique elements? If so, what data structure would you suggest for storing time series data.
I am trying to add observations to an existing array in Firestore, but ArrayUpdate only adds the element if it is not already present in the array. When I execute the second chuck of code (to update the existing array), only unique values are saved
# Initialize arrays and push to Firestore
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore
import datetime
cred = credentials.Certificate('path_to_certificate')
firebase_admin.initialize_app(cred)
db = firestore.client()
cell_1_arr = []
cell_2_arr = []
cell_3_arr = []
exec_time_arr = []
curr_time_arr = []
pred_volt = 12562.70
meas_volt = 12362.70
current = 0.0
soc = 0,0
cell_1_volt = 4.32
cell_2_volt = 4.4
cell_3_volt = 4.23
exec_time = 0.4
curr_time = datetime.datetime.now()
pred_v_arr.append(pred_volt)
meas_v_arr.append(meas_volt)
c_arr.append(current)
soc_arr.append(soc)
cell_1_arr.append(cell_1_volt)
cell_2_arr.append(cell_2_volt)
cell_3_arr.append(cell_3_volt)
exec_time_arr.append(exec_time)
curr_time_arr.append(curr_time)
try:
push_data = {
u'time': curr_time_arr,
u'vpred': meas_v_arr,
u'vmeas': pred_v_arr,
u'current': c_arr,
u'soc': soc_arr,
u'cell1': cell_1_arr,
u'cell2': cell_2_arr,
u'cell3': cell_3_arr,
u'exectime': exec_time_arr
}
db.collection(u'battery1').document(u"day1").set(push_data)
# Add add a new observation to the different arrays
db.collection(u'battery1').document(u"day1").update({'time': firestore.ArrayUnion(curr_time_arr)})
db.collection(u'battery1').document(u"day1").update({'vpred': firestore.ArrayUnion(pred_v_arr)})
db.collection(u'battery1').document(u"day1").update({'vmeas': firestore.ArrayUnion(meas_v_arr)})
db.collection(u'battery1').document(u"day1").update({'current': firestore.ArrayUnion(c_arr)})
db.collection(u'battery1').document(u"day1").update({'cell1': firestore.ArrayUnion(cell_1_arr)})
db.collection(u'battery1').document(u"day1").update({'cell2': firestore.ArrayUnion(cell_2_arr)})
db.collection(u'battery1').document(u"day1").update({'cell3': firestore.ArrayUnion(cell_3_arr)})
db.collection(u'battery1').document(u"day1").update({'exectime': firestore.ArrayUnion(exec_time_arr)})
db.collection(u'battery1').document(u"day1").update({'soc': firestore.ArrayUnion(soc_arr)})
In the screenshot above you can see that there are 8 elements in the "time" field (as all calls to datetime.now() produce unique instances of timestamps), while all the other fields have only saved the unique data points sent (exectime/soc only have two data points, for 8 calls to ArrayUnion).
When you use firestore.ArrayUnion that operator's job is literally to ensure each value can only be present once in the array.
If you want to allow non-unique values, don't use firestore.ArrayUnion but just add the elements to the array regularly. This does require that you read the entire document and array first, then add the element locally, and write the result back.
Is there a way to check the HTTP Status Code in the code below, as I have not used the request or urllib libraries which would allow for this.
from pandas.io.excel import read_excel
url = 'http://www.bankofengland.co.uk/statistics/Documents/yieldcurve/uknom05_mdaily.xls'
#check the sheet number, spot: 9/9, short end 7/9
spot_curve = read_excel(url, sheetname=8) #Creates the dataframes
short_end_spot_curve = read_excel(url, sheetname=6)
# do some cleaning, keep NaN for now, as forward fill NaN is not recommended for yield curve
spot_curve.columns = spot_curve.loc['years:']
valid_index = spot_curve.index[4:]
spot_curve = spot_curve.loc[valid_index]
# remove all maturities within 5 years as those are duplicated in short-end file
col_mask = spot_curve.columns.values > 5
spot_curve = spot_curve.iloc[:, col_mask]
#Providing correct names
short_end_spot_curve.columns = short_end_spot_curve.loc['years:']
valid_index = short_end_spot_curve.index[4:]
short_end_spot_curve = short_end_spot_curve.loc[valid_index]
# merge these two, time index are identical
# ==============================================
combined_data = pd.concat([short_end_spot_curve, spot_curve], axis=1, join='outer')
# sort the maturity from short end to long end
combined_data.sort_index(axis=1, inplace=True)
def filter_func(group):
return group.isnull().sum(axis=1) <= 50
combined_data = combined_data.groupby(level=0).filter(filter_func)
In pandas:
read_excel try to use urllib2.urlopen(urllib.request.urlopen instead in py3x) to open the url and get .read() of response immediately without store the http request like:
data = urlopen(url).read()
Though you need only part of the excel, pandas will download the whole excel each time. So, I voted #jonnybazookatone.
It's better to store the excel to your local, then you can check the status code and md5 of file first to verify data integrity or others.