I am trying to merge CRSP and IBES through the Eikon API.
I have extracted CUSIP codes from CRSP and wants to convert those to RIC codes in order to extract analyst estimates.
When I do the following in python it returns an error (Payload Too Large). I guess that means I have reached some datalimit. But how can the datalimit be so low - we are talking approx 28.000 requests (datapoints)? and secondly, how can I circumvent it - if that is possible?
ric = ek.get_symbology(cusips,from_symbol_type="CUSIP", to_symbol_type="RIC")
You could create a loop to retrieve the data in batches:
dfs = [] # Will be a list of dataframes
batchsize = 200
for i in range(0, len(cusips), batchsize):
batch = cusips[i:i + batchsize]
r = ek.get_symbology(batch,from_symbol_type="CUSIP", to_symbol_type="RIC")
dfs.append(r)
rics = pd.concat(dfs)
print(rics)
NB: I haven't tested this specific batch size, you can play around with the number to see what works best for you.
Hopefully this helps!
Related
Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.
I've been going in circles for days now, and I've run out of steam. Doesn't help that I'm new to python / numpy / pandas etc.
I started with numpy which led me to pandas, because of a GIS function that delivers a numpy array of data. That is my starting point. I'm trying to get to an endpoint being a small enriched dataset, in an excel spreadsheet.
But it seems like going down a rabbit hole trying to extract that data, and then manipulate it with the numpy toolsets. The delivered data is one dimensional, but each row contains 8 fields. A simple conversion to pandas and then to ndarray, magically makes it all good. Except that I lose headers in the process, and it just snowballs from there.
I've had to revaluate my understanding, based on some feedback on another post, and that's fine. But I'm just going in circles. Example after example seems to use predominantly numerical data, and I'm starting to get the feeling that's where it's strength lies. My trying to use it for what I call more of a non-mathematical / numerical purpose looks like I'm barking up the wrong tree.
Any advice?
Addendum
The data I extract from the GIS system is names, dates, other textual data. I then have another csv file that I need to use as a lookup, so that I can enrich the source with more textual information which finally gets published to excel.
SAMPLE DATA - SOURCE
WorkCode Status WorkName StartDate EndDate siteType Supplier
0 AT-W34319 None Second building 2020-05-04 2020-05-31 Type A Acem 1
1 AT-W67713 None Left of the red office tower 2019-02-11 2020-08-28 Type B Quester Q
2 AT-W68713 None 12 main street 2019-05-23 2020-11-03 Class 1 Type B Dettlim Group
3 AT-W70105 None city central 2019-03-07 2021-08-06 Other Hans Int
4 AT-W73855 None top floor 2019-05-06 2020-10-28 Type a None
SAMPLE DATA - CSV
["Id", "Version","Utility/Principal","Principal Contractor Contact"]
XM-N33463,7.1,"A Contracting company", "555-12345"
XM-N33211,2.1,"Contractor #b", "555-12345"
XM-N33225,1.3,"That other contractor", "555-12345"
XM-N58755,1.0,"v Contracting", "555-12345"
XM-N58755,2.3,"dsContracting", "555-12345"
XM-222222,2.3,"dsContracting", "555-12345"
BM-O33343,2.1,"dsContracting", "555-12345"
def SMAN():
####################################################################################################################
# Exporting the results of the analysis...
####################################################################################################################
"""
Approach is as follows:
1) Get the source data
2) Get he CSV lookup data loaded into memory - it'll be faster
3) Iterate through the source data, looking for matches in the CSV data
4) Add an extra couple of columns onto the source data, and populate it with the (matching) lookup data.
5) Export the now enhanced data to excel.
"""
arcpy.env.workspace = workspace + filenameGDB
input = "ApprovedActivityByLocalBoard"
exportFile = arcpy.da.FeatureClassToNumPyArray(input, ['WorkCode', 'Status','WorkName', 'PSN2', 'StartDate', 'EndDate', 'siteType', 'Supplier'])
# we have our data, but it's (9893,) instead of [9893 rows x 8 columns]
pdExportFile = pandas.DataFrame(exportFile)
LBW = pdExportFile.to_numpy()
del exportFile
del pdExportFile
# Now we have [9893 rows x 8 columns] - but we've lost the headers
col_list = ["WorkCode", "Version","Principal","Contact"]
allPermits = pandas.read_csv("lookup.csv", usecols=col_list)
# Now we have the CSV file loaded ... and only the important parts - should be fast.
# Shape: (94523, 4)
# will have to find a way to improve this...
# CSV file has got more than WordCode, because there are different versions (as different records)
# Only want the last one.
# each record must now be "enhanced" with matching record from the CSV file.
finalReport = [] # we are expecting this to be [9893 rows x 12 columns] at the end
counter = -1
for eachWorksite in LBW [:5]: #let's just work with 5 records right now...
counter += 1
# eachWorksite=list(eachWorksite) # eachWorksite is a tuple - so need to convert it
# # but if we change it to a list, we lose the headers!
certID = LBW [counter][0] # get the ID to use for lookup matching
# Search the CSV data
permitsFound = allPermits[allPermits['Id']==certID ]
permitsFound = permitsFound.to_numpy()
if numpy.shape(permitsFound)[0] > 1:
print ("Too many hits!") # got to deal with that CSV Version field.
exit()
else:
# now "enrich" the record/row by adding on the fields from the lookup
# so a row goes from 8 fields to 12 fields
newline = numpy.append (eachWorksite, permitsFound)
# and this enhanced record/row must become the new normal
# but I cannot change the original, so it must go into a new container
finalReport = numpy.append(finalReport, newline, axis = 0)
# now I should have a new container, of "enriched" data
# which as gone from [9893 rows x 8 columns] to [9893 rows x 12 columns]
# Some of the columns of course, could be empty.
#Now let's dump the results to an Excel file and make it accessible for everyone else.
df = pandas.DataFrame (finalReport)
filepath = 'finalreport.csv'
df.to_csv('filepath', index = False)
# Somewhere I was getting Error("Cannot convert {0!r} to Excel".format(value))
# Now I get
filepath = 'finalReport.xlsx'
df.to_excel(filepath, index=False)
I have eventually answered my own question, and this is how:
Yes, for my situation, pandas worked just fine, even beautifully for
manipulating non numerical data. I just had to learn some basics.
The biggest learning was to understand the pandas data frame as an object that has to be manipulated remotely by various functions/tools. Just because I "print" the dataframe, doesn't mean it's just text. (Thanks juanpa.arrivillaga for poitning out my erroneous assumptions) in Why can I not reproduce a nd array manually?
I also had to wrap my mind around the concept of indexes and columns, and how they could be altered/manipulated/ etc. And then, how to use them to maximum effect.
Once those fundamentals had been sorted, the rest followed naturally, and my code reduced to a couple of nice elegant functions.
Cheers
I am working with a csv sheet which contains data from a brewery, for e.g Data required, Quantity order etc.
I want to write a module to read the csv file structure and load the data into a suitable data structure
in Python. I have to interpret the data by calculating the average growth rate, the ratio of sales for
different beers and use these values to predict sales for a given week or month in the future.
I have no idea where to start. The only line of code I have so far are :
df = pd.read_csv (r'file location')
print (df)
To illustrate, I have downloaded data on the US employment level (https://fred.stlouisfed.org/series/CE16OV) and population (https://fred.stlouisfed.org/series/POP).
import pandas as pd
employ = pd.read_csv('/home/brb/bugs/data/CE16OV.csv')
employ = employ.rename(columns={'DATE':'date'})
employ = employ.rename(columns={'CE16OV':'employ'})
employ = employ[employ['date']>='1952-01-01']
pop = pd.read_csv('/home/brb/bugs/data/POP.csv')
pop = pop.rename(columns={'DATE':'date'})
pop = pop.rename(columns={'POP':'pop'})
pop = pop[pop['date']<='2019-10-01']
df = pd.merge(employ,pop)
df['employ_monthly'] = df['employ'].pct_change()
df['employ_yoy'] = df['employ'].pct_change(periods=12)
df['employ_pop'] = df['employ']/df['pop']
df.head()
I want to get 1000 tweets (without retweets) for the period 07/08/2006 00:00 to 07/08/2006 23:59 using the premium full archive. The api returns maximum 500 tweets per request. How can I get 1000 tweets without running my code two times? Also, how can I export the tweets in a csv format by including all the keys?
I am new in python. I tried to get the tweets but as I said in the summary description I'm getting 500 tweets (including rtweets). Also, when i save the tweets in the csv, every even row is empty.
for example:
|---------- |------|------|----|
|created_at |id_str|source|user|
|---------- |------|------ |----|
|2008|949483|www.none.com|John|
|----------|------|------|----|
|empty |empty |empty|empty|
|----------|------|------|----|
|2009|74332|www.non2.com|Marc|
|----------|------|------|----|
|empty |empty |empty|empty|
My questions are:
How can I get 1000 tweets (excluding rtweets) without getting duplicated tweets and running the code 1 time? and How can I save all the keys of the outputs in a csv without having empty even rows?
from TwitterAPI import TwitterAPI
import csv
SEARCH_TERM = '#nOne'
PRODUCT = 'fullarchive'
LABEL = 'dev-environment'
api = TwitterAPI("consumer_key",
"consumer_secret",
"access_token_key",
"access_token_secret")
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200608070000',
'toDate':'200608072359',
"maxResults":500
})
csvFile = open('data.csv', 'w',encoding='UTF-8')
csvWriter = csv.writer(csvFile)
for item in r:
csvWriter.writerow([item['created_at'],
item["id_str"],
item["source"],
item['user']['screen_name'],
item["user"]["location"],
item["geo"],
item["coordinates"],
item['text'] if 'text' in item else item])
I expect to get a dataframe with 1000 unique tweets (excluding retweets) by running the code once in a csv format?
Thanks
If you are using the TwitterAPI package, you should take advantage of the TwitterPager class which uses the next element in the returned JSON to get the next page of tweets. Look at this simple example to understand the usage.
In your case, you would simply replace this:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200608070000',
'toDate':'200608072359',
"maxResults":500
})
...with this:
from TwitterAPI import TwitterPager
r = TwitterPager(api, 'tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200608070000',
'toDate':'200608072359',
"maxResults":500
}).get_iterator()
By default, TwitterPager waits 5 seconds between requests. In the Sandbox environment you should be able to reduce this to 2 seconds without exceeding the rate limit. To change the wait to 2 seconds you would call get_iterator with a parameter, like this:
get_iterator(wait=2)
earnings = self.collection.find({}) #return 60k documents
----
data_dic = {'score': [], "reading_time": [] }
for earning in earnings:
data_dic['reading_time'].append(earning["reading_time"])
data_dic['score'].append(earning["score"])
----
df = pd.DataFrame()
df['reading_time'] = data_dic["reading_time"]
df['score'] = data_dic["score"]
The code between --- takes 4 seconds to complete. How can I improve this function ?
The time consists of these parts: Mongodb query time, time used to transfer data, network round trip, python list operation. You can optimize each of them.
One is to reduce data amount to transfer. Since you only need reading_time and score, you can only fetch them. If your average document size is big, this approach is very effective.
earnings = self.collection.find({}, {'reading_time': True, 'score': True})
Second. Mongo transfer limited amount of data in a batch. The data contains up to 60k rows, it will take multiple times to transfer data. You can adjust cursor.batchSize to reduce round trip count.
Third, increase network bandwith if you can.
Fourth. You can accelerate by leveraging numpy array. It's a C-like array data structure which is faster than python list. Pre-allocate fixed length array and assigne value by index. This avoid internal adjustment when call list.append.
count = earnings.count()
score = np.empty((count,), dtype=float)
reading_time = np.empty((count,), dtype='datetime64[us]')
for i, earning in enumerate(earnings):
score[i] = earning["score"]
reading_time[i] = earning["reading_time"]
df = pd.DataFrame()
df['reading_time'] = reading_time
df['score'] = score