I'm using the Quandl database service API and its python support to download stock financial data.
Right now, I'm using the free SFO database which downloads year operational financial data.
For example, this query code passes the last 6-8 years of data for stock "CRM" to the dataframe.
df=quandl.get('SF0/CRM_REVENUE_MRY')
df
Out[29]:
Value
Date
2010-01-31 1.305583e+09
2011-01-31 1.657139e+09
2012-01-31 2.266539e+09
2013-01-31 3.050195e+09
2014-01-31 4.071003e+09
2015-01-31 5.373586e+09
2016-01-31 6.667216e+09
What I want to do with this is to recursively pass it a list of about 50 stocks and also grab 6-8 other columns from this database using different query codes appended on to the SFO/CRM_ part of the query.
qcolumns = ['REVUSD_MRY',
'GP_MRY',
'INVCAP_MRY',
'DEBT_MRY',
'NETINC_MRY',
'RETEARN_MRY',
'SHARESWADIL_MRY',
'SHARESWA_MRY',
'COR_MRY',
'FCF_MRY',
'DEBTUSD_MRY',
'EBITDAUSD_MRY',
'SGNA_MRY',
'NCFO_MRY',
'RND_MRY']
So, I think I need to:
a) run the query for each column and in each case append to the dataframe.
b) Add column names to the dataframe.
c) Create a dataframe for each stock (should this be a panel or a list of dataframes? (apologies as I'm new to Pandas and dataframes and am on my learning curve.
d) write to CSV
could you suggest or point me?
This code works to do two queries (two columns of data, both date indexed), renames the columns, and then concatenates them.
df=quandl.get('SF0/CRM_REVENUE_MRY')
df = df.rename(columns={'Value': 'REVENUE_MRY'})
dfnext=quandl.get('SF0/CRM_NETINC_MRY')
dfnext = dfnext.rename(columns={'Value': 'CRM_NETINC_MRY'})
frames = [df, dfnext]
dfcombine = pd.concat([df, dfnext], axis=1) # now question is how to add stock tag "CRM" to frame
dfcombine
Out[39]:
REVENUE_MRY CRM_NETINC_MRY
Date
2010-01-31 1.305583e+09 80719000.0
2011-01-31 1.657139e+09 64474000.0
2012-01-31 2.266539e+09 -11572000.0
2013-01-31 3.050195e+09 -270445000.0
2014-01-31 4.071003e+09 -232175000.0
2015-01-31 5.373586e+09 -262688000.0
2016-01-31 6.667216e+09 -47426000.0
I can add recursion to this to get all the columns (there are around 15) but how do I tag each frame for each stock? Use a key? Use a 3D panel? Thanks for helping a struggling python programmer!
Related
I have two data sets.
df1 contains a stock ticker and date (these are insider trading transactions) -- about 15 million examples.
df2 contains a stock ticker and date (these are M&A events). (>100,000)
I want to create a column in df1, called 'FutureEvent' which is a binary indicator if there is an M&A event (in df2) within 30 days of the date in df1.
The way I currently have it using iterrows, where it takes every transaction and then loops through M&A events. Obivously, this is incredibly slow, as it's requiring 15m*100,000+ iterations. Is there a faster way to complete this?
I have made my dataframe. But I want to sort it by the date wise..For example, I want data for 02.01.2016 just after 01.01.2016.
df_data_2311 = df_data_231.groupby('Date').agg({'Wind Offshore in [MW]': ['sum']})
df_data_2311 = pd.DataFrame(df_data_2311)
After running this, I got the below output. This dataframe has 2192 rows.
Wind Offshore in [MW]
sum
Date
01.01.2016 5249.75
01.01.2017 12941.75
01.01.2018 19020.00
01.01.2019 13723.00
01.01.2020 17246.25
... ...
31.12.2017 21322.50
31.12.2018 13951.75
31.12.2019 21457.25
31.12.2020 16491.25
31.12.2021 35683.25
Kindly let me know How would I sort this data of the day of the date.
You can use the sort_values function in pandas.
df_data_2311.sort_values(by=["Date"])
However in order to sort them by the Date column you will need reset_index() on your grouped dataframe and then to convert the date values to datetime, you can use pandas.to_datetime.
df_data_2311 = df_data_231.groupby('Date').agg({'Wind Offshore in [MW]': ['sum']}).reset_index()
df_data_2311["Date"] = pandas.to_datetime(df_data_2311["Date"], format="%d.%m.%Y")
df_data_2311 = df_data_2311.sort_values(by=["Date"])
I recommend reviewing the pandas docs.
I am uploading a zipped file that contains a single CSV file. I then unzip it on the server and load it into a dataframe. Then I am creating django objects from it.
This works fine until my dataframe becomes too large. When the dataset is getting too large the django server shuts down. I am assuming that this happens because every iteration in my loop increases memory and when there is no memory space left it shuts down. I am working with big datasets and want to make my code work no matter how big the dataframe.
Imagine I have a df like this:
cola fanta sprite libella
2018-01-01 00:00:00 0 12 12 34
2018-01-01 01:00:00 1 34 23 23
2018-01-01 02:00:00 2 4 2 4
2018-01-01 03:00:00 3 24 2 4
2018-01-01 04:00:00 4 2 2 5
Imagine that the columns could be up to 1000 brands and the rows could be more than half a Million rows. Further imagine I have a model that saves this data in a JSONB field. Thus every column is a django object, the column name being the name. The time-stamp combined with the data in the column is the JSONB field.
e.g.: name=fanta, json_data={ "2018-01-01 00:00:00": 12, "2018-01-01 01:00:0": 34 }
My code to unzip, load to df and then create a django instance is this:
df = pd.read_csv(
file_decompressed.open(name_file),
index_col=0,
)
for column_name in df:
Drink.objects.create(
name=column_name,
json_data=df[column_name].to_dict(),
)
As I said, this works, but my loop breaks after having created about 15 elements. Searching the internet I found that bulk_create could make this more efficient. But I have custom signals implemented, so this is not really a solution. I also thought about using to_sql, but since I have to restructure the data of the dataframe I don't think this will work. Maybe not using pandas at all? Maybe chunking the DF somehow?
I am looking for a solution that works independently of the number of columns. Number of rows is maximum of half a million rows. I also tried a while-loop but same problem occurs.
Any ideas how I can make this work? Help is very much appreciated.
My dummy model could be as simple as:
class Juice(models.Model):
name = CharField(...)
json_data = JSONField(...)
I am trying to store a Pandas series column with around 3000 rows in a new table in Postgresql.This column is part of an excel file holding the time series data followed by a number of different sensors. My excel file looks like this :
dateTime sensor1 sensor2 ...sensor-n
2021-06-12 00:00:01 0.0,,,,
2021-06-13 00:00:03 0.0,1.0,,,
2021-06-14 00:00:05 0.0,,,,
...
If I store the name of the sensor and the pandas series for each sensor this will give me redundancy. Do you have any idea how can I store a pandas series efficiently for different sensors in Postgresql? Please help, I am new at Postgresql. Thank you.
I have a number of dataframes all which contain columns labeled 'Date' and 'Cost' along with additional columns. I'd like to add the numerical data in the 'Cost' columns across the different frames based on lining up the dates in the 'Date' columns to provide a timeseries of total costs for each of the dates.
There are different numbers of rows in each of the dataframes.
This seems like something that Pandas should be well suited to doing, but I can't find a clean solution.
Any help appreciated!
Here are two of the dataframes:
df1:
Date Total Cost Funded Costs
0 2015-09-30 724824 940451
1 2015-10-31 757605 940451
2 2015-11-15 788051 940451
3 2015-11-30 809368 940451
df2:
Date Total Cost Funded Costs
0 2015-11-30 3022 60000
1 2016-01-15 3051 60000
I want to have the resulting dataframe have five rows (there are five different dates) and a single column with the total of the 'Total Cost' column from each of the dataframes. Initially I used the following:
totalFunding = df1['Total Cost'].values + df2['Total Cost'].values
This worked fine until there were different dates in each of the dataframes.
Thanks!
The solution posted below works great, except that I need to do this recursively as I have a number of data frames. I created the following function:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
return dfTotal
Which works fine when adding the first two dataframes. However, the addition method appears to convert my Date column into an index in the resulting sum and therefore subsequent passes through the function fail. Here is what dfTotal looks like after the first two data frames are added together:
Total Cost Funded Costs Remaining Cost Total Employee Hours
Date
2015-09-30 1449648 1880902 431254 7410.6
2015-10-31 1515210 1880902 365692 7874.4
2015-11-15 1576102 1880902 304800 8367.2
2015-11-30 1618736 1880902 262166 8578.0
2015-12-15 1671462 1880902 209440 8945.2
2015-12-31 1721840 1880902 159062 9161.2
2016-01-15 1764894 1880902 116008 9495.0
Note that what was originally a column in the dataframe called 'Date' is now listed as the index causing df.set_index('Date') to generate an error on subsequent passes through my function.
DataFrame.add does exactly what you're looking for; it matches the DataFrames based on index, so:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)
should do the trick. If you just want the Total Cost column and you want it as a DataFrame:
df1.set_index('Date').add(df2.set_index('Date'), fill_value=0)[['Total Cost']]
See also the documentation for DataFrame.add at:
http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.add.html
Solution found. As mentioned, the add method converted the 'Date' column into the dataframe index. This was resolved using:
dfTotal['Date'] = dfTotal.index
The complete function is then:
def addDataFrames(f_arg, *argv):
dfTotal = f_arg
for arg in argv:
dfTotal = dfTotal.set_index('Date').add(arg.set_index('Date'), fill_value = 0)
dfTotal['Date'] = dfTotal.index
return dfTotal