Python- Convert Dict hidden in List to DataFrame - python

I'm having trouble using the normalize in JSON for a dictionary that is being recognized as a list. The goal is to create a data frame from yahoo_finance.
from yahoofinancials import YahooFinancials
import pandas as pd
from pandas.io.json import json_normalize
ticker = 'AAPL'
yahoo_financials = YahooFinancials(ticker)
balance_sheet_data_qt = yahoo_financials.get_financial_stmts('quarterly', 'balance')
#The return is a bit messy, I've simplified it with:
user_dict=balance_sheet_data_qt.get('balanceSheetHistoryQuarterly').get(ticker)
df=pd.DataFrame(user_dict)
But Still having trouble taking the data across the finish line, the goal would be for each quarterly date as the index for each row, and the key finances listed as columns.

You can use ChainMap from collections.
from collections import ChainMap
df = pd.DataFrame.from_dict(ChainMap(*user_dict), orient='index')
If you don't want to use ChainMap, you can iterate through the dicts in user_dict (a list), and then append these DFs to the main df.
df = pd.DataFrame()
for d in user_dict:
df = df.append(pd.DataFrame.from_dict(d, orient='index'))
ChainMap runs significantly faster for me
1.43 ms ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vs
7 ms ± 121 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I think this is what you are looking for.
The following has the date as index and the financial data in the columns:
dataframe_entries = list()
for result in balance_sheet_data_qt.get('balanceSheetHistoryQuarterly').get('AAPL'):
extracted_date = list(result)[0]
dataframe_row = list(result.values())[0]
dataframe_row['date'] = extracted_date
dataframe_entries.append(dataframe_row)
df = pd.DataFrame(dataframe_entries).set_index('date')
Outputs:
date accountsPayable treasuryStock
2018-12-29 44293000000 ... -3588000000
2018-09-29 55888000000 ... -3454000000
2018-06-30 38489000000 ... -3111000000
2018-03-31 34311000000 ... -3064000000

Related

How to extract components from a pandas datetime column and assign them

The following code for getting the week number and year works:
import pandas as pd
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
df['weekNo'] = df['date'].dt.isocalendar().week
df['year'] = df['date'].dt.year
date weekNo year
0 2021-12-05 48 2021
1 2021-12-12 49 2021
2 2021-12-19 50 2021
3 2021-12-26 51 2021
4 2022-01-02 52 2022
5 2022-01-09 1 2022
6 2022-01-16 2 2022
7 2022-01-23 3 2022
8 2022-01-30 4 2022
9 2022-02-06 5 2022
but,
df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
Gives the error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_26440/999845293.py in <module>
----> 1 df['weekYear'] = "%d/%d" % (df['date'].dt.isocalendar().week, df['date'].dt.year)
TypeError: %d format: a number is required, not Series
I am accessing the week and year in a way that accesses the series of values, as shown by the first code snippet. Why doesn't that work when I want a formatted string? How do I re-write the code in snippet 2, to make it work? I don't want to make intermediate columns.
Why doesn't that work when I want a formatted string? The error is clear, because '%d' expects a single decimal value, not a pandas.Series
Providing there is a format code for the value to be extracted, dt.strftime can be used.
This requires the 'date' column to be a datetime dtype, which can be done with pd.to_datetime. The column in the following example is already the correct dtype.
'%V': ISO 8601 week as a decimal number with Monday as the first day of the week. Week 01 is the week containing Jan 4.
'%Y': Year with century as a decimal number.
import pandas as pd
# sample data
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='w', periods=10), columns=['date'])
# add week number and year
df['weekYear'] = df.date.dt.strftime('%V/%Y')
# display(df)
date weekYear
0 2021-12-05 48/2021
1 2021-12-12 49/2021
2 2021-12-19 50/2021
3 2021-12-26 51/2021
4 2022-01-02 52/2022
5 2022-01-09 01/2022
6 2022-01-16 02/2022
7 2022-01-23 03/2022
8 2022-01-30 04/2022
9 2022-02-06 05/2022
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df.date.dt.strftime('%V/%Y')
[out]: 3.74 s ± 19.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can just use:
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
Or using pandas.Series.str.cat
df['weekYear'] = df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
Or using list comprehension
df['weekYear'] = [f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].dt.isocalendar().week.astype(str) + '/' + df['date'].dt.year.astype(str)
[out]: 886 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
df['date'].dt.isocalendar().week.astype(str).str.cat(df['date'].dt.year.astype(str), sep='/')
[out]: 965 ms ± 8.56 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
[f"{week}/{year}" for week, year in zip(df['date'].dt.isocalendar().week, df['date'].dt.year)]
[out]: 587 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If you want to use the formatting, can use map to get that map or apply the formatting to every road, the .dt is not needed since you will be working with date itself, not Series of dates. Also isocalendar() returns a tuple where second element is the week number:
df["date"] = pd.to_datetime(df["date"])
df['weekYear'] = df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
Timing for 1M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='h', periods=1000000), columns=['date'])
%%timeit
df['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
[out]: 2.03 s ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There are clearly a number of ways this can be solved, so a timing comparison is the best way to determine which is the "best" answer.
Here's a single implementation for anyone to run a timing analysis in Jupyter of all the current answers.
See this answer to modify the code to create a timing analysis plot with a varying number of rows.
See IPython: %timeit for the option descriptions.
import pandas as pd
# sample data with 60M rows
df = pd.DataFrame(data=pd.date_range('2021-11-29', freq='s', periods=60000000), columns=['date'])
# functions
def test1(d):
return d.date.dt.strftime('%V/%Y')
def test2(d):
return d['date'].dt.isocalendar().week.astype(str) + '/' + d['date'].dt.year.astype(str)
def test3(d):
return d['date'].dt.isocalendar().week.astype(str).str.cat(d['date'].dt.year.astype(str), sep='/')
def test4(d):
return [f"{week}/{year}" for week, year in zip(d['date'].dt.isocalendar().week, d['date'].dt.year)]
def test5(d):
return d['date'].map(lambda x: "%d/%d" % (x.isocalendar()[1], x.year))
t1 = %timeit -r2 -n1 -q -o test1(df)
t2 = %timeit -r2 -n1 -q -o test2(df)
t3 = %timeit -r2 -n1 -q -o test3(df)
t4 = %timeit -r2 -n1 -q -o test4(df)
t5 = %timeit -r2 -n1 -q -o test5(df)
print(f'test1 result: {t1}')
print(f'test2 result: {t2}')
print(f'test3 result: {t3}')
print(f'test4 result: {t4}')
print(f'test5 result: {t5}')
test1 result: 3min 45s ± 653 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test2 result: 53.4 s ± 459 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test3 result: 59.7 s ± 164 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test4 result: 35.5 s ± 409 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)
test5 result: 2min 2s ± 29.1 ms per loop (mean ± std. dev. of 2 runs, 1 loop each)

What's the most efficient way to convert a time-series data into a cross-sectional one?

Here's the thing, I have the dataset below where date is the index:
date value
2020-01-01 100
2020-02-01 140
2020-03-01 156
2020-04-01 161
2020-05-01 170
.
.
.
And I want to transform it in this other dataset:
value_t0 value_t1 value_t2 value_t3 value_t4 ...
100 NaN NaN NaN NaN ...
140 100 NaN NaN NaN ...
156 140 100 NaN NaN ...
161 156 140 100 NaN ...
170 161 156 140 100 ...
First I thought about using pandas.pivot_table to do something, but that would just provide a different layout grouped by some column, which is not exactly what I want. Later, I thought about using pandasql and apply 'case when', but that wouldn't work because I would have to type dozens of lines of code. So I'm stuck here.
try this:
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
The series .shift(n) method can get you a single column of your desired output by shifting everything down and filling in NaNs above. So we're building a new dataframe by feeding it a dictionary of the form {column name: column data, ...}, by using dictionary comprehension to iterate through your original dataframe.
I think the best is use numpy
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0], 1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
Times for 5000 rows
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values).add_prefix('value_t')
556 ms ± 35.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
new_df = pd.DataFrame({f"value_t{i}": df['value'].shift(i) for i in range(len(df))})
1.31 s ± 36.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Time without add_prefix
%%timeit
values = np.asarray(df['value'].astype(float))
new_values = np.tril(np.repeat([values], values.shape[0], axis=0).T)
new_values[np.triu_indices(new_values.shape[0],1)] = np.nan
new_df = pd.DataFrame(new_values)
357 ms ± 8.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas DatetimeIndex: Number of periods in a frequency string?

How can I get a count of the number of periods in a Pandas DatetimeIndex using a frequency string (offset alias)? For example, let's say I have the following DatetimeIndex:
idx = pd.date_range("2019-03-01", periods=10000, freq='5T')
I would like to know how many 5 minute periods are in a week, or '7D'. I can calculate this "manually":
periods = (7*24*60)//5
Or I can get the length of a dummy index:
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
Neither approach seems very efficient. Is there a better way using Pandas date functionality?
try using numpy
len(np.arange(pd.Timedelta('1 days'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
2016
My testing, first import time:
import time
the OP solution:
start_time = time.time()
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
print((time.time() - start_time))
out:
0.0011057853698730469]
using numpy
start_time = time.time()
len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
print((time.time() - start_time))
out:
0.0001723766326904297
Follow the sugestion of #meW, doing the performance test using timeit
using timedelta_range:
%timeit len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
out:
91.1 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using numpy:
%timeit len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
16.3 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I finally figured out a reasonable solution:
pd.to_timedelta('7D')//idx.freq
This has the advantage that I can specify a range using a frequency string (offset alias) and the period or frequency is inferred from the dataframe. The numpy solution suggested by #Terry is still the fastest solution where speed is important.

Alternative to groupby for generating a summary table from tidy pandas DataFrame

I want to generate a summary table from a tidy pandas DataFrame. I now use groupby and two for loops, which does not seem efficient. Seems stacking and unstacking would get me there, but I have failed.
Sample data
import pandas as pd
import numpy as np
import copy
import random
df_tidy = pd.DataFrame(columns = ['Stage', 'Exc', 'Cat', 'Score'])
for _ in range(10):
df_tidy = df_tidy.append(
{
'Stage': random.choice(['OP', 'FUEL', 'EOL']),
'Exc': str(np.random.randint(low=0, high=1000)),
'Cat': random.choice(['CC', 'HT', 'PM']),
'Score': np.random.random(),
}, ignore_index=True
)
df_tidy
returns
Stage Exc Cat Score
0 OP 929 HT 0.946234
1 OP 813 CC 0.829522
2 FUEL 114 PM 0.868605
3 OP 896 CC 0.382077
4 FUEL 10 CC 0.832246
5 FUEL 515 HT 0.632220
6 EOL 970 PM 0.532310
7 FUEL 198 CC 0.209856
8 FUEL 848 CC 0.479470
9 OP 968 HT 0.348093
I would like a new DataFrame with Stages as columns, Cats as rows and sum of Scores as values. I achieve it this way:
Working but probably inefficient approach
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
new_df
Which returns what I want:
OP FUEL EOL Total
CC 1.2116 1.52157 NaN 2.733170
HT 1.29433 0.63222 NaN 1.926548
PM NaN 0.868605 0.53231 1.400915
But I cannot believe this is the simplest or most efficient path.
Question
What pandas magic am I missing out on?
Update - Timing the proposed solutions
To understand the differences between pivot_table and crosstab proposed below, I timed the three solutions with a 100,000 row dataframe built exactly as above:
groupby solution, that I thought was inefficient:
%%timeit
new_df = pd.DataFrame(columns=list(df_tidy['Stage'].unique()))
for cat, small_df in df_tidy.groupby('Cat'):
for lcs, smaller_df in small_df.groupby('Stage'):
new_df.loc[cat, lcs] = smaller_df['Score'].sum()
new_df['Total'] = new_df.sum(axis=1)
41.2 ms ± 3.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
crosstab solution, that requires a creation of a DataFrame in the background, even if the passed data is already in DataFrame format:
%%timeit
pd.crosstab(index=df_tidy.Cat,columns=df_tidy.Stage, values=df_tidy.Score, aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
67.8 ms ± 1.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
pivot_table solution:
%%timeit
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]
713 ms ± 20.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, it would appear that the clunky groupbysolution is the quickest.
A simple solution from crosstab
pd.crosstab(index=df.Cat,columns=df.Stage,values=df.Score,aggfunc='sum', margins = True, margins_name = 'Total').iloc[:-1,:]
Out[342]:
Stage EOL FUEL OP Total
Cat
CC NaN 1.521572 1.211599 2.733171
HT NaN 0.632220 1.294327 1.926547
PM 0.53231 0.868605 NaN 1.400915
I was wondering if not a simpler solution than using pd.crosstab is to use pd.pivot:
pd.pivot_table(df_tidy, index=['Cat'], columns=["Stage"], margins=True, margins_name='Total', aggfunc=np.sum).iloc[:-1,:]

Python date string to date object

How do I convert a string to a date object in python?
The string would be: "24052010" (corresponding to the format: "%d%m%Y")
I don't want a datetime.datetime object, but rather a datetime.date.
You can use strptime in the datetime package of Python:
>>> import datetime
>>> datetime.datetime.strptime('24052010', "%d%m%Y").date()
datetime.date(2010, 5, 24)
import datetime
datetime.datetime.strptime('24052010', '%d%m%Y').date()
Directly related question:
What if you have
datetime.datetime.strptime("2015-02-24T13:00:00-08:00", "%Y-%B-%dT%H:%M:%S-%H:%M").date()
and you get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/_strptime.py", line 308, in _strptime
format_regex = _TimeRE_cache.compile(format)
File "/usr/local/lib/python2.7/_strptime.py", line 265, in compile
return re_compile(self.pattern(format), IGNORECASE)
File "/usr/local/lib/python2.7/re.py", line 194, in compile
return _compile(pattern, flags)
File "/usr/local/lib/python2.7/re.py", line 251, in _compile
raise error, v # invalid expression
sre_constants.error: redefinition of group name 'H' as group 7; was group 4
and you tried:
<-24T13:00:00-08:00", "%Y-%B-%dT%HH:%MM:%SS-%HH:%MM").date()
but you still get the traceback above.
Answer:
>>> from dateutil.parser import parse
>>> from datetime import datetime
>>> parse("2015-02-24T13:00:00-08:00")
datetime.datetime(2015, 2, 24, 13, 0, tzinfo=tzoffset(None, -28800))
If you are lazy and don't want to fight with string literals, you can just go with the parser module.
from dateutil import parser
dt = parser.parse("Jun 1 2005 1:33PM")
print(dt.year, dt.month, dt.day,dt.hour, dt.minute, dt.second)
>2005 6 1 13 33 0
Just a side note, as we are trying to match any string representation, it is 10x slower than strptime
you have a date string like this, "24052010" and you want date object for this,
from datetime import datetime
cus_date = datetime.strptime("24052010", "%d%m%Y").date()
this cus_date will give you date object.
you can retrieve date string from your date object using this,
cus_date.strftime("%d%m%Y")
For single value the datetime.strptime method is the fastest
import arrow
from datetime import datetime
import pandas as pd
l = ['24052010']
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.86 µs ± 56.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
305 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
46 µs ± 978 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
For a list of values the pandas pd.to_datetime is the fastest
l = ['24052010'] * 1000
%timeit _ = list(map(lambda x: datetime.strptime(x, '%d%m%Y').date(), l))
6.32 ms ± 89.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit _ = list(map(lambda x: x.date(), pd.to_datetime(l, format='%d%m%Y')))
1.76 ms ± 27.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit _ = list(map(lambda x: arrow.get(x, 'DMYYYY').date(), l))
44.5 ms ± 522 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For ISO8601 datetime format the ciso8601 is a rocket
import ciso8601
l = ['2010-05-24'] * 1000
%timeit _ = list(map(lambda x: ciso8601.parse_datetime(x).date(), l))
241 µs ± 3.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
There is another library called arrow really great to make manipulation on python date.
import arrow
import datetime
a = arrow.get('24052010', 'DMYYYY').date()
print(isinstance(a, datetime.date)) # True
string "24052010"
In a very manual way you could just go like this:-
first split the string as (yyyy-mm-dd) format so you could get a tuple something like this (2010, 5, 24), then simply convert this tuple to a date format something like 2010-05-24.
you could run this code on a list of string object similar to above and convert the entire list of tuples object to date object by simply unpacking(*tuple) check the code below.
import datetime
#for single string simply use:-
my_str= "24052010"
date_tup = (int(my_str[4:]),int(my_str[2:4]),int(my_str[:2]))
print(datetime.datetime(*date_tup))
output: 2012-01-01 00:00:00
# for a list of string date objects you could use below code.
date_list = []
str_date = ["24052010", "25082011", "25122011","01012012"]
for items in str_date:
date_list.append((int(items[4:]),int(items[2:4]),int(items[:2])))
for dates in date_list:
# unpack all tuple objects and convert to date
print(datetime.datetime(*dates))
output:
2010-05-24 00:00:00
2011-08-25 00:00:00
2011-12-25 00:00:00
2012-01-01 00:00:00
Use time module to convert data.
Code snippet:
import time
tring='20150103040500'
var = int(time.mktime(time.strptime(tring, '%Y%m%d%H%M%S')))
print var

Categories

Resources