Multi-index Dataframe from dictionary of Dataframes

Multi-index Dataframe from dictionary of Dataframes - python

I'd like to create a multi-index dataframe from a dictionary of dataframes where the top-level index is index of the dataframes within the dictionaries and the second level index is the keys of the dictionary.
Example
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03'])
column_names = ['Y', 'X']
df_dict = {'A':pd.DataFrame([[1,3],[7,4],[5,8]], index = dt_index, columns = column_names),
'B':pd.DataFrame([[12,3],[9,8],[75,0]], index = dt_index, columns = column_names),
'C':pd.DataFrame([[3,12],[5,1],[22,5]], index = dt_index, columns = column_names)}
Expected output:
Y X
2003-05-01 A 1 3
2003-05-01 B 12 3
2003-05-01 C 3 12
2003-05-02 A 7 4
2003-05-02 B 9 8
2003-05-02 C 5 1
2003-05-03 A 5 8
2003-05-03 B 75 0
2003-05-03 C 22 5
I've tried
pd.concat(df_dict, axis=0)
but this gives me the levels of the multi-index in the incorrect order.
Edit: Timings
Based on the answers so far, this seems like a slow operation to perform as the Dataframe scales.
Larger dummy data:
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
To convert the dictionary to a dataframe, albeit with swapped indicies takes:
%timeit pd.concat(df_dict, axis=0)
63.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Even in the best case, creating a dataframe with the indicies in the other order takes 8 times longer than the above!
%timeit pd.concat(df_dict, axis=0).swaplevel().sort_index()
528 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat(df_dict, axis=1).stack(0)
1.72 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use DataFrame.swaplevel with DataFrame.sort_index:
df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index()
print (df)
Y X
2003-05-01 A 1 3
B 12 3
C 3 12
2003-05-02 A 7 4
B 9 8
C 5 1
2003-05-03 A 5 8
B 75 0
C 22 5

You can reach down into numpy for a speed up if you can guarantee two things:
Each of your DataFrames in df_dict have the exact same index
Each of your DataFrames are already sorted.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
out = pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, C),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# check if this result is consistent with other answers
assert (pd.concat(df_dict, axis=0).swaplevel().sort_index() == out).all().all()
Timing:
%%timeit
pd.concat(df_dict, axis=0)
# 26.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, 500),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# 31.2 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat(df_dict, axis=0).swaplevel().sort_index()
# 123 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Use concat on axis=1 and stack:
out = pd.concat(df_dict, axis=1).stack(0)
Output:
X Y
2003-05-01 A 3 1
B 3 12
C 12 3
2003-05-02 A 4 7
B 8 9
C 1 5
2003-05-03 A 8 5
B 0 75
C 5 22

Related

How do i sort the numeric values within a cell in Pandas series

I have a panda series/column where values are like -
Values
101;1001
130;125
113;99
1001;101
I need to sort the values within the cell with an expected outcome like below using python as the dataframe is large with more than 5 million values so any faster way would be appreciated.
Values
101;1001
125;130
99;113
101;1001

Convert splitted values to integers, sorting, convert back to strings and join:
df['Values'] = df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
Performance:
#10k rows
df = pd.concat([df] * 10000, ignore_index=True)
#enke solution
In [52]: %timeit df['Values'].str.split(';').explode().sort_values(key=lambda x: x.str.zfill(10)).groupby(level=0).agg(';'.join)
616 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [53]: %timeit df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
70.7 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#1M rows
df = pd.concat([df] * 1000000, ignore_index=True)
#mozway solution
In [60]: %timeit df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
8.03 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit df['Values'] = df['Values'].map(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
7.88 s ± 602 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution for 2 columns:
df1 = df['Values'].str.split(';', expand=True).astype(int)
df1 = pd.DataFrame(np.sort(df1, axis=1), index=df1.index, columns=df1.columns)
print (df1)
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001

You can use a list comprehension, this will be faster on small datasets:
df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
output:
Values
0 101;1001
1 125;130
2 99;113
3 101;1001
timings:
For two columns:
df2 = pd.DataFrame([sorted(map(int, x.split(';'))) for x in df['Values']])
output:
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001

Count elements in defined groups in pandas dataframe

Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.
I could do something like
elem_list = [1,5,2]
for e in elemt_list:
(df["col1"]==e).sum()
but isn't there a better way like
elem_list = [1,5,2]
df["col1"].count_elements(elem_list)
#1 5 # 1 occurs 5 times
#5 3 # 5 occurs 3 times
#2 0 # 2 occurs 0 times
Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.

You can use value_counts and reindex:
df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})
elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)
output:
1 5
5 2
2 0
benchmark (100k values):
# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})
df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Pass to the Categorical which will return 0 for missing item
pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]:
1 3
5 0
2 1
dtype: int64

First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)

You could do something like that:
df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()
# col1
# 1 17
# 0 10
# dtype: int64

Python Pandas Fast Way to Divide Row Value by Previous Value

I want to calculate daily bond returns from clean prices based on the logarithm of the bond price in t divided by the bond price in t-1. So far, I calculate it like this:
import pandas as pd
import numpy as np
#create example data
col1 = np.random.randint(0,10,size=10)
df = pd.DataFrame()
df["col1"] = col1
df["result"] = [0]*len(df)
#slow computation
for i in range(len(df)):
if i == 0:
df["result"][i] = np.nan
else:
df["result"][i] = np.log(df["col1"][i]/df["col1"][i-1])
However, since I have a large sample this takes a lot of time to compute. Is there a way to improve the code in order to make it faster?

Use Series.shift by col1 column with Series.div for division:
df["result1"] = np.log(df["col1"].div(df["col1"].shift()))
#alternative
#df["result1"] = np.log(df["col1"] / df["col1"].shift())
print (df)
col1 result result1
0 5 NaN NaN
1 0 -inf -inf
2 3 inf inf
3 3 0.000000 0.000000
4 7 0.847298 0.847298
5 9 0.251314 0.251314
6 3 -1.098612 -1.098612
7 5 0.510826 0.510826
8 2 -0.916291 -0.916291
9 4 0.693147 0.693147
I test both solutions:
np.random.seed(0)
col1 = np.random.randint(0,10,size=10000)
df = pd.DataFrame({'col1':col1})
In [128]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
865 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [129]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
1.16 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
1.03 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(0)
col1 = np.random.randint(0,10,size=100000)
df = pd.DataFrame({'col1':col1})
In [132]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
3.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [133]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
6.31 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
3.75 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

No need to use multiple functions, use Series.pct_change():
df = df.assign(
result=lambda x: np.log(x.col1.pct_change() + 1)
)
print(df)
col1 result
0 3 NaN
1 5 0.510826
2 8 0.470004
3 7 -0.133531
4 9 0.251314
5 1 -2.197225
6 1 0.000000
7 2 0.693147
8 7 1.252763
9 0 -inf

This should be a much faster way to get the same results:
df["result_2"] = np.log(df["col1"] / df["col1"].shift())

Using Pandas to merge many dataframes on a single column while preserving the "on" column

I have several dataframes inn the following format:
time, 2019-01-25
07:00-07:30, 180.22
07:30-08:00, 119.12
08:00-08:30, 11.94
08:30-09:00, 41.62
09:00-09:30, 28.69
09:30-10:00, 119.77
...
(I have many files like the above loaded into dataframes array called frames).
And I am using Pandas to merge them with the code:
df_merged = reduce(lambda left, right: pd.merge(left, right, on=['time'],
how='outer'), frames).fillna('0.0').set_index('time')
(the code initially came from here)
The merge technically works, however, the final merged dataframe omits the time column. Does anyone know how to perform the merge as above while still keeping the time column in df_merged?

I would look at using join instead of merge in this situation.
Setup:
df1 = pd.DataFrame({'A':[*'ABCDE'], 'B':np.random.randint(0,10,5)})
df2 = pd.DataFrame({'A':[*'ABCDE'], 'C':np.random.randint(0,100,5)})
df3 = pd.DataFrame({'A':[*'ABCDE'], 'D':np.random.randint(0,1000,5)})
df4 = pd.DataFrame({'A':[*'ABCDE'], 'E':np.random.randint(0,10000,5)})
result1 = reduce(lambda l,r: pd.merge(l,r), [df1,df2,df3,df4])
result2 = df1.set_index('A').join([d.set_index('A') for d in [df2,df3,df4]]).reset_index()
all(result1 == result2)
True
Output(result1):
A B C D E
0 A 7 19 980 8635
1 B 7 44 528 431
2 C 5 4 572 9405
3 D 7 7 96 2596
4 E 1 6 514 940
Timings:
%%timeit
result1 = reduce(lambda l,r: pd.merge(l,r), [df1,df2,df3,df4])
9.37 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
result2 = df1.set_index('A').join([d.set_index('A') for d in [df2,df3,df4]]).reset_index()
4.04 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to create a dummy variable in Pandas Dataframe if a column matches certain values?

I have a Pandas Dataframe with a column (ip) with certain values and another Pandas Series not in this DataFrame with a collection of these values. I want to create a column in the DataFrame that is 1 if a given line has its ipin my Pandas Series (black_ip).
import pandas as pd
dict = {'ip': {0: 103022, 1: 114221, 2: 47902, 3: 23550, 4: 84644}, 'os': {0: 23, 1: 19, 2: 17, 3: 13, 4: 19}}
df = pd.DataFrame(dict)
df
ip os
0 103022 23
1 114221 19
2 47902 17
3 23550 13
4 84644 19
blacklist = pd.Series([103022, 23550])
blacklist
0 103022
1 23550
My question is: how can I create a new column in df such that it shows 1 when the given ip in the blacklist and zero otherwise?
Sorry if this too dumb, I'm still new to programming. Thanks a lot in advance!

Use isin with astype:
df['new'] = df['ip'].isin(blacklist).astype(np.int8)
Also is possible convert column to categoricals:
df['new'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
print (df)
ip os new
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
For interesting in large DataFrame converting to Categorical not save memory:
df = pd.concat([df] * 10000, ignore_index=True)
df['new1'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
df['new2'] = df['ip'].isin(blacklist).astype(np.int8)
df['new3'] = df['ip'].isin(blacklist)
print (df.memory_usage())
Index 80
ip 400000
os 400000
new1 50096
new2 50000
new3 50000
dtype: int64
Timings:
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
print (len(df))
10000
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
print (len(blacklist))
100
In [320]: %timeit df['ip'].isin(blacklist).astype(np.int8)
465 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
915 µs ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [322]: %timeit pd.Categorical(df['ip'], categories = blacklist.unique()).notnull().astype(int)
1.59 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [323]: %timeit df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
81.8 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Slow, but simple and readable method:
Another way to do this would be to use create your new column using a list comprehension, set to assign a 1 if your ip value is in blacklist and a 0 otherwise:
df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
>>> df
ip os new_column
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
EDIT: Faster method building on Categorical: If you want to maximize speed, the following would be quite fast, though not quite as fast as the .isin non-categorical method. It builds on the use of pd.Categorical as suggested by #jezrael, but leveraging it's capacity for assigning categories:
df['new_column'] = pd.Categorical(df['ip'],
categories = blacklist.unique()).notnull().astype(int)
Timings:
import numpy as np
import pandas as pd
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
%timeit df['ip'].isin(blacklist).astype(np.int8)
# 453 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
# 892 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'], categories = \
blacklist.unique()).notnull().astype(int)
# 565 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Multi-index Dataframe from dictionary of Dataframes - python

Use DataFrame.swaplevel with DataFrame.sort_index: df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index() print (df) Y X 2003-05-01 A 1 3 B 12 3 C 3 12 2003-05-02 A 7 4 B 9 8 C 5 1 2003-05-03 A 5 8 B 75 0 C 22 5

Use concat on axis=1 and stack: out = pd.concat(df_dict, axis=1).stack(0) Output: X Y 2003-05-01 A 3 1 B 3 12 C 12 3 2003-05-02 A 4 7 B 8 9 C 1 5 2003-05-03 A 8 5 B 0 75 C 5 22

Related

How do i sort the numeric values within a cell in Pandas series

Count elements in defined groups in pandas dataframe

Python Pandas Fast Way to Divide Row Value by Previous Value

Using Pandas to merge many dataframes on a single column while preserving the "on" column

How to create a dummy variable in Pandas Dataframe if a column matches certain values?

Categories

Resources