I'd like to create a multi-index dataframe from a dictionary of dataframes where the top-level index is index of the dataframes within the dictionaries and the second level index is the keys of the dictionary.
Example
import pandas as pd
dt_index = pd.to_datetime(['2003-05-01', '2003-05-02', '2003-05-03'])
column_names = ['Y', 'X']
df_dict = {'A':pd.DataFrame([[1,3],[7,4],[5,8]], index = dt_index, columns = column_names),
'B':pd.DataFrame([[12,3],[9,8],[75,0]], index = dt_index, columns = column_names),
'C':pd.DataFrame([[3,12],[5,1],[22,5]], index = dt_index, columns = column_names)}
Expected output:
Y X
2003-05-01 A 1 3
2003-05-01 B 12 3
2003-05-01 C 3 12
2003-05-02 A 7 4
2003-05-02 B 9 8
2003-05-02 C 5 1
2003-05-03 A 5 8
2003-05-03 B 75 0
2003-05-03 C 22 5
I've tried
pd.concat(df_dict, axis=0)
but this gives me the levels of the multi-index in the incorrect order.
Edit: Timings
Based on the answers so far, this seems like a slow operation to perform as the Dataframe scales.
Larger dummy data:
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
To convert the dictionary to a dataframe, albeit with swapped indicies takes:
%timeit pd.concat(df_dict, axis=0)
63.4 ms ± 1.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Even in the best case, creating a dataframe with the indicies in the other order takes 8 times longer than the above!
%timeit pd.concat(df_dict, axis=0).swaplevel().sort_index()
528 ms ± 25.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.concat(df_dict, axis=1).stack(0)
1.72 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Use DataFrame.swaplevel with DataFrame.sort_index:
df = pd.concat(df_dict, axis=0).swaplevel(0,1).sort_index()
print (df)
Y X
2003-05-01 A 1 3
B 12 3
C 3 12
2003-05-02 A 7 4
B 9 8
C 5 1
2003-05-03 A 5 8
B 75 0
C 22 5
You can reach down into numpy for a speed up if you can guarantee two things:
Each of your DataFrames in df_dict have the exact same index
Each of your DataFrames are already sorted.
import numpy as np
import pandas as pd
D = 3000
C = 500
dt_index = pd.date_range('2000-1-1', periods=D)
keys = 'abcdefghijk'
df_dict = {k:pd.DataFrame(np.random.rand(D,C), index=dt_index) for k in keys}
out = pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, C),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# check if this result is consistent with other answers
assert (pd.concat(df_dict, axis=0).swaplevel().sort_index() == out).all().all()
Timing:
%%timeit
pd.concat(df_dict, axis=0)
# 26.2 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.DataFrame(
data=np.column_stack([*df_dict.values()]).reshape(-1, 500),
index=pd.MultiIndex.from_product([df_dict["a"].index, df_dict.keys()]),
)
# 31.2 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat(df_dict, axis=0).swaplevel().sort_index()
# 123 ms ± 1.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Use concat on axis=1 and stack:
out = pd.concat(df_dict, axis=1).stack(0)
Output:
X Y
2003-05-01 A 3 1
B 3 12
C 12 3
2003-05-02 A 4 7
B 8 9
C 1 5
2003-05-03 A 8 5
B 0 75
C 5 22
Related
I have a panda series/column where values are like -
Values
101;1001
130;125
113;99
1001;101
I need to sort the values within the cell with an expected outcome like below using python as the dataframe is large with more than 5 million values so any faster way would be appreciated.
Values
101;1001
125;130
99;113
101;1001
Convert splitted values to integers, sorting, convert back to strings and join:
df['Values'] = df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
Performance:
#10k rows
df = pd.concat([df] * 10000, ignore_index=True)
#enke solution
In [52]: %timeit df['Values'].str.split(';').explode().sort_values(key=lambda x: x.str.zfill(10)).groupby(level=0).agg(';'.join)
616 ms ± 6.03 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [53]: %timeit df['Values'].apply(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
70.7 ms ± 420 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#1M rows
df = pd.concat([df] * 1000000, ignore_index=True)
#mozway solution
In [60]: %timeit df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
8.03 s ± 409 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [61]: %timeit df['Values'] = df['Values'].map(lambda x: ';'.join(map(str, sorted(map(int, x.split(';'))))))
7.88 s ± 602 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution for 2 columns:
df1 = df['Values'].str.split(';', expand=True).astype(int)
df1 = pd.DataFrame(np.sort(df1, axis=1), index=df1.index, columns=df1.columns)
print (df1)
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001
You can use a list comprehension, this will be faster on small datasets:
df['Values'] = [';'.join(map(str, sorted(map(int, x.split(';'))))) for x in df['Values']]
output:
Values
0 101;1001
1 125;130
2 99;113
3 101;1001
timings:
For two columns:
df2 = pd.DataFrame([sorted(map(int, x.split(';'))) for x in df['Values']])
output:
0 1
0 101 1001
1 125 130
2 99 113
3 101 1001
Say I have a dataframe and I want to count how many times we have element e.g [1,5,2] in a/each column.
I could do something like
elem_list = [1,5,2]
for e in elemt_list:
(df["col1"]==e).sum()
but isn't there a better way like
elem_list = [1,5,2]
df["col1"].count_elements(elem_list)
#1 5 # 1 occurs 5 times
#5 3 # 5 occurs 3 times
#2 0 # 2 occurs 0 times
Note it should count all the elements in the list, and return "0" if an element in the list is not in the column.
You can use value_counts and reindex:
df = pd.DataFrame({'col1': [1,1,5,1,5,1,1,4,3]})
elem_list = [1,5,2]
df['col1'].value_counts().reindex(elem_list, fill_value=0)
output:
1 5
5 2
2 0
benchmark (100k values):
# setup
df = pd.DataFrame({'col1': np.random.randint(0,10, size=100000)})
df['col1'].value_counts().reindex(elem_list, fill_value=0)
# 774 µs ± 10.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
pd.Categorical(df['col1'],elem_list).value_counts()
# 2.72 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_value=0)
# 2.98 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Pass to the Categorical which will return 0 for missing item
pd.Categorical(df['col1'],elem_list).value_counts()
Out[62]:
1 3
5 0
2 1
dtype: int64
First filter by Series.isin and DataFrame.loc and then use Series.value_counts, last if order is important add Series.reindex:
df.loc[df["col1"].isin(elem_list), 'col1'].value_counts().reindex(elem_list, fill_values=0)
You could do something like that:
df = pd.DataFrame({"col1":np.random.randint(0,10, 100)})
df[df["col1"].isin([0,1])].value_counts()
# col1
# 1 17
# 0 10
# dtype: int64
I want to calculate daily bond returns from clean prices based on the logarithm of the bond price in t divided by the bond price in t-1. So far, I calculate it like this:
import pandas as pd
import numpy as np
#create example data
col1 = np.random.randint(0,10,size=10)
df = pd.DataFrame()
df["col1"] = col1
df["result"] = [0]*len(df)
#slow computation
for i in range(len(df)):
if i == 0:
df["result"][i] = np.nan
else:
df["result"][i] = np.log(df["col1"][i]/df["col1"][i-1])
However, since I have a large sample this takes a lot of time to compute. Is there a way to improve the code in order to make it faster?
Use Series.shift by col1 column with Series.div for division:
df["result1"] = np.log(df["col1"].div(df["col1"].shift()))
#alternative
#df["result1"] = np.log(df["col1"] / df["col1"].shift())
print (df)
col1 result result1
0 5 NaN NaN
1 0 -inf -inf
2 3 inf inf
3 3 0.000000 0.000000
4 7 0.847298 0.847298
5 9 0.251314 0.251314
6 3 -1.098612 -1.098612
7 5 0.510826 0.510826
8 2 -0.916291 -0.916291
9 4 0.693147 0.693147
I test both solutions:
np.random.seed(0)
col1 = np.random.randint(0,10,size=10000)
df = pd.DataFrame({'col1':col1})
In [128]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
865 µs ± 139 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [129]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
1.16 ms ± 11.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [130]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
1.03 ms ± 14.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
np.random.seed(0)
col1 = np.random.randint(0,10,size=100000)
df = pd.DataFrame({'col1':col1})
In [132]: %timeit df["result1"] = np.log(df["col1"] / df["col1"].shift())
3.7 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [133]: %timeit df.assign(result=lambda x: np.log(x.col1.pct_change() + 1))
6.31 ms ± 545 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df["result1"] = np.log(df["col1"].pct_change() + 1)
3.75 ms ± 269 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
No need to use multiple functions, use Series.pct_change():
df = df.assign(
result=lambda x: np.log(x.col1.pct_change() + 1)
)
print(df)
col1 result
0 3 NaN
1 5 0.510826
2 8 0.470004
3 7 -0.133531
4 9 0.251314
5 1 -2.197225
6 1 0.000000
7 2 0.693147
8 7 1.252763
9 0 -inf
This should be a much faster way to get the same results:
df["result_2"] = np.log(df["col1"] / df["col1"].shift())
I have several dataframes inn the following format:
time, 2019-01-25
07:00-07:30, 180.22
07:30-08:00, 119.12
08:00-08:30, 11.94
08:30-09:00, 41.62
09:00-09:30, 28.69
09:30-10:00, 119.77
...
(I have many files like the above loaded into dataframes array called frames).
And I am using Pandas to merge them with the code:
df_merged = reduce(lambda left, right: pd.merge(left, right, on=['time'],
how='outer'), frames).fillna('0.0').set_index('time')
(the code initially came from here)
The merge technically works, however, the final merged dataframe omits the time column. Does anyone know how to perform the merge as above while still keeping the time column in df_merged?
I would look at using join instead of merge in this situation.
Setup:
df1 = pd.DataFrame({'A':[*'ABCDE'], 'B':np.random.randint(0,10,5)})
df2 = pd.DataFrame({'A':[*'ABCDE'], 'C':np.random.randint(0,100,5)})
df3 = pd.DataFrame({'A':[*'ABCDE'], 'D':np.random.randint(0,1000,5)})
df4 = pd.DataFrame({'A':[*'ABCDE'], 'E':np.random.randint(0,10000,5)})
result1 = reduce(lambda l,r: pd.merge(l,r), [df1,df2,df3,df4])
result2 = df1.set_index('A').join([d.set_index('A') for d in [df2,df3,df4]]).reset_index()
all(result1 == result2)
True
Output(result1):
A B C D E
0 A 7 19 980 8635
1 B 7 44 528 431
2 C 5 4 572 9405
3 D 7 7 96 2596
4 E 1 6 514 940
Timings:
%%timeit
result1 = reduce(lambda l,r: pd.merge(l,r), [df1,df2,df3,df4])
9.37 ms ± 325 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
result2 = df1.set_index('A').join([d.set_index('A') for d in [df2,df3,df4]]).reset_index()
4.04 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I have a Pandas Dataframe with a column (ip) with certain values and another Pandas Series not in this DataFrame with a collection of these values. I want to create a column in the DataFrame that is 1 if a given line has its ipin my Pandas Series (black_ip).
import pandas as pd
dict = {'ip': {0: 103022, 1: 114221, 2: 47902, 3: 23550, 4: 84644}, 'os': {0: 23, 1: 19, 2: 17, 3: 13, 4: 19}}
df = pd.DataFrame(dict)
df
ip os
0 103022 23
1 114221 19
2 47902 17
3 23550 13
4 84644 19
blacklist = pd.Series([103022, 23550])
blacklist
0 103022
1 23550
My question is: how can I create a new column in df such that it shows 1 when the given ip in the blacklist and zero otherwise?
Sorry if this too dumb, I'm still new to programming. Thanks a lot in advance!
Use isin with astype:
df['new'] = df['ip'].isin(blacklist).astype(np.int8)
Also is possible convert column to categoricals:
df['new'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
print (df)
ip os new
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
For interesting in large DataFrame converting to Categorical not save memory:
df = pd.concat([df] * 10000, ignore_index=True)
df['new1'] = pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
df['new2'] = df['ip'].isin(blacklist).astype(np.int8)
df['new3'] = df['ip'].isin(blacklist)
print (df.memory_usage())
Index 80
ip 400000
os 400000
new1 50096
new2 50000
new3 50000
dtype: int64
Timings:
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
print (len(df))
10000
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
print (len(blacklist))
100
In [320]: %timeit df['ip'].isin(blacklist).astype(np.int8)
465 µs ± 21.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [321]: %timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
915 µs ± 49.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [322]: %timeit pd.Categorical(df['ip'], categories = blacklist.unique()).notnull().astype(int)
1.59 ms ± 20.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [323]: %timeit df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
81.8 ms ± 2.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Slow, but simple and readable method:
Another way to do this would be to use create your new column using a list comprehension, set to assign a 1 if your ip value is in blacklist and a 0 otherwise:
df['new_column'] = [1 if x in blacklist.values else 0 for x in df.ip]
>>> df
ip os new_column
0 103022 23 1
1 114221 19 0
2 47902 17 0
3 23550 13 1
4 84644 19 0
EDIT: Faster method building on Categorical: If you want to maximize speed, the following would be quite fast, though not quite as fast as the .isin non-categorical method. It builds on the use of pd.Categorical as suggested by #jezrael, but leveraging it's capacity for assigning categories:
df['new_column'] = pd.Categorical(df['ip'],
categories = blacklist.unique()).notnull().astype(int)
Timings:
import numpy as np
import pandas as pd
np.random.seed(4545)
N = 10000
df = pd.DataFrame(np.random.randint(1000,size=N), columns=['ip'])
blacklist = pd.Series(np.random.randint(500,size=int(N/100)))
%timeit df['ip'].isin(blacklist).astype(np.int8)
# 453 µs ± 8.81 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'].isin(blacklist).astype(np.int8))
# 892 µs ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit pd.Categorical(df['ip'], categories = \
blacklist.unique()).notnull().astype(int)
# 565 µs ± 32.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)