Pandas Split Dataframe into two Dataframes at a specific row - python

I have pandas DataFrame which I have composed from concat. One row consists of 96 values, I would like to split the DataFrame from the value 72.
So that the first 72 values of a row are stored in Dataframe1, and the next 24 values of a row in Dataframe2.
I create my DF as follows:
temps = DataFrame(myData)
datasX = concat(
[temps.shift(72), temps.shift(71), temps.shift(70), temps.shift(69), temps.shift(68), temps.shift(67),
temps.shift(66), temps.shift(65), temps.shift(64), temps.shift(63), temps.shift(62), temps.shift(61),
temps.shift(60), temps.shift(59), temps.shift(58), temps.shift(57), temps.shift(56), temps.shift(55),
temps.shift(54), temps.shift(53), temps.shift(52), temps.shift(51), temps.shift(50), temps.shift(49),
temps.shift(48), temps.shift(47), temps.shift(46), temps.shift(45), temps.shift(44), temps.shift(43),
temps.shift(42), temps.shift(41), temps.shift(40), temps.shift(39), temps.shift(38), temps.shift(37),
temps.shift(36), temps.shift(35), temps.shift(34), temps.shift(33), temps.shift(32), temps.shift(31),
temps.shift(30), temps.shift(29), temps.shift(28), temps.shift(27), temps.shift(26), temps.shift(25),
temps.shift(24), temps.shift(23), temps.shift(22), temps.shift(21), temps.shift(20), temps.shift(19),
temps.shift(18), temps.shift(17), temps.shift(16), temps.shift(15), temps.shift(14), temps.shift(13),
temps.shift(12), temps.shift(11), temps.shift(10), temps.shift(9), temps.shift(8), temps.shift(7),
temps.shift(6), temps.shift(5), temps.shift(4), temps.shift(3), temps.shift(2), temps.shift(1), temps,
temps.shift(-1), temps.shift(-2), temps.shift(-3), temps.shift(-4), temps.shift(-5), temps.shift(-6),
temps.shift(-7), temps.shift(-8), temps.shift(-9), temps.shift(-10), temps.shift(-11), temps.shift(-12),
temps.shift(-13), temps.shift(-14), temps.shift(-15), temps.shift(-16), temps.shift(-17), temps.shift(-18),
temps.shift(-19), temps.shift(-20), temps.shift(-21), temps.shift(-22), temps.shift(-23)], axis=1)
Question is: How can split them? :)

iloc
df1 = datasX.iloc[:, :72]
df2 = datasX.iloc[:, 72:]
(iloc docs)

use np.split(..., axis=1):
Demo:
In [255]: df = pd.DataFrame(np.random.rand(5, 6), columns=list('abcdef'))
In [256]: df
Out[256]:
a b c d e f
0 0.823638 0.767999 0.460358 0.034578 0.592420 0.776803
1 0.344320 0.754412 0.274944 0.545039 0.031752 0.784564
2 0.238826 0.610893 0.861127 0.189441 0.294646 0.557034
3 0.478562 0.571750 0.116209 0.534039 0.869545 0.855520
4 0.130601 0.678583 0.157052 0.899672 0.093976 0.268974
In [257]: dfs = np.split(df, [4], axis=1)
In [258]: dfs[0]
Out[258]:
a b c d
0 0.823638 0.767999 0.460358 0.034578
1 0.344320 0.754412 0.274944 0.545039
2 0.238826 0.610893 0.861127 0.189441
3 0.478562 0.571750 0.116209 0.534039
4 0.130601 0.678583 0.157052 0.899672
In [259]: dfs[1]
Out[259]:
e f
0 0.592420 0.776803
1 0.031752 0.784564
2 0.294646 0.557034
3 0.869545 0.855520
4 0.093976 0.268974
np.split() is pretty flexible - let's split an original DF into 3 DFs at columns with indexes [2,3]:
In [260]: dfs = np.split(df, [2,3], axis=1)
In [261]: dfs[0]
Out[261]:
a b
0 0.823638 0.767999
1 0.344320 0.754412
2 0.238826 0.610893
3 0.478562 0.571750
4 0.130601 0.678583
In [262]: dfs[1]
Out[262]:
c
0 0.460358
1 0.274944
2 0.861127
3 0.116209
4 0.157052
In [263]: dfs[2]
Out[263]:
d e f
0 0.034578 0.592420 0.776803
1 0.545039 0.031752 0.784564
2 0.189441 0.294646 0.557034
3 0.534039 0.869545 0.855520
4 0.899672 0.093976 0.268974

I generally use array split because it's easier simple syntax and scales better with more than 2 partitions.
import numpy as np
partitions = 2
dfs = np.array_split(df, partitions)
np.split(df, [100,200,300], axis=0] wants explicit index numbers which may or may not be desirable.

Related

how to apply multiplication within pandas dataframe

please advice how to get the following output:
df1 = pd.DataFrame([['1, 2', '2, 2','3, 2','1, 1', '2, 1','3, 1']])
df2 = pd.DataFrame([[1, 2, 100, 'x'], [3, 4, 200, 'y'], [5, 6, 300, 'x']])
import numpy as np
df22 = df2.rename(index = lambda x: x + 1).set_axis(np.arange(1, len(df2.columns) + 1), inplace=False, axis=1)
f = lambda x: df22.loc[tuple(map(int, x.split(',')))]
df = df1.applymap(f)
print (df)
Output:
0 1 2 3 4 5
0 2 4 6 1 3 5
df1 is 'address' of df2 in row, col format (1,2 is first row, second column which is 2, 2,2 is 4 3,2 is 6 etc.)
I need to add values from the 3rd and 4th columns to get something like (2*100x, 4*200y, 6*300x, 1*100x, 3*200y, 5*300x)
the output should be 5000(sum of x's and y's), 0.28 (1400/5000 - % of y's)
It's not clear to me why you need df1 and df... Maybe your question is lacking some details?
You can compute your values directly:
df22['val'] = (df22[1] + df22[2])*df22[3]
Output:
1 2 3 4 val
1 1 2 100 x 300
2 3 4 200 y 1400
3 5 6 300 x 3300
From there it's straightforward to compute the sums (total and grouped by column 4):
total = df22['val'].sum() # 5000
y_sum = df22.groupby(4).sum().loc['y', 'val'] # 1400
print(y_sum/total) # 0.28
Edit: if df1 doesn't necessarily contain all members of columns 1 and 2, you could loop through it (it's not clear in your question why df1 is a Dataframe or if it can have more than one row, therefore I flattened it):
df22['val'] = 0
for c in df1.to_numpy().flatten():
i, j = map(int, c.split(','))
df22.loc[i, 'val'] += df22.loc[i, j]*df22.loc[i, 3]
This gives you the same output as above for your example but will ignore values that are not in df1.

Convert dataframe with tuples into dataframe with multiindex

I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!
Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209

add a different random number to every cell in a pandas dataframe

I need to add some 'noise' to my data, so I would like to add a different random number to every cell in my pandas dataframe. This code works, but seems unpythonic. Is there a better way?
import pandas as pd
import numpy as np
df = pd.DataFrame(0.0, index=[1,2,3,4,5], columns=list('ABC') )
print df
for x,line in df.iterrows():
for col in df:
line[col] = line[col] + (np.random.rand()-0.5)/1000.0
print df
df + np.random.rand(*df.shape) / 10000.0
OR
Let's use applymap:
df = pd.DataFrame(1.0, index=[1,2,3,4,5], columns=list('ABC') )
df.applymap(lambda x: x + np.random.rand()/10000.0)
output:
A \
1 [[1.00006953418, 1.00009164785, 1.00003177706]...
2 [[1.00007291245, 1.00004186046, 1.00006935173]...
3 [[1.00000490127, 1.0000633115, 1.00004117181],...
4 [[1.00007159622, 1.0000559506, 1.00007038891],...
5 [[1.00000980335, 1.00004760836, 1.00004214422]...
B \
1 [[1.00000320322, 1.00006981682, 1.00008912557]...
2 [[1.00007443802, 1.00009270815, 1.00007225764]...
3 [[1.00001371778, 1.00001512412, 1.00007986851]...
4 [[1.00005883343, 1.00007936509, 1.00009523334]...
5 [[1.00009329606, 1.00003174878, 1.00006187704]...
C
1 [[1.00005894836, 1.00006592776, 1.0000171843],...
2 [[1.00009085391, 1.00006606979, 1.00001755092]...
3 [[1.00009736701, 1.00007240762, 1.00004558753]...
4 [[1.00003981393, 1.00007505714, 1.00007209959]...
5 [[1.0000031608, 1.00009372917, 1.00001960112],...
This would be the more succinct method and equivalent:
In [147]:
df = pd.DataFrame((np.random.rand(5,3) - 0.5)/1000.0, columns=list('ABC'))
df
Out[147]:
A B C
0 0.000381 -0.000167 0.000020
1 0.000482 0.000007 -0.000281
2 -0.000032 -0.000402 -0.000251
3 -0.000037 -0.000319 0.000260
4 -0.000035 0.000178 0.000166
If you're doing this to an existing df with non-zero values then add:
In [149]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df
Out[149]:
A B C
0 -1.705644 0.149067 0.835378
1 -0.956335 -0.586120 0.212981
2 0.550727 -0.401768 1.421064
3 0.348885 0.879210 0.136858
4 0.271063 0.132579 1.233789
In [154]:
df.add((np.random.rand(df.shape[0], df.shape[1]) - 0.5)/1000.0)
Out[154]:
A B C
0 -1.705459 0.148671 0.835761
1 -0.956745 -0.586382 0.213339
2 0.550368 -0.401651 1.421515
3 0.348938 0.878923 0.136914
4 0.270864 0.132864 1.233622
For nonzero data:
df + (np.random.rand(df.shape)-0.5)*0.001
OR
df + np.random.uniform(-0.01,0.01,(df.shape)))
For cases where your data frame contains zeros that you wish to keep as zero:
df * (1 + (np.random.rand(df.shape)-0.5)*0.001)
OR
df * (1 + np.random.uniform(-0.01,0.01,(df.shape)))
I think either of these should work, its a case of generating a same size "dataframe" (or perhaps array of arrays) as your existing df and adding it to your existing df (multiplying by 1 + random for cases where you wish zeros to remain zero). With the uniform function you can determine the scale of your noise by altering the 0.01 variable.

concatenation large number of dataframes

I have a dictionary D that contains many dataframes.
I can access every dataframe with D[0], D[1]...D[i], with the integers as keys/identifier of the respective dataframe.
I now want to concat all the dataframes in this fashion into a new dataframe:
new_df = pd.concat([D[0],D[1],...D[i]], axis= 1)
What would you suggest how I can solve this (concat needs still to be used)?
I tried with generating a list of D's and including this but received an error message.
I think the easiest thing to do is to use a dict comprehension of the dict items:
In [14]:
d = {'a':pd.DataFrame(np.random.randn(5,3), columns=list('abc')), 'b':pd.DataFrame(np.random.randn(5,3), columns=list('def'))}
d
Out[14]:
{'a': a b c
0 0.030358 1.523752 1.040409
1 -0.220019 -1.579467 -0.312059
2 1.019489 -0.272261 1.182399
3 0.580368 1.985362 -0.835338
4 0.183974 -1.150667 1.571003, 'b': d e f
0 -0.911246 0.721034 -0.347018
1 0.483298 -0.553996 0.374566
2 -0.041415 -0.275874 -0.858687
3 0.105171 -1.509721 0.265802
4 -0.788434 0.648109 0.688839}
In [29]:
pd.concat([df for k,df in d.items()], axis=1)
Out[29]:
a b c d e f
0 0.030358 1.523752 1.040409 -0.911246 0.721034 -0.347018
1 -0.220019 -1.579467 -0.312059 0.483298 -0.553996 0.374566
2 1.019489 -0.272261 1.182399 -0.041415 -0.275874 -0.858687
3 0.580368 1.985362 -0.835338 0.105171 -1.509721 0.265802
4 0.183974 -1.150667 1.571003 -0.788434 0.648109 0.688839

How to create a lagged data structure using pandas dataframe

Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s
1 5
2 4
3 3
4 2
5 1
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5]
4 [2, 3, 4]
5 [1, 2, 3]
This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
new_dict={}
for col_name in s:
new_dict[col_name]=s[col_name]
# create lagged Series
for l in range(1,lag+1):
new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
res=pd.DataFrame(new_dict,index=s.index)
elif type(s) is pd.Series:
the_range=range(lag+1)
res=pd.concat([s.shift(i) for i in the_range],axis=1)
res.columns=['lag_%d' %i for i in the_range]
else:
print 'Only works for DataFrame or Series'
return None
if dropna:
return res.dropna()
else:
return res
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
lag_0 lag_1 lag_2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)
a a_lag1 a_lag2 b b_lag1 b_lag2
3 3 4 5 30 40 50
4 2 3 4 20 30 40
5 1 2 3 10 20 30
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]:
0 1 2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]:
0 1 2
3 3 4 5
4 2 3 4
5 1 2 3
Doing work on this will be more efficient that on lists...
Very simple solution using pandas DataFrame:
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
#if you want numpy arrays with no null values:
df.dropna().values for numpy arrays
for Python 3.x (change xrange to range)
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in range(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
print(df)
vals lag_1 lag_2 lag_3
0 5 NaN NaN NaN
1 4 5.0 NaN NaN
2 3 4.0 5.0 NaN
3 2 3.0 4.0 5.0
4 1 2.0 3.0 4.0
For a dataframe df with the lag to be applied on 'col name', you can use the shift function.
df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)
I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.
Here's an example of the result:
# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
[5, 4, 3, 2, 1],
index=indx,
name='population')
shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])
Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):
Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.
def shift_timeseries_by_lags(df, lags, lag_label='lag'):
return pd.concat([
shift_timeseries_and_create_multiindex_column(df, lag,
lag_label=lag_label)
for lag in lags], axis=1)
def shift_timeseries_and_create_multiindex_column(
dataframe, lag, lag_label='lag'):
return (dataframe.shift(lag)
.pipe(append_level_to_columns_of_dataframe,
lag, lag_label))
I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.
def append_level_to_columns_of_dataframe(
dataframe, new_level, name_of_new_level, inplace=False):
"""Given a (possibly MultiIndex) DataFrame, append labels to the column
labels and assign this new level a name.
Parameters
----------
dataframe : a pandas DataFrame with an Index or MultiIndex columns
new_level : scalar, or arraylike of length equal to the number of columns
in `dataframe`
The labels to put on the columns. If scalar, it is broadcast into a
list of length equal to the number of columns in `dataframe`.
name_of_new_level : str
The label to give the new level.
inplace : bool, optional, default: False
Whether to modify `dataframe` in place or to return a copy
that is modified.
Returns
-------
dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
The original `dataframe` with new columns that have the given `level`
appended to each column label.
"""
old_columns = dataframe.columns
if not hasattr(new_level, '__len__') or isinstance(new_level, str):
new_level = [new_level] * dataframe.shape[1]
if isinstance(dataframe.columns, pd.MultiIndex):
new_columns = pd.MultiIndex.from_arrays(
old_columns.levels + [new_level],
names=(old_columns.names + [name_of_new_level]))
elif isinstance(dataframe.columns, pd.Index):
new_columns = pd.MultiIndex.from_arrays(
[old_columns] + [new_level],
names=([old_columns.name] + [name_of_new_level]))
if inplace:
dataframe.columns = new_columns
return dataframe
else:
copy_dataframe = dataframe.copy()
copy_dataframe.columns = new_columns
return copy_dataframe
Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:
def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
return pd.concat({
'{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
df.shift(lag)
for lag in lags},
axis=1)
Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):
Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:
lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()
You can do following:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()
It produces:
array([[ 3., 4., 5.],
[ 2., 3., 4.],
[ 1., 2., 3.]])
which I hope is very close to what you are actually want.
For multiple (many of them) lags, this could be more compact:
df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))
Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)
shiftdf
>>
test_0 test_1 test_2
1 5 NaN NaN
2 4 5.0 NaN
3 3 4.0 5.0
4 2 3.0 4.0
5 1 2.0 3.0
Based on the proposal by #charlie-brummitt, here is a revision that fix a set of columns:
def shift_timeseries_by_lags(df, fix_columns, lag_numbers, lag_label='lag'):
df_fix = df[fix_columns]
df_lag = df.drop(columns=fix_columns)
df_lagged = pd.concat({f'{lag_label}_{lag}':
df_lag.shift(lag) for lag in lag_numbers},
axis=1)
df_lagged.columns = ['__'.join(reversed(x)) for x in df_lagged.columns.to_flat_index()]
return pd.concat([df_fix, df_lagged], axis=1)
Here is an example of usage:
df = shift_timeseries_by_lags(df_province_cases, fix_columns=['country', 'state'], lag_numbers=[1,2,3])
I personally prefer the lag name as suffix. But can be changed removing reversed().

Categories

Resources