I want to convert a dataframe which has tuples in cells into a dataframe with MultiIndex.
Here is an example of the table code:
d = {2:[(0,2),(0,4)], 3:[(826.0, 826.0),(4132.0, 4132.0)], 4:[(6019.0, 6019.0),(12037.0, 12037.0)], 6:[(18337.0, 18605.0),(36674.0, 37209.0)]}
test = pd.DataFrame(d)
This is how the dataframe looks like:
2 3 4 6
0 (0, 2) (826.0, 826.0) (6019.0, 6019.0) (18337.0, 18605.0)
1 (0, 4) (4132.0, 4132.0) (12037.0, 12037.0) (36674.0, 37209.0)
This is what I want it to look like
2 3 4 6
0 A 0 826.0 6019.0 18337.0
B 2 826.0 6019.0 18605.0
1 A 0 4132.0 12037.0 36674.0
B 4 4132.0 12037.0 37209.0
Thanks for your help!
Unsure for the efficiency, because this will rely an the apply method, but you could concat the dataframe with itself, adding a 'A' column to the first and a 'B' one to the second. Then you sort the resulting dataframe by its index, and use apply to change even rows to the first value of the tuple and odd ones to the second:
df = pd.concat([test.assign(X='A'), test.assign(X='B')]).set_index(
'X', append=True).sort_index().rename_axis(index=(None, None))
df.iloc[0:len(df):2] = df.iloc[0:len(df):2].apply(lambda x: x.apply(lambda y: y[0]))
df.iloc[1:len(df):2] = df.iloc[1:len(df):2].apply(lambda x: x.apply(lambda y: y[1]))
It gives as expected:
2 3 4 6
0 A 0 826 6019 18337
B 2 826 6019 18605
1 A 0 4132 12037 36674
B 4 4132 12037 37209
I am trying to count how many characters from the first column appear in second one. They may appear in different order and they should not be counted twice.
For example, in this df
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2"],["AL0X24",
"CXP44",
"MLN",
"KKRR9",
"22MMRRS"]]).T
the result should be:
result = [3,2,2,2,3]
Looks like set.intersection after zipping the 2 columns:
[len(set(a).intersection(set(b))) for a,b in zip(df[0],df[1])]
#[3, 2, 2, 2, 3]
The other solutions will fail in the case that you compare names that both have the same multiple character, eg. AAL0 and AAL0X24. The result here should be 4.
from collections import Counter
df = pd.DataFrame(data=[["AL0","CP1","NM3","PK9","RM2", "AAL0"],
["AL0X24", "CXP44", "MLN", "KKRR9", "22MMRRS", "AAL0X24"]]).T
def num_shared_chars(char_counter1, char_counter2):
shared_chars = set(char_counter1.keys()).intersection(char_counter2.keys())
return sum([min(char_counter1[k], char_counter2[k]) for k in shared_chars])
df_counter = df.applymap(Counter)
df['shared_chars'] = df_counter.apply(lambda row: num_shared_chars(row[0], row[1]), axis = 'columns')
Result:
0 1 shared_chars
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
5 AAL0 AAL0X24 4
Sticking to the dataframe data structure, you could do:
>>> def count_common(s1, s2):
... return len(set(s1) & set(s2))
...
>>> df["result"] = df.apply(lambda x: count_common(x[0], x[1]), axis=1)
>>> df
0 1 result
0 AL0 AL0X24 3
1 CP1 CXP44 2
2 NM3 MLN 2
3 PK9 KKRR9 2
4 RM2 22MMRRS 3
I am working on a project using Learning to Rank. Below is the example dataset format (taken from https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/). The first column is the rank, second column is query id, and the followings are [feature number]:[feature value]
1008 qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000 … 46:0.00000
1007 qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333 … 46:0.000000
1006 qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000 … 46:0.000000
Right now, I am successfully convert my data into this following format in Pandas.DataFrame.
10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0 92.28260 ...
...
The first two column is already fine. What I need next is appending feature number to the remaining columns (e.g. first feature from 3500 become 1:3500)
I know I can append a string to columns by using this following command.
df['col'] = 'str' + df['col'].astype(str)
Look at the first feature, 3500, is located at column index 2, so what I can think of is appending column index - 1 for each column. How do I append the string based on the column number?
Any help would be appreciated.
I think need DataFrame.radd for add columns names from right side and iloc for select from second column to end:
print (df)
0 1 2 3 4 5 6 7 8 \
0 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
1 10 qid:354714443278337 3500 1 122.0 156.0 13.0 1698.0 1840.0
9
0 92.2826
1 92.2826
df.iloc[:, 2:] = df.iloc[:, 2:].astype(str).radd(':').radd((df.columns[2:] - 1).astype(str))
print (df)
0 1 2 3 4 5 6 7 \
0 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
1 10 qid:354714443278337 1:3500 2:1 3:122.0 4:156.0 5:13.0 6:1698.0
8 9
0 7:1840.0 8:92.2826
1 7:1840.0 8:92.2826
You can simply concatenate the columns
df['new_col'] = df[df.columns[3]].astype(str) + ':' + df[df.columns[2]].astype(str)
This will output a new column in your df named new_col. Now you can either delete the unnecessary columns.
You can convert the string to dictionary and then read it again as pandas dataframe.
import pandas as pd
import ast
df = pd.DataFrame({'rank': [1008, 1007, 1006], 'column':['qid:10 1:0.004356 2:0.080000 3:0.036364 4:0.000000',\
'qid:10 1:0.004901 2:0.000000 3:0.036364 4:0.333333',\
'qid:10 1:0.019058 2:0.240000 3:0.072727 4:0.500000']} )
def putquotes(x):
x1 = x.split(":")
return "'" + x1[0] +"':" + x1[1]
def putcommas(x):
x1 = x.split()
return "{" + ",".join([putquotes(t) for t in x1]) + "}"
import ast
df1 = [ast.literal_eval(putcommas(x)) for x in df['column'].tolist()]
df = pd.concat([df,pd.DataFrame(df1)], axis=1)
Given a set up such as below:
import pandas as pd
import numpy as np
#Create random number dataframes
df1 = pd.DataFrame(np.random.rand(10,4))
df2 = pd.DataFrame(np.random.rand(10,4))
df3 = pd.DataFrame(np.random.rand(10,4))
#Create list of dataframes
data_frame_list = [df1, df2, df3]
#Introduce some NaN values
df1.iloc[4,3] = np.NaN
df2.iloc[1:4,2] = np.NaN
#Create loop to ffill any NaN values
for df in data_frame_list:
df = df.fillna(method='ffill')
This still leaves df2 (for example) as:
0 1 2 3
0 0.946601 0.492957 0.688421 0.582571
1 0.365173 0.507617 NaN 0.997909
2 0.185005 0.496989 NaN 0.962120
3 0.278633 0.515227 NaN 0.868952
4 0.346495 0.779571 0.376018 0.750900
5 0.384307 0.594381 0.741655 0.510144
6 0.499180 0.885632 0.13413 0.196010
7 0.245445 0.771402 0.371148 0.222618
8 0.564510 0.487644 0.121945 0.095932
9 0.401214 0.282698 0.0181196 0.689916
Although the individual line of code:
df2 = df2.fillna(method='ffill)
Does work. I thought the issue may be due to the way I was naming variables so I introduced global()[df], but this didn't seem to work either.
Wondering if it possible to do a ffill of an entire dataframe in a for loop, or am I going wrong somewhere in my approach?
No, it unfortunately does not. You are calling fillna not in place and it results in the generation of a copy, which you then reassign back to the variable df. You should understand that reassigning this variable does not change the contents of the list.
If you want to do that, iterate over the index or use a list comprehension.
data_frame_list = [df.ffill() for df in data_frame_list]
Or,
for i in range(len(data_frame_list)):
data_frame_list[i].ffill(inplace=True)
You can change only DataFrame in list of DataFrames, so df1 - df3 are not changed with ffill and parameter inplace=True:
data_frame_list = [df1, df2, df3]
for df in data_frame_list:
df.ffill(inplace=True)
print (data_frame_list)
[ 0 1 2 3
0 0.506726 0.057531 0.627580 0.132553
1 0.131085 0.788544 0.506686 0.412826
2 0.578009 0.488174 0.335964 0.140816
3 0.891442 0.086312 0.847512 0.529616
4 0.550261 0.848461 0.158998 0.529616
5 0.817808 0.977898 0.933133 0.310414
6 0.481331 0.382784 0.874249 0.363505
7 0.384864 0.035155 0.634643 0.009076
8 0.197091 0.880822 0.002330 0.109501
9 0.623105 0.999237 0.567151 0.487938, 0 1 2 3
0 0.104856 0.525416 0.284066 0.658453
1 0.989523 0.644251 0.284066 0.141395
2 0.488099 0.167418 0.284066 0.097982
3 0.930415 0.486878 0.284066 0.192273
4 0.210032 0.244598 0.175200 0.367130
5 0.981763 0.285865 0.979590 0.924292
6 0.631067 0.119238 0.855842 0.782623
7 0.815908 0.575624 0.037598 0.532883
8 0.346577 0.329280 0.606794 0.825932
9 0.273021 0.503340 0.828568 0.429792, 0 1 2 3
0 0.491665 0.752531 0.780970 0.524148
1 0.635208 0.283928 0.821345 0.874243
2 0.454211 0.622611 0.267682 0.726456
3 0.379144 0.345580 0.694614 0.585782
4 0.844209 0.662073 0.590640 0.612480
5 0.258679 0.413567 0.797383 0.431819
6 0.034473 0.581294 0.282111 0.856725
7 0.352072 0.801542 0.862749 0.000285
8 0.793939 0.297286 0.441013 0.294635
9 0.841181 0.804839 0.311352 0.171094]
Or you can concat
df=pd.concat([df1,df2,df3],keys=['df1','df2','df3'])
[x for _,x in df.groupby(level=0).ffill().groupby(level=0)]
Example
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
print s
1 5
2 4
3 3
4 2
5 1
Is there an efficient way to create a series. e.g. containing in each row the lagged values (in this example up to lag 2)
3 [3, 4, 5]
4 [2, 3, 4]
5 [1, 2, 3]
This corresponds to s=pd.Series([[3,4,5],[2,3,4],[1,2,3]], index=[3,4,5])
How can this be done in an efficient way for dataframes with a lot of timeseries which are very long?
Thanks
Edited after seeing the answers
ok, at the end I implemented this function:
def buildLaggedFeatures(s,lag=2,dropna=True):
'''
Builds a new DataFrame to facilitate regressing over all possible lagged features
'''
if type(s) is pd.DataFrame:
new_dict={}
for col_name in s:
new_dict[col_name]=s[col_name]
# create lagged Series
for l in range(1,lag+1):
new_dict['%s_lag%d' %(col_name,l)]=s[col_name].shift(l)
res=pd.DataFrame(new_dict,index=s.index)
elif type(s) is pd.Series:
the_range=range(lag+1)
res=pd.concat([s.shift(i) for i in the_range],axis=1)
res.columns=['lag_%d' %i for i in the_range]
else:
print 'Only works for DataFrame or Series'
return None
if dropna:
return res.dropna()
else:
return res
it produces the wished outputs and manages the naming of columns in the resulting DataFrame.
For a Series as input:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res=buildLaggedFeatures(s,lag=2,dropna=False)
lag_0 lag_1 lag_2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
and for a DataFrame as input:
s2=s=pd.DataFrame({'a':[5,4,3,2,1], 'b':[50,40,30,20,10]},index=[1,2,3,4,5])
res2=buildLaggedFeatures(s2,lag=2,dropna=True)
a a_lag1 a_lag2 b b_lag1 b_lag2
3 3 4 5 30 40 50
4 2 3 4 20 30 40
5 1 2 3 10 20 30
As mentioned, it could be worth looking into the rolling_ functions, which will mean you won't have as many copies around.
One solution is to concat shifted Series together to make a DataFrame:
In [11]: pd.concat([s, s.shift(), s.shift(2)], axis=1)
Out[11]:
0 1 2
1 5 NaN NaN
2 4 5 NaN
3 3 4 5
4 2 3 4
5 1 2 3
In [12]: pd.concat([s, s.shift(), s.shift(2)], axis=1).dropna()
Out[12]:
0 1 2
3 3 4 5
4 2 3 4
5 1 2 3
Doing work on this will be more efficient that on lists...
Very simple solution using pandas DataFrame:
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in xrange(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
#if you want numpy arrays with no null values:
df.dropna().values for numpy arrays
for Python 3.x (change xrange to range)
number_lags = 3
df = pd.DataFrame(data={'vals':[5,4,3,2,1]})
for lag in range(1, number_lags + 1):
df['lag_' + str(lag)] = df.vals.shift(lag)
print(df)
vals lag_1 lag_2 lag_3
0 5 NaN NaN NaN
1 4 5.0 NaN NaN
2 3 4.0 5.0 NaN
3 2 3.0 4.0 5.0
4 1 2.0 3.0 4.0
For a dataframe df with the lag to be applied on 'col name', you can use the shift function.
df['lag1']=df['col name'].shift(1)
df['lag2']=df['col name'].shift(2)
I like to put the lag numbers in the columns by making the columns a MultiIndex. This way, the names of the columns are retained.
Here's an example of the result:
# Setup
indx = pd.Index([1, 2, 3, 4, 5], name='time')
s=pd.Series(
[5, 4, 3, 2, 1],
index=indx,
name='population')
shift_timeseries_by_lags(pd.DataFrame(s), [0, 1, 2])
Result: a MultiIndex DataFrame with two column labels: the original one ("population") and a new one ("lag"):
Solution: Like in the accepted solution, we use DataFrame.shift and then pandas.concat.
def shift_timeseries_by_lags(df, lags, lag_label='lag'):
return pd.concat([
shift_timeseries_and_create_multiindex_column(df, lag,
lag_label=lag_label)
for lag in lags], axis=1)
def shift_timeseries_and_create_multiindex_column(
dataframe, lag, lag_label='lag'):
return (dataframe.shift(lag)
.pipe(append_level_to_columns_of_dataframe,
lag, lag_label))
I wish there were an easy way to append a list of labels to the existing columns. Here's my solution.
def append_level_to_columns_of_dataframe(
dataframe, new_level, name_of_new_level, inplace=False):
"""Given a (possibly MultiIndex) DataFrame, append labels to the column
labels and assign this new level a name.
Parameters
----------
dataframe : a pandas DataFrame with an Index or MultiIndex columns
new_level : scalar, or arraylike of length equal to the number of columns
in `dataframe`
The labels to put on the columns. If scalar, it is broadcast into a
list of length equal to the number of columns in `dataframe`.
name_of_new_level : str
The label to give the new level.
inplace : bool, optional, default: False
Whether to modify `dataframe` in place or to return a copy
that is modified.
Returns
-------
dataframe_with_new_columns : pandas DataFrame with MultiIndex columns
The original `dataframe` with new columns that have the given `level`
appended to each column label.
"""
old_columns = dataframe.columns
if not hasattr(new_level, '__len__') or isinstance(new_level, str):
new_level = [new_level] * dataframe.shape[1]
if isinstance(dataframe.columns, pd.MultiIndex):
new_columns = pd.MultiIndex.from_arrays(
old_columns.levels + [new_level],
names=(old_columns.names + [name_of_new_level]))
elif isinstance(dataframe.columns, pd.Index):
new_columns = pd.MultiIndex.from_arrays(
[old_columns] + [new_level],
names=([old_columns.name] + [name_of_new_level]))
if inplace:
dataframe.columns = new_columns
return dataframe
else:
copy_dataframe = dataframe.copy()
copy_dataframe.columns = new_columns
return copy_dataframe
Update: I learned from this solution another way to put a new level in a column, which makes it unnecessary to use append_level_to_columns_of_dataframe:
def shift_timeseries_by_lags_v2(df, lags, lag_label='lag'):
return pd.concat({
'{lag_label}_{lag_number}'.format(lag_label=lag_label, lag_number=lag):
df.shift(lag)
for lag in lags},
axis=1)
Here's the result of shift_timeseries_by_lags_v2(pd.DataFrame(s), [0, 1, 2]):
Here is a cool one liner for lagged features with _lagN suffixes in column names using pd.concat:
lagged = pd.concat([s.shift(lag).rename('{}_lag{}'.format(s.name, lag+1)) for lag in range(3)], axis=1).dropna()
You can do following:
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5])
res = pd.DataFrame(index = s.index)
for l in range(3):
res[l] = s.shift(l)
print res.ix[3:,:].as_matrix()
It produces:
array([[ 3., 4., 5.],
[ 2., 3., 4.],
[ 1., 2., 3.]])
which I hope is very close to what you are actually want.
For multiple (many of them) lags, this could be more compact:
df=pd.DataFrame({'year': range(2000, 2010), 'gdp': [234, 253, 256, 267, 272, 273, 271, 275, 280, 282]})
df.join(pd.DataFrame({'gdp_' + str(lag): df['gdp'].shift(lag) for lag in range(1,4)}))
Assuming you are focusing on a single column in your data frame, saved into s. This shortcode will generate instances of the column with 7 lags.
s=pd.Series([5,4,3,2,1], index=[1,2,3,4,5], name='test')
shiftdf=pd.DataFrame()
for i in range(3):
shiftdf = pd.concat([shiftdf , s.shift(i).rename(s.name+'_'+str(i))], axis=1)
shiftdf
>>
test_0 test_1 test_2
1 5 NaN NaN
2 4 5.0 NaN
3 3 4.0 5.0
4 2 3.0 4.0
5 1 2.0 3.0
Based on the proposal by #charlie-brummitt, here is a revision that fix a set of columns:
def shift_timeseries_by_lags(df, fix_columns, lag_numbers, lag_label='lag'):
df_fix = df[fix_columns]
df_lag = df.drop(columns=fix_columns)
df_lagged = pd.concat({f'{lag_label}_{lag}':
df_lag.shift(lag) for lag in lag_numbers},
axis=1)
df_lagged.columns = ['__'.join(reversed(x)) for x in df_lagged.columns.to_flat_index()]
return pd.concat([df_fix, df_lagged], axis=1)
Here is an example of usage:
df = shift_timeseries_by_lags(df_province_cases, fix_columns=['country', 'state'], lag_numbers=[1,2,3])
I personally prefer the lag name as suffix. But can be changed removing reversed().