Let's assume I have a MultiIndex which consists of the date and some categories (one for simplicity in the example below) and for each category I have a time series with values of some process.
I only have a value when there was an observation and I now want to add a "0" whenever there was no observation on that date.
I found a way which seems very inefficient (stacking and unstacking which will create many many columns in case of millions of categories).
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(datetime.date(2013, 2, 10), 1, 4),
(datetime.date(2013, 2, 10), 2, 7),
(datetime.date(2013, 2, 11), 2, 7),
(datetime.date(2013, 2, 13), 1, 2),
(datetime.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
print df
print df.unstack().reindex(all_dates).fillna(0).stack()
# insert 0 values for missing dates
print all_dates
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
value
category
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
[datetime.date(2013, 2, 13), datetime.date(2013, 2, 12),
datetime.date(2013, 2, 11), datetime.date(2013, 2, 10)]
Does anybody know a smarter way to achieve the same?
EDIT: I found another possibility to achieve the same:
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [datetime.date(2013, 2, 13) - dt.timedelta(days=x) for x in range(days)]
df = pd.DataFrame([(datetime.date(2013, 2, 10), 1, 4, 5),
(datetime.date(2013, 2, 10), 2,1, 7),
(datetime.date(2013, 2, 10), 2,2, 7),
(datetime.date(2013, 2, 11), 2,3, 7),
(datetime.date(2013, 2, 13), 1,4, 2),
(datetime.date(2013, 2, 13), 2,4, 3)],
columns = ['date', 'category', 'cat2', 'value'])
date_col = 'date'
other_index = ['category', 'cat2']
index = [date_col] + other_index
df.set_index(index, inplace=True)
grouped = df.groupby(level=other_index)
df_list = []
for i, group in grouped:
df_list.append(group.reset_index(level=other_index).reindex(all_dates).fillna(0))
print pd.concat(df_list).set_index(other_index, append=True)
value
category cat2
2013-02-13 1 4 2
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 1 4 5
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 1 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 2 2 7
2013-02-13 0 0 0
2013-02-12 0 0 0
2013-02-11 2 3 7
2013-02-10 0 0 0
2013-02-13 2 4 3
2013-02-12 0 0 0
2013-02-11 0 0 0
2013-02-10 0 0 0
You can make a new multi index based on the Cartesian product of the index levels you want. Then, re-index your data frame using the new index.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
new_df = df.reindex(new_index)
# Optional: convert missing values to zero, and convert the data back
# to integers. See explanation below.
new_df = new_df.fillna(0).astype(int)
That's it! The new data frame has all the possible index values. The existing data is indexed correctly.
Read on for a more detailed explanation.
Explanation
Set up sample data
import datetime as dt
import pandas as pd
days= 4
#List of all dates that should be in the index
all_dates = [dt.date(2013, 2, 13) - dt.timedelta(days=x)
for x in range(days)]
df = pd.DataFrame([
(dt.date(2013, 2, 10), 1, 4),
(dt.date(2013, 2, 10), 2, 7),
(dt.date(2013, 2, 11), 2, 7),
(dt.date(2013, 2, 13), 1, 2),
(dt.date(2013, 2, 13), 2, 3)],
columns = ['date', 'category', 'value'])
df.set_index(['date', 'category'], inplace=True)
Here's what the sample data looks like
value
date category
2013-02-10 1 4
2 7
2013-02-11 2 7
2013-02-13 1 2
2 3
Make new index
Using from_product we can make a new multi index. This new index is the Cartesian product of all the values you pass to the function.
(date_index, category_index) = df.index.levels
new_index = pd.MultiIndex.from_product([all_dates, category_index])
Reindex
Use the new index to reindex the existing data frame.
All the possible combinations are now present. The missing values are null (NaN).
new_df = df.reindex(new_index)
Now, the expanded, re-indexed data frame looks like this:
value
2013-02-13 1 2.0
2 3.0
2013-02-12 1 NaN
2 NaN
2013-02-11 1 NaN
2 7.0
2013-02-10 1 4.0
2 7.0
Nulls in integer column
You can see that the data in the new data frame has been converted from ints to floats. Pandas can't have nulls in an integer column. Optionally, we can convert all the nulls to 0, and cast the data back to integers.
new_df = new_df.fillna(0).astype(int)
Result
value
2013-02-13 1 2
2 3
2013-02-12 1 0
2 0
2013-02-11 1 0
2 7
2013-02-10 1 4
2 7
Checkout this answer: How to fill the missing record of Pandas dataframe in pythonic way?
You can do something like:
import datetime
import pandas as pd
#make an empty dataframe with the index you want
def get_datetime(x):
return datetime.date(2013, 2, 13)- datetime.timedelta(days=x)
all_dates = [ get_datetime(x) for x in range(4)]
categories = [1,2,3,4]
index = [ [date, cat] for cat in categories for date in all_dates ]
#this df will be just an index
df = pd.DataFrame(index)
df =print df.set_index([0,1])
df.columns = ['date', 'category']
df = df.set_index(['date', 'category'])
#now if your original df is called df_original you can reindex against the other values
df_orig = df_orig.reindex_axis(df.index)
#and to add zeros
df_orig.fillna(0)
Related
my first dataframe is like this:
value
idx1
idx2
1
a
9
b
8
2
a
7
b
6
i want to convert into like this:
idx1
a
b
1
9
8
2
7
6
Assuming a MultiIndex, you need to convert to Series and unstack:
df['value'].unstack('idx2').reset_index().rename_axis(None, axis=1)
output:
idx1 a b
0 1 9 8
1 2 7 6
used input:
df = (pd.DataFrame({'value': {(1, 'a'): 9, (1, 'b'): 8, (2, 'a'): 7, (2, 'b'): 6}})
.rename_axis(['idx1', 'idx2'])
)
I am trying to aggregate and count values together. Below you can see my dataset
data = {'id':['1','2','3','4','5'],
'name': ['Company1', 'Company1', 'Company3', 'Company3', 'Company5'],
'sales': [10, 3, 5, 1, 0],
'income': [10, 3, 5, 1, 0],
}
df = pd.DataFrame(data, columns = ['id','name', 'sales','income'])
conditions = [
(df['sales'] < 1),
(df['sales'] >= 1) & (df['sales'] < 3),
(df['sales'] >= 3) & (df['sales'] < 5),
(df['sales'] >= 5)
]
values = ['<1', '1-3', '3-5', '>= 5']
df['range'] = np.select(conditions, values)
df=df.groupby('range')['sales','income'].agg(['count','sum']).reset_index()
This code gives me the next table
But I am not satisfied with the appearance of this table because 'count' is duplicated two times. So can anybody help me with this table in order to have separate columns 'range', 'count', 'income' and 'sales'.
You could try named aggregation:
df.groupby('range', as_index=False).agg(count=('range','count'), sales=('sales','sum'), income=('income','sum'))
Output:
range count sales income
0 1-3 1 1 1
1 3-5 1 3 3
2 <1 1 0 0
3 >= 5 2 15 15
P.S. You probably want to make "range" a categorical variable, so that the output is sorted in the correct order:
df['range'] = pd.Categorical(np.select(conditions, values), categories=values, ordered=True)
Then the above code outputs:
range count sales income
0 <1 1 0 0
1 1-3 1 1 1
2 3-5 1 3 3
3 >= 5 2 15 15
I want to transform two-dimensional dataframe into one-dimensional Series.
let me list an example:
In [11]: df = pd.DataFrame(np.reshape(range(9), (3,3)))
In [12]: df
Out[12]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
what i want is a seies like this:
In [13]: pd.Series([(0,1,2), (3,4,5), (6,7,8)])
Out[13]:
0 (0, 1, 2)
1 (3, 4, 5)
2 (6, 7, 8)
dtype: object
which merge all element in one row into a tuple, so we reduce the dimension from 2 -> 1
is there good methods can do this?
and, how can i recover the dataframe by the tuple series?
Use List comprehension with df.to_numpy to convert it to Series:
In [556]: l = [tuple(r) for r in df.to_numpy()]
In [563]: new_series = pd.Series(l)
In [564]: new_series
Out[564]:
0 (0, 1, 2)
1 (3, 4, 5)
2 (6, 7, 8)
dtype: object
To convert it back to df, pass the list to the dataframe constructor:
In [561]: pd.DataFrame(l)
Out[561]:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
I have a following function to calculate a value for two parameter x,y:
import numpy as np
import math
def some_func(pt1,pt2):
return math.sqrt( (pt2[0]-pt1[0])*(pt2[0]-pt1[0]) + (pt2[1]-pt1[1])*(pt2[1]-pt1[1]) )
usage:
a = 1, 2
b = 4, 5
some_func(a,b)
#outputs = 4.24264
#or some_func((1,2), (4,5)) would give the same output too
I have a following df:
seq x y points
1 2 3 (2,3)
1 10 5 (10,5)
1 6 7 (6,7)
2 8 9 (8,9)
2 10 11 (10,11)
column "points" was obtained using the below piece of code:
df["points"] = list(zip(df.loc[:, "x"], df.loc[:, "y"]))
I want to apply the some_func function on the whole df, also by grouping them by "seq"
I tried :
df["value"] = some_func(df["points"].values, df["points"].shift(1).values)
#without using groupby
and
df["value"] = df.groupby("seq").points.apply(some_func) #with groupby
but both of them shows TypeError saying 1 missing argument or unsupported data type.
Expected df
seq x y points value
1 2 3 (2,3) NaN
1 10 5 (10,5) 8.24
1 6 7 (6,7) 4.47
2 8 9 (8,9) NaN
2 10 11 (10,11) 2.82
You can use groupby with DataFrameGroupBy.shift first, but then need replace NaNs to tuples - one possible solution is use fillna. Last use apply
s = pd.Series([(np.nan, np.nan)], index=df.index)
df['shifted'] = df.groupby('seq').points.shift().fillna(s)
df['values'] = df.apply(lambda x: some_func(x['points'], x['shifted']), axis=1)
print (df)
seq x y points shifted values
0 1 2 3 (2, 3) (nan, nan) NaN
1 1 10 5 (10, 5) (2, 3) 8.246211
2 1 6 7 (6, 7) (10, 5) 4.472136
3 2 8 9 (8, 9) (nan, nan) NaN
4 2 10 11 (10, 11) (8, 9) 2.828427
Another solution is filter out NaNs in apply:
df['shifted'] = df.groupby('seq').points.shift()
f = lambda x: some_func(x['points'], x['shifted']) if pd.notnull(x['shifted']) else np.nan
df['values'] = df.apply(f, axis=1)
print (df)
seq x y points shifted values
0 1 2 3 (2, 3) NaN NaN
1 1 10 5 (10, 5) (2, 3) 8.246211
2 1 6 7 (6, 7) (10, 5) 4.472136
3 2 8 9 (8, 9) NaN NaN
4 2 10 11 (10, 11) (8, 9) 2.828427
f=lambda x,y:some_func(x,y)
f["value"] = f(df["points"].values, df["points"].shift(1).values)
Let's say I have a dataframe column. I want to create a new column where the value for a given observation is 1 if the corresponding value in the old column is above average. But the value should be 0 if the value in the other column is average or below.
What's the fastest way of doing this?
Say you have the following DataFrame:
df = pd.DataFrame({'A': [1, 4, 6, 2, 8, 3, 7, 1, 5]})
df['A'].mean()
Out: 4.111111111111111
Comparison against the mean will get you a boolean vector. You can cast that to integer:
df['B'] = (df['A'] > df['A'].mean()).astype(int)
or use np.where:
df['B'] = np.where(df['A'] > df['A'].mean(), 1, 0)
df
Out:
A B
0 1 0
1 4 0
2 6 1
3 2 0
4 8 1
5 3 0
6 7 1
7 1 0
8 5 1