creating columns based on values in the hierarchical row index - python

I have a pandas dataframe with a hierarchical row index
def stack_example():
i = pd.DatetimeIndex([ '2011-04-04',
'2011-04-06',
'2011-04-12', '2011-04-13'])
cols = pd.MultiIndex.from_product([['milk', 'honey'],[u'jan', u'feb'], [u'PRICE','LITERS']])
df = pd.DataFrame(np.random.randint(12, size=(len(i), 8)), index=i, columns=cols)
df.columns.names = ['food', 'month', 'measure']
df.index.names = ['when']
df = df.stack('food', 'columns')
df= df.stack('month', 'columns')
df['constant_col'] = "foo"
df['liters_related_col'] = df['LITERS']*99
return df
I can add new columns to this dataframe based on constants or based on calculations involving other columns.
I would like to add new columns based in part on calculations involving the index.
For example, just repeat the food name twice:
df.index
MultiIndex(levels=[[2011-04-04 00:00:00, 2011-04-06 00:00:00, 2011-04-12 00:00:00, 2011-04-13 00:00:00], [u'honey', u'milk'], [u'feb', u'jan']],
labels=[[0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
names=[u'when', u'food', u'month'])
df.index.values[4][1]*2
'honeyhoney'
But I can't figure out the syntax for creating something like this:
df['xcol'] = df.index.values[2]*2
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2519, in __setitem__
self._set_item(key, value)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2585, in _set_item
value = self._sanitize_column(key, value)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\frame.py", line 2760, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "C:\Users\mds\Anaconda2\envs\bbg27\lib\site-packages\pandas\core\series.py", line 3080, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
I've also tried variations like df['xcol'] = df.index.values[:][2]*2

In the case of df.index.values[4][1] * 2, where the value is a string (honeyhoney), it's fine to assign that to a column:
df['col1'] = df.index.values[4][1] * 2
df.col1
when food month
2011-04-04 honey feb honeyhoney
jan honeyhoney
milk feb honeyhoney
jan honeyhoney
In your second example, though, the one with the error, you're not actually performing an operation on a single value:
df.index.values[2]*2
(Timestamp('2011-04-04 00:00:00'),
'milk',
'feb',
Timestamp('2011-04-04 00:00:00'),
'milk',
'feb')
You could still smush all that into a string, or into some other format, depending on your needs:
df['col2'] = ''.join([str(x) for x in df.index.values[2]*2])
But the main issue is that the output of df.index.values[2]*2 gives you a multi-dimensional structure, which doesn't map to the existing structure of df.
New columns in df can either be a single value (in which case it's replicated automatically to fit the number of rows in df), or they can have the same number of entries as len(df).
UPDATE
per comments
IIUC, you can use get_level_values() to apply an operation to an entire level of a MultiIndex:
df.index.get_level_values(1).values*2
array(['honeyhoney', 'honeyhoney', 'milkmilk', 'milkmilk', 'honeyhoney',
'honeyhoney', 'milkmilk', 'milkmilk', 'honeyhoney', 'honeyhoney',
'milkmilk', 'milkmilk', 'honeyhoney', 'honeyhoney', 'milkmilk',
'milkmilk'], dtype=object)

Related

Pandas how to reorder my table or dataframe

I have a dataframe in pandas where each column has different value range. For example:
My desired output is:
First, it makes 2-level index, then unstack based on multiindex and, finally, rename columns.
df = pd.DataFrame({'axis_x': [0, 1, 2, 0, 1, 2, 0, 1, 2], 'axis_y': [0, 0, 0, 1, 1, 1, 2, 2, 2],
'data': ['diode', 'switch', 'coil', '$2.2', '$4.5', '$3.2', 'colombia', 'china', 'brazil']})
df = df.set_index(['axis_x', 'axis_y']).unstack().rename(columns={0: 'product', 1: 'price', 2: 'country'})
print(df)
Prints:
data
axis_y product price country
axis_x
0 diode $2.2 colombia
1 switch $4.5 china
2 coil $3.2 brazil

how to get last column of pandas series

I am trying to count frequencies of an array.
I've read this post, I am using DataFrame and get a series.
>>> a = np.array([1, 1, 5, 0, 1, 2, 2, 0, 1, 4])
>>> df = pd.DataFrame(a, columns=['a'])
>>> b = df.groupby('a').size()
>>> b
a
0 2
1 4
2 2
4 1
5 1
dtype: int64
>>> b.iloc[:,-1]
when i try to get the last column, i got this error.
Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1472, in __getitem__
return self._getitem_tuple(key) File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 2013, in _getitem_tuple
self._has_valid_tuple(tup) File "/Users/pan/anaconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 220, in _has_valid_tuple
raise IndexingError('Too many indexers') pandas.core.indexing.IndexingError: Too many indexers
how to get the last column of b?
Since pandas.Series is a
One-dimensional ndarray with axis labels
If you want to get just the frequencies column, i.e. the values of
your series, use:
b.tolist()
or, alternatively:
b.to_dict()
to keep both labels and frequencies.
P.S:
For your specific task consider also collections package:
>>> from collections import Counter
>>> a = [1, 1, 5, 0, 1, 2, 2, 0, 1, 4]
>>> c = Counter(a)
>>> list(c.values())
[2, 4, 2, 1, 1]
Problem is output of GroupBy.size is Series, and Series have no columns, so is possible get last value only:
b.iloc[-1]
If use:
b.iloc[:,-1]
it return last column in Dataframe.
Here : means all rows and -1 in second position last column.
So if create DataFrame from Series:
b1 = df.groupby('a').size().reset_index(name='count')
it working like expected.

Efficient way: find row where nearly no zero appears in column

I have a problem that as to be solved as efficient as possible. My current approach kind of works, but is extreme slow.
I have a dataframe with multiple columns, in this case I only care for one of them. It contains positive continuous numbers and some zeros.
my goal: is to find the row where nearly no zeros appear in the following rows.
To make clear what I mean I wrote this example to replicate my problem:
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
There are some zeros at the beginning, but they get less after some time.
Here comes my unoptimized code to visualize the number of zeros:
zerosum = 0 # counter for all zeros that have appeared so far
for i in range(len(df)):
if(df[0][i]== 0.0):
df.loc[df.index[i],'zerosum']=zerosum
zerosum+=1
else:
df.loc[df.index[i],'zerosum']=zerosum
df['zerosum'].plot()
With that unoptimized code I can see the distribution of zeros over time.
My expected output: would be in this example the date 01-Jan-2018 08:00, because no zeros appear after that date.
The problem I have when dealing with my real data is some single zeros can appear later. Therefore I can't just pick the last row that contains a zero. I have to somehow inspect the distribution of zeros and ignore later outliers.
Note: The visualization is not necessary to solve my problem, I just included it to explain my problem as good as possible. Thanks
Ok
Second go
import pandas as pd
import numpy as np
import math
df = pd.DataFrame([0,0,0,0,1,0,1,0,0,2,0,0,0,1,1,0,1,2,3,4,0,4,0,5,1,0,1,2,3,4,
0,0,1,2,1,1,1,1,2,2,1,3,6,1,1,5,1,2,3,4,4,4,3,5,1,2,1,2,3,4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'),
columns=['values'])
We create a column that contains the rank of each zero, and zero if there is a non-zero value
df['zero_idx'] = np.where(df['values']==0,np.cumsum(np.where(df['values']==0,1,0)), 0)
We can use this column to get the location of any zero of any rank. I dont know what your criteria is for naming a zero an outlier. But lets say we want to make sure at we are past at least 90% of all zeros...
# Total number of zeros
n_zeros = max(df['zero_idx'])
# Get past at least this percentage
tolerance = 0.9
# The rank of the abovementioned zero
rank_tolerance = math.ceil(tolerance * n_zeros)
df[df['zero_idx']==rank_tolerance].index
Out[44]: DatetimeIndex(['2018-01-01 07:30:00'], dtype='datetime64[ns]', freq='15T')
Okay, If you need to get the index after the last zero occurred, you can try this:
last = 0
for i in range(len(df)):
if(df[0][i] == 0):
last = i
print(df.iloc[last+1])
or by Filtering:
new = df.loc[df[0]==0]
last = df.index.get_loc(new.index[-1])
print(df.iloc[last+1])
here my solution using a filter and cumsum:
df = pd.DataFrame([0, 0, 0, 0, 1, 0, 1, 0, 0, 2, 0, 0, 0, 1, 1, 0, 1, 2, 3, 4, 0, 4, 0, 5, 1, 0, 1, 2, 3, 4,
0, 0, 1, 2, 1, 1, 1, 1, 2, 2, 1, 3, 6, 1, 1, 5, 1, 2, 3, 4, 4, 4, 3, 5, 1, 2, 1, 2, 3, 4],
index=pd.date_range('2018-01-01', periods=60, freq='15T'))
a = df[0] == 0
df['zerosum'] = a.cumsum()
maxval = max(df['zerosum'])
firstdate = df[df['zerosum'] == maxval].index[1]
print(firstdate)
output:
2018-01-01 08:00:00

How to read the data ignoring the comma in columns using pd.read_csv

I used the follow way to read my data:
df=pd.read_csv("file.dat",delim_whitespace=True,header=None,skiprows=None)
df.head()
and then I obtained:
0, 1, 0, 1, 0, 0, 0, 0, 0, 0.5
1, 1, 0, 1, 0, 0, 0, 0, 0, 1.5
...
It is shown that all the columns (except for the last one) contain the number+','. However, I just need the number value (without the commas) for each column. How should I read the table?
Your option delim_whitespace=True is equivalent to sep='\s+' if your file has value separeted with commas ommit the line delim_whitespace=True, you don't need the sep = ',' as it is the default value.
try:
df = pd.read_csv("file.dat", header=None, skiprows=None)

Filter a pandas Dataframe based on specific month values and conditional on another column

I have a large dataframe with the following heads
import pandas as pd
f = pd.Dataframe(columns=['month', 'Family_id', 'house_value'])
Months go from 0 to 239, Family_ids up to 10900 and house values vary. So the dataframe has more than 2 and a half million lines.
I want to filter the Dataframe only for those in which there is a difference between the final house price and its initial for each family.
Some sample data would look like this:
f = pd.DataFrame({'month': [0, 0, 0, 0, 0, 1, 1, 239, 239], 'family_id': [0, 1, 2, 3, 4, 0, 1, 0, 1], 'house_value': [10, 10, 5, 7, 8, 10, 11, 10, 11]})
And from that sample, the resulting dataframe would be:
g = pd.DataFrame({'month': [0, 1, 239], 'family_id': [1, 1, 1], 'house_value': [10, 11, 11]})
So I thought in a code that would be something like this:
ft = f[f.loc['month'==239, 'house_value'] > f.loc['month'==0, 'house_value']]
Also tried this:
g = f[f.house_value[f.month==239] > f.house_value[f.month==0] and f.family_id[f.month==239] == f.family_id[f.month==0]]
And the above code gives an error Keyerror: False and ValueError Any ideas. Thanks.
Use groupby.filter:
(f.sort_values('month')
.groupby('family_id')
.filter(lambda g: g.house_value.iat[-1] != g.house_value.iat[0]))
# family_id house_value month
#1 1 10 0
#6 1 11 1
#8 1 11 239
As commented by #Bharath, your approach errors out because for boolean filter, it expects the boolean series to have the same length as the original data frame, which is not true in both of your cases due to the filter process you applied before the comparison.

Categories

Resources