I am not really sure how multi indexing works, so I maybe simply be trying to do the wrong things here. If I have a dataframe with
Value
A B
1 1 5.67
1 2 6.87
1 3 7.23
2 1 8.67
2 2 9.87
2 3 10.23
If I want to access the elements where B=2, how would I do that? df.ix[2] gives me the A=2. To get a particular value it seems df.ix[(1,2)] but that is the purpose of the B index if you can't access it directly?
You can use xs:
In [11]: df.xs(2, level='B')
Out[11]:
Value
A
1 6.87
2 9.87
alternatively:
In [12]: df1.xs(1, level=1)
Out[12]:
Value
A
1 5.67
2 8.67
Just as an alternative, you could use df.loc:
>>> df.loc[(slice(None),2),:]
Value
A B
1 2 6.87
2 2 9.87
The tuple accesses the indexes in order. So, slice(None) grabs all values from index 'A', the second position limits based on the second level index, where 'B'=2 in this example. The : specifies that you want all columns, but you could subet the columns there as well.
If you only want to return a cross-section, use xs (as mentioned by #Andy Hayden).
However, if you want to overwrite some values in the original dataframe, use pd.IndexSlice (with pd.loc) instead. Given a dataframe df:
In [73]: df
Out[73]:
col_1 col_2
index_1 index_2
1 1 5 6
1 5 6
2 5 6
2 2 5 6
if you want to overwrite with 0 all elements in col_1 where index_2 == 2 do:
In [75]: df.loc[pd.IndexSlice[:, 2], 'col_1'] = 0
In [76]: df
Out[76]:
col_1 col_2
index_1 index_2
1 1 5 6
1 5 6
2 0 6
2 2 0 6
Related
I am trying to select only one row from a dask.dataframe by using command x.loc[0].compute(). It returns 4 rows with all having index=0. I tried reset_index, but there will still be 4 rows having index=0 after resetting. (I think I did reset correctly because I did reset_index(drop=False) and I could see the original index in the new column).
I read dask.dataframe document and it says something along the line that there might be more than one rows with index=0 due to how dask structuring the chunk data.
So, if I really want only one row by using index=0 for subsetting, how can I do this?
Edit
Probably, your problem comes from reset_index. This issue is explained at the end of the answer. Earlier part of the text is just how to solve it.
For example, there is the following dask DataFrame:
import pandas as pd
import dask
import dask.dataframe as dd
df = pd.DataFrame({'col_1': [1,2,3,4,5,6,7], 'col_2': list('abcdefg')},
index=pd.Index([0,0,1,2,3,4,5]))
df = dd.from_pandas(df, npartitions=2)
df.compute()
Out[1]:
col_1 col_2
0 1 a
0 2 b
1 3 c
2 4 d
3 5 e
4 6 f
5 7 g
it has a numerical index with repeated 0 values. As loc is a
Purely label-location based indexer for selection by label
- it selects both 0-labeled values, if you'll do a
df.loc[0].compute()
Out[]:
col_1 col_2
0 1 a
0 2 b
- you'll get all the rows with 0-s (or another specified label).
In pandas there is a pd.DataFrame.iloc which helps us to select a row by it's numerical index. Unfortunately, in dask you can't do so, because the iloc is
Purely integer-location based indexing for selection by position.
Only indexing the column positions is supported. Trying to select row positions will raise a ValueError.
To beat this problem, you can do some indexing tricks:
df.compute()
Out[2]:
index col_1 col_2
x
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
4 3 5 e
5 4 6 f
6 5 7 g
- now, there's new index ranged from 0 to the length of the data frame - 1.
It's possible to slice it with the loc and do the following (I suppose that select 0 label via loc means "select first row"):
df.loc[0].compute()
Out[3]:
index col_1 col_2
x
0 0 1 a
About multiplicated 0 index label
If you need original index, it's still here an it could be accessed through the
df.loc[:, 'index'].compute()
Out[4]:
x
0 0
1 0
2 1
3 2
4 3
5 4
6 5
I guess, you get such a duplication from reset_index() or so, because it genretates new 0-started index for each partition, for example, for this table of 2 partitions:
df.reset_index().compute()
Out[5]:
index col_1 col_2
0 0 1 a
1 0 2 b
2 1 3 c
3 2 4 d
0 3 5 e
1 4 6 f
2 5 7 g
In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)
Example:
Current df looks like:
df=
A B
1 5
2 6
3 8
4 1
I want the resulting df to be like this (B is sorted and A remains untouched):
df=
A B
1 8
2 6
3 5
4 1
You need to break an internal Pandas security mechanism - aligning by index, which takes care of the data consistency. So assigning 1D Numpy array or a vanilla Python list would do the trick, because both of them don't have an index, so Pandas can't do alignment:
df['B'] = df['B'].sort_values(ascending=False).values
or
df['B'] = df['B'].sort_values(ascending=False).tolist()
both yield:
In [77]: df
Out[77]:
A B
0 1 8
1 2 6
2 3 5
3 4 1
You can do this as well :
df['B'] = sorted(df['B'].tolist())[::-1]
I have a question regarding pandas dataframes:
I have a dataframe like the following,
df = pd.DataFrame([[1,1,10],[1,1,30],[1,2,40],[2,3,50],[2,3,150],[2,4,100]],columns=["a","b","c"])
a b c
0 1 1 10
1 1 1 30
2 1 2 40
3 2 3 50
4 2 3 150
5 2 4 100
And i want to produce the following output,
a "new col"
0 1 30
1 2 100
where the first line is calculated as the following:
Group df by the first column "a",
Then group each grouped object the "b"
calculate the mean of "c" for this b-group
calculate the means of all b-groupbs for one "a"
this is the final value stored in "new col" for one "a"
I can imagine that this is somehow confusing, but I hope this is understandable, nevertheless.
I achieved the desired result, but as i need it for a huge dataframe, my solution is probably much to slow,
pd.DataFrame([ [a, adata.groupby("b").agg({"c": lambda x:x.mean()}).mean()[0]] for a,adata in df.groupby("a") ],columns=["a","new col"])
a new col
0 1 30.0
1 2 100.0
Therefore, what I would need is something like (?)
df.groupby("a").groupby("b")["c"].mean()
Thank you very much in advance!
Here's one way
In [101]: (df.groupby(['a', 'b'], as_index=False)['c'].mean()
.groupby('a', as_index=False)['c'].mean()
.rename(columns={'c': 'new col'}))
Out[101]:
a new col
0 1 30
1 2 100
In [57]: df.groupby(['a','b'])['c'].mean().mean(level=0).reset_index()
Out[57]:
a c
0 1 30
1 2 100
df.groupby(['a','b']).mean().reset_index().groupby('a').mean()
Out[117]:
b c
a
1 1.5 30.0
2 3.5 100.0
I ve been around several trials, nothing seems to work so far.
I have tried df.insert(0, "XYZ", 555) which seemed to work until it did not for some reasons i am not certain.
I understand that the issue is that Index is not considered a Series and so, df.iloc[0] does not allow you to insert data directly above the Index column.
I ve also tried manually adding in the list of indices part of the definition of the dataframe a first index with the value "XYZ"..but nothing has work.
Thanks for your help
A B C D are my columns. range(5) is my index. I am trying to obtain this below, for an arbitrary row starting with type, and then a list of strings..thanks
A B C D
type 'string1' 'string2' 'string3' 'string4'
0
1
2
3
4
If you use Timestamps as Index adding a row and a custom single row with its own custom index will throw an error:
ValueError: Cannot add integral value to Timestamp without offset. I am guessing it's due to the difference in the operands, if i substract an Integer from a Timestamp for example.. ? how could i fix this in a generic manner? thanks! –
if you want to insert a row before the first row, you can do it this way:
data:
In [57]: df
Out[57]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
adding one row:
In [58]: df.loc[df.index.min() - 1] = ['z', -1]
In [59]: df
Out[59]:
id var
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
-1 z -1
sort index:
In [60]: df = df.sort_index()
In [61]: df
Out[61]:
id var
-1 z -1
0 a 1
1 a 2
2 a 3
3 b 5
4 b 9
optionally reset your index :
In [62]: df = df.reset_index(drop=True)
In [63]: df
Out[63]:
id var
0 z -1
1 a 1
2 a 2
3 a 3
4 b 5
5 b 9