Multi-slice pandas dataframe - python

I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?

Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h

Related

Why doesn't pandas reindex() operate in-place?

From the reindex docs:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
Therefore, I thought that I would get a reordered Dataframe by setting copy=False in place (!). It appears, however, that I do get a copy and need to assign it to the original object again. I don't want to assign it back, if I can avoid it (the reason comes from this other question).
This is what I am doing:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
df.columns = [ 'a', 'b', 'c', 'd', 'e' ]
df.head()
Outs:
a b c d e
0 0.234296 0.011235 0.664617 0.983243 0.177639
1 0.378308 0.659315 0.949093 0.872945 0.383024
2 0.976728 0.419274 0.993282 0.668539 0.970228
3 0.322936 0.555642 0.862659 0.134570 0.675897
4 0.167638 0.578831 0.141339 0.232592 0.976057
Reindex gives me the correct output, but I'd need to assign it back to the original object, which is what I wanted to avoid by using copy=False:
df.reindex( columns=['e', 'd', 'c', 'b', 'a'], copy=False )
The desired output after that line is:
e d c b a
0 0.177639 0.983243 0.664617 0.011235 0.234296
1 0.383024 0.872945 0.949093 0.659315 0.378308
2 0.970228 0.668539 0.993282 0.419274 0.976728
3 0.675897 0.134570 0.862659 0.555642 0.322936
4 0.976057 0.232592 0.141339 0.578831 0.167638
Why is copy=False not working in place?
Is it possible to do that at all?
Working with python 3.5.3, pandas 0.23.3
reindex is a structural change, not a cosmetic or transformative one. As such, a copy is always returned because the operation cannot be done in-place (it would require allocating new memory for underlying arrays, etc). This means you have to assign the result back, there's no other choice.
df = df.reindex(['e', 'd', 'c', 'b', 'a'], axis=1)
Also see the discussion on GH21598.
The one corner case where copy=False is actually of any use is when the indices used to reindex df are identical to the ones it already has. You can check by comparing the ids:
id(df)
# 4839372504
id(df.reindex(df.index, copy=False)) # same object returned
# 4839372504
id(df.reindex(df.index, copy=True)) # new object created - ids are different
# 4839371608
A bit off topic, but I believe this would rearrange the columns in place
for i, colname in enumerate(list_of_columns_in_desired_order):
col = dataset.pop(colname)
dataset.insert(i, colname, col)

Python: for cycle in range over files

I am trying to create a list that takes values from different files.
I have three dataframes called for example "df1","df2","df3"
each files contains two columns with data, so for example "df1" looks like this:
0, 1
1, 4
7, 7
I want to create a list that takes a value from first row in second column in each file, so it should look like this
F=[1,value from df2,value from df3]
my try
import pandas as pd
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
df3 = pd.read_csv(file3)
F=[]
for i in range(3):
F.append(df{"i"}[1][0])
probably that is not how to iterate over, but I cannot figure out the correct way
You can use iloc and list comprehension
vals = [df.iloc[0, 1] for df in [df1,df2,df3]]
iloc will get value from first row (index 0) and second column (index 1). If you wanted, say, value from third row and fourth column, you'd do .iloc[2, 3] and so forth.
As suggested by #jpp, you may use iat instead:
vals = [df.iat[0, 1] for df in [df1,df2,df3]]
For difference between them, check this and this question

Use results from two queries with common key to create a dataframe without having to use merge

data set:
df = pd.DataFrame(np.random.randn(5, 4), columns=['A', 'B', 'C', 'D'],index=['abcd','efgh','abcd','abc123','efgh']).reset_index()
s = pd.Series(data=[True,True,False],index=['abcd','efgh','abc123'], name='availability').reset_index()
(feel free to remove the reset_index bits above, they are simply there to easily provide a different approach to the problem. however, the resulting datasets from the queries i'm running resemble the above most accurately)
I have two separate queries that return data similar to the above. One query queries one field from a DB that has one column of information that does not exist in the other. The 'index' column is the common key across both tables.
My result set needs to have the 2nd query's result series injected into the first query's resulting dataframe at a specific column index.
I know that I can simply run:
df = df.merge(s, how='left', on='index')
Then to enforce column order:
df = df[['index', 'A', 'B', 'availability', 'C', 'D']
I saw that you can do df.inject, but that requires that the series be the same length as the df.
I'm wondering if there is a way to do this without having to run merge and then enforce column order. With my actual dataset, the number of columns is significantly longer. I'd imagine the best solution likely relies on list manipulation, but I'd much rather do something clever with how the dataframe is created in the first place.
df.set_index(['index','id']).index.map(s['availability'])
is returning:
TypeError: 'Series' object is not callable
S is a dataframe with a multi-index and one column which is a boolean.
df is a dataframe with columns in it which make up S's multi-index
IIUC:
In [260]: df.insert(3, 'availability',
df['index'].map(s.set_index('index')['availability']))
In [261]: df
Out[261]:
index A B availability C D
0 abcd 1.867270 0.517894 True 0.584115 -0.162361
1 efgh -0.036696 1.155110 True -1.112075 2.005678
2 abcd 0.693795 -0.843335 True -1.003202 1.001791
3 abc123 -1.466148 -0.848055 False -0.373293 0.360091
4 efgh -0.436618 -0.625454 True -0.285795 -0.220717

How do I slice X rows from a Dataframe begining at a specific label index in Pandas and Python

I want to specify a label index, then slice int X rows from a dataframe. And I do not necessarily know my end label. My labels are usually timestamps, but that should not matter. I am having trouble achieving this, mixing labels and integer numbers of rows wanted.
so if:
df= pd.DataFrame(np.random.rand(8,3), columns = list('abc'), index = list('lmnopqrs'))
How do I get the result given by this code:
df.loc['q':'o':-1]
BUT, if I only know the 'q' index? So I want to something that returns logic like this:
df.loc['q':"3 rows only":-1]
So normally I would never know which int index the 'q' is, but I would know its name, and I do not know where in the dataframe it is. Thanks.
I am not sure if there are better ways to do this, But You can use df.index to access the indexes in the DataFrame, and also df.index.tolist() to access the index as list.
So in your case, df.index.tolist() would give -
In [13]: df.index.tolist()
Out[13]: ['l', 'm', 'n', 'o', 'p', 'q', 'r', 's']
Then, you can find the index of q in that list, using list.index() method and then get the element that is 2 indexes before q . Example -
In [19]: df.index[df.index.tolist().index('q')-2]
Out[19]: 'o'
You can use this to index your DataFrame , Example -
In [20]: df.loc['q':df.index[df.index.tolist().index('q')-2]:-1]
Out[20]:
a b c
q 0.791467 0.703116 0.268405
p 0.643924 0.434607 0.918549
o 0.630881 0.209446 0.351309
You can do this with the .ix attribute:
index = df.index.searchsorted('q') # or just a number if you already have it.
offset = 3
df.ix[index : index - offset : -1]

A series query on groupby function

I have a data frame called active and it has 10 unique POS column values.
Then I group POS values and mean normalize the OPW columns and then store the normalized values as a seperate column ['resid'].
If I groupby on POS values shouldnt the new active data frame's POS columns contain only unique POS values??
For example:
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
print df2
df2.groupby(['X']).sum()
I get an output like this:
Y
X
A 7
B 3
In my example shouldn't I get an column with only unique Pos values as mentioned below??
POS Other Columns
Rf values
2B values
LF values
2B values
OF values
I can't be 100% sure without the actual data, but i'm pretty sure that the problem here is that you are not aggregating the data.
Let's go through the groupby step by step.
When you do active.groupby('POS'), what's actually happening is that you are slicing the dataframe per each unique POS, and passing each of these sclices, sequentially, to the applied function.
You can get a better vision of what's happening by using get_group (ex : active.groupby('POS').get_group('RF') )
So you're applying your meanNormalizeOPW function to each of those slices. That function creates a mean normalized value of the column 'resid' for each line of the passed dataframe. And you return that dataframe, ending with a shape that is similar than what was passed.
So if you just add an aggregation function to the returned df, it should work fine. I guess here you want a mean, so just change return df into return df.mean()

Categories

Resources