Reorder DataFrame rows inplace - python

Is it possible to reorder pandas.DataFrame rows (based on an index) inplace?
I am looking for something like this:
df.reorder(idx, axis=0, inplace=True)
Where idx is not equal to but is the same type of df.index, it contains the same elements, but in another order. The output is df reordered before new idx.
I have not found anything in documentation and I fail to use reindex_axis. Which made me hoping it was possible because:
A new object is produced unless the new index is equivalent to the
current one and copy=False
I might have misunderstood what equivalent index means in this context.

Try using the reindex function (note that this is not inplace):
>>import pandas as pd
>>df = pd.DataFrame({'col1':[1,2,3],'col2':['test','hi','hello']})
>>df
col1 col2
0 1 test
1 2 hi
2 3 hello
>>df = df.reindex([2,0,1])
>>df
col1 col2
2 3 hello
0 1 test
1 2 hi

Related

Adding a column with one single categorical value to a pandas dataframe

I have a pandas.DataFrame df and would like to add a new column col with one single value "hello". I would like this column to be of dtype category with the single category "hello". I can do the following.
df["col"] = "hello"
df["col"] = df["col"].astype("catgegory")
Do I really need to write df["col"] three times in order to achieve this?
After the first line I am worried that the intermediate dataframe df might take up a lot of space before the new column is converted to categorical. (The dataframe is rather large with millions of rows and the value "hello" is actually a much longer string.)
Are there any other straightforward, "short and snappy" ways of achieving this while avoiding the above issues?
An alternative solution is
df["col"] = pd.Categorical(itertools.repeat("hello", len(df)))
but it requires itertools and the use of len(df), and I am not sure how memory usage is under the hood.
We can explicitly build the Series of the correct size and type instead of implicitly doing so via __setitem__ then converting:
df['col'] = pd.Series('hello', index=df.index, dtype='category')
Sample Program:
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3]})
df['col'] = pd.Series('hello', index=df.index, dtype='category')
print(df)
print(df.dtypes)
print(df['col'].cat.categories)
a col
0 1 hello
1 2 hello
2 3 hello
a int64
col category
dtype: object
Index(['hello'], dtype='object')
A simple way to do this would be to use df.assign to create your new variable, then change dtype to category using df.astype along with dictionary of dtypes for the specific columns.
df = df.assign(col="hello").astype({'col':'category'})
df.dtypes
A int64
col category
dtype: object
That way you don't have to create a series of length equal to the dataframe. You can just broadcast the input string directly, which would be a bit more time and memory efficient.
This approach is quite scalable as you can see. You can assign multiple variables as per your need, some based on complex functions as well. Then set datatypes for them as per requirement.
df = pd.DataFrame({'A':[1,2,3,4]})
df = (df.assign(col1 = 'hello', #Define column based on series or broadcasting
col2 = lambda x:x['A']**2, #Define column based on existing columns
col3 = lambda x:x['col2']/x['A']) #Define column based on previously defined columns
.astype({'col1':'category',
'col2':'float'}))
print(df)
print(df.dtypes)
A col1 col2 col3
0 1 hello 1.0 1.0
1 2 hello 4.0 2.0
2 3 hello 9.0 3.0
3 4 hello 16.0 4.0
A int64
col1 category #<-changed dtype
col2 float64 #<-changed dtype
col3 float64
dtype: object
This solution surely solves the first point, not sure about the second:
df['col'] = pd.Categorical(('hello' for i in len(df)))
Essentially
we first create a generator of 'hello' with length equal to the number of records in df
then we pass it to pd.Categorical to make it a categorical column.

get value from dataframe based on row values without using column names

I am trying to get a value situated on the third column from a pandas dataframe by knowing the values of interest on the first two columns, which point me to the right value to fish out. I do not know the row index, just the values I need to look for on the first two columns. The combination of values from the first two columns is unique, so I do not expect to get a subset of the dataframe, but only a row. I do not have column names and I would like to avoid using them.
Consider the dataframe df:
a 1 bla
b 2 tra
b 3 foo
b 1 bar
c 3 cra
I would like to get tra from the second row, based on the b and 2 combination that I know beforehand. I've tried subsetting with
df = df.loc['b', :]
which returns all the rows with b on the same column (provided I've read the data with index_col = 0) but I am not able to pass multiple conditions on it without crashing or knowing the index of the row of interest. I tried both df.loc and df.iloc.
In other words, ideally I would like to get tra without even using row indexes, by doing something like:
df[(df[,0] == 'b' & df[,1] == `2`)][2]
Any suggestions? Probably it is something simple enough, but I have the tendency to use the same syntax as in R, which apparently is not compatible.
Thank you in advance
As #anky has suggested, a way to do this without knowing the column names nor the row index where your value of interest is, would be to read the file in a pandas dataframe using multiple column indexing.
For the provided example, knowing the column indexes at least, that would be:
df = pd.read_csv(path, sep='\t', index_col=[0, 1])
then, you can use:
df = df.iloc[df.index.get_loc(("b", 2)):]
df.iloc[0]
to get the value of interest.
Thanks again #anky for your help. If you found this question useful, please upvote #anky 's comment in the posted question.
I'd probably use pd.query for that:
import pandas as pd
df = pd.DataFrame(index=['a', 'b', 'b', 'b', 'c'], data={"col1": [1, 2, 3, 1, 3], "col2": ['bla', 'tra', 'foo', 'bar', 'cra']})
df
col1 col2
a 1 bla
b 2 tra
b 3 foo
b 1 bar
c 3 cra
df.query('col1 == 2 and col2 == "tra"')
col1 col2
b 2 tra

Pandas - slicing column values based on another column

How can I slice column values based on first & last character location indicators from two other columns?
Here is the code for a sample df:
import pandas as pd
d = {'W': ['abcde','abcde','abcde','abcde']}
df = pd.DataFrame(data=d)
df['First']=[0,0,0,0]
df['Last']=[1,2,3,5]
df['Slice']=['a','ab','abc','abcde']
print(df.head())
Code output:
Desired Output:
Just do it with for loop , you may worry about the speed , please check For loops with pandas - When should I care?
df['Slice']=[x[y:z]for x,y,z in zip(df.W,df.First,df.Last)]
df
Out[918]:
W First Last Slice
0 abcde 0 1 a
1 abcde 0 2 ab
2 abcde 0 3 abc
3 abcde 0 5 abcde
I am not sure if this will be faster, but a similar approach would be:
df['Slice'] = df.apply(lambda x: x[0][x[1]:x[2]],axis=1)
Briefly, you go through each row (axis=1) and apply a custom function. The function takes the row (stored as x), and slices the first element using the second and third elements as indices for the slicing (that's the lambda part). I will be happy to elaborate more if this isn't clear.

How to get the index value of column value compared with another column value

I want something like this.
Index Sentence
0 I
1 want
2 like
3 this
Keyword Index
want 1
this 3
I tried with df.index("Keyword") but its not giving for all the rows. It will be really helpful if someone solve this.
Use isin with boolean indexing only:
df = df[df['Sentence'].isin(['want', 'this'])]
print (df)
Index Sentence
1 1 want
3 3 this
EDIT: If need compare by another column:
df = df[df['Sentence'].isin(df['Keyword'])]
#another DataFrame df2
#df = df[df['Sentence'].isin(df2['Keyword'])]
And if need index values:
idx = df.index[df['Sentence'].isin(df['Keyword'])]
#alternative
#idx = df[df['Sentence'].isin(df['Keyword'])].index

Grouping by many columns in Pandas

I basically have a dataset that looks as follows
Col1 Col2 Col3 Count
A B 1 50
A B 1 50
A C 20 1
A D 17 2
A E 5 70
A E 15 20
Suppose it is called data. I basically do data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum(), which should give me this:
Col1 Col2 Col3 Count
A B 1 100
A C 20 1
A D 17 2
A E 5 70
A E 15 20
However, this returns an empty dataset, which does have the columns I want but no rows. The only caveat is that the by parameter is getting calculated dynamically, instead of fixed (thats because the columns might change, although Count will always be there).
Any ideas on why this could be failing, and how to fix it?
EDIT: Further searching revealed that pandas' groupby removes rows that have NULL at any column. This is a problem for me because every single column might be NULL. Hence, the actual question is: any reasonable way to deal with NULLs and still use groupby?
Would love to be corrected here, but I'm not sure if there is a clean way to handle missing data. As you noted, Pandas will just exclude rows from groupby that contain NaN values
You could fill the NaN values with something beyond the range of your data:
data = pd.read_csv("c:/Users/simon/Desktop/data.csv")
data.fillna(-999, inplace=True)
new = data.groupby(by=['Col1', 'Col2', 'Col3'], as_index=False, sort=False).sum()
Which is messy because it wont add those values to the correct group by for the summation. But theres no real way to groupby something thats missing
Another method might be to fill each column separately with some missing value that is appropriate for that variable.

Categories

Resources