Deleting multiple series from a dataframe in one command - python

In short ... I have a Python Pandas data frame that is read in from an Excel file using 'read_table'. I would like to keep a handful of the series from the data, and purge the rest. I know that I can just delete what I don't want one-by-one using 'del data['SeriesName']', but what I'd rather do is specify what to keep instead of specifying what to delete.
If the simplest answer is to copy the existing data frame into a new data frame that only contains the series I want, and then delete the existing frame in its entirety, I would satisfied with that solution ... but if that is indeed the best way, can someone walk me through it?
TIA ... I'm a newb to Pandas. :)

You can use the DataFrame drop function to remove columns. You have to pass the axis=1 option for it to work on columns and not rows. Note that it returns a copy so you have to assign the result to a new DataFrame:
In [1]: from pandas import *
In [2]: df = DataFrame(dict(x=[0,0,1,0,1], y=[1,0,1,1,0], z=[0,0,1,0,1]))
In [3]: df
Out[3]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [4]: df = df.drop(['x','y'], axis=1)
In [5]: df
Out[5]:
z
0 0
1 0
2 1
3 0
4 1

Basically the same as Zelazny7's answer -- just specifying what to keep:
In [68]: df
Out[68]:
x y z
0 0 1 0
1 0 0 0
2 1 1 1
3 0 1 0
4 1 0 1
In [70]: df = df[['x','z']]
In [71]: df
Out[71]:
x z
0 0 0
1 0 0
2 1 1
3 0 0
4 1 1
*Edit*
You can specify a large number of columns through indexing/slicing into the Dataframe.columns object.
This object of type(pandas.Index) can be viewed as a dict of column labels (with some extended functionality).
See this extension of above examples:
In [4]: df.columns
Out[4]: Index([x, y, z], dtype=object)
In [5]: df[df.columns[1:]]
Out[5]:
y z
0 1 0
1 0 0
2 1 1
3 1 0
4 0 1
In [7]: df.drop(df.columns[1:], axis=1)
Out[7]:
x
0 0
1 0
2 1
3 0
4 1

You can also specify a list of columns to keep with the usecols option in pandas.read_table. This speeds up the loading process as well.

Related

How to use lambda function on a pandas data frame via map/apply where lambda takes different values for each column

The idea is to transform a data frame in the fastest way according to the values specific to each column.
For simplicity, here is an example where each element of a column is compared to the mean of the column it belongs to and replaced with 0 if greater than mean(column) or 1 otherwise.
In [26]: df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
In [27]: df
Out[27]:
0 1 2
0 1 2 3
1 4 5 6
In [28]: df.mean().values.tolist()
Out[28]: [2.5, 3.5, 4.5]
Snippet bellow, it is not real code but more to exemplify the desired behavior. I used apply method but it can be whatever works fastest.
In [29]: f = lambda x: 0 if x < means else 1
In [30]: df.apply(f)
In [27]: df
Out[27]:
0 1 2
0 0 0 0
1 1 1 1
This is a toy example but the solution has to be applied to a big data frame, therefore, it has to be fast.
Cheers!
You can create a boolean mask of the dataframe by comparing each element with the mean of that column. It can be easily achieved using
df > df.mean()
0 1 2
0 False False False
1 True True True
Since True equates to 1 and False to 0, a boolean dataframe can be easily converted to integer using astype.
(df > df.mean()).astype(int)
0 1 2
0 0 0 0
1 1 1 1
If you need the output to be some strings rather than 0 and 1, use np.where which works as (condition, if true, else)
pd.DataFrame(np.where(df > df.mean(), 'm', 'n'))
0 1 2
0 n n n
1 m m m
Edit: Addressing qn in comment; What if m and n are column dependent
df = pd.DataFrame(np.arange(12).reshape(4,3))
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
pd.DataFrame(np.where(df > df.mean(), df.min(), df.max()))
0 1 2
0 9 10 11
1 9 10 11
2 0 1 2
3 0 1 2

Drop all columns where all values are zero

I have a simple question which relates to similar questions here, and here.
I am trying to drop all columns from a pandas dataframe, which have only zeroes (vertically, axis=1). Let me give you an example:
df = pd.DataFrame({'a':[0,0,0,0], 'b':[0,-1,0,1]})
a b
0 0 0
1 0 -1
2 0 0
3 0 1
I'd like to drop column asince it has only zeroes.
However, I'd like to do it in a nice and vectorized fashion if possible. My data set is huge - so I don't want to loop. Hence I tried
df = df.loc[(df).any(1), (df!=0).any(0)]
b
1 -1
3 1
Which allows me to drop both columns and rows. But if I just try to drop the columns, locseems to fail. Any ideas?
You are really close, use any - 0 are casted to Falses:
df = df.loc[:, df.any()]
print (df)
b
0 0
1 1
2 0
3 1
If it's a matter of 0s and not sum, use df.any:
In [291]: df.T[df.any()].T
Out[291]:
b
0 0
1 -1
2 0
3 1
Alternatively:
In [296]: df.T[(df != 0).any()].T # or df.loc[:, (df != 0).any()]
Out[296]:
b
0 0
1 -1
2 0
3 1
In [73]: df.loc[:, df.ne(0).any()]
Out[73]:
b
0 0
1 1
2 0
3 1
or:
In [71]: df.loc[:, ~df.eq(0).all()]
Out[71]:
b
0 0
1 1
2 0
3 1
If we want to check those that do NOT sum up to 0:
In [78]: df.loc[:, df.sum().astype(bool)]
Out[78]:
b
0 0
1 1
2 0
3 1

How to save a pandas dataframe such that there is no delimiter?

I have the following pandas dataframe:
import pandas as pd
df = pd.DataFrame(np.random.choice([0,1], (6,3)), columns=list('XYZ'))
X Y Z
0 1 0 1
1 1 1 0
2 0 0 0
3 0 1 1
4 0 1 1
5 1 1 1
Let's say I take the transpose and wish to save it
df = df.T
0 1 2 3 4 5
X 1 1 0 0 0 1
Y 0 1 0 1 1 1
Z 1 0 0 1 1 1
So, there three rows. I would like to save it in this format:
X 110001
Y 010111
Z 100111
I tried
df.to_csv("filename.txt", header=None, index=None, sep='')
However, this outputs an error:
TypeError: "delimiter" must be an 1-character string
Is it possible to save the dataframe in this manner, or is there some what to combine all columns into one? What is the most "pandas" solution?
Leave original df alone. Don't transpose.
df = pd.DataFrame(np.random.choice([0,1], (6,3)), columns=list('XYZ'))
df.astype(str).apply(''.join)
X 101100
Y 101001
Z 111110
dtype: object
If you do want to transpose, then you can do something like this:
In [126]: df.T.apply(lambda row: ''.join(map(str, row)), axis=1)
Out[126]:
X 001111
Y 000000
Z 010010
dtype: object

Add columns to pandas dataframe containing max of each row, AND corresponding column name

My system
Windows 7, 64 bit
python 3.5.1
The challenge
I've got a pandas dataframe, and I would like to know the maximum value for each row, and append that info as a new column. I would also like to know the name of the column where the maximum value is located. And I would like to add another column to the existing dataframe containing the name of the column where the max value can be found.
A similar question has been asked and answered for R in this post.
Reproducible example
In[1]:
# Make pandas dataframe
df = pd.DataFrame({'a':[1,0,0,1,3], 'b':[0,0,1,0,1], 'c':[0,0,0,0,0]})
# Calculate max
my_series = df.max(numeric_only=True, axis = 1)
my_series.name = "maxval"
# Include maxval in df
df = df.join(my_series)
df
Out[1]:
a b c maxval
0 1 0 0 1
1 0 0 0 0
2 0 1 0 1
3 1 0 0 1
4 3 1 0 3
So far so good. Now for the add another column to the existing dataframe containing the name of the column part:
In[2]:
?
?
?
# This is what I'd like to accomplish:
Out[2]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a
Notice that I'd like to return all column names if multiple columns contain the same maximum value. Also please notice that the column maxval is not included in maxcol since that would not make much sense. Thanks in advance if anyone out there finds this interesting.
You can compare the df against maxval using eq with axis=0, then use apply with a lambda to produce a boolean mask to mask the columns and join them:
In [183]:
df['maxcol'] = df.ix[:,:'c'].eq(df['maxval'], axis=0).apply(lambda x: ','.join(df.columns[:3][x==x.max()]),axis=1)
df
Out[183]:
a b c maxval maxcol
0 1 0 0 1 a
1 0 0 0 0 a,b,c
2 0 1 0 1 b
3 1 0 0 1 a
4 3 1 0 3 a

How to apply math operations to a row of a csv in python?

I've been successful creating functions in python and reading/writing files. However, I really need to apply certain functions to whole rows of data (not columns) and can't find out anything about how to do this. The goals are:
Read a csv or txt file into python (can-do)
Find a row of data and apply certain conditions and operations
Do the same with a second row of data
Then compare results from the rows to each other (done with a similarity function)
Print the resulting data into a separate file (easy peasy)
Function parameters include "if/then" conditions for ratios, sums, and square roots -- will not include whole function. For example, just use sum
Here's what I have so far (not much...):
import numpy as np
data = np.genfromtxt ('file_to_read.csv',
dtype=float,
delimiter=",",
names=True)
np.sum()
print(data)
np.savetxt('test.csv', data, delimiter=',')
file_to_read.csv is this:
0,2,1
0,2,2
0,2,3
0,1,0
0,2,0
0,3,0
1,0,0
2,0,0
3,0,0
you can transpose your matrix or data frame (if using pandas) and work with columns.
Example (pandas):
Original DF
In [162]: df
Out[162]:
a b c
0 0 2 1
1 0 2 2
2 0 2 3
3 0 1 0
4 0 2 0
5 0 3 0
6 1 0 0
7 2 0 0
8 3 0 0
Transposed DF
In [163]: df.T
Out[163]:
0 1 2 3 4 5 6 7 8
a 0 0 0 0 0 0 1 2 3
b 2 2 2 1 2 3 0 0 0
c 1 2 3 0 0 0 0 0 0
Select rows where b>0 and c>1:
In [166]: df[(df.b>0) & (df.c>1)]
Out[166]:
a b c
1 0 2 2
2 0 2 3
now calculate sum of cells for each found row:
In [167]: df[(df.b>0) & (df.c>1)].sum(axis=1)
Out[167]:
1 4
2 5
dtype: int64
or product:
In [169]: df[(df.b>0) & (df.c>1)].product(axis=1)
Out[169]:
1 0
2 0
dtype: int64
PS using axis=1 instructs Pandas/Numpy to use rows instead of columns

Categories

Resources