Why does not work pandas df.loc + lambda? - python

I have created pandas frame from csv file.
And I want to select rows use lambda.
But it does not work.
I use this pandas manual.
exception:
what is problem?
thanks.

As #BrenBam has said in the comment this syntax was added in 0.18.1 and it won't work in previous versions.
Selection By Callable:
.loc, .iloc, .ix and also [] indexing can accept a callable as
indexer. The callable must be a function with one argument (the
calling Series, DataFrame or Panel) and that returns valid output for
indexing.
Example (version 0.18.1):
In [10]: df
Out[10]:
a b c
0 1 4 2
1 2 2 4
2 3 4 0
3 0 2 3
4 3 0 4
In [11]: df.loc[lambda df: df.a == 3]
Out[11]:
a b c
2 3 4 0
4 3 0 4
For versions <= 0.18.0 you can't use Selection by callable:
do it this way instead:
df.loc[df['Date'] == '2003-01-01 00:00:00', ['Date']]

Related

How to drop columns with NA using cudf?

Pandas:
data = data.dropna(axis = 'columns')
I am trying to do something similar using a cudf dataframe but the apis don't offer this functionality.
My solution is to convert to a pandas df, do the above command, then re-convert to a cudf. Is there a better solution?
cuDF now supports column based dropna, so the following will work:
import cudf
​
df = cudf.DataFrame({'a':[0,1,None], 'b':[None,0,2], 'c':[1,2,3]})
print(df)
a b c
0 0 null 1
1 1 0 2
2 null 2 3
df.dropna(axis='columns')
c
0 1
1 2
2 3
Until dropna is implemented, you can check the null_count of each column and drop the ones with null_count>0.

Select columns using pandas dataframe.query()

The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?
After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.
UPDATE:
javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978
Dataframe.query is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query method you do something like
df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]
How about
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
but you would need to filter for rows otherwise it doesn't work.
for filtering columns only better use .loc or .iloc
pandasql
https://pypi.python.org/pypi/pandasql/0.1.0
Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.
#maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
The output is a pandas DataFrame as desired.
It is working great for my particular use case (evaluating us crimes)
odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)
odf
: SMURDER SRAPE SROBBERY SAGASSLT SOTHASLT SVANDLSM SWEAPONS
0 0 0 0 1 1 10 54
1 0 0 0 0 1 0 52
2 0 0 0 0 1 0 46
3 0 0 0 0 1 0 43
4 0 0 0 0 1 0 33
5 1 0 2 16 28 4 32
6 0 0 0 7 17 4 30
7 0 0 0 0 1 0 29
8 0 0 0 7 16 3 29
9 0 0 0 1 0 5 28
Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.
Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –
Just a simpler example solution (using get):
My goal:
I want the lat and lon columns out of the result of the query.
My table details:
df_city.columns
Index(['name', 'city_id', 'lat', 'lon', 'CountryName',
'ContinentName'], dtype='object')
# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')
# Only lat and lon
city_continent[['lat', 'lon']]
lat lon
113883 -19.12753 -169.84623
113884 -19.11667 -169.90000
113885 -19.10000 -169.91667
113886 -46.33333 168.85000
113887 -46.36667 168.55000
... ... ...
347956 -23.14083 113.77630
347957 -31.48023 131.84242
347958 -28.29967 153.30142
347959 -35.60358 138.10548
347960 -35.02852 117.83416
3712 rows × 2 columns

Replacing categorical strings with ints in pandas

I have a series containing data like
0 a
1 ab
2 b
3 a
And I want to replace any row containing 'b' to 1, and all others to 0. I've tried
one = labels.str.contains('b')
zero = ~labels.str.contains('b')
labels.ix[one] = 1
labels.ix[zero] = 0
And this does the trick but it gives this pesky warning
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
And I know I've seen this before in the last few times I've used pandas. Could you please give the recommended approach? My method gives the desired result, but what should I do? Also, I think Python is supposed to be an 'if it makes logical sense and you type it it will run' kind of language, but my solution seems perfectly logical in the human-readable sense and it seems very non-pythonic that it throws an error.
Try this:
ds = pd.Series(['a','ab','b','a'])
ds
0 a
1 ab
2 b
3 a
dtype: object
ds.apply(lambda x: 1 if 'b' in x else 0)
0 0
1 1
2 1
3 0
dtype: int64
You can use numpy.where. Output is numpy.ndarray, so you have to use Series constructor:
import pandas as pd
import numpy as np
ser = pd.Series(['a','ab','b','a'])
print ser
0 a
1 ab
2 b
3 a
dtype: object
print np.where(ser.str.contains('b'),1,0)
[0 1 1 0]
print type(np.where(ser.str.contains('b'),1,0))
<type 'numpy.ndarray'>
print pd.Series(np.where(ser.str.contains('b'),1,0), index=ser.index)
0 0
1 1
2 1
3 0
dtype: int32

Python pandas correlation corr() TypeError: Could not compare ['pearson'] with block values

one = pd.DataFrame(data=[1,2,3,4,5], index=[1,2,3,4,5])
two = pd.DataFrame(data=[5,4,3,2,1], index=[1,2,3,4,5])
one.corr(two)
I think it should return a float = -1.00 but instead it's generating the following error:
TypeError: Could not compare ['pearson'] with block values
Thanks in advance for your help.
pandas.DataFrame.corr computes pairwise correlation between the columns of a single data frame. What you need here is pandas.DataFrame.corrwith:
>>> one.corrwith(two)
0 -1
dtype: float64
You are operating on a DataFrame when you should be operating on a Series.
In [1]: import pandas as pd
In [2]: one = pd.DataFrame(data=[1,2,3,4,5], index=[1,2,3,4,5])
In [3]: two = pd.DataFrame(data=[5,4,3,2,1], index=[1,2,3,4,5])
In [4]: one
Out[4]:
0
1 1
2 2
3 3
4 4
5 5
In [5]: two
Out[5]:
0
1 5
2 4
3 3
4 2
5 1
In [6]: one[0].corr(two[0])
Out[6]: -1.0
Why subscript with [0]? Because that is the name of the column in the DataFrame, since you didn't give it one. When you reference a column in a DataFrame, it will return a Series, which is 1-dimensional. The documentation for this function is here.

SettingWithCopyWarning in pandas: how to set the first value in a column?

When running my code I get the following message:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
df['detect'][df.index[0]] = df['event'][df.index[0]]
What is the correct way of setting the first value of a column equal to the first value of another column?
Use ix to perform the index label selection:
In [102]:
df = pd.DataFrame({'detect':np.random.randn(5), 'event':np.arange(5)})
df
Out[102]:
detect event
0 -0.815105 0
1 -0.656923 1
2 -1.417722 2
3 0.210070 3
4 0.211728 4
In [103]:
df.ix[0,'detect'] = df.ix[0,'event']
df
Out[103]:
detect event
0 0.000000 0
1 -0.656923 1
2 -1.417722 2
3 0.210070 3
4 0.211728 4
What you are doing is called chained indexing and may or may not work hence the warning

Categories

Resources