Select columns using pandas dataframe.query() - python

The documentation on dataframe.query() is very terse http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html . I was also unable to find examples of projections by web search.
So I tried simply providing the column names: that gave a syntax error. Likewise for typing select and then the column names. So .. how to do this?

After playing around with this for a while and reading through the source code for DataFrame.query, I can't figure out a way to do it.
If it's not impossible, apparently it's at least strongly discouraged. When this question came up on github, prolific Pandas dev/maintainer jreback suggested using df.eval() for selecting columns and df.query() for filtering on rows.
UPDATE:
javadba points out that the return value of eval is not a dataframe. For example, to flesh out jreback's example a bit more...
df.eval('A')
returns a Pandas Series, but
df.eval(['A', 'B'])
does not return at DataFrame, it returns a list (of Pandas Series).
So it seems ultimately the best way to maintain flexibility to filter on rows and columns is to use iloc/loc, e.g.
df.loc[0:4, ['A', 'C']]
output
A C
0 -0.497163 -0.046484
1 1.331614 0.741711
2 1.046903 -2.511548
3 0.314644 -0.526187
4 -0.061883 -0.615978

Dataframe.query is more like the where clause in a SQL statement than the select part.
import pandas as pd
import numpy as np
np.random.seed(123)
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])
To select a column or columns you can use the following:
df['A'] or df.loc[:,'A']
or
df[['A','B']] or df.loc[:,['A','B']]
To use the .query method you do something like
df.query('A > B') which would return all the rows where the value in column A is greater than the value in column b.
A B C D
2000-01-03 1.265936 -0.866740 -0.678886 -0.094709
2000-01-04 1.491390 -0.638902 -0.443982 -0.434351
2000-01-05 2.205930 2.186786 1.004054 0.386186
2000-01-08 -0.140069 -0.861755 -0.255619 -2.798589
Which is more readable in my opinion that boolean index selection with
df[df['A'] > df['B']]

How about
df_new = df.query('col1==1 & col2=="x" ')[['col1', 'col3']]
Would filter rows where col1 equals 1 and col2 equals "X" and return only columns 1 and 3.
but you would need to filter for rows otherwise it doesn't work.
for filtering columns only better use .loc or .iloc

pandasql
https://pypi.python.org/pypi/pandasql/0.1.0
Here is an example from the following blog http://blog.yhat.com/posts/pandasql-sql-for-pandas-dataframes.html . The inputs are two DataFrames meat and births : and this approach gives the projections, filtering, aggregation and sorting expected from sql.
#maxpower did mention this package is buggy: so let's see.. At least the code from the blog and shown below works fine.
pysqldf = lambda q: sqldf(q, globals())
q = """
SELECT
m.date
, m.beef
, b.births
FROM
meat m
LEFT JOIN
births b
ON m.date = b.date
WHERE
m.date > '1974-12-31';
"""
meat = load_meat()
births = load_births()
df = pysqldf(q)
The output is a pandas DataFrame as desired.
It is working great for my particular use case (evaluating us crimes)
odf = pysqldf("select %s from df where sweapons > 10 order by sweapons desc limit 10" %scols)
p('odf\n', odf)
odf
: SMURDER SRAPE SROBBERY SAGASSLT SOTHASLT SVANDLSM SWEAPONS
0 0 0 0 1 1 10 54
1 0 0 0 0 1 0 52
2 0 0 0 0 1 0 46
3 0 0 0 0 1 0 43
4 0 0 0 0 1 0 33
5 1 0 2 16 28 4 32
6 0 0 0 7 17 4 30
7 0 0 0 0 1 0 29
8 0 0 0 7 16 3 29
9 0 0 0 1 0 5 28
Update I have done a bunch of stuff with pandasql now: calculated fields, limits, aliases, cascaded dataframes.. it is just so productive.
Another update (3 yrs later) This works but warning it is very slow (seconds vs milliseconds) –

Just a simpler example solution (using get):
My goal:
I want the lat and lon columns out of the result of the query.
My table details:
df_city.columns
Index(['name', 'city_id', 'lat', 'lon', 'CountryName',
'ContinentName'], dtype='object')
# All columns
city_continent = df_city.get(df_city['ContinentName']=='Oceania')
# Only lat and lon
city_continent[['lat', 'lon']]
lat lon
113883 -19.12753 -169.84623
113884 -19.11667 -169.90000
113885 -19.10000 -169.91667
113886 -46.33333 168.85000
113887 -46.36667 168.55000
... ... ...
347956 -23.14083 113.77630
347957 -31.48023 131.84242
347958 -28.29967 153.30142
347959 -35.60358 138.10548
347960 -35.02852 117.83416
3712 rows × 2 columns

Related

extract multiple sub-fields from Pandas dataframe column into a new dataframe

I have a Pandas dataframe (approx 100k rows) as my input. It is an export from a database, and each of the fields in one of the columns contain one or more records which I need to expand into independent records. For example:
record_id
text_field
0
r0_sub_record1_field1#r0_sub_record1_field2#r0_sub_record2_field1#r0_sub_record2_field2#
1
sub_record1_field1#sub_record1_field2#
2
sub_record1_field1#sub_record1_field2#sub_record2_field1#sub_record2_field2#sub_record3_field1#sub_record3_field2#
The desired result should look like this:
record_id
field1
field2
original_record_id
0
r0_sub_record1_field1
r0_sub_record1_field2
0
1
r0_sub_record2_field1
r0_sub_record2_field2
0
2
r1_sub_record1_field1
r1_sub_record1_field2
1
3
r2_sub_record1_field1
r2_sub_record1_field2
2
4
r2_sub_record2_field1
r2_sub_record2_field2
2
5
r2_sub_record3_field1
r2_sub_record3_field2
2
It is quite straight-forward how to extract the data I need using a loop, but I suspect it is not the most efficient and also not the nicest way.
As I understand it, I cannot use apply or map here, because I am building another dataframe with the extracted data.
Is there a good Python-esque and Panda-style way to solve the problem?
I am using Python 3.7 and Pandas 1.2.1.
I think you need to explode based on # then split the # text.
df1 = df.assign(t=df['text_field'].str.split('#')
).drop('text_field',1).explode('t').reset_index(drop=True)
df2 = df1.join(df1['t'].str.split('#',expand=True)).drop('t',1)
print(df2.dropna())
record_id 0 1
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
3 1 sub_record1_field1 sub_record1_field2
5 2 sub_record1_field1 sub_record1_field2
6 2 sub_record2_field1 sub_record2_field2
7 2 sub_record3_field1 sub_record3_field2
Is it what you expect?
out = df['text_field'].str.strip('#').str.split('#').explode() \
.str.split('#').apply(pd.Series)
prefix = 'r' + out.index.map(str) + '_'
out.apply(lambda v: prefix + v).reset_index() \
.rename(columns={0: 'field1', 1: 'field2', 'index': 'original_record_id'})
>>> out
original_record_id field1 field2
0 0 r0_sub_record1_field1 r0_sub_record1_field2
1 0 r0_sub_record2_field1 r0_sub_record2_field2
2 1 r1_sub_record1_field1 r1_sub_record1_field2
3 2 r2_sub_record1_field1 r2_sub_record1_field2
4 2 r2_sub_record2_field1 r2_sub_record2_field2
5 2 r2_sub_record3_field1 r2_sub_record3_field2

Iterating Conditions through Pandas .loc

I just wanted to ask the community and see if there is a more efficient to do this.
I have several rows in a data frame and I am using .loc to filter values in row A for I can perform calculations on row B.
I can easily do something like...
filter_1 = df.loc['Condition'] = 1
And then perform the mathematical calculation on row B that I need.
But there are many conditions I must go through so I was wondering if I could possibly make a list of the conditions and then iterate them through the .loc function in less lines of code?
Would something like this work where I create a list, then iterate the conditions through a loop?
Thank you!
This example gets most of what I want. I just need it to show 6.4 and 7.0 in this example. How can I manipulate the iteration for it shows the results for the unique values in row 'a'?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
col = ['a', 'b']
list_1 = []
for i, j in zip(a,b):
list_1.append([i,j])
df1 = pd.DataFrame(list_1, columns= col)
for i in a:
aa = df1[df1['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using set
set_a = set(a)
for i in set_a:
aa = df[df['a'].isin([i])]
aa1 = aa['b'].mean()
print (aa1)
Solution using pandas mean function
Is this what you are looking for?
import pandas as pd
a = [1,2,1,2,1,2,1,2,1,2]
b = [5,1,3,5,7,20,9,5,8,4]
df = pd.DataFrame({'a':a,'b':b})
print (df)
print(df.groupby('a').mean())
The results from this are:
Original Dataframe df:
a b
0 1 5
1 2 1
2 1 3
3 2 5
4 1 7
5 2 20
6 1 9
7 2 5
8 1 8
9 2 4
The mean value of df['a'] is:
b
a
1 6.4
2 7.0
Here you go:
df = df[(df['A'] > 1) & (df['A'] < 10)]

efficiently flattening a large multiidex in pandas

I have a very large DataFrame that looks like this:
A B
SPH2008 3/21/2008 1 2
3/21/2008 1 2
3/21/2008 1 2
SPM2008 6/21/2008 1 2
6/21/2008 1 2
6/21/2008 1 2
And I have the following code which is intended to flatten and acquire the unique pairs of the two indeces into a new DF:
indeces = [df.index.get_level_values(0), df.index.get_level_values(1)]
tmp = pd.DataFrame(data=indeces).T.drop_duplicates()
tmp.columns = ['ID', 'ExpirationDate']
tmp.sort_values('ExpirationDate', inplace=True)
However, this operation takes a remarkably long amount of time. Is there a more efficient way to do this?
pandas.DataFrame.index.drop_duplicates
pd.DataFrame([*df.index.drop_duplicates()], columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008
With older versions of Python that can't unpack in that way
pd.DataFrame(df.index.drop_duplicates().tolist(), columns=['ID', 'ExpirationDate'])
IIUC, You can also groupby the levels of your multiindex, then create a dataframe from that with your desired columns:
>>> pd.DataFrame(df.groupby(level=[0,1]).groups.keys(), columns=['ID', 'ExpirationDate'])
ID ExpirationDate
0 SPH2008 3/21/2008
1 SPM2008 6/21/2008

How to delete all columns in DataFrame except certain ones?

Let's say I have a DataFrame that looks like this:
a b c d e f g
1 2 3 4 5 6 7
4 3 7 1 6 9 4
8 9 0 2 4 2 1
How would I go about deleting every column besides a and b?
This would result in:
a b
1 2
4 3
8 9
I would like a way to delete these using a simple line of code that says, delete all columns besides a and b, because let's say hypothetically I have 1000 columns of data.
Thank you.
In [48]: df.drop(df.columns.difference(['a','b']), 1, inplace=True)
Out[48]:
a b
0 1 2
1 4 3
2 8 9
or:
In [55]: df = df.loc[:, df.columns.intersection(['a','b'])]
In [56]: df
Out[56]:
a b
0 1 2
1 4 3
2 8 9
PS please be aware that the most idiomatic Pandas way to do that was already proposed by #Wen:
df = df[['a','b']]
or
df = df.loc[:, ['a','b']]
Another option to add to the mix. I prefer this approach for readability.
df = df.filter(['a', 'b'])
Where the first positional argument is items=[]
Bonus
You can also use a like argument or regex to filter.
Helpful if you have a set of columns like ['a_1','a_2','b_1','b_2']
You can do
df = df.filter(like='b_')
and end up with ['b_1','b_2']
Pandas documentation for filter.
there are multiple solution .
df = df[['a','b']] #1
df = df[list('ab')] #2
df = df.loc[:,df.columns.isin(['a','b'])] #3
df = pd.DataFrame(data=df.eval('a,b').T,columns=['a','b']) #4 PS:I do not recommend this method , but still a way to achieve this
Hey what you are looking for is:
df = df[["a","b"]]
You will recive a dataframe which only contains the columns a and b
If you only want to keep more columns than you're dropping put a "~" before the .isin statement to select every column except the ones you want:
df = df.loc[:, ~df.columns.isin(['a','b'])]
If you have more than two columns that you want to drop, let's say 20 or 30, you can use lists as well. Make sure that you also specify the axis value.
drop_list = ["a","b"]
df = df.drop(df.columns.difference(drop_list), axis=1)

How to df.groupby(cols).apply(my_func) for some columns, while leave a few columns not tackled?

Suppose I have a Pandas dataframe df has columns a,b,c,d...z . And I want to: df.groupby('a').apply(my_func()) for columns d-z, while leave column 'b' & 'c' unchanged . How to do that ?
I notice Pandas can apply different function to different column by passing a dict . But I have a long column list and just want parameters to set or tip to simply tell Pandas to bypass some columns and apply my_func() to rest of columns ? (Otherwise I have to build a long dict)
One simple (and general) approach is to create a view of the dataframe with the subset you are interested in (or, stated for your case, a view with all columns except the ones you want to ignore), and then use APPLY for that view.
In [116]: df
Out[116]:
a b c d f
0 one 3 0.493808 a bob
1 two 8 0.150585 b alice
2 one 6 0.641816 c michael
3 two 5 0.935653 d joe
4 one 1 0.521159 e kate
Use your favorite methods to create the view you need. You could select a range of columns like so df_view = df.ix[:,'b':'d'], but the following might be more useful for your scenario:
#I want all columns except two
cols = df.columns.tolist()
mycols = [x for x in cols if not x in ['a','f']]
df_view = df[mycols]
Apply your function to that view. (Note this doesn't yet change anything in df.)
In [158]: df_view.apply(lambda x: x /2)
Out[158]:
b c d
0 1 0.246904 20
1 4 0.075293 25
2 3 0.320908 28
3 2 0.467827 28
4 0 0.260579 24
Update the df using update()
In [156]: df.update(df_view.apply(lambda x: x/2))
In [157]: df
Out[157]:
a b c d f
0 one 1 0.246904 20 bob
1 two 4 0.075293 25 alice
2 one 3 0.320908 28 michael
3 two 2 0.467827 28 joe
4 one 0 0.260579 24 kate

Categories

Resources