Pandas column access w/column names containing spaces

Pandas column access w/column names containing spaces - python

If I import or create a pandas column that contains no spaces, I can access it as such:
from pandas import DataFrame
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1.data1
which would return that series for me. If, however, that column has a space in its name, it isn't accessible via that method:
from pandas import DataFrame
df2 = DataFrame({'key': ['a','b','d'],
'data 2': range(3)})
df2.data 2 # <--- not the droid I'm looking for.
I know I can access it using .xs():
df2.xs('data 2', axis=1)
There's got to be another way. I've googled it like mad and can't think of any other way to google it. I've read all 96 entries here on SO that contain "column" and "string" and "pandas" and could find no previous answer. Is this the only way, or is there something better?

Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores:
df1.columns = [c.replace(' ', '_') for c in df1.columns]

I think the default way is to use the bracket method instead of the dot notation.
import pandas as pd
df1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'dat a1': range(7)
})
df1['dat a1']
The other methods, like exposing it as an attribute are more for convenience.

If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs.
df.assign(**{'space column': (lambda x: x['space column2'])})

While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method:
> df.assign("data 2" = lambda x: x.sum(axis=1)
SyntaxError: keyword can't be an expression

You can do it with df['Column Name']

If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings:
df_package[(df_package['Country_Region Code'].notnull()) |
(df_package['Country_Region Code'] != u'')]
as I figured out thanks to Rutger Kassies answer.

Related

Is it possible to form groups in a dataframe with rows having a value in a column in addition to grouping columns in Pandas?

I tried a lot but could not find a way to do the following and even I am not sure if it is possible in pandas.
Assume I have a dataframe like in (1).
When I use dataframe.groupby() on "col-a" i get (2) and i can process the groupbydataframe as usual, for example by applying a function. My question is :
Is it possible to group the dataframe like in (3) before processing (the row having "1" at Col-x to be included in group2 with a condition or something... or is it possible to apply a function to include that row belonging to group1 in group2 while processing.
Thank you all for your attention.
Last one request and may be the most imortant one :), altough i started learning pandas a while ago, as a retired software developer i still have a difficulty of understanding its inner mechanism. May a pandas pro please advice me a document,book,method or another resource to learn Panda's basic principles well since, I really love it.

groupby can use a defined function to select groups. The function can combine column values in any way you want. To use your example this could be done along these lines:
df = pd.DataFrame({ 'col_a': ['a', 'a', 'a', 'a', 'a', 'a', 'b', 'b','b','b','b','b'],
'col_x': [0,0,0,0,0,1,0,0,0,0,0,0],
'col_calc': [1,1,1,1,1,99,1,1,1,1,1,1]
})
def func(mdf, idx, col1, col2):
x = mdf[col1].loc[idx]
y = mdf[col2].loc[idx]
if x == 'a' and y == 0:
return 'g1'
if x == 'b' or y == 1:
return 'g2'
df2 = df.groupby(lambda x: func(df, x, 'col_a', 'col_x'))['col_calc'].sum()
print(df2)
which gives:
g1 5
g2 105

How to parse a condition stored in a variable as PySpark code?

I have a dataframe df and a string variable cond which contains a condition, lets say:
cond = """F.col('some-column').isin(['some-value'])"""
I need to apply/parse this condition that is stored as text on the dataframe df. How can I accomplish this?
I know if I change it a little bit, I can utilize SparkSQL. However, for a bunch of upcoming requirements, I would prefer this method. If it would be possible, that is.

Thanks to #ghost's input, this turned out to be easier than I thought.
cond = """F.col('some-column').isin(['some-value'])"""
df = df.filter(eval(cond))

IIUC , use filter on dataframe directly.
a_list = [1, 2 3]
df = df.filter(F.col("col_name).isin([a_list]))
Update the answer as per the understanding :)
# business logic in a case condition here
logic = F.expr("""
CASE WHEN column_name IN ('A', 'B', 'C') THEN '1'
WHEN column_name IN ('D', 'E', 'F') THEN '2'
ELSE column_name //or any_default_condition
END""")
df = df.withColumn('new_column_name', logic)

How to change column names in pandas Dataframe using a list of names?

I have been trying to change the column names of a pandas dataframe using a list of names. The following code is being used:
df.rename(columns = list_of_names, inplace=True)
However I got a Type Error each time, with an error message that says "list object is not callable".
I would like to know why does this happen? And What can I do to solve this problem?
Thank you for your help.

you could use
df.columns = ['Leader', 'Time', 'Score']

If you need rename (l is your list of name need to change to)
df.rename(columns=dict(zip(df.columns,l)))

Just update the columns attribute:
df.columns = list_of_names

set_axis
To set column names, use set_axis along axis=1 or axis='columns':
df = df.set_axis(list_of_names, axis=1)
Note that the default axis=0 sets index names.
Why not just modify df.columns directly?
The accepted answer is fine and is used often, but set_axis has some advantages:
set_axis allows method chaining:
df.some_method().set_axis(list_of_names, axis=1).another_method()
vs:
df = df.some_method()
df.columns = list_of_names
df.another_method()
set_axis should theoretically provide better error checking than directly modifying an attribute, though I can't find a specific example at the moment.

if your list is : column_list so column_list is ['a', 'b', 'c']
and original df.columns is ['X', 'Y', 'Z']
you just need: df.columns = column_list

Renaming columns on DataFrame output of pandas.concat

I'm construction a new DataFrame by concatenating the columns of other DataFrames, like so:
pairs = pd.concat([pos1['Close'], pos2['Close'], pos3['Close'], pos4['Close'], pos5['Close'],
pos6['Close'], pos7['Close']], axis=1)
I want to rename all of the columns of the pairs Dataframe to the symbol of the underlying securities. Is there a way to do this during the the concat method call? Reading through the docs on the method here http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.concat.html didn't give me a solid answer.

You can achieve the same in one go using the attribute keys:
pairs = pd.concat([pos1['Close'], pos2['Close'], pos3['Close'], pos4['Close'], pos5['Close'], pos6['Close'], pos7['Close']],
axis=1, keys= ['JPM', 'WFC', 'BAC', 'C', 'STI', 'PNC', 'CMA'])

This is the approach I'm taking. Seems to fit all my requirements.
symbols = ['JPM', 'WFC', 'BAC', 'C', 'STI', 'PNC', 'CMA']
pairs.columns = symbols

Pandas dataframe column selection

I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.

Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas column access w/column names containing spaces - python

Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores: df1.columns = [c.replace(' ', '_') for c in df1.columns]

I think the default way is to use the bracket method instead of the dot notation. import pandas as pd df1 = pd.DataFrame({ 'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'dat a1': range(7) }) df1['dat a1'] The other methods, like exposing it as an attribute are more for convenience.

If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs. df.assign(**{'space column': (lambda x: x['space column2'])})

While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method: > df.assign("data 2" = lambda x: x.sum(axis=1) SyntaxError: keyword can't be an expression

You can do it with df['Column Name']

If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings: df_package[(df_package['Country_Region Code'].notnull()) | (df_package['Country_Region Code'] != u'')] as I figured out thanks to Rutger Kassies answer.

Related

Is it possible to form groups in a dataframe with rows having a value in a column in addition to grouping columns in Pandas?

How to parse a condition stored in a variable as PySpark code?

How to change column names in pandas Dataframe using a list of names?

Renaming columns on DataFrame output of pandas.concat

Pandas dataframe column selection

Categories

Resources