I have just started leaning pandas and this is really my first question here so pls don't mind if its too basic !
When to use df.index.name and when df.index.names ?
Would really appreciate to know the difference and application.
Many Thanks
name returns the name of the Index or MultiIndex.
names returns a list of the level names of an Index (just one level) or a MultiIndex (more than one level).
Index:
df1 = pd.DataFrame(columns=['a', 'b', 'c']).set_index('a')
print(df1.index.name, df1.index.names)
# a ['a']
MultiIndex:
df2 = pd.DataFrame(columns=['a', 'b', 'c']).set_index(['a', 'b'])
print(df2.index.name, df2.index.names)
# None ['a', 'b']
df2.index.name = 'my_multiindex'
print(df2.index.name, df2.index.names)
# my_multiindex ['a', 'b']
Related
I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones.
I know its possible to do something like:
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))\
.withColumn("NAME1", lit(""))\
.withColumn("TELEPHONE", lit(""))\
.withColumn("ELECTRONICMAILADDRESS", lit(""))
However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this:
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))\
.withcolumn("*", lit("")) # all other columns replace
This does however not seem to work. Is there other work arounds that achieve this?
I guess one solution would be to could create a list of column names and do something along the lines of:
col_list = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
for col in col_list:
employee_df= employee_df.withColumn(col, lit("")))
Other suggestions would be of much help.
You can use select. syntax-wise it won't be much different but it will only create 1 snapshot.
keep_cols = ['a', 'b', 'c']
empty_cols = ['d', 'e', 'f'] # or list(set(df.columns) - set(keep_cols))
df = df.select(*keep_cols, *[lit('').alias(x) for x in empty_cols])
I have a dataframe with both numerical and categorical data (df1). I am creating a database that resembles the first dataframe df2, meaning it has the same column names and dtypes as df1. However, besides the names and dtypes of df1 I would also like to keep the categories of the categorical variables even if they don't appear on df2 when I create it.
So far the easiest solution I have found is to loop over all categorical variables on df2, adding the categories of each categorical variable of df1. However I believe there must be a faster/more efficient solution than the one I am proposing.
df1 = pd.DataFrame({
'A' : pd.Categorical(list('bbeebbaa'), categories=['e','a','b'], ordered=True),
'B' : [1,2,1,2,2,1,2,1],
'C' : pd.Categorical(list('ddeeccaa'), categories=['e','a','d', 'c'], ordered=True)})
df2 = pd.DataFrame({
'A' : pd.Categorical(list('bbeebbbb'), categories=['e', 'b'], ordered=True),
'B' : [1,2,1,2,2,1,2,1],
'C' : pd.Categorical(list('cccccccc'), categories=['c'], ordered=True)})
categorical = ['A', 'B']
for var in categorical:
df2[var] = df2[var].cat.add_categories(df1[var].cat.categories)
If all categories of df2 are in df1, you can use set_categories() function.
l = list(df1['A'].cat.categories)
df2['A'] = df2['A'].cat.set_categories(l)
Or in one line:
df2['A'] = df2['A'].cat.set_categories(list(df1['A'].cat.categories))
If both df1 and df2 contain categories unique for them, I am not sure how would I handle it - probably similarly to your way presented here.
Trying to use pivot_table in dask while maintaining a sorted index. I have a simple pandas dataframe that looks something like this:
# make dataframe, fist in pandas and then in dask
df = pd.DataFrame({'A':['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c'], 'B': ['a', 'b', 'c', 'a', 'b', 'c', 'a','b', 'c'], 'dist': [0, .1, .2, .1, 0, .3, .4, .1, 0]})
df.sort_values(by='A', inplace=True)
dd = dask.dataframe.from_pandas(df, chunksize=3) # just for demo's sake, you obviously don't ever want a chunksize of 3
print(dd.known_divisions) # Here I get True, which means my data is sorted
# now pivot and see if the index remains sorted
dd = dd.categorize('B')
pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
print(pivot_dd.known_divisions) # Here I get False, which makes me sad
I would love to find a way to get pivot_dd to have a sorted index, but I don't see a sort_index method in dask and cannot set 'A' as an index w/out getting a key error (it already is the index!).
In this toy example, I could pivot the pandas table first and then sort. The real application I have in mind won't allow me to do that.
Thanks in advance for any help/suggestions.
This may not be what you were wishing for, and perhaps not even the best answer, but it does seem to work. The first wrinkle, is that pivot operations create a categorical index for the columns, which is annoying. You could do the following.
>>> pivot_dd = dd.pivot_table(index='A', columns='B', values='dist')
>>> pivot_dd.columns = list(pivot_dd.columns)
>>> pivot_dd = pivot_dd.reset_index().set_index('A', sorted=True)
>>> pivot_dd.known_divisions
True
I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')
If I import or create a pandas column that contains no spaces, I can access it as such:
from pandas import DataFrame
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1.data1
which would return that series for me. If, however, that column has a space in its name, it isn't accessible via that method:
from pandas import DataFrame
df2 = DataFrame({'key': ['a','b','d'],
'data 2': range(3)})
df2.data 2 # <--- not the droid I'm looking for.
I know I can access it using .xs():
df2.xs('data 2', axis=1)
There's got to be another way. I've googled it like mad and can't think of any other way to google it. I've read all 96 entries here on SO that contain "column" and "string" and "pandas" and could find no previous answer. Is this the only way, or is there something better?
Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores:
df1.columns = [c.replace(' ', '_') for c in df1.columns]
I think the default way is to use the bracket method instead of the dot notation.
import pandas as pd
df1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'dat a1': range(7)
})
df1['dat a1']
The other methods, like exposing it as an attribute are more for convenience.
If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs.
df.assign(**{'space column': (lambda x: x['space column2'])})
While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method:
> df.assign("data 2" = lambda x: x.sum(axis=1)
SyntaxError: keyword can't be an expression
You can do it with df['Column Name']
If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings:
df_package[(df_package['Country_Region Code'].notnull()) |
(df_package['Country_Region Code'] != u'')]
as I figured out thanks to Rutger Kassies answer.