Using .withColumn on all remaining columns in DF - python

I want to anonymize or replace almost all columns in a pyspark dataframe except a few ones.
I know its possible to do something like:
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))\
.withColumn("NAME1", lit(""))\
.withColumn("TELEPHONE", lit(""))\
.withColumn("ELECTRONICMAILADDRESS", lit(""))
However, doing this for all columns is a tedious process. I would rather want to do something along the lines of this:
anonymized_df = employee_df.withColumn("EMPLOYEENUMBER", col("EMPLOYEENUMBER"))\
.withcolumn("*", lit("")) # all other columns replace
This does however not seem to work. Is there other work arounds that achieve this?
I guess one solution would be to could create a list of column names and do something along the lines of:
col_list = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6']
for col in col_list:
employee_df= employee_df.withColumn(col, lit("")))
Other suggestions would be of much help.

You can use select. syntax-wise it won't be much different but it will only create 1 snapshot.
keep_cols = ['a', 'b', 'c']
empty_cols = ['d', 'e', 'f'] # or list(set(df.columns) - set(keep_cols))
df = df.select(*keep_cols, *[lit('').alias(x) for x in empty_cols])

Related

Difference between index.name and index.names in pandas

I have just started leaning pandas and this is really my first question here so pls don't mind if its too basic !
When to use df.index.name and when df.index.names ?
Would really appreciate to know the difference and application.
Many Thanks
name returns the name of the Index or MultiIndex.
names returns a list of the level names of an Index (just one level) or a MultiIndex (more than one level).
Index:
df1 = pd.DataFrame(columns=['a', 'b', 'c']).set_index('a')
print(df1.index.name, df1.index.names)
# a ['a']
MultiIndex:
df2 = pd.DataFrame(columns=['a', 'b', 'c']).set_index(['a', 'b'])
print(df2.index.name, df2.index.names)
# None ['a', 'b']
df2.index.name = 'my_multiindex'
print(df2.index.name, df2.index.names)
# my_multiindex ['a', 'b']

Parallel Processing of Loop of Pandas Columns

I have the following code which I would like to speed up.
EDIT: we would like the columns in 'colsi' to be shifted by the group columns in 'colsj'. Pandas allows us to shift multiple columns at once through vectorization of 'colsi'. I loop through each group column and perform the vectorized shifts. Then I fill the NAs by the medians of the columns in 'colsi'. The reindex is just to create new blank columns before they are assigned. The issue is that I have many groups and looping through each is becoming time consuming.
EDIT2: My goal is to engineer new columns by the lag of each group. I have many group columns and many columns to be shifted. 'colsi' contains the columns to be shifted. 'colsj' contains the group columns. I am able to vectorize 'colsi', but looping through each group column in 'colsj' is still time consuming.
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
for j in colsj:
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
Parallelization seems to be a good way to do it. Leaning on this code, I attempted the following but it didn't work:
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=3)
colsi = ['a', 'b', 'c']
colsj = ['d', 'e', 'f']
med = df[colsi].median()
def funct(j):
newcols=[j+i+'_n' for i in colsi]
newmed = med.copy()
newmed.index=newcols
df = df.reindex(columns=df.columns.tolist()+newcols)
df[newcols] = df.groupby(j)[colsi].shift()
df[newcols] = df[newcols].fillna(newmed)
for j in colsj:
pool.apply_async(funct, (j))
I do not have any knowledge on how to go about parallel processing, so I am not sure whats missing here. Please advise.

How to change column names in pandas Dataframe using a list of names?

I have been trying to change the column names of a pandas dataframe using a list of names. The following code is being used:
df.rename(columns = list_of_names, inplace=True)
However I got a Type Error each time, with an error message that says "list object is not callable".
I would like to know why does this happen? And What can I do to solve this problem?
Thank you for your help.
you could use
df.columns = ['Leader', 'Time', 'Score']
If you need rename (l is your list of name need to change to)
df.rename(columns=dict(zip(df.columns,l)))
Just update the columns attribute:
df.columns = list_of_names
set_axis
To set column names, use set_axis along axis=1 or axis='columns':
df = df.set_axis(list_of_names, axis=1)
Note that the default axis=0 sets index names.
Why not just modify df.columns directly?
The accepted answer is fine and is used often, but set_axis has some advantages:
set_axis allows method chaining:
df.some_method().set_axis(list_of_names, axis=1).another_method()
vs:
df = df.some_method()
df.columns = list_of_names
df.another_method()
set_axis should theoretically provide better error checking than directly modifying an attribute, though I can't find a specific example at the moment.
if your list is : column_list so column_list is ['a', 'b', 'c']
and original df.columns is ['X', 'Y', 'Z']
you just need: df.columns = column_list

Pandas dataframe column selection

I am using Pandas to select columns from a dataframe, olddf. Let's say the variable names are 'a', 'b','c', 'starswith1', 'startswith2', 'startswith3',...,'startswith10'.
My approach was to create a list of all variables with a common starting value.
filter_col = [col for col in list(health) if col.startswith('startswith')]
I'd like to then select columns within that list as well as others, by name, so I don't have to type them all out. However, this doesn't work:
newdf = olddf['a','b',filter_col]
And this doesn't either:
newdf = olddf[['a','b'],filter_col]
I'm a newbie so this is probably pretty simple. Is the reason this doesn't work because I'm mixing a list improperly?
Thanks.
Use
newdf = olddf[['a','b']+filter_col]
since adding lists concatenates them:
In [264]: ['a', 'b'] + ['startswith1']
Out[264]: ['a', 'b', 'startswith1']
Alternatively, you could use the filter method:
newdf = olddf.filter(regex=r'^(startswith|[ab])')

Pandas column access w/column names containing spaces

If I import or create a pandas column that contains no spaces, I can access it as such:
from pandas import DataFrame
df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'data1': range(7)})
df1.data1
which would return that series for me. If, however, that column has a space in its name, it isn't accessible via that method:
from pandas import DataFrame
df2 = DataFrame({'key': ['a','b','d'],
'data 2': range(3)})
df2.data 2 # <--- not the droid I'm looking for.
I know I can access it using .xs():
df2.xs('data 2', axis=1)
There's got to be another way. I've googled it like mad and can't think of any other way to google it. I've read all 96 entries here on SO that contain "column" and "string" and "pandas" and could find no previous answer. Is this the only way, or is there something better?
Old post, but may be interesting: an idea (which is destructive, but does the job if you want it quick and dirty) is to rename columns using underscores:
df1.columns = [c.replace(' ', '_') for c in df1.columns]
I think the default way is to use the bracket method instead of the dot notation.
import pandas as pd
df1 = pd.DataFrame({
'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
'dat a1': range(7)
})
df1['dat a1']
The other methods, like exposing it as an attribute are more for convenience.
If you like to supply spaced columns name to pandas method like assign you can dictionarize your inputs.
df.assign(**{'space column': (lambda x: x['space column2'])})
While the accepted answer works for column-specification when using dictionaries or []-selection, it does not generalise to other situations where one needs to refer to columns, such as the assign method:
> df.assign("data 2" = lambda x: x.sum(axis=1)
SyntaxError: keyword can't be an expression
You can do it with df['Column Name']
If you want to apply filtering, that's also possible with column names having spaces in it, e.g. filtering for NULL-values or empty strings:
df_package[(df_package['Country_Region Code'].notnull()) |
(df_package['Country_Region Code'] != u'')]
as I figured out thanks to Rutger Kassies answer.

Categories

Resources