how to make withColumnRenamed query generic in pyspark - python

Description
I have 2 lists
List1=['curentColumnName1','curentColumnName2','currentColumnName3']
List2=['newColumnName1','newColumnName2','newColumnName3']
Their is a dataframe df which contains all columns
I want to check like if column 'curentColumnName1 is present in dataframe,if yes then rename it to newColumnName1
Need to do this for all columns if those are present in dataframe
How to achieve this scenario using pyspark

Just iterate over the first list, check if it in the column list, and rename:
for i in range(len(List1)):
if List1[i] in df.columns:
df = df.withColumnRenamed(List1[i], List2[i])
P.S. Instead of two lists, it's better to use dictionary - it's easier to maintain, and you can avoid errors when you add/remove elements only in one list

Here is anotherway of doing it in one-liner :
from functools import reduce
df = reduce(
lambda a, b: a.withColumnRenamed(b[0], b[1]),
zip(List1, List2),
df,
)

You can achieve in one line:
df.selectExpr(*[f"{old_col} AS {new_col}" for old_col, new_col in zip(List1, List2)]).show()

Related

Filtering a large Pandas DataFrame based on a list of strings in column names

Stack Overflow Family,
I have recently started learning Python and am using Pandas to handle some factory data. The csv file is essentially a large dataframe (1621 rows Ă— 5633 columns). While I need all the rows as these are data of each unit, I need to filter many unwanted columns. I have identified a list of strings in these column names that I can use to find only the wanted columns, however, I am not able to figure out what a good logic here would be or any built in python functions.
dropna is not an option for me as some of these wanted columns have NA as values (for example test limit)
dropna for columns with all NA is also not good enough as I will still end up with a large number of columns.
Looking for some guidance here. Thank you for your time.
If you have a list of valid columns you can just use df.filter(cols_subset, axis=1) to drop everything else.
You could use a regex to also match substrings from your list in column names:
df.filter(regex='|'.join(cols_subset), axis=1)
Or you could match only columns starting with a substring from your list:
df.filter(regex='^('+'|'.join(cols_subset)+')', axis=1)
EDIT:
Given the time complexity of my previous solution, I came up with a way to use list comprehension:
fruits = ["apple", "banana", "cherry", "kiwi", "mango"]
app = ["app", "ban"]
new_list = [x for x in fruits if any(y in x for y in app)]
output:
['apple', 'banana']
This should only display the columns you need. In your case you just need to do:
my_strings = ["A", "B", ...]
new_list = [x for x in df.columns if any(y in x for y in my_strings)]
print(new_list)
If you know exactly the column names, what you could do is some thing like that:
unwanted_cols = ['col1', 'col4'] #list of unwanted cols names
df_cleaned = current_df.drop(unwanted_cols, axis=1)
# or
current_df.drop(unwanted_cols, inplace=True, axis=1)
If you don't know exactly the columns names what you could do is first retrieveing all the columns
all_cols = current_df.columns.tolist()
and apply a regex on all of the columns names, to obtain all of the columns names that matches your list of string and apply the same code as above
You can drop columns from a dataframe by applying string contains with regular expression. Below is an example
df.drop(df.columns[df.columns.str.contains('^abc')], axis=1)

Group by with lists in Dataframe

I have a problem with a Dataframe looking like this:
It contains "ClusterLabels" (0-44) and I want to group the "Document" col by the ClusterLabel value. I want These lists from "Document" to be combined in one list per Cluster. (duplicate words sould be kept)
Tryed the ".groupby" argument but it gives the error "sequence item 0: expected str instance, list found".
Can someone help?
Don't use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.
Better is use list comprehension with flatten lists:
df1 = (df.groupby('ClusterLabel')['Document']
.agg(lambda x: [z for y in x for z in y])
.reset_index())
Or flatten in itertools.chain:
from itertools import chain
df1 = (df.groupby('ClusterLabel')['Document']
.agg(lambda x: list(chain(*x)))
.reset_index())
You can do this like :
import pandas as pd
df = pd.DataFrame({"Document" : [["a","b","c","d"],["a","d"],["a","b"],["c","d"],["d"]],
"ClusterLabel": [0,0,0,1,1]})
df
df.groupby("ClusterLabel").sum()

List comprehension pandas assignment

How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines
I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')

compare list of dictionaries to dataframe, show missing values

I have a list of dictionaries
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
and a dataframe with an 'Email' column
I need to compare the list against the dataframe and return the values that are not in the dataframe.
I can certainly iterate over the list, check in the dataframe, but I was looking for a more pythonic way, perhaps using list comprehension or perhaps a map function in dataframes?
To return those values that are not in DataFrame.email, here's a couple of options involving set difference operations—
np.setdiff1d
emails = [d['email'] for d in example_list)]
diff = np.setdiff1d(emails, df['Email']) # returns a list
set.difference
# returns a set
diff = set(d['email'] for d in example_list)).difference(df['Email'])
One way is to take one set from another. For a functional solution you can use operator.itemgetter:
from operator import itemgetter
res = set(map(itemgetter('email'), example_list)) - set(df['email'])
Note - is syntactic sugar for set.difference.
I ended up converting the list into a dataframe, comparing the two dataframes by merging them on a column, and then creating a dataframe out of the missing values
so, for example
example_list = [{'email':'myemail#email.com'},{'email':'another#email.com'}]
df_two = pd.DataFrame(item for item in example_list)
common = df_one.merge(df_two, on=['Email'])
df_diff = df_one[(~df_one.Email.isin(common.Email))]

python pandas selecting columns from a dataframe via a list of column names

I have a dataframe with a lot of columns in it. Now I want to select only certain columns. I have saved all the names of the columns that I want to select into a Python list and now I want to filter my dataframe according to this list.
I've been trying to do:
df_new = df[[list]]
where list includes all the column names that I want to select.
However I get the error:
TypeError: unhashable type: 'list'
Any help on this one?
You can remove one []:
df_new = df[list]
Also better is use other name as list, e.g. L:
df_new = df[L]
It look like working, I try only simplify it:
L = []
for x in df.columns:
if not "_" in x[-3:]:
L.append(x)
print (L)
List comprehension:
print ([x for x in df.columns if not "_" in x[-3:]])

Categories

Resources