List comprehension pandas assignment - python

How do I use list comprehension, or any other technique to refactor the code I have? I'm working on a DataFrame, modifying values in the first example, and adding new columns in the second.
Example 1
df['start_d'] = pd.to_datetime(df['start_d'],errors='coerce').dt.strftime('%Y-%b-%d')
df['end_d'] = pd.to_datetime(df['end_d'],errors='coerce').dt.strftime('%Y-%b-%d')
Example 2
df['col1'] = 'NA'
df['col2'] = 'NA'
I'd prefer to avoid using apply, just because it'll increase the number of lines

I think need simply loop, especially if want avoid apply and many columns:
cols = ['start_d','end_d']
for c in cols:
df[c] = pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d')
If need list comprehension is necessary concat because result is list of Series:
comp = [pd.to_datetime(df[c],errors='coerce').dt.strftime('%Y-%b-%d') for c in cols]
df = pd.concat(comp, axis=1)
But still here is possible solution with apply:
df[cols]=df[cols].apply(lambda x: pd.to_datetime(x ,errors='coerce')).dt.strftime('%Y-%b-%d')

Related

how to make withColumnRenamed query generic in pyspark

Description
I have 2 lists
List1=['curentColumnName1','curentColumnName2','currentColumnName3']
List2=['newColumnName1','newColumnName2','newColumnName3']
Their is a dataframe df which contains all columns
I want to check like if column 'curentColumnName1 is present in dataframe,if yes then rename it to newColumnName1
Need to do this for all columns if those are present in dataframe
How to achieve this scenario using pyspark
Just iterate over the first list, check if it in the column list, and rename:
for i in range(len(List1)):
if List1[i] in df.columns:
df = df.withColumnRenamed(List1[i], List2[i])
P.S. Instead of two lists, it's better to use dictionary - it's easier to maintain, and you can avoid errors when you add/remove elements only in one list
Here is anotherway of doing it in one-liner :
from functools import reduce
df = reduce(
lambda a, b: a.withColumnRenamed(b[0], b[1]),
zip(List1, List2),
df,
)
You can achieve in one line:
df.selectExpr(*[f"{old_col} AS {new_col}" for old_col, new_col in zip(List1, List2)]).show()

Apply Pandas series string function to the whole dataframe

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.
There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34
import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe
I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

Converting for loop to list comprehension?

This has been killing me!
Any idea how to convert this to a list comprehension?
for x in dataframe:
if dataframe[x].value_counts().sum()<=1:
dataframe.drop(x, axis=1, inplace=True)
[dataframe.drop(x, axis=1, inplace=True) for x in dataframe if dataframe[x].value_counts().sum() <= 1]
I have not used pandas yet, but the documentation on dataframe.drop says it returns a new object, so I assume it will work.
I would probably suggest going the other way and filtering it, I don't know your dataframe but something like this should work:
counts_valid = df.T.apply(pd.value_counts()).sum() > 1
df = df[counts_valid]
Or, if I see what you are doing, you may be better with
counts_valid = df.T.nunique() > 1
df = df[counts_valid]
That will just keep rows that have more than one unique value.

Loop over columns to cleanse the values

I have many columns in my pandas data frame which I want to cleanse with a specific method. I want to see if there is a way to do this in one go.
This is what I tried and this does not work.
list = list(bigtable) # this list has all the columns i want to cleanse
for index in list:
bigtable1.column = bigtable.column.str.split(',', expand=True).apply(lambda x: pd.Series(np.sort(x)).str.cat(sep=','), axis=1)
try this should work:
bigtable1=pd.Dataframe()
for index in list:
bigtable1[index] = bigtable[index].str.split(',', expand=True).apply(lambda x: pd.Series(np.sort(x)).str.cat(sep=','), axis=1)

pandas: check membership in array of lists, avoid looping through columns

What is the best way to accomplish the following task?
In the following DataFrame,
df = DataFrame({'a':[20,21,99], 'b':[[1,2,3,4],[1,2,99],[1,2]], 'c':['x','y','z']})
I want to check which elements in column df['a'] are contained in some list in column df['b']. In case there is a match I want the corresponding element in column df['c'], and if no match is found a 0.
So in my example I would like to get a Series:
[0,0,'y'].
Since 99 is the only element in column df['a'] contained in a list from column df['b'], and that list corresponds to element 'y' in column df['c']
I tried:
def match(item):
for ind, row in A.iterrows():
if item in row.b:
return row.c
return False
df['a'].apply(match)
But is quite slow.
Thanks!
I think this is an example of why you never want a column of lists in a Pandas DataFrame. Accessing the values in the lists force you to use Python loops with no opportunity to really take advantage of Pandas.
Ideally, I think you would be best off altering the way you are constructing df so that you do not store the values in b as lists. The appropriate data structure to use depends on how you intend to use the data.
For the particular purpose you describe in the question, a dict would be useful.
To construct the dict given the current df, you could do this:
In [69]: dct = {key:row['c'] for i, row in df[['b', 'c']].iterrows() for key in row['b']}
In [70]: df['a'].map(dct).fillna(0)
Out[70]:
0 0
1 0
2 y
Name: a, dtype: object

Categories

Resources