Find pattern in pandas column names and change such columns using pipe - python

Let say I have below calculation,
import pandas as pd
dat = pd.DataFrame({'xx1' : [1,2,3], 'aa2' : ['qq', '4', 'd'], 'xx3' : [4,5,6]})
dat2 = (dat
.assign(xx1 = lambda x : [str(i) for i in x['xx1'].values])
.assign(xx3 = lambda x : [str(i) for i in x['xx3'].values])
)
Basically, I need to find those columns for which column names match pattern xx + sequence of numbers (i.e. xx1, xx2, xx3 etc) and then apply some transformation to those column (e.g. apply str function)
One way I can do this is like above i.e. find manually those columns and perform transformation. I wonder if there is any way to generalise this approach. I prefer to use pipe like above.
Any pointer will be very helpful.

You could do:
# Matches all columns starting with 'xx' with a sequence of numbers afterwards.
cols_to_transform = dat.columns[dat.columns.str.match('^xx[0-9]+$')]
# Transform to apply (column-wise).
transform_function = lambda c: c.astype(str)
# If you want a new DataFrame and not modify the other in-place.
dat2 = dat.copy()
dat2[cols_to_transform] = dat2[cols_to_transform].transform(transform_function, axis=0)
To use it within assign:
# Here I put a lambda to avoid precomputing all the transformations in the dict comprehension.
dat.assign(**{col: lambda df: df[col].astype(str) for col in cols_to_transform})

import pandas as pd
frame = pd.DataFrame({'xx1' : [1,2,3], 'aa2' : ['qq', '4', 'd'], 'xx3' : [4,5,6]})
def parse_column(col, vals):
if "xx" == col[:2] and col[2:].isdigit():
return [str(i) for i in vals]
return vals
for (name, col) in frame.iteritems():
frame[name] = parse_column(name, col.values)
you can iterate over columns, getting their names and values as a series
the incredibly niche str.isdigits() function exists as an inherent part of python for some reason, but it came in useful here

One option is to select the relevant columns, apply your function and assign them back to the dataframe via unpacking:
result = dat.assign(**dat.filter(regex=r"xx\d+").astype(str))
result.dtypes
xx1 object
aa2 object
xx3 object
dtype: object
dat.dtypes
xx1 int64
aa2 object
xx3 int64
dtype: object

Related

PySpark: Sum up columns from array [duplicate]

I've got a list of column names I want to sum
columns = ['col1','col2','col3']
How can I add the three and put it in a new column ? (in an automatic way, so that I can change the column list and have new results)
Dataframe with result I want:
col1 col2 col3 result
1 2 3 6
[TL;DR,]
You can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
Explanation:
The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.
The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns will be list of columns from df.
Add multiple columns from a list into one column
I tried a lot of methods and the following are my observations:
PySpark's sum function doesn't support column addition (Pyspark version 2.3.1)
Built-in python's sum function is working for some folks but giving error for others.
So, the addition of multiple columns can be achieved using the expr function in PySpark, which takes an expression to be computed as an input.
from pyspark.sql.functions import expr
cols_list = ['a', 'b', 'c']
# Creating an addition expression using `join`
expression = '+'.join(cols_list)
df = df.withColumn('sum_cols', expr(expression))
This gives us the desired sum of columns. We can also use any other complex expression to get other output.

how to express dataframe operations using symbols?

suppose i have an sympy expression, it seems to me i can only substitute symbols with numbers. the question is can i substitute it with something else like a pandas series? For example,
from sympy import Symbol, Function
a_sym = Symbol('a')
b_sym = Symbol('b')
sum_func_sym = Function('sum_func')
expression = sum_func_sym(a_sym+b_sym)
is there a way for me to substitute a_sym and b_sym with pandas series and replace the sum_func_sym with series sum and then calculate the result?
import pandas as pd
df = pd.DataFrame({'a': [1,2], 'b': [3,4]})
a = df.a
b = df.b
def sum_func(series):
return series.sum()
When i do the substitution and replacement i get an error:
expression.subs(a_sym, a).subs(b_sym, b).replace(sum_func_sym, sum_func)
AttributeError: 'Add' object has no attribute 'sum'
Building upon this answer, I came up with the following implementation that seems to work for at least fairly simple use cases:
df = pd.DataFrame({'a': range(5), 'b': range(5)})
my_vars = symbols('a b') # have to have same names as DataFrame columns
expr = parse_expr('a+Sqrt(b)+1')
# Create callable version of the expression
callable_obj = lambdify(my_vars, expr)
# Call the object, passing in the DataFrame columns as parameters
# Write the result in a new column of the dataframe
df['result'] = callable_obj(**{
str(a): df[str(a)] # pass column as variable with the same name
for a in expr.atoms() # for all atomic expressions...
if isinstance(a, Symbol) # that are Symbols (not constants)
})
The output is (as expected):
0 1.000000
1 3.000000
2 4.414214
3 5.732051
4 7.000000
dtype: float64
I assume that you have a dataframe with many columns and you want to add two of them. However, the names of columns to be added are variables, unknown beforeahead. Here is the solution for this case. f-strings work for Python 3.6+, for other versions, please modify appropriately.
def sum(a, b):
global df
df[f'sum_of_{a}_and_{b}'] = df[a] + df[b]
# for more general function than sum
# df['f'sum_of_{a}_and_{b}']] = df.apply(lambda x: f(x[a],x[b]), axis=1)
# where f is the function instead of the sum

Apply Pandas series string function to the whole dataframe

I want to apply the method pd.Series.str.join() to my whole dataframe
A B
[foo,bar] [1,2]
[bar,foo] [3,4]
Desired output:
A B
foobar 12
barfoo 34
For now I used a quite slow method:
a = [df[x].str.join('') for x in df.columns]
I tried
df.apply(pd.Series.str.join)
and
df.agg(pd.Series.str.join)
and
df.applymap(str.join)
but none of them seem to work. For extension of the question, how can I efficiently apply series method to the whole dataframe?
Thank you.
There will always be a problem when trying to joim on lists that contain numeric values, that's why I suggest we first turn the into strings. Afterwards, we can solve it with a nested list comprehension:
df = pd.DataFrame({'A':[['Foo','Bar'],['Bar','Foo']],'B':[[1,2],[3,4]]})
df['B'] = df['B'].map(lambda x: [str(i) for i in x])
df_new = pd.DataFrame([[''.join(x) for x in df[i]] for i in df],index=df.columns).T
Which correctly outputs:
A B
FooBar 12
BarFoo 34
import pandas as pd
df=pd.DataFrame({'A':[['foo','bar'],['bar','foo']],'B':[[1,2],[3,4]]})
#If 'B' is list of integers, else the below step can be ignored
df['B']=df['B'].transform(lambda value: [str(x) for x in value])
df=df.applymap(lambda value:''.join(value))
Explanation: applymap() helps to apply any function to each value of your dataframe
I came up with this solution:
df_sum = df_sum.stack().str.join('').unstack()
I have a quite big dataframe, so for loop is not really scalable.

Pandas DataFrame.apply on object dtype: create new column without affecting used columns

I'd like to create a new column B by applying a function on each row of column A, which is of data type object and filled with list data, in dataframe DF without changing the values of column A.
def f(i):
if(type(i) is list):
for j in range(0,len(i)):
i[j]+=1
else:
i+=1
return i
df = pd.DataFrame([1,1],columns=['A'])
df['A']=df['A'].astype(object)
df.at[[0,1],'A']=[1,2]
df['B']=df['A'].apply(lambda x: f(x))
Unfortunately the following happens: df['B'] = function(df['A']), but also df['A'] = function(df['A']).
Please note: df['A'] is a list, dtype is object (o).
To be clear: I want column A to remain as original. Can anyone tell me how to achieve this?
you want to use apply on column A
df['B'] = df['A'].apply(function)
this does the function on each value in A.
essentially you are using the apply method of the series object, more info:
pandas.Series.apply
df2 = df.copy()
df['B'] = df2.apply(lamba row: function(row['A']), axis=1)

Python Pandas add new column which will have multiple columns values along with column names

I am currently in the process of automating SQL script using the csv file and pandas module. where condition is based on the values present on my csv file.
Sample csv file wll be as below.
First Last
X A
Y B
Z C
I want a new dataframe which should look like this(with new column added).
First Last condition
X A First='X' and Last='A'
Y B First='Y' and Last='B'
Z C First='Z' and Last='C'
so i can use the third column in my sql where condition.
Note:
I can achieve this by below method but i cannot use it because my column names are not static, i mean i will be using this on multiple csv/df's which will have different column names, also number columns might be more than 2.
df['condition'] = 'First=\'' + df['First'] +'\' And ' + 'Last=\'' + df['Last'] +'\''
If I resolve the 'condition' column then my final SQL will should look like this:
Select First, Last from mydb.customers
where
(First='X' and Last='A') or
(First='Y' and Last='B') or
(First='Z' and Last='C')
Thanks
You can use apply with row (axis=1) to execute function with every row - and this functions gets all informations about data in row - column names and values
import pandas as pd
df = pd.DataFrame({
'First': ['X', 'Y', 'Z'],
'Second': ['1', '2', '3'],
'Last': ['A', 'B', 'C'],
})
print(df)
def concatenate(row):
parts = []
for name, value in row.items():
parts.append("{}='{}'".format(name, value))
return ' and '.join(parts)
df['condition'] = df.apply(concatenate, axis=1)
print(df['condition'])
Data:
(because I used dictionary which doesn't have to keep order so I get Second as last element ;) )
First Last Second
0 X A 1
1 Y B 2
2 Z C 3
Result:
0 First='X' and Last='A' and Second='1'
1 First='Y' and Last='B' and Second='2'
2 First='Z' and Last='C' and Second='3'
Name: condition, dtype: object
You can create a function that accomplishes what you are attempting. This takes any string series (such as yours) and creates the pattern you want with the series name.
Avoiding explicitly naming the columns is the hard part.
from functools import reduce #for python 3, it is native in 2
def series_to_str(s):
n = s.name
return n+"='" + s +"'"
df['condition'] = reduce(lambda x, y: x+' and '+y,
map(series_namer, (df[col] for col in df)))

Categories

Resources