Iterate over two dataframes' columns and str.encode in utf8 - python

I'm currently running on Python 2.7 and have two dataframes x and y. I would like to use some sort of list comprehension to iterate over both columns and use str.encode('UTF8) on each column to get rid of unicode.
This works perfectly fine and is easily readable but wanted to try to use something faster and more efficient.
for col in y:
if y[col].dtype=='O':
y[col] = y[col].str.encode("utf-8")
for col in x:
if x[col].dtype=='O':
x[col] = x[col].str.encode("utf-8")
Other methods I have tried:
1.)[y[col].str.encode("utf-8") for col in y if y[col].dtype=='O' ]
2.)y.columns= [( y[col].str.encode("utf-8") if y[col].dtype=='O' else y[col]) for col in y ]
3.)y.apply(lambda x : (y[col].str.encode("utf-8") for col in y if y[col].dtype=='O'))
I am getting valueerrors and length mismatch errors for 2.) and 3.)

You can use select_dtypes to get object columns, then call apply over each column to encode it:
u = df.select_dtypes(include=[object])
df[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
Write a small function to do this and call it for each dataframe.
def encode_df(df):
u = df.select_dtypes(include=[object])
df[u.columns] = u.apply(lambda x: x.str.encode('utf-8'))
return df
x, y = encode_df(x), encode_df(y)

Use this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':[1,2,3,4], 'b':[11,12,13,14]})
def f(x):
return x**2
pd.DataFrame([[f(i) for i in tuple(v)] for k,v in df.iterrows()], columns=df.columns)
Out[54]:
a b
0 1 121
1 4 144
2 9 169
3 16 196

Related

Apply a function row by row using other dataframes' rows as list inputs in python

I'm trying to apply a function row-by-row which takes 5 inputs, 3 of which are lists. I want these lists to come from each row of 3 correspondings dataframes.
I've tried using 'apply' and 'lambda' as follows:
sol['tf_dd']=sol.apply(lambda tsol, rfsol, rbsol:
taurho_difdif(xy=xy,
l=l,
t=tsol,
rf=rfsol,
rb=rbsol),
axis=1)
However I get the error <lambda>() missing 2 required positional arguments: 'rfsol' and 'rbsol'
The DataFrame sol and the DataFrames tsol, rfsol and rbsol all have the same length. For each row, I want the entire row from tsol, rfsol and rbsol to be input as three lists.
Here is much simplified example (first with single lists, which I then want to replicate row by row with dataframes):
The output with single lists is a single value (120). With dataframes as inputs I want an output dataframe of length 10 where all values are 120.
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
def simple_func(t, rf, rb):
x=sum(t)
y=sum(rf)
z=sum(rb)
return x+y+z
out=simple_func(t,rf,rb)
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=['output'])
out2['output'] = out2.apply(lambda tsol, rfsol, rbsol:
simple_func(t=tsol.tolist(),
rf=rfsol.tolist(),
rb=rbsol.tolist()),
axis=1)
Try to use "name" field in Series Type to get index value, and then get the same index for the other DataFrame
import pandas as pd
import numpy as np
def postional_sum(inot, df1, df2, df3):
"""
Get input index and gather the same position for the other DataFrame collection
"""
position = inot.name
x = df1.iloc[position].sum()
y = df2.iloc[position].sum()
z = df3.iloc[position].sum()
return x + y + z
# dataframe rows as lists
tsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rfsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
rbsol = pd.DataFrame(np.random.randn(10, 5), columns=range(5))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = out2.apply(lambda x: postional_sum(x, tsol, rfsol, rbsol), axis=1)
out2
Hope this helps!
When you run df.apply() with axis=1, it does not pass on the columns as individual arguments to the function, but as a Series object, as explained here. The correct way to do this would be
out2['output'] = out2.apply(lambda row:
simple_func(t=row["tsol"],
rf=row["rfsol"],
rb=row["rbsol"]),
axis=1)
You can eliminate the simple function using this:
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
Here is the complete code:
t=[1,2,3,4,5]
rf=[6,7,8,9,10]
rb=[11,12,13,14,15]
# dataframe rows as lists
tsol=pd.DataFrame((t,t,t,t,t,t,t,t,t,t))
rfsol=pd.DataFrame((rf,rf,rf,rf,rf,rf,rf,rf,rf,rf))
rbsol=pd.DataFrame((rb,rb,rb,rb,rb,rb,rb,rb,rb,rb))
out2 = pd.DataFrame(index=range(len(tsol)), columns=["output"])
out2["output"] = tsol.sum(axis=1) + rfsol.sum(axis=1) + rbsol.sum(axis=1)
print(out2)
OUTPUT:
output
0 120
1 120
2 120
3 120
4 120
5 120
6 120
7 120
8 120
9 120

how to apply fsolve over pandas dataframe columns?

I'm trying to solve a system of equations:
and I would like to apply fsolve over a pandas dataframe.
How can I do that?
this is my code:
import numpy as np
import pandas as pd
import scipy.optimize as opt
a = np.linspace(300,400,30)
b = np.random.randint(700,18000,30)
c = np.random.uniform(1.4,4.0,30)
df = pd.DataFrame({'A':a, 'B':b, 'C':c})
def func(zGuess,*Params):
x,y,z = zGuess
a,b,c = Params
eq_1 = ((3.47-np.log10(y))**2+(np.log10(c)+1.22)**2)**0.5
eq_2 = (a/101.32) * (101.32/b)** z
eq_3 = 0.381 * x + 0.05 * (b/101.32) -0.15
return eq_1,eq_2,eq_3
zGuess = np.array([2.6,20.2,0.92])
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])))
But still not working, and I can't see the problem
The error: KeyError: 'A'
basically means he can't find the reference to 'A'
Thats happening because apply doesn't default to apply on rows.
By setting the parameter 1 at the end, it will iterate on each row, looking for the column reference 'A','B',...
df['result']= df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1)
That however, might not give the desired result, as it will save all the output (an array) into a single column.
For that, make reference to the three columns you want to create, make an interator with zip(*...)
df['output_a'],df['output_b'],df['output_c'] = zip(*df.apply(lambda x: opt.fsolve(func,zGuess,args=(x['A'],x['B'],x['C'])),1) )

pandas Dataframe create new column

I have this snippet of the code working with pandas dataframe, i am trying to use the apply function to create a new column called STDEV_TV but i keep running into this error all the columns i am working with are type float
TypeError: ("'float' object is not iterable", 'occurred at index 0')
Can someone help me understand why i keep getting this error
def sigma(df):
val = df.volume2Sum / df.volumeSum - df.vwap * df.vwap
return math.sqrt(max(val))
df['STDEV_TV'] = df.apply(sigma, axis=1)
Try:
import pandas as pd
import numpy as np
import math
df = pd.DataFrame(np.random.randint(1, 10, (5, 3)),
columns=['volume2Sum', 'volumeSum', 'vwap'])
def sigma(df):
val = df.volume2Sum / df.volumeSum - df.vwap * df.vwap
return math.sqrt(val) if val >= 0 else val
df['STDEV_TV'] = df.apply(sigma, axis=1)
Output:
>>> df
volume2Sum volumeSum vwap STDEV_TV
0 4 5 8 -63.200000
1 2 8 4 -15.750000
2 3 3 3 -8.000000
3 8 3 4 -13.333333
4 4 2 3 -7.000000
You need to apply sigma to each set of values not the whole DataFrame.
I would use a lambda function, eg:
def sigma(volume2Sum, volumeSum, vwap):
val = volume2Sum / volumeSum - vwap * vwap
return math.sqrt(val)
df['STDEV_TV'] = df.apply(lambda x: sigma(x.volume2Sum, x.volumeSum, x.vwap), axis=1)
That should put val into the STDEV_TV column and you can find the max value separately.
Take care you not to take the squareroot of a negative number.
You function sigma gives you one number as a result. Because, the first step you find the maximum:
max(val)
and it's only the one number...
After that you try uses you function for data series.
You should use in your code this last string:
df['STDEV_TV'] = sigma(df)
It will be working
Change
return math.sqrt(max(val))
to
return math.sqrt(max(val)) if isinstance(val, pd.Series) else (math.sqrt(val) if val >= 0 else val)
max() iterates over an iterable and find the maximum value. The problem here is since you're applying sigma to every row, local variable val is a float, not a list, so what you have similar to max(1.3).

manipulate a column of dataframe with conditions

enter code hereIn order to change strings' suffix to be prefix in a column of dataframe, which is made with the following code for example.
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
a b
1 100000.ss 10
2 200000.zz 18
I tried one line code below, but the result shows the if else statement doesn't work. Why?
df['a'] = df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") else 'zz.'+x[:6])
a b
1 ss.100000 10
2 ss.200000 18
Each x of your lambda function is a string. x.find returns -1 if not found. -1 is considered as boolean True. Therefore, your lambda always returns ss + .... Try to change your lambda to this
df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") != -1 else 'zz.'+x[:6])
Out[4]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Anyway, you don't need apply for this issue. Just use pandas str accessor
df['a'].str[-2:] + '.' + df['a'].str[:-3]
Out[10]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Why do the hardwork when there is a library that does it for you....
import pandas as pd
from pathlib import Path
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df.assign(
a=lambda x: x["a"].apply(lambda s: f"{Path(s).suffix[1:]}.{Path(s).stem}")
)
output
a b
ss.100000 10
zz.200000 18
There might be options to this in a lower number of lines. I have a solution
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df[['First','Last']] = df.a.str.split(".",expand=True)
df['a']=df['Last']+'.'+df['First']
df.drop(['First','Last'],axis=1)

Create dataframe in a loop

I would like to create a dataframe in a loop and after use these dataframe in a loop. I tried eval() function but it didn't work.
For example :
for i in range(5):
df_i = df[(df.age == i)]
There I would like to create df_0,df_1 etc. And then concatenate these new dataframe after some calculations :
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
You can create a dict of DataFrames x and have is as dict keys:
np.random.seed(42)
df = pd.DataFrame({'age': np.random.randint(0, 5, 20)})
x = {}
for i in range(5):
x[i] = df[df['age']==i]
final = pd.concat(x.values())
Then you can refer to individual DataFrames as:
x[1]
Output:
age
5 1
13 1
15 1
And concatenate all of them with:
pd.concat(x.values())
Output:
age
18 0
5 1
13 1
15 1
2 2
6 2
...
The way is weird and not recommended, but it can be done.
Answer
for i in range(5):
exec("df_{i} = df[df['age']=={i}]")
def UDF(dfi):
# do something in user-defined function
for i in range(5):
exec("df_{i} = UDF(df_{i})")
final_df = pd.concat(df_0,df_1)
for i in range(2:5):
final_df = pd.concat(final_df, df_i)
Better Way 1
Using a list or a dict to store the dataframe should be a better way since you can access each dataframe by an index or a key.
Since another answer shows the way using dict (#perl), I will show you the way using list.
def UDF(dfi):
# do something in user-defined function
dfs = [df[df['age']==i] for i in range(i)]
final_df = pd.concat(map(UDF, dfs))
Better Way 2
Since you are using pandas.DataFrame, groupby function is a 'pandas' way to do what you want. (maybe, I guess, cause I don't know what you want to do. LOL)
def UDF(dfi):
# do something in user-defined function
final_df = df.groupby('age').apply(UDF)
Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

Categories

Resources