Create a new column based on calculations that change between rows? - python

I would like to calculate a sum of variables for a given day. Each day contains a different calculation, but all the days use the variables consistently.
There is a df which specifies my variables and a df which specifies how calculations will change depending on the day.
How can I create a new column containing answers from these different equations?
import pandas as pd
import numpy as np
conversion = [["a",5],["b",1],["c",10]]
conversion_table = pd.DataFrame(conversion,columns=['Variable','Cost'])
data1 = [[1,"3a+b"],[2,"c"],[3,"2c"]]
to_solve = pd.DataFrame(data1,columns=['Day','Q1'])
desired = [[1,16],[2,10],[3,20]]
desired_table=pd.DataFrame(desired,columns=['Day','Q1 solved'])
I have separated my variables and equations based on row. Can I loop though these equations to find non-numerics and re-assign them?
#separate out equations and values
for var in conversion_table["Variable"]:
cost=(conversion_table.loc[conversion_table['Variable'] == var, 'Cost']).mean()
for row in to_solve["Q1"]:
equation=row

A simple suggestion, perhaps you need to rewrite a part of your code. Not sure if your want something like this:
a = 5
b = 1
c = 10
# Rewrite the equation that is readable by Python
# e.g. replace 3a+b by 3*a+b
data1 = [[1,"3*a+b"],
[2,"c"],
[3,"2*c"]]
desired_table = pd.DataFrame(data1,
columns=['Day','Q1'])
desired_table['Q1 solved'] = desired_table['Q1'].apply(lambda x: eval(x))
desired_table
Output:
Day Q1 Q1 solved
0 1 3*a+b 16
1 2 c 10
2 3 2*c 20

If it's possible to have the equations changed to equations with * then you could do this.
Get the mapping from the
mapping = dict(zip(conversion_table['Variable'], conversion_table['Cost'])
the eval the function and replace variables with numeric from the mapping
desired_table['Q1 solved'] = to_solve['Q1'].map(lambda x: eval(''.join([str(mapping[i]) if i.isalpha() else str(i) for i in x])))
0 16
1 10
2 20

Related

Count the number of elements in a list where the list contains the empty string

I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array

How to replace two entire columns in a df by adding 5 to the previous value?

I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5

Why does Vectorization fail with larger numbers but Map and Apply work?

I'm trying to further understand the difference between Map, Apply, and Vectorization, and just encountered a challenge I don't understand: for small numbers, these three functions achieve the same outcome, but for large numbers Vectorization appears to fail. Here's what I mean:
# get a simple dataframe set up
import numpy as np
import pandas as pd
x = range(10)
y = range(10,20)
df = pd.DataFrame(data = zip(x,y), columns = ['x','y'])
# define a simple function to test map, apply, and vectorization with
def simple_power(num1, num2):
return num1 ** num2
# use Map, Apply, and Vectorization to apply the function to every row in the dataframe
df['map power'] = list(map(simple_power, *(df['x'], df['y'])))
df['apply power'] = df.apply(lambda row: simple_power(row['x'], row['y']), axis=1)
df['optimize power'] = simple_power(df['x'], df['y'])
Everything works:
in: df.head()
out: x y map power apply power vectorized power
0 0 10 0 0 0
1 1 11 1 1 1
2 2 12 4096 4096 4096
3 3 13 1594323 1594323 1594323
4 4 14 268435456 268435456 268435456
Here's where things get confusing: if I replace my x and y with larger ranges, map and apply still work, but vectorization fails:
# set up dataframe with larger numbers to multiply together
x = range(100)
y = range(100,200)
df = pd.DataFrame(data = zip(x,y), columns = ['x','y'])
Then if I re-run map, apply, and vectorization, I get a wonky output for vectorization:
in: df.head()
out:
Map and Apply are consistent with each other, but Vectorization gives a nonsense results.
Can anyone tell me what's going on? Thank you!
https://github.com/numpy/numpy/issues/8987 and https://github.com/numpy/numpy/issues/10964 are where your problem is foundering.
When using ** in your function you are implicitly using numpy.power When you overflow the integer you don't see the error.
This is a known bug and should be getting fixed.

Pandas Apply/Lambda returning dataframe and not single row

New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1
Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)
Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15
This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))

dataframe apply custom function

I have a huge dataframe, and I want to use several columns to apply a custom function, and put the result in a new column. But I have met a problem.
The following is my function to calculate the distance between two rows.
def calcDist(p, q):
diff = p - q
square_diff = diff ** 2
sum_square_diff = square_diff.sum()
return sum_square_diff ** 0.5
One of the parameters in the function is constant(a series with 0 and 1), the second parameter of the function is the data in the dataframe which in the selected columns(somthing like a series with 0 and 1).
I have tried the following codes.
cols = ['a','b','c']
new = [0,1,1]
df.columns = ['aa','a','b','c','dd','ee']
df['dist'] = df.loc[:,cols].apply(lamda x: calcdist(x, new))
But I get NaN in the 'dist' column.
I 've already tried for loop to solve this problem. But it works to slow.
house_chosen['dist'] = 0
for i in range(len(house_chosen)):
cols_chosen = house_chosen.loc[:, addition_list]
series_chosen = cols_chosen.iloc[i, :]
house_chosen.iloc[i, 42] = calcDist(new_house_addition, series_chosen)
So is there any way to solve the problem with apply function?
thx

Categories

Resources