I would like to calculate a sum of variables for a given day. Each day contains a different calculation, but all the days use the variables consistently.
There is a df which specifies my variables and a df which specifies how calculations will change depending on the day.
How can I create a new column containing answers from these different equations?
import pandas as pd
import numpy as np
conversion = [["a",5],["b",1],["c",10]]
conversion_table = pd.DataFrame(conversion,columns=['Variable','Cost'])
data1 = [[1,"3a+b"],[2,"c"],[3,"2c"]]
to_solve = pd.DataFrame(data1,columns=['Day','Q1'])
desired = [[1,16],[2,10],[3,20]]
desired_table=pd.DataFrame(desired,columns=['Day','Q1 solved'])
I have separated my variables and equations based on row. Can I loop though these equations to find non-numerics and re-assign them?
#separate out equations and values
for var in conversion_table["Variable"]:
cost=(conversion_table.loc[conversion_table['Variable'] == var, 'Cost']).mean()
for row in to_solve["Q1"]:
equation=row
A simple suggestion, perhaps you need to rewrite a part of your code. Not sure if your want something like this:
a = 5
b = 1
c = 10
# Rewrite the equation that is readable by Python
# e.g. replace 3a+b by 3*a+b
data1 = [[1,"3*a+b"],
[2,"c"],
[3,"2*c"]]
desired_table = pd.DataFrame(data1,
columns=['Day','Q1'])
desired_table['Q1 solved'] = desired_table['Q1'].apply(lambda x: eval(x))
desired_table
Output:
Day Q1 Q1 solved
0 1 3*a+b 16
1 2 c 10
2 3 2*c 20
If it's possible to have the equations changed to equations with * then you could do this.
Get the mapping from the
mapping = dict(zip(conversion_table['Variable'], conversion_table['Cost'])
the eval the function and replace variables with numeric from the mapping
desired_table['Q1 solved'] = to_solve['Q1'].map(lambda x: eval(''.join([str(mapping[i]) if i.isalpha() else str(i) for i in x])))
0 16
1 10
2 20
Related
I'm having difficulties counting the number of elements in a list within a DataFrame's column. My problem comes from the fact that, after importing my input csv file, the rows that are supposed to contain an empty list [] are actually parsed as lists containing the empty string [""]. Here's a reproducible example to make things clearer:
import pandas as pd
df = pd.DataFrame({"ID": [1, 2, 3], "NETWORK": [[""], ["OPE", "GSR", "REP"], ["MER"]]})
print(df)
ID NETWORK
0 1 []
1 2 [OPE, GSR, REP]
2 3 [MER]
Even though one might think that the list for the row where ID = 1 is empty, it's not. It actually contains the empty string [""] which took me a long time to figure out.
So whatever standard method I try to use to calculate the number of elements within each list I get a wrong value of 1 for those who are supposed to be empty:
df["COUNT"] = df["NETWORK"].str.len()
print(df)
ID NETWORK COUNT
0 1 [] 1
1 2 [OPE, GSR, REP] 3
2 3 [MER] 1
I searched and tried a lot of things before posting here but I couldn't find a solution to what seems to be a very simple problem. I should also note that I'm looking for a solution that doesn't require me to modify my original input file nor modify the way I'm importing it.
You just need to write a custom apply function that ignores the ''
df['COUNT'] = df['NETWORK'].apply(lambda x: sum(1 for w in x if w!=''))
Another way:
df['NETWORK'].apply(lambda x: len([y for y in x if y]))
Using apply is probably more straightforward. Alternatively, explode, filter, then group by count.
_s = df['NETWORK'].explode()
_s = _s[_s != '']
df['count'] = _s.groupby(level=0).count()
This yields:
NETWORK count
ID
1 [] NaN
2 [OPE, GSR, REP] 3.0
3 [MER] 1.0
Fill NA with zeroes if needed.
df["COUNT"] = df["NETWORK"].apply(lambda x: len(x))
Use a lambda function on each row and in the lambda function return the length of the array
I'm new to Python and stackoverflow, so please forgive the bad edit on this question.
I have a df with 11 columns and 3 108 730 rows.
Columns 1 and 2 represent the X and Y (mathematical) coordinates, respectively and the other columns represent different frequencies in Hz.
The df looks like this:
df before adjustment
I want to plot this df in ArcGIS but for that I need to replace the (mathematical) coordinates that currently exist by the real life geograhical coordinates.
The trick is that I was only given the first geographical coordinate which is x=1055000 and y=6315000.
The other rows in columns 1 and 2 should be replaced by adding 5 to the previous row value so for example, for the x coordinates it should be 1055000, 1055005, 1055010, 1055015, .... and so on.
I have written two for loops that replace the values accordingly but my problem is that it takes much too long to run because of the size of the df and I haven't yet got a result after some hours because I used the row number as the range like this:
for i in range(0,3108729):
if i == 0:
df.at[i,'IDX'] = 1055000
else:
df.at[i,'IDX'] = df.at[i-1,'IDX'] + 5
df.head()
and like this for the y coordinates:
for j in range(0,3108729):
if j == 0:
df.at[j,'IDY'] = 6315000
else:
df.at[j,'IDY'] = df.at[j-1,'IDY'] + 5
df.head()
I have run the loops as a test with range(0,5) and it works but I'm sure there is a way to replace the coordinates in a more time-efficient manner without having to define a range? I appreciate any help !!
You can just build a range series in one go, no need to iterate:
df.loc[:, 'IDX'] = 1055000 + pd.Series(range(len(df))) * 5
df.loc[:, 'IDY'] = 6315000 + pd.Series(range(len(df))) * 5
I'm trying to further understand the difference between Map, Apply, and Vectorization, and just encountered a challenge I don't understand: for small numbers, these three functions achieve the same outcome, but for large numbers Vectorization appears to fail. Here's what I mean:
# get a simple dataframe set up
import numpy as np
import pandas as pd
x = range(10)
y = range(10,20)
df = pd.DataFrame(data = zip(x,y), columns = ['x','y'])
# define a simple function to test map, apply, and vectorization with
def simple_power(num1, num2):
return num1 ** num2
# use Map, Apply, and Vectorization to apply the function to every row in the dataframe
df['map power'] = list(map(simple_power, *(df['x'], df['y'])))
df['apply power'] = df.apply(lambda row: simple_power(row['x'], row['y']), axis=1)
df['optimize power'] = simple_power(df['x'], df['y'])
Everything works:
in: df.head()
out: x y map power apply power vectorized power
0 0 10 0 0 0
1 1 11 1 1 1
2 2 12 4096 4096 4096
3 3 13 1594323 1594323 1594323
4 4 14 268435456 268435456 268435456
Here's where things get confusing: if I replace my x and y with larger ranges, map and apply still work, but vectorization fails:
# set up dataframe with larger numbers to multiply together
x = range(100)
y = range(100,200)
df = pd.DataFrame(data = zip(x,y), columns = ['x','y'])
Then if I re-run map, apply, and vectorization, I get a wonky output for vectorization:
in: df.head()
out:
Map and Apply are consistent with each other, but Vectorization gives a nonsense results.
Can anyone tell me what's going on? Thank you!
https://github.com/numpy/numpy/issues/8987 and https://github.com/numpy/numpy/issues/10964 are where your problem is foundering.
When using ** in your function you are implicitly using numpy.power When you overflow the integer you don't see the error.
This is a known bug and should be getting fixed.
New to Python and Pandas, so please bear with me here.
I have created a dataframe with 10 rows, with a column called 'Distance' and I want to calculate a new column (TotalCost) with apply and a lambda funtion that I have created. Snippet below of the function
def TotalCost(Distance, m, c):
return m * df.Distance + c
where Distance is the column in the dataframe df, while m and c are just constants that I declare earlier in the main code.
I then try to apply it in the following manner:
df = df.apply(lambda row: TotalCost(row['Distance'], m, c), axis=1)
but when running this, I keep getting a dataframe as an output, instead of a single row.
EDIT: Adding in an example of input and desired output,
Input: df = {Distance: '1','2','3'}
if we assume m and c equal 10,
then the output of applying the function should be
df['TotalCost'] = 20,30,40
I will post the error below this, but what am I missing here? As far as I understand, my syntax is correct. Any assistance would be greatly appreciated :)
The error message:
ValueError: Wrong number of items passed 10, placement implies 1
Your lambda in apply should process only one row. BTW, apply return only calculated columns, not whole dataframe
def TotalCost(Distance,m,c): return m * Distance + c
df['TotalCost'] = df.apply(lambda row: TotalCost(row['Distance'],m,c),axis=1)
Your apply function will basically pass one row at a time to your lambda function and then returns a copy of your data frame with the edited or changed values
Finally it returns a modified copy of dataframe constructed with rows returned by lambda functions, instead of altering the original dataframe.
have a look at this link it should help you gain more insight
https://thispointer.com/pandas-apply-apply-a-function-to-each-row-column-in-dataframe/
import numpy as np
import pandas as pd
def star(x,m,c):
return x*m+c
vals=[(1,2,4),
(3,4,5),
(5,6,6) ]
df=pd.DataFrame(vals,columns=('one','two','three'))
res=df.apply(star,axis=0,args=[2,3])
Initial DataFrame
one two three
0 1 2 4
1 3 4 5
2 5 6 6
After applying the function you should get this stored in res
one two three
0 5 7 11
1 9 11 13
2 13 15 15
This is a more memory-efficient and cleaner way:
df.eval('total_cost = #m * Distance + #c', inplace=True)
Update: I also sometimes stick to assign,
df = df.assign(total_cost=lambda x: TotalCost(x['Distance'], m, c))
I have a huge dataframe, and I want to use several columns to apply a custom function, and put the result in a new column. But I have met a problem.
The following is my function to calculate the distance between two rows.
def calcDist(p, q):
diff = p - q
square_diff = diff ** 2
sum_square_diff = square_diff.sum()
return sum_square_diff ** 0.5
One of the parameters in the function is constant(a series with 0 and 1), the second parameter of the function is the data in the dataframe which in the selected columns(somthing like a series with 0 and 1).
I have tried the following codes.
cols = ['a','b','c']
new = [0,1,1]
df.columns = ['aa','a','b','c','dd','ee']
df['dist'] = df.loc[:,cols].apply(lamda x: calcdist(x, new))
But I get NaN in the 'dist' column.
I 've already tried for loop to solve this problem. But it works to slow.
house_chosen['dist'] = 0
for i in range(len(house_chosen)):
cols_chosen = house_chosen.loc[:, addition_list]
series_chosen = cols_chosen.iloc[i, :]
house_chosen.iloc[i, 42] = calcDist(new_house_addition, series_chosen)
So is there any way to solve the problem with apply function?
thx