I have a dataframe like this:
lis = [['a','b','c'],
['17','10','6'],
['5','30','x'],
['78','50','2'],
['4','58','x']]
df = pd.DataFrame(lis[1:],columns=lis[0])
How can I write a function that says, if 'x' is in column [c], then overwrite that value with the corresponding one in column [b]. The result would be this:
[['a','b','c'],
['17','10','6'],
['5','30','30'],
['78','50','2'],
['4','58','58']]
By using .loc and np.where
import numpy as np
df.c=np.where(df.c=='x',df.b,df.c)
df
Out[569]:
a b c
0 17 10 6
1 5 30 30
2 78 50 2
3 4 58 58
This should do the trick
import numpy as np
df.c = np.where(df.c == 'x',df.b, df.c)
I am not into pandas but if you want to change the lis you could do it like so:
>>> [x if x[2] != "x" else [x[0], x[1], x[1]] for x in lis]
[['a','b','c'],
['17','10','6'],
['5','30','30'],
['78','50','2'],
['4','58','58']]
Related
I am learning Pandas and I am moving my python code to Pandas. I want to compare every value with the next values using a sub. So the first with the second etc.. The second with the third but not with the first because I already did that. In python I use two nested loops over a list:
sub match_values (a, b):
#do some stuff...
l = ['a', 'b', 'c']
length = len(l)
for i in range (1, length):
for j in range (i, length): # starts from i, not from the start!
if match_values(l[i], l[j]):
#do some stuff...
How do I do a similar technique in Pandas when my list is a column in a dataframe? Do I simply reference every value like before or is there a clever "vector-style" way to do this fast and efficient?
Thanks in advance,
Jo
Can you please check this ? It provides an output in the form of a list for each row after comparing the values.
>>> import pandas as pd
>>> import numpy as np
>>> val = [16,19,15,19,15]
>>> df = pd.DataFrame({'val': val})
>>> df
val
0 16
1 19
2 15
3 19
4 15
>>>
>>>
>>> df['match'] = df.apply(lambda x: [ (1 if (x['val'] == df.loc[idx, 'val']) else 0) for idx in range(x.name+1, len(df)) ], axis=1)
>>> df
val match
0 16 [0, 0, 0, 0]
1 19 [0, 1, 0]
2 15 [0, 1]
3 19 [0]
4 15 []
Yes, vector comparison as pandas is built on Numpy:
df['columnname'] > 5
This will result in a Boolean array. If you also want to return the actually part of the dataframe:
df[df['columnname'] > 5]
I have the following dataframe:
import pandas as pd
import numpy as np
np.random.seed(123)
n = 10
df = pd.DataFrame({"val": np.random.randint(1, 10, n),
"cat": np.random.choice(["X", "Y", "Z"], n)})
val cat
0 3 Z
1 3 X
2 7 Y
3 2 Z
4 4 Y
5 7 X
6 2 X
7 1 X
8 2 X
9 1 Y
I want to know the percentage each category X, Y, and Z has of the entire val column sum. I can aggregate df like this:
total_sum = df.val.sum()
#32
s = df.groupby("cat").val.sum().div(total_sum)*100
#this is the desired result in % of total val
cat
X 46.875 #15/32
Y 37.500 #12/32
Z 15.625 #5/32
Name: val, dtype: float64
However, I find it rather surprising that pandas seemingly does not have a percentage/frequency function something like df.groupby("cat").val.freq() instead of df.groupby("cat").val.sum() or df.groupby("cat").val.mean(). I assumed this is a common operation, and Series.value_counts has implemented this with normalize=True - but for groupby aggregation, I cannot find anything similar. Am I missing here something or is there indeed no out-of-the-box function?
Environment:
Python 3.6.4
pandas 0.23.4
My code is below.
from math import sqrt
import pandas as pd
df = pd.DataFrame({'x':[1,2,3], 'y':[4,5,6]})
df = df.assign(d = lambda z: sqrt(z.x**2 + z.y**2))
The bottom line raise TypeError like below.
...
TypeError: cannot convert the series to <class 'float'>
Without sqrt, it works.
df = df.assign(d2 = lambda z: z.x**2 + z.y**2)
df
Out[6]:
x y d2
0 1 4 17
1 2 5 29
2 3 6 45
And apply also works.
df['d3'] = df.apply(lambda z: sqrt(z.x**2 + z.y**2), axis=1)
df
Out[8]:
x y d2 d3
0 1 4 17 4.123106
1 2 5 29 5.385165
2 3 6 45 6.708204
What's the matter with the first?
Use numpy.sqrt - it works also with 1d arrays, while sqrt from math works only with scalars:
df = df.assign(d = lambda z: np.sqrt(z.x**2 + z.y**2))
Another solution is use **(1/2):
df = df.assign(d = lambda z: (z.x**2 + z.y**2)**(1/2))
print (df)
x y d
0 1 4 4.123106
1 2 5 5.385165
2 3 6 6.708204
Your solution working, because axis=1 in apply working by scalars, but like #jpp mentioned, apply should not be preferred as it involves a Python-level row-wise loop.
df.apply(lambda z: print(z.x), axis=1)
1
2
3
pandas series object is like a numpy array, you cannot operate a math module
that is searching for a single object, and not a series.
the default math operations are valid , but not functions that do not work on arrays/ series.
what you can do is:
df = df.assign(d = lambda z: (z.x**0.5 + z.y**0.5))
or
df['d'] = df.z.x**0.5 + df.y.x**0.5
which is defined in pandas standard operations.
I need to apply a function to a subset of columns in a dataframe. consider the following toy example:
pdf = pd.DataFrame({'a' : [1, 2, 3], 'b' : [2, 3, 4], 'c' : [5, 6, 7]})
arb_cols = ['a', 'b']
what I want to do is this:
[df[c] = df[c].apply(lambda x : 99 if x == 2 else x) for c in arb_cols]
But this is bad syntax. Is it possible to accomplish such a task without a for loop?
With mask
pdf.mask(pdf.loc[:,arb_cols]==2,99).assign(c=pdf.c)
Out[1190]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Or with assign
pdf.assign(**pdf.loc[:,arb_cols].mask(pdf.loc[:,arb_cols]==2,99))
Out[1193]:
a b c
0 1 99 5
1 99 3 6
2 3 4 7
Do not use pd.Series.apply when you can use vectorised functions.
For example, the below should be efficient for larger dataframes even though there is an outer loop:
for col in arb_cols:
pdf.loc[pdf[col] == 2, col] = 99
Another option it to use pd.DataFrame.replace:
pdf[arb_cols] = pdf[arb_cols].replace(2, 99)
Yet another option is to use numpy.where:
import numpy as np
pdf[arb_cols] = np.where(pdf[arb_cols] == 2, 99, pdf[arb_cols])
For this case it would probably be better to use applymap if you need to apply a custom function
pdf[arb_cols] = pdf[arb_cols].applymap(lambda x : 99 if x == 2 else x)
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)