Suppose I have the following dataframe:
df = pd.DataFrame({'X':['AB_123_CD','EF_123CD','XY_Z'],'Y':[1,2,3]})
X Y
0 AB_123_CD 1
1 EF_123CD 2
2 XY_Z 3
I want to use strip method to get rid of the first prefix such that I get
X Y
0 123_CD 1
1 123CD 2
2 Z 3
I tried doing: df.X.str.split('_').str[-1].str.strip() but since the positions of _'s are different it returns different result to the one desired above. I wonder how can I address this issue?
You're close, you can split once (n=1) from the left and keep the second one (str[1]):
df.X = df.X.str.split("_", n=1).str[1]
to get
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
Try this instead:
df["X"] = df["X"].apply(lambda x: x[x.find("_")+1:])
>>> df
X Y
0 123_CD 1
1 123CD 2
2 Z 3
This keeps the entire string after the first occurence of _
The following code could do the job:
df['X'] = df.X.apply(lambda x: '_'.join(x.split('_')[1:]))
Your solution is very close. With some minor changes, it should work:
df.X.str.split('_').str[1:].str.join('_')
0 123_CD
1 123CD
2 Z
Name: X, dtype: object
You can define maxsplit in the str.split() function. It sounds like you just want to split with maxsplit 1 and take the last element:
df['X'] = df['X'].apply(lambda x: x.split('_',1)[-1])
Related
I have a column with land dimensions in Pandas. It looks like this:
df.LotSizeDimensions.value_counts(dropna=False)
40.00X150.00 2
57.00X130.00 2
27.00X117.00 2
63.00X135.00 2
37.00X108.00 2
65.00X134.00 2
57.00X116.00 2
33x124x67x31x20x118 1
55.00X160.00 1
63.00X126.00 1
36.00X105.50 1
In rows where there is only one X, I would like to create a separate column that would multiply the values. In columns where there is more than one X, I would like to return a zero. This is the code I came up with
def dimensions_split(df: pd.DataFrame):
df.LotSizeDimensions = df.LotSizeDimensions.str.strip()
df.LotSizeDimensions = df.LotSizeDimensions.str.upper()
df.LotSizeDimensions = df.LotSizeDimensions.str.strip('`"M')
if df.LotSizeDimensions.count('X') > 1
return 0
df['LotSize'] = map(int(df.LotSizeDimensions.str.split("X", 1).str[0])*int(df.LotSizeDimensions.str.split("X", 1).str[1]))
This is coming back with the following error:
TypeError: cannot convert the series to <class 'int'>
I would also like to add a line where if there are any non-numeric characters other than X, return a zero.
Idea is first stripping and convert to upper column LotSizeDimensions to Series and then use Series.str.split for DataFrame and then multiple columns if there is only one X else is returned 0:
s = df.LotSizeDimensions.str.strip('`"M ').str.upper()
df1 = s.str.split('X', expand=True).astype(float)
#general data
#df1 = s.str.split('X', expand=True).apply(lambda x: pd.to_numeric(x, errors='coerce'))
df['LotSize'] = np.where(s.str.count('X').eq(1), df1[0] * df1[1], 0)
print (df)
LotSizeDimensions LotSize
0 40.00X150.00 6000.0
1 57.00X130.00 7410.0
2 27.00X117.00 3159.0
3 37.00X108.00 3996.0
4 63.00X135.00 8505.0
5 65.00X134.00 8710.0
6 57.00X116.00 6612.0
7 33x124x67x31x20x118 0.0
8 55.00X160.00 8800.0
9 63.00X126.00 7938.0
10 36.00X105.50 3798.0
I get this using list comprehension:
import pandas as pd
df = pd.DataFrame(['40.00X150.00','57.00X130.00',
'27.00X117.00',
'37.00X108.00',
'63.00X135.00' ,
'65.00X134.00' ,
'57.00X116.00' ,
'33x124x67x31x20x118',
'55.00X160.00',
'63.00X126.00',
'36.00X105.50'])
df[1] = [float(str_data.strip().split("X")[0])*float(str_data.strip().split("X")[1]) if len(str_data.strip().split("X"))==2 else None for str_data in df[0]]
I have a df called df like so. The tag_position is either a string or list. but I want them to be all strings. How can i do this? I also want to remove the white space at the end.
input
id tag_positions
1 center
2 right
3 ['left']
4 ['center ']
5 [' left']
6 ['right']
7 left
expected output
id tag_positions
1 center
2 right
3 left
4 center
5 left
6 right
7 left
You can explode and then strip:
df.tag_positions = df.tag_positions.explode().str.strip()
to get
id tag_positions
0 1 center
1 2 right
2 3 left
3 4 center
4 5 left
5 6 right
6 7 left
You can join:
df['tag_positions'].map(''.join)
Or:
df['tag_positions'].str.join('')
Try with str chain with np.where
df['tag_positions'] = np.where(df['tag_positions'].map(lambda x : type(x).__name__)=='list',df['tag_positions'].str[0],df['tag_positions'])
Also my favorite explode
df = df.explode('tag_positions')
you can convert with apply method like this
df.tag_positions = df.tag_positions.apply(lambda x : ''.join(x) if type(x) == list else x)
if all the lists have a length of 1 you can do this also:
df.tag_positions = df.tag_positions.apply(lambda x : x[0] if type(x) == list else x)
You can use apply and check if an item is an instance of a list, if yes, take the first element. and then you can just use str.strip to strip off the unwanted spaces.
df['tag_positions'].apply(lambda x: x[0] if isinstance(x, list) else x).str.strip()
OUTPUT
Out[42]:
0 center
1 right
2 left
3 center
4 left
5 right
6 left
Name: 0, dtype: object
Consider the following data set stored in a pandas DataFrame dfX:
A B
1 2
4 6
7 9
I have a function that is:
def someThingSpecial(x,y)
# z = do something special with x,y
return z
I now want to create a new column in df that bears the computed z value
Looking at other SO examples, I've tried several variants including:
dfX['C'] = dfX.apply(lambda x: someThingSpecial(x=x['A'], y=x['B']), axis=1)
Which returns errors. What is the right way to do this?
This seems to work for me on v0.21. Take a look -
df
A B
0 1 2
1 4 6
2 7 9
def someThingSpecial(x,y):
return x + y
df.apply(lambda x: someThingSpecial(x.A, x.B), 1)
0 3
1 10
2 16
dtype: int64
You might want to try upgrading your pandas version to the latest stable release (0.21 as of now).
Here's another option. You can vectorise your function.
v = np.vectorize(someThingSpecial)
v now accepts arrays, but operates on each pair of elements individually. Note that this just hides the loop, as apply does, but is much cleaner. Now, you can compute C as so -
df['C'] = v(df.A, df.B)
if your function only needs one column's value, then do this instead of coldspeed's answer:
dfX['A'].apply(your_func)
to store it:
dfX['C'] = dfX['A'].apply(your_func)
I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here
Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)
Dataframe:
one two
a 1 x
b 1 y
c 2 y
d 2 z
e 3 z
grp = DataFrame.groupby('one')
grp.agg(lambda x: ???) #or equivalent function
Desired output from grp.agg:
one two
1 x|y
2 y|z
3 z
My agg function before integrating dataframes was "|".join(sorted(set(x))). Ideally I want to have any number of columns in the group and agg returns the "|".join(sorted(set()) for each column item like two above. I also tried np.char.join().
Love Pandas and it has taken me from a 800 line complicated program to a 400 line walk in the park that zooms. Thank you :)
You were so close:
In [1]: df.groupby('one').agg(lambda x: "|".join(x.tolist()))
Out[1]:
two
one
1 x|y
2 y|z
3 z
Expanded answer to handle sorting and take only the set:
In [1]: df = DataFrame({'one':[1,1,2,2,3], 'two':list('xyyzz'), 'three':list('eecba')}, index=list('abcde'), columns=['one','two','three'])
In [2]: df
Out[2]:
one two three
a 1 x e
b 1 y e
c 2 y c
d 2 z b
e 3 z a
In [3]: df.groupby('one').agg(lambda x: "|".join(x.order().unique().tolist()))
Out[3]:
two three
one
1 x|y e
2 y|z b|c
3 z a
Just an elaboration on the accepted answer:
df.groupby('one').agg(lambda x: "|".join(x.tolist()))
Note that the type of df.groupby('one') is SeriesGroupBy. And the function agg defined on this type. If you check the documentation of this function, it says its input is a function that works on Series. This means that x type in the above lambda is Series.
Another note is that defining the agg function as lambda is not necessary. If the aggregation function is complex, it can be defined separately as a regular function like below. The only constraint is that the x type should be of Series (or compatible with it):
def myfun1(x):
return "|".join(x.tolist())
and then:
df.groupby('one').agg(myfun1)
There is a better way to concatenate strings, in pandas documentation.So I prefer this way:
In [1]: df.groupby('one').agg(lambda x: x.str.cat(sep='|'))
Out[1]:
two
one
1 x|y
2 y|z
3 z