Split Columns in pandas with str.split and keep values - python

So I am stuck with a problem here:
I have a pandas dataframe which looks like the following:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21/23,60
3 Ismael 21,2/ 21,54
4 Joe 23,1
and so on...
What I am trying to is to split the "Value" column by the slash forward (/) but keep all the values, which do not have this kind of pattern.
Like here:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21
3 Ismael 21,2
4 Joe 23,1
How can I achieve this? I tried the str.split method but it's not giving me the solution I want. Instead, it returns NaN as can be seen in the following.
My Code: df['Value']=df['value'].str.split('/', expand=True)[0]
Returns:
ID Name Value
0 Peter NaN
1 Frank NaN
2 Tom 23,21
3 Ismael 21,2
4 Joe Nan
All I need is the very first Value before the '/' is coming.
Appreciate any kind of help!

Remove expand=True for return lists and add str[0] for select first value:
df['Value'] = df['Value'].str.split('/').str[0]
print (df)
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
If performance is important use list comprehension:
df['Value'] = [x.split('/')[0] for x in df['Value']]

pandas.Series.str.replace with regex
df.assign(Value=df.Value.str.replace('/.*', ''))
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
Optionally, you can assign results directly back to dataframe
df['Value'] = df.Value.str.replace('/.*', '')

Related

How do I create a new column of max values of a column(corresponding to specific name) using pandas?

I'm wondering if it is possible to use Pandas to create a new column for the max values of a column (corresponding to different names, so that each name will have a max value).
For an example:
name value max
Alice 1 9
Linda 1 1
Ben 3 5
Alice 4 9
Alice 9 9
Ben 5 5
Linda 1 1
So for Alice, we are picking the max of 1, 4, and 9, which is 9. For Linda max(1,1) = 1, and for Ben max(3,5) = 5.
I was thinking of using .loc to select the name == "Alice", then get the max value of these rows, then create the new column. But since I'm dealing with a large dataset, this does not seem like a good option. Is there a smarter way to do this so that I don't need to know what specific names?
groupby and taking a max gives the max by name, which is then merged with the original df
df.merge(df.groupby(['name'])['value'].max().reset_index(),
on='name').rename(
columns={'value_x' : 'value',
'value_y' : 'max'})
name value max
0 Alice 1 9
1 Alice 4 9
2 Alice 9 9
3 Linda 1 1
4 Linda 1 1
5 Ben 3 5
6 Ben 5 5
You could use transform or map
df['max'] = df.groupby('name')['value'].transform('max')
or
df['max'] = df['name'].map(df.groupby('name')['value'].max())

Pandas - dense rank but keep current group numbers

I'm dealing with pandas dataframe and have a frame like:
data = {
"name": ["Andrew", "Andrew", "James", "James", "Mary", "Andrew", "Michael"],
"id": [3, 3, 1, 0, 0, 0, 2]
}
df = pd.DataFrame(data)
----------------------
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 0
4 Mary 0
5 Andrew 0
6 Michael 2
I'm trying to write code to group values by "name" column. However, I want to keep the current group numbers.
If the value is 0, it means that there is no assignment.
For the example above, assign a value of 3 for each occurrence of Andrew and a value of 1 for each occurrence of James. For Mary, there is no assignment so assign next/unique number.
The expected output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
I've spent time already trying to figure this out. I managed to get to something like this:
df.loc[df["id"].eq(0), "id"] = ( df['name'].rank(method='dense').astype(int))
The issue with above it that it ignore records equal 0, thus numbers are incorrect. I removed that part (values equal to 0) but then numbering is not preserved.
Can u please support me?
Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:
df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')
s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2
IIUC you can first fill in the non-zero IDs with groupby.transform('max') to get the max existing ID, then complete the names without ID to the next available ID on the masked data (you can use factorize or rank as you wish):
# fill existing non-zero IDs
s = df.groupby('name')['id'].transform('max')
m = s.eq(0)
df['id'] = s.mask(m)
# add new ones
df.loc[m, 'id'] = pd.factorize(df.loc[m, 'name'])[0]+df['id'].max()+1
# or rank, although factorize is more appropriate for non numerical data
# df.loc[m, 'id'] = df.loc[m, 'name'].rank(method='dense')+df['id'].max()
# optional, if you want integers
df['id']= df['id'].convert_dtypes()
output:
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

How to keep certain rows based on a condition in python pandas

I have the following df. Below are two fields that pertain to my question
name tardy
max 0
max 1
ben 0
amy 0
amy 1
sue 1
tyler 0
tyler 1
I would like to keep only the name of those who have both tardy==0 and tardy==1. Thus, my desired output is the following
name tardy
max 0
max 1
amy 0
amy 1
tyler 0
tyler 1
Getting rid of name==sue and name==ben makes it so that the only name showing up is for those who have both a 0 and 1 value for tardy.
I tried doing a .loc
df[(df.tardy==0) & (df.tardy==1)]
but this doesn't take into account filtering it by name.
Any help is appreciated. Thanks!
For most general solution working for any data compare values of groups converted to sets with original and for avoid matching data like 0,1,0 compare by length if match:
vals = set([0,1])
m = df.groupby('name')['tardy'].transform(lambda x: set(x)==vals and len(x)==len(vals))
df = df[m]
print (df)
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
Or solution with pandas functions - compare values if unique is same like set, compare lengths and also if matching values 0,1:
vals = [0,1]
g = df.groupby('name')['tardy']
df = df[g.transform('size').eq(2) & g.transform('size').eq(2) & df['tardy'].isin(vals)]
print (df)
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
You can use groupby().nunique():
df[df.groupby('name')['tardy'].transform('nunique')==2]
Output:
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
The easiest way is to use df.groupby().filter, which filters the dataframe's groups based on a condition.
tardy_vals = {0, 1}
df.groupby('name').filter(lambda g: tardy_vals.issubset(g['tardy']))
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1

Pandas, how to make matrix

I have a question about pandas and if someone could help me, I would be grateful for that very much.
I have a dataframe
df1 = pd.DataFrame( {'Name': ['A', 'B','A','A']})
df1
I want to do groupby for this.
x=df1.groupby("Name").size()
x
I also have another dataframe
df2 = pd.DataFrame( {'Name2': ['Jon',Maria','Maria','Mike','Mike','Mike']})
df2
For this one, I do groupby as well.
y= df2.groupby("Name2").size()
And then I want to make matrix whose column is x and row is y, and want to multiply the values.
I want the matrix like this.
Jon Maria Mike
A 3 6 9
B 1 2 3
If you could tell me how to do that, I would greatly appreciate it.
You could perform a dot product:
x.to_frame().dot(y.to_frame().T)
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3
If you want to remove the axis labels, use rename_axis:
x.to_frame().dot(y.to_frame().T)\
.rename_axis(None).rename_axis(None, 1)
Jon Maria Mike
A 3 6 9
B 1 2 3
Alternatively, assign in-place:
v = x.to_frame().dot(y.to_frame().T)
v.index.name = v.columns.name = None
v
Jon Maria Mike
A 3 6 9
B 1 2 3
In [35]: (pd.DataFrame(y[:,None].dot(x[:,None].T).T, columns=y.index, index=x.index)
.rename_axis(None)
.rename_axis(None,1))
Out[35]:
Jon Maria Mike
A 3 6 9
B 1 2 3
Or we can using np.multiply.outer
pd.DataFrame(np.multiply.outer(x.values,y.values),columns=y.index,index=x.index)
Out[344]:
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3

Flatten a pandas dataframe column

I currently have the following column:
0 [Joe]
1 John
2 Mary
3 [Joey]
4 Harry
5 [Susan]
6 Kevin
I can't seem to remove the [] with out making the rows with [] = NaN
To be clear I want the column to look like this:
0 Joe
1 John
2 Mary
3 Joey
4 Harry
5 Susan
6 Kevin
Can anyone help?
Your title seems to imply that some elements of your series are lists.
setup
s = pd.Series([['Joe'], 'John', 'Mary', ['Joey'], 'Harry', ['Susan'], 'Kevin'])
s
0 [Joe]
1 John
2 Mary
3 [Joey]
4 Harry
5 [Susan]
6 Kevin
dtype: object
option 1
apply with pd.Series
s.apply(pd.Series).squeeze()
0 Joe
1 John
2 Mary
3 Joey
4 Harry
5 Susan
6 Kevin
Name: 0, dtype: object
Try this:
df['column_name'] = df['column_name'].apply(lambda x: str(x).strip("'[]") if type(x) == list else x)
Why not just do s.astype(str).str.strip ("'[]'")
or
s.map(lambda x: x if type(x) != list else x [0])

Categories

Resources