I currently have the following column:
0 [Joe]
1 John
2 Mary
3 [Joey]
4 Harry
5 [Susan]
6 Kevin
I can't seem to remove the [] with out making the rows with [] = NaN
To be clear I want the column to look like this:
0 Joe
1 John
2 Mary
3 Joey
4 Harry
5 Susan
6 Kevin
Can anyone help?
Your title seems to imply that some elements of your series are lists.
setup
s = pd.Series([['Joe'], 'John', 'Mary', ['Joey'], 'Harry', ['Susan'], 'Kevin'])
s
0 [Joe]
1 John
2 Mary
3 [Joey]
4 Harry
5 [Susan]
6 Kevin
dtype: object
option 1
apply with pd.Series
s.apply(pd.Series).squeeze()
0 Joe
1 John
2 Mary
3 Joey
4 Harry
5 Susan
6 Kevin
Name: 0, dtype: object
Try this:
df['column_name'] = df['column_name'].apply(lambda x: str(x).strip("'[]") if type(x) == list else x)
Why not just do s.astype(str).str.strip ("'[]'")
or
s.map(lambda x: x if type(x) != list else x [0])
Related
I have a list of names:
lst = ['Albert', 'Carl', 'Julian', 'Mary']
and I have a DF:
target id name
A 100 Albert
A 110 Albert
B 200 Carl
D 500 Mary
E 235 Mary
I want to make another dataframe counting how many id per name in lst:
lst_names Count
Albert 2
Carl 1
Julian 0
Mary 2
What's the most efficient way to do this considering the list of names has 12k unique names on it?
Check with value_counts
pd.Categorical(df['name'],lst).value_counts()
Out[894]:
Albert 2
Carl 1
Julian 0
Mary 2
dtype: int64
Or
df['name'].value_counts().reindex(lst,fill_value=0)
Out[896]:
Albert 2
Carl 1
Julian 0
Mary 2
Name: name, dtype: int64
You can use value_counts, and then create an empty Series with lst as the index, and then add them together, filling NaN with 0:
(df['name'].value_counts() + pd.Series(index=lst, dtype=int)).fillna(0).astype(int)
Output:
>>> df
Albert 2
Carl 1
Julian 0
Mary 2
Name: count, dtype: int64
I have the following dataframe:
df = pd.DataFrame({'Value': [0, 1, 2,3, 4,5,6,7,8,9],'Name': ['John', 'Jim', 'John','Jim', 'John','Jim','Jim','John','Jim','John']})
df
Value Name
0 0 John
1 1 Jim
2 2 John
3 3 Jim
4 4 John
5 5 Jim
6 6 Jim
7 7 John
8 8 Jim
9 9 John
I would like to select the top n items by Name and find the mean from the Value column.
I have tried this:
df['Top2Mean'] = df.groupby(['Name'])['Value'].nlargest(2).transform('mean')
But the following error:
ValueError: transforms cannot produce aggregated results
My expected result is a new column called Top2Mean with a 8 next to John and 7 next to Jim.
Thanks in advance!
Let us calculate mean on level=0, then map the calculated mean value to the Name column to broadcast the aggregated results.
top2 = df.groupby('Name')['Value'].nlargest(2).mean(level=0)
df['Top2Mean'] = df['Name'].map(top2)
If we need to group on multiple columns for example Name and City then we have to take mean on level=[Name, City] and map the calculated mean values using MultiIndex.map
c = ['Name', 'City']
top2 = df.groupby(c)['Value'].nlargest(2).mean(level=c)
df['Top2Mean'] = df.set_index(c).index.map(top2)
Alternative approach with groupby and transform using a custom lambda function
df['Top2Mean'] = df.groupby('Name')['Value']\
.transform(lambda v: v.nlargest(2).mean())
Value Name Top2Mean
0 0 John 8
1 1 Jim 7
2 2 John 8
3 3 Jim 7
4 4 John 8
5 5 Jim 7
6 6 Jim 7
7 7 John 8
8 8 Jim 7
9 9 John 8
So I am stuck with a problem here:
I have a pandas dataframe which looks like the following:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21/23,60
3 Ismael 21,2/ 21,54
4 Joe 23,1
and so on...
What I am trying to is to split the "Value" column by the slash forward (/) but keep all the values, which do not have this kind of pattern.
Like here:
ID Name Value
0 Peter 21,2
1 Frank 24
2 Tom 23,21
3 Ismael 21,2
4 Joe 23,1
How can I achieve this? I tried the str.split method but it's not giving me the solution I want. Instead, it returns NaN as can be seen in the following.
My Code: df['Value']=df['value'].str.split('/', expand=True)[0]
Returns:
ID Name Value
0 Peter NaN
1 Frank NaN
2 Tom 23,21
3 Ismael 21,2
4 Joe Nan
All I need is the very first Value before the '/' is coming.
Appreciate any kind of help!
Remove expand=True for return lists and add str[0] for select first value:
df['Value'] = df['Value'].str.split('/').str[0]
print (df)
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
If performance is important use list comprehension:
df['Value'] = [x.split('/')[0] for x in df['Value']]
pandas.Series.str.replace with regex
df.assign(Value=df.Value.str.replace('/.*', ''))
ID Name Value
0 0 Peter 21,2
1 1 Frank 24
2 2 Tom 23,21
3 3 Ismael 21,2
4 4 Joe 23,1
Optionally, you can assign results directly back to dataframe
df['Value'] = df.Value.str.replace('/.*', '')
C1
0 John
1 John
2 John
3 Michale
4 Michale
5 Newton
6 Newton
7 John
8 John
9 John
I want to know how many time John occurred row wise. Suppose John occurred from 0 to 2 In result i want from 0 to 2 John. from 3 to 4 Michel from 5 to 6 Newton
Result I want in this format:
Start End Name
0 2 John
3 4 Michale
5 6 newton
7 9 John
Use
In [163]: df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
Out[163]:
start end
C1
John 0 2
Michale 3 4
Newton 5 6
#Zero: Would adding the below to your code help ..?? :)
df_new = df.reset_index().groupby('C1')['index'].agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'})
df_new.reset_index().rename(columns={'C1':'Name'})
Edit: Maybe something like this..? I am still learning but there is no harm trying. :)
labels = (df.C1 != df.C1.shift()).cumsum()
df1 = pd.concat([df,labels],axis = 1,names = 'label')
df1.columns = ['C1','label']
df_new = df1.reset_index().groupby(['label','C1']).agg(['min', 'max']).rename(
columns={'min': 'start', 'max': 'end'}).reset_index().rename(columns={'C1':'Name'})
df_new
I have a question about pandas and if someone could help me, I would be grateful for that very much.
I have a dataframe
df1 = pd.DataFrame( {'Name': ['A', 'B','A','A']})
df1
I want to do groupby for this.
x=df1.groupby("Name").size()
x
I also have another dataframe
df2 = pd.DataFrame( {'Name2': ['Jon',Maria','Maria','Mike','Mike','Mike']})
df2
For this one, I do groupby as well.
y= df2.groupby("Name2").size()
And then I want to make matrix whose column is x and row is y, and want to multiply the values.
I want the matrix like this.
Jon Maria Mike
A 3 6 9
B 1 2 3
If you could tell me how to do that, I would greatly appreciate it.
You could perform a dot product:
x.to_frame().dot(y.to_frame().T)
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3
If you want to remove the axis labels, use rename_axis:
x.to_frame().dot(y.to_frame().T)\
.rename_axis(None).rename_axis(None, 1)
Jon Maria Mike
A 3 6 9
B 1 2 3
Alternatively, assign in-place:
v = x.to_frame().dot(y.to_frame().T)
v.index.name = v.columns.name = None
v
Jon Maria Mike
A 3 6 9
B 1 2 3
In [35]: (pd.DataFrame(y[:,None].dot(x[:,None].T).T, columns=y.index, index=x.index)
.rename_axis(None)
.rename_axis(None,1))
Out[35]:
Jon Maria Mike
A 3 6 9
B 1 2 3
Or we can using np.multiply.outer
pd.DataFrame(np.multiply.outer(x.values,y.values),columns=y.index,index=x.index)
Out[344]:
Name2 Jon Maria Mike
Name
A 3 6 9
B 1 2 3