I have a pandas DataFrame with some labels for n classes. Now I want to add a column and store how many items are between two elements of the same class.
Class
0 0
1 1
2 1
3 1
4 0
and I want to get this:
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
This is the code I used:
df = pd.DataFrame({'Class':[0,1,1,1,0]})
df['Shift'] = np.nan
for item in df.Class.unique():
_df = df[df['Class'] == item]
_df = _df.reset_index().rename({'index':'idx'}, axis=1)
df.loc[_df.idx, 'Shift'] = _df['idx'].diff().values
df
This seems circuitous to me. Is there a more elegant way of producing this output?
You could do:
df['shift'] = np.arange(len(df))
df['shift'] = df.groupby('Class')['shift'].diff()
print(df)
Output
Class shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
As an alternative:
df['shift'] = df.assign(shift=np.arange(len(df))).groupby('Class')['shift'].diff()
The idea is to create a column with consecutive values, group by the Class column and compute the diff on the new column.
If there is default RangeIndex use Index.to_series with grouping by column df['Class'] and DataFrameGroupBy.diff:
df['Shift'] = df.index.to_series().groupby(df['Class']).diff()
Similar alternative is create helper column:
df['Shift'] = df.assign(tmp = df.index).groupby('Class')['tmp'].diff()
print (df)
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
Your solution with reseting index should be simplify by:
df['Shift'] = df.reset_index().groupby('Class')['index'].diff().to_numpy()
i have df below
Cost,Reve
0,3
4,0
0,0
10,10
4,8
len(df['Cost']) = 300
len(df['Reve']) = 300
I need to divide df['Cost'] / df['Reve']
Below is my code
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
I got the error ValueError: Columns must be same length as key
df['C/R'] = df[['Cost']].div(df['Reve'].values, axis=0)
I got the error ValueError: Wrong number of items passed 2, placement implies 1
Problem is duplicated columns names, verify:
#generate duplicates
df = pd.concat([df, df], axis=1)
print (df)
Cost Reve Cost Reve
0 0 3 0 3
1 4 0 4 0
2 0 0 0 0
3 10 10 10 10
4 4 8 4 8
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
print (df)
# ValueError: Columns must be same length as key
You can find this columns names:
print (df.columns[df.columns.duplicated(keep=False)])
Index(['Cost', 'Reve', 'Cost', 'Reve'], dtype='object')
If same values in columns is possible remove duplicated by:
df = df.loc[:, ~df.columns.duplicated()]
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
#simplify division
df['C/R'] = df['Cost'].div(df['Reve'])
print (df)
Cost Reve C/R
0 0 3 0.0
1 4 0 inf
2 0 0 NaN
3 10 10 1.0
4 4 8 0.5
The issue lies in the size of data that you are trying to assign to the columns. I had an issue with this:
df[['X1','X2', 'X3']] = pd.DataFrame(df.X.tolist(), index= df.index)
I was trying to assign values of X to 3 columns X1,X2,X3, assuming that X has 3 values, but, X had 4 values.
So the revised code in my case was
df[['X1','X2', 'X3','X4']] = pd.DataFrame(df.X.tolist(), index= df.index)
I had the same error, but it did not come from the above two issues. In my case the columns had the same length. What helped me was transforming my Series to a DataFrame with pd.DataFrame() and then assigning its values to a new column of my existing df.
I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below
First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0
Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
I have a dataframe like this:
id = [1,1,2,3]
x1 = [0,1,1,2]
x2 = [2,3,1,1]
df = pd.DataFrame({'id':id, 'x1':x1, 'x2':x2})
df
id x1 x2
1 0 2
1 1 3
2 1 1
3 2 1
Some rows have the same id. I want to sum up such rows (over x1 and x2) to obtain a new dataframe with unique ids:
df_new
id x1 x2
1 1 5
2 1 1
3 2 1
An important detail is that the real number of columns x1, x2,... is large, so I cannot apply a function that requires manual input of column names.
As discussed you can use pandas groupby function to sum based on the id value:
df.groupby(df.id).sum()
# or
df.groupby('id').sum()
If you need don't want id to become the index then you can:
df.groupby('id').sum().reset_index()
# or
df.groupby('id', as_index=False).sum() # #John_Gait
With pivot_table:
In [31]: df.pivot_table(index='id', aggfunc=sum)
Out[31]:
x1 x2
id
1 1 5
2 1 1
3 2 1
Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)
Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0
Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()