Decreasing column values with Pandas

Decreasing column values with Pandas - python

I would like to decrease values in a column. The following code
df['ID'] = df.index
df.loc[df['ID'].astype(int)] -= 1
Should convert
ID Process ID Name Time
0 0 NaN NaN msecond
1 1 32434.0 X1 1.4
2 2 32434.0 X2 1.3
to
ID Process ID Name Time
0 -1 NaN NaN msecond
1 0 32434.0 X1 1.4
2 1 32434.0 X2 1.3
How can I fix that? I should keep the first line because of the code I have written.

Subtract ID only:
#if necessary convert to integers
#df['ID'] = df['ID'].astype(int)
df['ID'] -= 1
Or maybe:
df['ID'] = df.index - 1
For first column:
df.insert(0, 'ID', df.index - 1)

Related

Find the latest occurrence of an class item and store how many values are between these two in a pandas DataFrame

I have a pandas DataFrame with some labels for n classes. Now I want to add a column and store how many items are between two elements of the same class.
Class
0 0
1 1
2 1
3 1
4 0
and I want to get this:
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
This is the code I used:
df = pd.DataFrame({'Class':[0,1,1,1,0]})
df['Shift'] = np.nan
for item in df.Class.unique():
_df = df[df['Class'] == item]
_df = _df.reset_index().rename({'index':'idx'}, axis=1)
df.loc[_df.idx, 'Shift'] = _df['idx'].diff().values
df
This seems circuitous to me. Is there a more elegant way of producing this output?

You could do:
df['shift'] = np.arange(len(df))
df['shift'] = df.groupby('Class')['shift'].diff()
print(df)
Output
Class shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
As an alternative:
df['shift'] = df.assign(shift=np.arange(len(df))).groupby('Class')['shift'].diff()
The idea is to create a column with consecutive values, group by the Class column and compute the diff on the new column.

If there is default RangeIndex use Index.to_series with grouping by column df['Class'] and DataFrameGroupBy.diff:
df['Shift'] = df.index.to_series().groupby(df['Class']).diff()
Similar alternative is create helper column:
df['Shift'] = df.assign(tmp = df.index).groupby('Class')['tmp'].diff()
print (df)
Class Shift
0 0 NaN
1 1 NaN
2 1 1.0
3 1 1.0
4 0 4.0
Your solution with reseting index should be simplify by:
df['Shift'] = df.reset_index().groupby('Class')['index'].diff().to_numpy()

ValueError: Columns must be same length as key in pandas

i have df below
Cost,Reve
0,3
4,0
0,0
10,10
4,8
len(df['Cost']) = 300
len(df['Reve']) = 300
I need to divide df['Cost'] / df['Reve']
Below is my code
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
I got the error ValueError: Columns must be same length as key
df['C/R'] = df[['Cost']].div(df['Reve'].values, axis=0)
I got the error ValueError: Wrong number of items passed 2, placement implies 1

Problem is duplicated columns names, verify:
#generate duplicates
df = pd.concat([df, df], axis=1)
print (df)
Cost Reve Cost Reve
0 0 3 0 3
1 4 0 4 0
2 0 0 0 0
3 10 10 10 10
4 4 8 4 8
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
print (df)
# ValueError: Columns must be same length as key
You can find this columns names:
print (df.columns[df.columns.duplicated(keep=False)])
Index(['Cost', 'Reve', 'Cost', 'Reve'], dtype='object')
If same values in columns is possible remove duplicated by:
df = df.loc[:, ~df.columns.duplicated()]
df[['Cost','Reve']] = df[['Cost','Reve']].apply(pd.to_numeric)
#simplify division
df['C/R'] = df['Cost'].div(df['Reve'])
print (df)
Cost Reve C/R
0 0 3 0.0
1 4 0 inf
2 0 0 NaN
3 10 10 1.0
4 4 8 0.5

The issue lies in the size of data that you are trying to assign to the columns. I had an issue with this:
df[['X1','X2', 'X3']] = pd.DataFrame(df.X.tolist(), index= df.index)
I was trying to assign values of X to 3 columns X1,X2,X3, assuming that X has 3 values, but, X had 4 values.
So the revised code in my case was
df[['X1','X2', 'X3','X4']] = pd.DataFrame(df.X.tolist(), index= df.index)

I had the same error, but it did not come from the above two issues. In my case the columns had the same length. What helped me was transforming my Series to a DataFrame with pd.DataFrame() and then assigning its values to a new column of my existing df.

How to drop float values from a column - pandas

I have a dataframe given shown below
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1],
'val' :[5,6.4,5.4,6,6,6]
})
It looks like as shown below
I would like to drop the values from val column which ends with .[1-9]. Basically I would like to retain values like 5.0,6.0 and drop values like 5.4,6.4 etc
Though I tried below, it isn't accurate
df['val'] = df['val'].astype(int)
df.drop_duplicates() # it doesn't give expected output and not accurate.
I expect my output to be like as shown below

First idea is compare original value with casted column to integer, also assign integers back for expected output (integers in column):
s = df['val']
df['val'] = df['val'].astype(int)
df = df[df['val'] == s]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
Another idea is test is_integer:
mask = df['val'].apply(lambda x: x.is_integer())
df['val'] = df['val'].astype(int)
df = df[mask]
print (df)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6
If need floats in output you can use:
df1 = df[ df['val'].astype(int) == df['val']]
print (df1)
subject_id val
0 1 5.0
3 1 6.0
4 1 6.0
5 1 6.0

Use mod 1 to determine the residual. If residual is 0 it means the number is a int. Then use the results as a mask to select only those rows.
df.loc[df.val.mod(1).eq(0)].astype(int)
subject_id val
0 1 5
3 1 6
4 1 6
5 1 6

Sum up non-unique rows in DataFrame

I have a dataframe like this:
id = [1,1,2,3]
x1 = [0,1,1,2]
x2 = [2,3,1,1]
df = pd.DataFrame({'id':id, 'x1':x1, 'x2':x2})
df
id x1 x2
1 0 2
1 1 3
2 1 1
3 2 1
Some rows have the same id. I want to sum up such rows (over x1 and x2) to obtain a new dataframe with unique ids:
df_new
id x1 x2
1 1 5
2 1 1
3 2 1
An important detail is that the real number of columns x1, x2,... is large, so I cannot apply a function that requires manual input of column names.

As discussed you can use pandas groupby function to sum based on the id value:
df.groupby(df.id).sum()
# or
df.groupby('id').sum()
If you need don't want id to become the index then you can:
df.groupby('id').sum().reset_index()
# or
df.groupby('id', as_index=False).sum() # #John_Gait

With pivot_table:
In [31]: df.pivot_table(index='id', aggfunc=sum)
Out[31]:
x1 x2
id
1 1 5
2 1 1
3 2 1

Replace column values based on another dataframe python pandas - better way?

Note:for simplicity's sake, i'm using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there's an easy way to do this).
Is there a way to merge the values from one dataframe onto another without getting the _X, _Y columns? I'd like the values on one column to replace all zero values of another column.
df1:
Name Nonprofit Business Education
X 1 1 0
Y 0 1 0 <- Y and Z have zero values for Nonprofit and Educ
Z 0 0 0
Y 0 1 0
df2:
Name Nonprofit Education
Y 1 1 <- this df has the correct values.
Z 1 1
pd.merge(df1, df2, on='Name', how='outer')
Name Nonprofit_X Business Education_X Nonprofit_Y Education_Y
Y 1 1 1 1 1
Y 1 1 1 1 1
X 1 1 0 nan nan
Z 1 1 1 1 1
In a previous post, I tried combine_First and dropna(), but these don't do the job.
I want to replace zeros in df1 with the values in df2.
Furthermore, I want all rows with the same Names to be changed according to df2.
Name Nonprofit Business Education
Y 1 1 1
Y 1 1 1
X 1 1 0
Z 1 0 1
(need to clarify: The value in 'Business' column where name = Z should 0.)
My existing solution does the following:
I subset based on the names that exist in df2, and then replace those values with the correct value. However, I'd like a less hacky way to do this.
pubunis_df = df2
sdf = df1
regex = str_to_regex(', '.join(pubunis_df.ORGS))
pubunis = searchnamesre(sdf, 'ORGS', regex)
sdf.ix[pubunis.index, ['Education', 'Public']] = 1
searchnamesre(sdf, 'ORGS', regex)

Attention: In latest version of pandas, both answers above doesn't work anymore:
KSD's answer will raise error:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1]],columns=["Name","Nonprofit", "Education"])
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
Out[851]:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
and EdChum's answer will give us the wrong result:
df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
df1
Out[852]:
Name Nonprofit Business Education
0 X 1.0 1 0.0
1 Y 1.0 1 1.0
2 Z NaN 0 NaN
3 Y NaN 1 NaN
Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.
Here is my answer:
Way 1:
df1 = df1.merge(df2,on='Name',how="left")
df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
Way 2:
df1 = df1.set_index('Name')
df2 = df2.set_index('Name')
df1.update(df2)
df1.reset_index(inplace=True)
More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.
Example:
df1 = pd.DataFrame([["X",1,1,0],
["Y",0,1,0],
["Z",0,0,0],
["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])
df2 = pd.DataFrame([["Y",1,1],
["Z",1,1],
['U',1,3]],columns=["Name2","Nonprofit", "Education"])
df1 = df1.set_index('Name1')
df2 = df2.set_index('Name2')
df1.update(df2)
result:
Nonprofit Business Education
Name1
X 1.0 1 0.0
Y 1.0 1 1.0
Z 1.0 0 1.0
Y 1.0 1 1.0

Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:
In [27]:
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]

In [27]:
This is the correct one.
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
df
Out[27]:
Name Nonprofit Business Education
0 X 1 1 0
1 Y 1 1 1
2 Z 1 0 1
3 Y 1 1 1
[4 rows x 4 columns]
The above will work only when all rows in df1 exists in df . In other words df should be super set of df1
Incase if you have some non matching rows to df in df1,you should follow below
In other words df is not superset of df1 :
df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] =
df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values

df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Decreasing column values with Pandas - python

Subtract ID only: #if necessary convert to integers #df['ID'] = df['ID'].astype(int) df['ID'] -= 1 Or maybe: df['ID'] = df.index - 1 For first column: df.insert(0, 'ID', df.index - 1)

Related

Find the latest occurrence of an class item and store how many values are between these two in a pandas DataFrame

ValueError: Columns must be same length as key in pandas

How to drop float values from a column - pandas

Sum up non-unique rows in DataFrame

Replace column values based on another dataframe python pandas - better way?

Categories

Resources