My df looks as follows:
Index Country Val1 Val2 ... Val10
1 Australia 1 3 ... 5
2 Bambua 12 33 ... 56
3 Tambua 14 34 ... 58
I'd like to substract Val10 from Val1 for each country, so output looks like:
Country Val10-Val1
Australia 4
Bambua 23
Tambua 24
So far I've got:
def myDelta(row):
data = row[['Val10', 'Val1']]
return pd.Series({'Delta': np.subtract(data)})
def runDeltas():
myDF = getDF() \
.apply(myDelta, axis=1) \
.sort_values(by=['Delta'], ascending=False)
return myDF
runDeltas results in this error:
ValueError: ('invalid number of arguments', u'occurred at index 9')
What's the proper way to fix this?
Given the following dataframe:
import pandas as pd
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
It comes down to a simple broadcasting operation:
>>> df["Val1"] - df["Val10"]
0 -4
1 -44
2 -44
dtype: int64
You can also store this into a new column with:
>>> df['Val_1_minus_10'] = df['Val1'] - df['Val10']
>>> df
Country Val1 Val2 Val10 Val_1_minus_10
0 Australia 1 3 5 -4
1 Bambua 12 33 56 -44
2 Tambua 14 34 58 -44
Using this as the df:
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
You can also do the subtraction and put it into a new column as follows.
>>>df['Val_Diff'] = df['Val10'] - df['Val1']
Country Val1 Val2 Val10 Val_Diff
0 Australia 1 3 5 4
1 Bambua 12 33 56 44
2 Tambua 14 34 58 44
You can do this by using lambda function and assign to new column.
df['Val10-Val1'] = df.apply(lambda x: x['Val10'] - x['Val1'], axis=1)
print df
You can also use pandas.DataFrame.assign function: e,g
import numpy as np
import pandas as pd
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
df = df.assign(Val10_minus_Val1 = df['Val10'] - df['Val1'])
The best part of assign is you can add as many assignments as you wish. e.g. getting both the difference and then the log of it
df = df.assign(Val10_minus_Val1 = df['Val10'] - df['Val1'], log_result = lambda x: np.log(x.Val10_minus_Val1) )
Results:
Though it's an old question but pandas allows subtracting two DataFrames or Seriess using pandas.DataFrame.subtract
import pandas as pd
df = pd.DataFrame([["Australia", 1, 3, 5],
["Bambua", 12, 33, 56],
["Tambua", 14, 34, 58]
], columns=["Country", "Val1", "Val2", "Val10"]
)
df["Val1"].subtract(df["Val2"])
Output:
0 -2
1 -21
2 -20
dtype: int64
You can also use eval here:
In [12]: df.eval('Val10_minus_Val1 = Val10-Val1', inplace=True)
In [13]: df
Out[13]:
Country Val1 Val2 Val10 Val10_minus_Val1
0 Australia 1 3 5 4
1 Bambua 12 33 56 44
2 Tambua 14 34 58 44
Since inplace=True you don't have to assign it back to df.
What I have faced today, makes me ambitious to share it with you. As people mentioned above you can used easily:
df['Val10-Val1'] = df['Val10']-df['Val1']
but sometimes you might need to use apply function, so you might use the following line:
df['Val10-Val1'] = df.apply(lambda row: row['Val10']-row['Val1'])
Related
I have a dataframe like this:
import pandas as pd
data1 = {
"siteID": [1, 2, 3, 1, 2, 'nan', 'nan', 'nan'],
"date": [42, 30, 43, 29, 26, 34, 10, 14],
}
df = pd.DataFrame(data1)
But I want to delete any duplicates in siteID, keeping only the most up-to-date value AND keeping all 'nan' values.
I get close with this code:
df_no_dup = df.sort_values('date').drop_duplicates('siteID', keep='last')
which only keeps the siteID with the highest date value. The issue is that most of the rows with 'nan' for siteID are being removed when I want to ignore them all. Is there any way to keep all the rows where siteID is equal to 'nan'?
Expected output:
siteID date
nan 10
nan 14
2 30
nan 34
1 42
3 43
I would use df.duplicated to create a custom condition.
Like this
df.drop(df[df.sort_values('date').duplicated('siteID', keep='last') & (df.siteID!='nan')].index)
Result
siteID date
0 1 42
1 2 30
2 3 43
5 nan 34
6 nan 10
7 nan 14
This question already has answers here:
How to group dataframe rows into list in pandas groupby
(17 answers)
Closed 5 months ago.
This post was edited and submitted for review 5 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have a pandas DataFrame df that looks like:
df =
sample col1 data_value time_stamp
A 1 15 0.5
A 1 45 0.5
A 1 32 0.5
A 2 3 1
A 2 57 1
A 2 89 1
B 1 10 0.5
B 1 20 0.5
B 1 30 0.5
B 2 12 1
B 2 24 1
B 2 36 1
For a given sample and its respective column, I am trying to condense all data values into a numpy array in a new column merged_data to look like:
sample col1 merged_data time_stamp
A 1 [15, 45, 32] 0.5
A 2 [3, 57, 89] 1
B 1 [10, 20, 30] 0.5
B 2 [12, 24, 36] 1
I've tried using df['merged_data] = df.to_numpy() operations and df['merged_data'] = np.array(df.iloc[0:2, :].to_numpy(), but they don't work. All elements in the merged_data column need to be numpy arrays or lists (can easily convert between the two).
Lastly, I need to retain the time_stamp column for each combination of sample and col. How can I include this with the groupby?
Any help or thoughts would be greatly appreciated!
you can do this :
df = df.groupby(['sample','col1'], as_index=False)['data_value'].agg(list)
output:
>>
sample col1 data_value
0 A 1 [15, 45, 32]
1 A 2 [3, 57, 89]
2 B 1 [10, 20, 30]
3 B 2 [12, 24, 36]
If the number of values in each group is identical, you can use:
import numpy as np
a = np.vstack(df.groupby(['sample','col1'])['data_value'].agg(list))
Or:
a = (df
.assign(col=lambda d: d.groupby(['sample', 'col1']).cumcount())
.pivot(['sample', 'col1'], 'col', 'data_value')
.to_numpy()
)
output:
array([[15, 45, 32],
[ 3, 57, 89],
[10, 20, 30],
[12, 24, 36]])
I have a dataframe. This dataframe contains three cells id, horstid, date. The cell date has one NaN value. I want the below code what works with pandas, I want it with numpy.
First I want to transform my dataframe to a numpy array. After that I want is to find all rows where the date is NaN and print it. After that I want to remove all this rows. But how could I do this in numpy?
This is my dataframe
id horstid date
0 1 11 2008-09-24
1 2 22 NaN
2 3 33 2008-09-18
3 4 33 2008-10-24
This is my code. That works with fine, but with pandas.
d = {'id': [1, 2, 3, 4], 'horstid': [11, 22, 33, 33], 'date': ['2008-09-24', np.nan, '2008-09-18', '2008-10-24']}
df = pd.DataFrame(data=d)
df['date'].isna()
[OUT]
0 False
1 True
2 False
3 False
df.drop(df.index[df['date'].isna() == True])
[OUT]
id horstid date
0 1 11 2008-09-24
2 3 33 2008-09-18
3 4 33 2008-10-24
What I want is the above code without pandas but with numpy.
npArray = df.to_numpy()
date = npArray [:,2].astype(np.datetime64)
[OUT]
ValueError: Cannot create a NumPy datetime other than NaT with generic units
Here's a solution based on Numpy and pure python:
df = pd.DataFrame.from_dict(dict(horstid = [11, 22, 33, 33], id=[1,2,3,4], data=['2008-09-24', np.nan, '2008-09-18', '2008-10-24']))
a = df.values
index = list(map(lambda x: type(x) != type(1.),a[:, 2]))
print(a[index,:])
[[11 1 '2008-09-24']
[33 3 '2008-09-18']
[33 4 '2008-10-24']]
This is the current dataframe I have: It is Nx1 with each cell containing a numpy array.
print (df)
age
0 [35, 34, 55, 56]
1 [25, 34, 35, 66]
2 [45, 35, 53, 16]
.
.
.
N [45, 35, 53, 16]
I would like somehow to ravel each value of each cell to a new column.
# do conversion
print (df)
age1 age2 age3 age4
0 35 34 55 56
1 25 34 35 66
2 45 35 53 16
.
.
.
N 45 35 53 16
You can reconstruct the dataframe from the lists, and customize the column names with:
df = pd.DataFrame(df.age.values.tolist())
df.columns += 1
df = df.add_prefix('age')
print(df)
age1 age2 age3 age4
0 35 34 55 56
1 25 34 35 66
...
Here is another alternative:
import pandas as pd
df = pd.DataFrame({'age':[[35,34,55,54],[1,2,3,4],[5,6,7,8],[9,10,11,12]]})
df['age_aux'] = df['age'].astype(str).str.split(',')
for i in range(4):
df['age_'+str(i)] = df['age_aux'].str.get(i).map(lambda x: x.lstrip('[').rstrip(']'))
df = df.drop(columns=['age','age_aux'])
print(df)
Output:
age_0 age_1 age_2 age_3
0 35 34 55 54
1 1 2 3 4
2 5 6 7 8
3 9 10 11 12
You can create DataFrame by constructor for improve performance and change columns names by rename with f-strings:
df1 = (pd.DataFrame(df.age.values.tolist(), index=df.index)
.rename(columns = lambda x: f'age{x+1}'))
Another variation is to apply pd.Series to the column and massage the column names:
df= pd.DataFrame( { "age": [[1,2,3,4],[2,3,4,5]] })
df = df["age"].apply(pd.Series)
df.columns = ["age1","age2","age3","age4"]
I have 2 dataframes:
df = pd.DataFrame({'begin': [10, 20, 30, 40, 50],
'end': [15, 23, 36, 48, 56]})
begin end
0 10 15
1 20 23
2 30 36
3 40 48
4 50 56
df2 = pd.DataFrame({'begin2': [12, 13, 22, 40],
'end2': [14, 13, 26, 48]})
begin2 end2
0 12 14
1 13 13
2 22 26
3 40 48
How can i get the rows of df2 which are within the rows of df1? I want each row of the df2 to be compared to all rows of df1.
That is, i want a df3 like:
begin2 end2
0 12 14
1 13 13
3 40 48
I tried:
df3 = df2.loc[ (df['begin'] <= df2['begin2']) & (df2['end2'] <= df['end'] )]
But it only compares row for row and requeres same sizes of the dataframes.
You need apply with boolean indexing:
df = df2[df2.apply(lambda x: any((df['begin'] <= x['begin2']) &
(x['end2'] <= df['end'])), axis=1)]
print (df)
begin2 end2
0 12 14
1 13 13
3 40 48
Detail:
print (df2.apply(lambda x: any((df['begin'] <= x['begin2']) &
(x['end2'] <= df['end'])), axis=1))
0 True
1 True
2 False
3 True
dtype: bool