Adding values of two Pandas series with different column names - python

I have two pandas series of the same length but with different column names. How can one add the values in them?
series.add(other, fill_value=0, axis=0) does avoid NaN-values, but the values are not added. Instead, the result is a concatenation of the two series.
Is there a way to obtain a new series consisting of the sum of the values in two series?

Mismatched indices
This issue is your 2 series have different indices. Here's an example:
s1 = pd.Series([1, np.nan, 3, np.nan, 5], index=np.arange(5))
s2 = pd.Series([np.nan, 7, 8, np.nan, np.nan], index=np.arange(5)+10)
print(s1.add(s2, fill_value=0, axis=0))
0 1.0
1 NaN
2 3.0
3 NaN
4 5.0
10 NaN
11 7.0
12 8.0
13 NaN
14 NaN
dtype: float64
You have 2 options: reindex via, for example, a dictionary or disregard indices and add your series positionally.
Map index of one series to align with the other
You can use a dictionary to realign. The mapping below is arbitrary. NaN values occur where, after reindexing, values in both series are NaN:
index_map = dict(zip(np.arange(5) + 10, [3, 2, 4, 0, 1]))
s2.index = s2.index.map(index_map)
print(s1.add(s2, fill_value=0, axis=0))
0 1.0
1 NaN
2 10.0
3 NaN
4 13.0
dtype: float64
Disregard indices; use positional location only
In this case, you can either construct a new series with the regular pd.RangeIndex as index (i.e. 0, 1, 2, ...), or use an index from one of the input series:
# normalized index
res = pd.Series(s1.values + s2.values)
# take index from s1
res = pd.Series(s1.values + s2.values, index=s1.index)

The values attribute lets you access the underlying raw numpy arrays. You can add those.
raw_sum = series.values + other.values
series2 = Series(raw_sum, index=series.index)
This also works:
series2 = series + other.values

Related

How to iterate over rows and multiple columns in panda?

I have a dataframe (df1) and I want to replace the values for the columns V2 and V3 if they have the same value than V1.
import pandas as pd
import numpy as np
df_start= pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[10,5,20,17,15], "V3":[10, 25, 15, 10, 20]})
df_end = pd.DataFrame({"ID":[1, 2 , 3 ,4, 5], "V1":[10,5,15,20,20], "V2":[np.nan,np.nan,20,17,15], "V3":[np.nan, 25, np.nan, 10, np.nan]})
I know iterrows is not recommended but I don't know what I should do.
You can use mask:
For a seperate dataframe use assign:
df_end = df_start.assign(**df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
For modifying the input dataframe just assign inplace:
df_start[['V2','V3']] = (df_start[['V2','V3']]
.mask(df_start[['V2','V3']].eq(df_start['V1'],axis=0)))
ID V1 V2 V3
0 1 10 NaN NaN
1 2 5 NaN 25.0
2 3 15 20.0 NaN
3 4 20 17.0 10.0
4 5 20 15.0 NaN
You'll still use a regular loop to go through the columns, but the apply function is your best friend for this kind of row-wise operation. If you're going to use info from more than one column (here you're comparing some column and "V1"), you use apply on the DataFrame and specify the axis. If you were only looking at info from one column (like making a column that doubles values from V1 if they're even, you can use apply with just a Series.
For both versions of the function, the argument you're going to pass is a lambda expression. If you apply it do a DataFrame like you are here, the x represents the values in a row that can be indexed by a column. Finally, you assign the result back to a new or existing column in your DataFrame.
Assuming that df_start and df_end represent your planned input and output:
cols = ["V2","V3"]
for col in cols:
df_start[col] = df.apply(lambda x[col] if x[col] != x["V1"] else np.nan, axis=1]

Need to combine multiple rows based on index

I have a dataframe with values like
0 1 2
a 5 NaN 6
a NaN 2 NaN
Need the output by combining the two rows based on index 'a' which is same in both rows
Also need to add multiple columns and output as single column
Need the output as below. Value 13 since adding 5 2 6
0
a 13
Trying this using concat function but getting errors
How about using Pandas dataframe.sum() ?
import pandas as pd
import numpy as np
data = pd.DataFrame({"0":[5, np.NaN], "1":[np.NaN, 2], "2":[6,np.NaN]})
row_total = data.sum(axis = 1, skipna = True)
row_total.sum(axis = 0)
result:
13.0
EDIT: #Chris comment (did not see it while writing my answer) shows how to do it in one line, if all rows have same index.
data:
data = pd.DataFrame({"0":[5, np.NaN],
"1":[np.NaN, 2],
"2":[6,np.NaN]},
index=['a', 'a'])
gives:
0 1 2
a 5.0 NaN 6.0
a NaN 2.0 NaN
Then
data.groupby(data.index).sum().sum(1)
Returns
13.0

Pandas how to place an array in a single dataframe cell?

So I currently have a dataframe that looks like:
And I want to add a completely new column called "Predictors" with only one cell that contains an array.
So [0, 'Predictors'] should contain an array and everything below that cell in the same column should be empty.
Here's my attempt, I tried to create a separate dataframe that just contained the "Predictors" column, and tried appending it to the current dataframe, but I get: 'Length mismatch: Expected axis has 3 elements, new values have 4 elements.'
How do I append a single cell containing an array to my dataframe?
# create a list and dataframe to hold the names of predictors
dataframe=dataframe.drop(['price','Date'],axis=1)
predictorsList = dataframe.columns.get_values().tolist()
predictorsList = np.array(predictorsList, dtype=object)
# Combine actual and forecasted lists to one dataframe
combinedResults = pd.DataFrame({'Actual': actual, 'Forecasted': forecasted})
predictorsDF = pd.DataFrame({'Predictors': [predictorsList]})
# Add Predictors to dataframe
#combinedResults.at[0, 'Predictors'] = predictorsList
pd.concat([combinedResults,predictorsDF], ignore_index=True, axis=1)
You could fill the rest of the cells in the desired column with NaN, but they will not "empty". To do that, use pd.merge on both indexes:
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Actual': [18.442, 15.4233, 20.6217, 16.7, 18.185],
'Forecasted': [19.6377, 13.1665, 19.3992, 17.4557, 14.0053]
})
arr = np.zeros(3)
df_arr = pd.DataFrame({'Predictors': [arr]})
Merging df and df_arr
result = pd.merge(
df,
df_arr,
how='left',
left_index=True, # Merge on both indexes, since right only has 0...
right_index=True # all the other rows will be NaN
)
Results
>>> print(result)
Actual Forecasted Predictors
0 18.4420 19.6377 [0.0, 0.0, 0.0]
1 15.4233 13.1665 NaN
2 20.6217 19.3992 NaN
3 16.7000 17.4557 NaN
4 18.1850 14.0053 NaN
>>> result.loc[0, 'Predictors']
array([0., 0., 0.])
>>> result.loc[1, 'Predictors'] # actually contains a NaN value
nan
You need to change the object type of the column (in your case Predictors) first
import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(20).reshape(5,4), columns=list('abcd'))
df=df.astype(object) # this line allows the signment of the array
df.iloc[1,2] = np.array([99,99,99])
print(df)
gives
a b c d
0 0 1 2 3
1 4 5 [99, 99, 99] 7
2 8 9 10 11
3 12 13 14 15
4 16 17 18 19

Can I get a trimmed mean of all columns in a dataframe with nan values?

The problem is that I want to get the trimmed mean of all the columns in a pandas dataframe (i.e. the mean of the values in a given column, excluding the max and the min values). It's likely that some columns will have nan values. Basically, I want to get the exact same functionality as the pandas.DataFrame.mean function, except that it's the trimmed mean.
The obvious solution is to use the scipy tmean function, and iterate over the df columns. So I did:
import scipy as sp
trim_mean = []
for i in data_clean3.columns:
trim_mean.append(sp.tmean(data_clean3[i]))
This worked great, until I encountered nan values, which caused tmean to choke. Worse, when I dropped the nan values in the dataframe, there were some datasets that were wiped out completely as they had an nan value in every column. This means that when I amalgamate all my datasets into a master set, there'll be holes on the master set where the trimmed mean should be.
Does anyone know of a way around this? As in, is there a way to get tmean to behave like the standard scipy stats functions and ignore nan values?
(Note that my code is calculating a big number of descriptive statistics on large datasets with limited hardware; highly involved or inefficient workarounds might not be optimal. Hopefully, though, I'm just missing something simple.)
(EDIT: Someone suggested in a comment (that has since vanished?) that I should used the trim_mean scipy function, which allows you to top and tail a specific proportion of the data. This is just to say that this solution won't work for me, as my datasets are of unequal sizes, so I cannot specify a fixed proportion of data that will be OK to remove in every case; it must always just be the max and the min values.)
consider df
np.random.seed()
data = np.random.choice((0, 25, 35, 100, np.nan),
(1000, 2),
p=(.01, .39, .39, .01, .2))
df = pd.DataFrame(data, columns=list('AB'))
Construct your mean using sums and divide by relevant normalizer.
(df.sum() - df.min() - df.max()) / (df.notnull().sum() - 2)
A 29.707674
B 30.402228
dtype: float64
df.mean()
A 29.756987
B 30.450617
dtype: float64
you colud use df.mean(skipna =True) DataFrame.mean
df1 = pd.DataFrame([[5, 1, 'a'], [6, 2, 'b'],[7, 3, 'd'],[np.nan, 4, 'e'],[9, 5, 'f'],[5, 1, 'g']], columns = ["A", "B", "C"])
print df1
df1 = df1[df1.A != df1.A.max()] # Remove max values
df1 = df1[df1.A != df1.A.min()] # Remove min values
print "\nDatafrmae after removing max and min\n"
print df1
print "\nMean of A\n"
print df1["A"].mean(skipna =True)
output
A B C
0 5.0 1 a
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
4 9.0 5 f
5 5.0 1 g
Datafrmae after removing max and min
A B C
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
Mean of A
6.5

Concatenating Columns Pandas

I'm trying to concatenate several columns which mostly contain NaNs to one, but here is an example on 2 only:
2013-06-18 21:46:33.422096-05:00 A NaN
2013-06-18 21:46:35.715770-05:00 A NaN
2013-06-18 21:46:42.669825-05:00 NaN B
2013-06-18 21:46:45.409733-05:00 A NaN
2013-06-18 21:46:47.130747-05:00 NaN B
2013-06-18 21:46:47.131314-05:00 NaN B
This could go on for 3 or 4 or 10 columns, always 1 being pd.notnull() and the rest are NaN.
I want to concatenate these into 1 column the fastest way possible. How can I do this?
You get one string per line and the other cells are NaN, then the math to apply is to ask for the max value:
df.max(axis=1)
As per comment, if it doesn't work in Python 3, project your NaN into strings before:
df.fillna('').max(axis=1)
You could do
In [278]: df = pd.DataFrame([[1, np.nan], [2, np.nan], [np.nan, 3]])
In [279]: df
Out[279]:
0 1
0 1 NaN
1 2 NaN
2 NaN 3
In [280]: df.sum(1)
Out[280]:
0 1
1 2
2 3
dtype: float64
Since NaNs are treated as 0 when summed, they don't show up.
A couple of caveats: You need to be sure that only one of the columns has a non-Nan for this to work. It will also only work on numeric data.
You can also use
df.fillna(method='ffill', axis=1).iloc[:, -1]
The last column will now contain all the valid observations since the valid ones have been filled ahead. See the documentation here. The second way should be more flexible but slower. I slice off every row and the last column with iloc[:, -1].

Categories

Resources