So this is more of a question than a problem i have.
I wanted to .append() some pandas series' together and without thinking i just did total=series1+series2+series3.
The length of each series is 2199902,171175, and 178989 respectively and sum(pd.isnull(i) for i in total) = 2214596
P.S all 3 series' had no null values to start with, is it something to do with merging 3 series' of different lengths which creates missing values? Even if that is the case why aer 2,214,596 null values created?
If you're trying to append series, you're doing it wrong. The + operator calls .add which ends up adding each corresponding elements in the series. If your series are not aligned, this results in a lot of NaNs being generated.
If you're looking to append these together into one long series, you can use pd.concat:
pd.concat([s1, s2, s3], ignore_index=True)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
If you're going to use append, you can do this in a loop, or with reduce:
s = s1
for i in [s2, s3]:
s = s.append(i, ignore_index=True)
s
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
from functools import reduce
reduce(lambda x, y: x.append(y, ignore_index=True), [s1, s2, s3])
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
Both solutions generalise to multiple series quite nicely, but they are slow in comparison to pd.concat or np.concatenate.
If sum Series all index are align. So if some index exist in series1 and not in another Series, get NaNs.
So need add with fill_value=0:
s = s1.add(s2, fill_value=0).add(s3, fill_value=0)
Sample:
s1 = pd.Series([1,2,4,5])
s2 = pd.Series([4,7], index=[10,11])
s3 = pd.Series([40,70], index=[2,4])
s = s1.add(s2, fill_value=0).add(s3, fill_value=0)
print (s)
0 1.0
1 2.0
2 44.0
3 5.0
4 70.0
10 4.0
11 7.0
dtype: float64
But if need append them together (or use concat as mentioned cᴏʟᴅsᴘᴇᴇᴅ):
s = s1.append(s2, ignore_index=True).append(s3, ignore_index=True)
print (s)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
And numpy alternative:
#alternative, thanks cᴏʟᴅsᴘᴇᴇᴅ - np.concatenate([s1, s2, s3])
s = pd.Series(np.concatenate([s1.values, s2.values, s3.values]))
print (s)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
If want use + for append then need convert Series to lists:
s = pd.Series(s1.tolist() + s2.tolist() + s3.tolist())
print (s)
0 1
1 2
2 4
3 5
4 4
5 7
6 40
7 70
dtype: int64
Related
I have the following dataframe:
df = pd.DataFrame({'timestamp' : [10,10,10,20,20,20], 'idx': [1,2,3,1,2,3], 'v1' : [1,2,4,5,1,9], 'v2' : [1,2,8,5,1,2]})
timestamp idx v1 v2
0 10 1 1 1
1 10 2 2 2
2 10 3 4 8
3 20 1 5 5
4 20 2 1 1
5 20 3 9 2
I'd like to group data by timestamp and calculate the following cumulative statistic:
np.sum(v1*v2) for every timestamp. I'd like to see the following result:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
I'm trying to do the following:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df.loc[:, 'stat'] = df.groupby('timestamp').apply(calc_some_stat)
But for stat columns I receive all NaN values - what is wrong in my code?
We want groupby transform here not groupby apply:
df['stat'] = (df['v1'] * df['v2']).groupby(df['timestamp']).transform('sum')
If we really want to use the function we need to join back to scale up the aggregated DataFrame:
def calc_some_stat(d):
return np.sum(d.v1 * d.v2)
df = df.join(
df.groupby('timestamp').apply(calc_some_stat)
.rename('stat'), # Needed to use join but also sets the col name
on='timestamp'
)
df:
timestamp idx v1 v2 stat
0 10 1 1 1 37
1 10 2 2 2 37
2 10 3 4 8 37
3 20 1 5 5 44
4 20 2 1 1 44
5 20 3 9 2 44
The issue is that groupby apply is producing summary information:
timestamp
10 37
20 44
dtype: int64
This does not assign back to the DataFrame naturally as there are only 2 rows when the initial DataFrame has 6. We either need to use join to scale these 2 rows up to align with the original DataFrame, or we can avoid all of this using groupby transform which is designed to produce a:
like-indexed DataFrame on each group and return a DataFrame having the same indexes as the original object filled with the transformed values
I have 3 DataFrames, all with over 100 rows and 1000 columns. I am trying to combine all these DataFrames into one in such a way that common columns from each DataFrame are summed up. I understand there is a method of summation called "pd.DataFrame.sum()", but remember, I have over 1000 columns and I can not add each common column manually. I am attaching sample DataFrames and the result I want. Help will be appreciated.
#Sample DataFrames.
df_1 = pd.DataFrame({'a':[1,2,3],'b':[2,1,0],'c':[1,3,5]})
df_2 = pd.DataFrame({'a':[1,1,0],'b':[2,1,4],'c':[1,0,2],'d':[2,2,2]})
df_3 = pd.DataFrame({'a':[1,2,3],'c':[1,3,5], 'x':[2,3,4]})
#Result.
df_total = pd.DataFrame({'a':[3,5,6],'b':[4,2,4],'c':[3,6,12],'d':[2,2,2], 'x':[2,3,4]})
df_total
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Let us do pd.concat then sum
out = pd.concat([df_1,df_2,df_3],axis=1).sum(level=0,axis=1)
Out[7]:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
You can add with fill_value=0:
df_1.add(df_2, fill_value=0).add(df_3, fill_value=0).astype(int)
Output:
a b c d x
0 3 4 3 2 2
1 5 2 6 2 3
2 6 4 12 2 4
Note: pandas intrinsically aligns most operations along indexes (index and column headers).
I have a data like this.
I calculate the mean of each IDs
df.groupby(['ID'], as_index= False)['A'].mean()
Now, I want to drop all those Ids whose mean value is more than 3
df.drop(df[df.A > 3].index)
And this is here i am stucked. I want to save the file but in original format (without grouping and no mean value) and without those Ids whose means were more than 3.
Any Idea How can i achieve this. Output something like this. Also I want to know how many unique Ids were removed while using drop.
Use transform for Series with same size as original DataFrame, so is possible filtering by changed condition from > 3 to <=3 by boolean indexing:
df1 = df[df.groupby('ID')['A'].transform('mean') <= 3]
print (df1)
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
Details:
print (df.groupby('ID')['A'].transform('mean'))
0 2.000000
1 2.000000
2 2.000000
3 6.666667
4 6.666667
5 6.666667
6 2.250000
7 2.250000
8 2.250000
9 2.250000
Name: A, dtype: float64
print (df.groupby('ID')['A'].transform('mean') <= 3)
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 True
Name: A, dtype: bool
Another solution using groupby and filter. This solutions is a slower than using transform with boolean indexing.
df.groupby('ID').filter(lambda x: x['A'].mean() < 3)
Output:
ID A
0 1 2
1 1 3
2 1 1
6 3 6
7 3 1
8 3 1
9 3 1
I am looking to create a new column in panda based on the value in the row. My sample data:
df=pd.DataFrame({"A":['a','a','a','a','a','a','b','b','b'],
"Sales":[2,3,7,1,4,3,5,6,9,10,11,8,7,13,14],
"Week":[1,2,3,4,5,11,1,2,3,4])
I want a new column "Last3WeekSales" corresponding to each week, having the sum of sales for the previous 3 weeks.
NOTE: Shift() won't work here as data for some weeks is missing.
Logic which I thought:
Checking the week no. in each row, then summing up the data from w-1, w-2, w-3.
Output required:
A Week Last3WeekSales
0 a 1 0
1 a 2 2
2 a 3 5
3 a 4 12
4 a 5 11
5 a 11 0
6 b 1 0
7 b 2 5
8 b 3 11
9 b 4 20
Use groupby, shift and rolling:
df['Last3WeekSales'] = df.groupby('A')['Sales']\
.apply(lambda x: x.shift(1)
.rolling(3, min_periods=1)
.sum())\
.fillna(0)
Output:
A Sales Week Last3WeekSales
0 a 2 1 0.0
1 a 3 2 2.0
2 a 7 3 5.0
3 a 1 4 12.0
4 a 4 5 11.0
5 a 3 6 12.0
6 b 5 1 0.0
7 b 6 2 5.0
8 b 9 3 11.0
you can use pandas.rolling_sum to sum over 3 last values, and shift(n) to shift your column by n times (1 in your case).
if we suppose you a column 'sales' with the sales of each week, the code would be :
df["Last3WeekSales"] = df.groupby("A")["sales"].apply(lambda x: pd.rolling_sum(x.shoft(1),3))
Is it possible to multiply all the columns in a Pandas.DataFrame together to get a single value for every row in the DataFrame?
As an example, using
df = pd.DataFrame(np.random.randn(5,3)*10)
I want a new DataFrame df2 where df2.ix[x,0] will have the value of df.ix[x,0] * df.ix[x,1] * df.ix[x,2].
However I do not want to hardcode this, how can I use a loop to achieve this?
I found a function df.mul(series, axis=1) but cant figure out a way to use this for my purpose.
You could use DataFrame.prod():
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 7 7 5
1 1 8 6
2 4 8 4
3 2 9 5
4 3 8 7
>>> df.prod(axis=1)
0 245
1 48
2 128
3 90
4 168
dtype: int64
You could also apply np.prod, which is what I'd originally done, but usually when available the direct methods are faster.
>>> df = pd.DataFrame(np.random.randint(1, 10, (5, 3)))
>>> df
0 1 2
0 9 3 3
1 8 5 4
2 3 6 7
3 9 8 5
4 7 1 2
>>> df.apply(np.prod, axis=1)
0 81
1 160
2 126
3 360
4 14
dtype: int64