I have two dataframes each with one column. I'm pasting them exactly as they print below:
Top: (it has no column names as it is the result of a Top = Df1.groupby('col1')['att1'].diff().dropna()
1 15.566667
3 5.066667
5 57.266667
7 -10.366667
9 18.966667
11 50.966667
13 -5.633333
15 -14.266667
17 18.933333
19 3.100000
21 35.966667
23 -17.566667
25 -8.066667
27 -6.366667
29 7.133333
31 -2.633333
33 3.333333
35 -23.800000
37 2.333333
39 -53.533333
41 -17.300000
dtype: float64
Bottom: which is the result of Bottom = np.sqrt(Df2.groupby('ID')['Col2'].sum()/n)
ID
12868123 1.029001
757E13D7 1.432014
79731492 2.912770
799EFB29 1.826576
7D44062A 1.736757
7D4C0E2F 1.943503
7DBA169D 0.650023
7E558E2B 1.256287
7E8B3815 1.491974
7EB80123 0.558717
7FFB607D 1.505221
8065A321 1.809937
80EFE91B 2.064825
811F1B1E 0.992645
82B67C94 0.980618
833C27AE 0.969195
83957B28 0.469914
8447B85D 1.477168
84877498 0.872973
8569499D 2.215307
8617B7D9 1.033294
Name: Col2, dtype: float64
I want the divide those two columns values by each other.
Top/Bottom
I get the following:
1 NaN
3 NaN
5 NaN
7 NaN
9 NaN
11 NaN
13 NaN
15 NaN
17 NaN
19 NaN
21 NaN
23 NaN
25 NaN
27 NaN
29 NaN
31 NaN
33 NaN
35 NaN
37 NaN
39 NaN
41 NaN
12868123 NaN
757E13D7 NaN
79731492 NaN
799EFB29 NaN
7D44062A NaN
7D4C0E2F NaN
7DBA169D NaN
7E558E2B NaN
7E8B3815 NaN
7EB80123 NaN
7FFB607D NaN
8065A321 NaN
80EFE91B NaN
811F1B1E NaN
82B67C94 NaN
833C27AE NaN
83957B28 NaN
8447B85D NaN
84877498 NaN
8569499D NaN
8617B7D9 NaN
dtype: float64
I tried resetting the index column, it didn't help. Not sure why it's not working.
Problem is with different index values, because arithmetic opearations align Series by indices, so need cast to numpy array by values:
print (Top/Bottom.values)
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
Name: col, dtype: float64
Solution with div:
print (Top.div(Bottom.values))
1 15.127942
3 3.538141
5 19.660552
7 -5.675464
9 10.920737
11 26.224126
13 -8.666359
15 -11.356216
17 12.690123
19 5.548426
21 23.894609
23 -9.705679
25 -3.906707
27 -6.413841
29 7.274324
31 -2.717031
33 7.093496
35 -16.111911
37 2.672858
39 -24.165198
41 -16.742573
dtype: float64
But if assign one index values to another, you can use:
Top.index = Bottom.index
print (Top/Bottom)
ID
12868123 15.127942
757E13D7 3.538141
79731492 19.660552
799EFB29 -5.675464
7D44062A 10.920737
7D4C0E2F 26.224126
7DBA169D -8.666359
7E558E2B -11.356216
7E8B3815 12.690123
7EB80123 5.548426
7FFB607D 23.894609
8065A321 -9.705679
80EFE91B -3.906707
811F1B1E -6.413841
82B67C94 7.274324
833C27AE -2.717031
83957B28 7.093496
8447B85D -16.111911
84877498 2.672858
8569499D -24.165198
8617B7D9 -16.742573
dtype: float64
And if get error like:
ValueError: operands could not be broadcast together with shapes (20,) (21,)
problem is with different length of Series.
I arrived here because I was looking how to divide a column by a subset of itself.
I found a solution which is not reported here
Suppose you have a df like
d = {'mycol1':[0,0,1,1,2,2],'mycol2':[1,2,3,6,4,8]}
df = pd.DataFrame(data=d)
i.e.
mycol1 mycol2
0 0 1
1 0 2
2 1 3
3 1 6
4 2 4
5 2 8
And now you want to divide mycol2 for a subset composed by the first two values
df['mycol2'].div(df[df['mycol1']==0.0]['mycol2'])
will result in
0 1.0
1 1.0
2 NaN
3 NaN
4 NaN
5 NaN
because of the index problem reported by jezreal.
The solution is to simply use concat to concatenate the subset to match the length of the original df.
Nrows = df[df['mycol1'==0.0]]['mycol2'].shape[0]
Nrows_tot = df['mycol2'].shape[0]
times_longer = int(Nrows_tot/Nrows)
df['mycol3'] = df['mycol2'].div(pd.concat([df[df['mycol1']==0.0]['mycol2']]*times_longer,ignore_index=True))
Related
I have one data frame (df1) with 5 columns and another (df2) with 10 columns. I want to add columns from df2 to df1, but only columns names (without values). Also, I want to do the same with adding columns without values from df1 to df2.
Here are the data frames:
df1
A B C D E
1 234 52 1 54
54 23 87 5 125
678 67 63 8 18
45 21 36 5 65
8 5 24 3 13
df2
F G H I J K L M N O
12 34 2 17 4 19 54 7 58 123
154 3 7 53 25 2 47 27 84 6
78 7 3 82 8 56 21 29 547 1
And I want to get this:
df1
A B C D E F G H I J K L M N O
1 234 52 1 54
54 23 87 5 125
678 67 63 8 18
45 21 36 5 65
8 5 24 3 13
And I want to get this:
df2
A B C D E F G H I J K L M N O
12 34 2 17 4 19 54 7 58 123
154 3 7 53 25 2 47 27 84 6
78 7 3 82 8 56 21 29 547 1
I tried with df.columns.values and got the array of columns names, but then I have to apply them as data frame columns and give them empty values, and the way that I am doing now has too many lines of code, and I just wonder is it some easier way to do that?
I will appreciate any help.
Use Index.union with DataFrame.reindex:
cols = df1.columns.union(df2.columns)
#if order is important
#cols = df1.columns.append(df2.columns)
df1 = df1.reindex(columns=cols)
df2 = df2.reindex(columns=cols)
print (df1)
A B C D E F G H I J K L M N O
0 1 234 52 1 54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 54 23 87 5 125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 678 67 63 8 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 45 21 36 5 65 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 8 5 24 3 13 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print (df2)
A B C D E F G H I J K L M N O
0 NaN NaN NaN NaN NaN 12 34 2 17 4 19 54 7 58 123
1 NaN NaN NaN NaN NaN 154 3 7 53 25 2 47 27 84 6
2 NaN NaN NaN NaN NaN 78 7 3 82 8 56 21 29 547 1
If same index values in both DataFrames is possible use DataFrame.align:
print (df1)
A B C D E
0 1 234 52 1 54
1 54 23 87 5 125
2 678 67 63 8 18
df1, df2 = df1.align(df2)
print (df1)
A B C D E F G H I J K L M N O
0 1 234 52 1 54 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 54 23 87 5 125 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 678 67 63 8 18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
print (df2)
A B C D E F G H I J K L M N O
0 NaN NaN NaN NaN NaN 12 34 2 17 4 19 54 7 58 123
1 NaN NaN NaN NaN NaN 154 3 7 53 25 2 47 27 84 6
2 NaN NaN NaN NaN NaN 78 7 3 82 8 56 21 29 547 1
I have a DataFrame in python, the cell value is the purchase quantity like:
code 1/18 2/18 3/18 4/18 5/18
1 NaN 15 15 16 14
2 NaN NaN 30 23 24
3 24 21 23 NaN 26
I want to order the code in terms of the date they were first purchased, the result would be:
code 1/18 2/18 3/18 4/18 5/18
3 24 21 23 NaN 26
1 NaN 15 15 16 14
2 NaN NaN 30 23 24
Please help!
I think need specify columns for sorting by indexing - here by all columns without first:
print (df.columns[1:].tolist())
['1/18', '2/18', '3/18', '4/18', '5/18']
df = df.sort_values(by=df.columns[1:].tolist())
print (df)
code 1/18 2/18 3/18 4/18 5/18
2 3 24.0 21.0 23 NaN 26
0 1 NaN 15.0 15 16.0 14
1 2 NaN NaN 30 23.0 24
If first column is index:
print (df.columns.tolist())
['1/18', '2/18', '3/18', '4/18', '5/18']
df = df.sort_values(by=df.columns.tolist())
print (df)
1/18 2/18 3/18 4/18 5/18
code
3 24.0 21.0 23 NaN 26
1 NaN 15.0 15 16.0 14
2 NaN NaN 30 23.0 24
i have a dataframe with, say, 4 columns [['a','b','c','d']], to which I add another column ['total'] containing the sum of all the other columns for each row. I then add another column ['growth of total'] with the growth rate of the total.
some of the values in [['a','b','c','d']] are blank, rendering the ['total'] column invalid for these rows. I can easily get rid of these rows with df.dropna(how='any').
However, my growth rate will be invalid not only for rows with missing values in [['a','b','c','d']], but also for the following row. How do I drop all these rows?
IIUC correctly you can use notnull with all to mask off any rows with NaN and any rows that follow NaN rows:
In [43]:
df = pd.DataFrame({'a':[0,np.NaN, 2, 3,np.NaN], 'b':[np.NaN, 1,2,3,4], 'c':[0, np.NaN,2,3,4]})
df
Out[43]:
a b c
0 0 NaN 0
1 NaN 1 NaN
2 2 2 2
3 3 3 3
4 NaN 4 4
In [44]:
df[df.notnull().all(axis=1) & df.shift().notnull().all(axis=1)]
Out[44]:
a b c
3 3 3 3
Here's one option that I think does what you're looking for:
In [76]: df = pd.DataFrame(np.arange(40).reshape(10,4))
In [77]: df.ix[1,2] = np.nan
In [78]: df.ix[6,1] = np.nan
In [79]: df['total'] = df.sum(axis=1, skipna=False)
In [80]: df
Out[80]:
0 1 2 3 total
0 0 1 2 3 6
1 4 5 NaN 7 NaN
2 8 9 10 11 38
3 12 13 14 15 54
4 16 17 18 19 70
5 20 21 22 23 86
6 24 NaN 26 27 NaN
7 28 29 30 31 118
8 32 33 34 35 134
9 36 37 38 39 150
In [81]: df['growth'] = df['total'].iloc[1:] - df['total'].values[:-1]
In [82]: df
Out[82]:
0 1 2 3 total growth
0 0 1 2 3 6 NaN
1 4 5 NaN 7 NaN NaN
2 8 9 10 11 38 NaN
3 12 13 14 15 54 16
4 16 17 18 19 70 16
5 20 21 22 23 86 16
6 24 NaN 26 27 NaN NaN
7 28 29 30 31 118 NaN
8 32 33 34 35 134 16
9 36 37 38 39 150 16
I am trying to fill NaN values in a dataframe with values coming from a standard normal distribution.
This is currently my code:
sqlStatement = "select * from sn.clustering_normalized_dataset"
df = psql.frame_query(sqlStatement, cnx)
data=df.pivot("user","phrase","tfw")
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
data[np.isnan(data)] = dfrand[np.isnan(data)]
After pivoting the dataframe 'data' it looks like that:
phrase aaron abbas abdul abe able abroad abu abuse \
user
14233664 NaN NaN NaN NaN NaN NaN NaN NaN
52602716 NaN NaN NaN NaN NaN NaN NaN NaN
123456789 NaN NaN NaN NaN NaN NaN NaN NaN
500158258 NaN NaN NaN NaN NaN NaN NaN NaN
517187571 0.4 NaN NaN 0.142857 1 0.4 0.181818 NaN
However, I need that each NaN value will be replaced with a new random value. So I created a new df consists of only random values (dfrand) and then trying to swap the missing numbers (Nan) by the values from dfrand corresponding to indices of the NaN's. Well - unfortunately it doesn't work -
Although the expression
np.isnan(data)
returns a dataframe consists of True and False values, the expression
dfrand[np.isnan(data)]
return only NaN values so the overall trick doesn't work.
Any ideas what the issue ?
Three-thousand columns is not so many. How many rows do you have? You could always make a random dataframe of the same size and do a logical replacement (the size of your dataframe will dictate whether this is feasible or not.
if you know the size of your dataframe:
import pandas as pd
import numpy as np
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(rows,cols))
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
if you do not know the size of your dataframe, just shuffle things around
import pandas as pd
import numpy as np
# import "real" dataframe
data = pd.read_csv(etc.) # or however you choose to read it in
# create random dummy dataframe
dfrand = pd.DataFrame(data=np.random.randn(data.shape[0],data.shape[1]))
# replace nans
data[np.isnan(data)] = dfrand[np.isnan(data)]
EDIT
Per "users" last comment:
"dfrand[np.isnan(data)] returns NaN only."
Right! And that is exactly what you wanted. In my solution I have: data[np.isnan(data)] = dfrand[np.isnan(data)]. Translated, this means: take the randomly-generated value from dfrand that corresponds to the NaN-location within "data" and insert it in "data" where "data" is NaN. An example will help:
a = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
a[0][5] = np.nan
In [32]: a
Out[33]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 NaN 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
# define randomly-generated dataframe, much like what you are doing, and replace NaN's
b = pd.DataFrame(data=np.random.randint(0,100,(10,3)))
In [39]: b
Out[39]:
0 1 2
0 92 21 55
1 65 53 89
2 54 98 97
3 48 87 79
4 98 38 62
5 46 16 30
6 95 39 70
7 90 59 9
8 14 85 37
9 48 29 46
a[np.isnan(a)] = b[np.isnan(a)]
In [38]: a
Out[38]:
0 1 2
0 2 26 28
1 14 79 82
2 89 32 59
3 65 47 31
4 29 59 15
5 46 58 90
6 15 66 60
7 10 19 96
8 90 26 92
9 0 19 23
As you can see, all NaN's in have been replaced with the randomly-generated value in based on 's nan-value indices.
you could try something like this, assuming you are dealing with one series:
ser = data['column_with_nulls_to_replace']
index = ser[ser.isnull()].index
df = pd.DataFrame(np.random.randn(len(index)), index=index, columns=['column_with_nulls_to_replace'])
ser.update(df)
I'm just getting into Pandas, trying to do what I would do in excel easily just with a large data set. I have a selection of futures price data that I have input into Pandas using:
df = pd.read_csv('TData1.csv')
this gives me a DataFrame. The data is in the form below:
Date,Time,Open,High,Low,Close,Volume,Tick Count
02/01/2013,05:01:00,1443.00,1443.75,1438.25,1440.25,20926,4652
02/01/2013,05:02:00,1440.25,1441.75,1440.00,1441.25,7261,1781
02/01/2013,05:03:00,1441.25,1443.25,1441.00,1443.25,5010,1014
Now what I'm essentially trying to do is calculate a Bollinger band in pandas. If I was in excel I would select the whole block of 'High', 'Low', 'Open' and 'Close' columns for say 20 rows and calculate the standard deviation.
I see pandas has the rolling_std function which can calculate the rolling standard deviation but just on one column. How do I get Python Pandas to calculate a rolling standard deviation on the 'High', 'Low', 'Open' and 'Close' column for say 20 periods?
Thanks.
You can call rolling_std on whole DataFrame or on subset:
>>> pd.rolling_std(df[['high','open','close','low']], 5)
like this:
>>> df = pd.DataFrame({'high':np.random.randint(15,25,size=10), 'close':np.random.randint(15,25,size=10), 'low':np.random.randint(15,25,size=10), 'open':np.random.randint(15,25,size=10), 'a':list('abcdefghij')})
>>> df
a close high low open
0 a 16 20 18 15
1 b 21 23 22 15
2 c 20 23 21 23
3 d 19 24 24 17
4 e 23 19 20 17
5 f 15 16 19 17
6 g 19 24 23 19
7 h 21 18 17 22
8 i 22 22 17 15
9 j 19 20 17 18
>>> pd.rolling_std(df[['high','open','close','low']], 5)
high open close low
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 2.167948 3.286335 2.588436 2.236068
5 3.391165 3.033150 2.966479 1.923538
6 3.563706 2.607681 2.863564 2.073644
7 3.633180 2.190890 2.966479 2.880972
8 3.193744 2.645751 3.162278 2.489980
9 3.162278 2.588436 2.683282 2.607681