I know this subject was brought up a few times on stack overflow, however I'm still stumbling upon an interpolation problem.
I have a complex dataframe of a set of columns, which could look something like this if simplified:
df_new = pd.DataFrame(np.random.randn(5,7), columns=[402.3, 407.2, 412.3, 415.8, 419.9, 423.5, 428.3])
wl = np.array([400.0, 408.2, 412.5, 417.2, 420.5, 423.3, 425.0])
So what I need to do is to interpolate column-wise, to the new assigned values of cols (wl), for each row.
And how to get the new dataframe with columns ONLY containing values presented in the wl array?
Use reindex to include wl as new columns (whose values will be filled with NaNs).
Then use interpolate(axis=1) to interpolate across the columns.
Strictly speaking interpolation is only done between known values.
You could, however, use limit_direction='both' to fill NaN edge values in both the forward and backward directions:
>>> df_new.reindex(columns=df_new.columns.union(wl)).interpolate(axis=1, limit_direction='both')
400.0 402.3 407.2 408.2 412.3 412.5 415.8 417.2 419.9 420.5 423.3 423.5 425.0 428.3
0 0.342346 0.342346 1.502418 1.102496 0.702573 0.379089 0.055606 -0.135563 -0.326732 -0.022298 0.282135 0.586569 0.164917 -0.256734
1 -0.220773 -0.220773 -0.567199 -0.789194 -1.011190 -0.485832 0.039526 -0.426771 -0.893069 -0.191818 0.509432 1.210683 0.414023 -0.382636
2 0.078147 0.078147 0.335040 -0.146892 -0.628824 -0.280976 0.066873 -0.881153 -1.829178 -0.960608 -0.092038 0.776532 0.458758 0.140985
3 -0.792214 -0.792214 0.254805 0.027573 -0.199659 -1.173250 -2.146841 -1.421482 -0.696124 -0.073018 0.550088 1.173194 -0.049967 -1.273128
4 -0.485818 -0.485818 0.019046 -1.421351 -2.861747 -1.020571 0.820605 0.097722 -0.625160 -0.782700 -0.940241 -1.097781 -0.809617 -0.521453
Note that Pandas DataFrames store values in a primarily column-based data structure. So computations are generally more efficient when done column-wise, not row-wise. Therefore, it might be better to transpose your dataframe:
df = df_new.T
and then proceed similarly as described above:
df = df.reindex(index=df.index.union(wl))
df = df.interpolate(limit_direction='both')
If you want to extrapolate edge values, you could use scipy.interpolate.interp1d with :
fill_value='extrapolate':
import numpy as np
import pandas as pd
import scipy.interpolate as interpolate
np.random.seed(2018)
df_new = pd.DataFrame(np.random.randn(5,7), columns=[402.3, 407.2, 412.3, 415.8, 419.9, 423.5, 428.3])
wl = np.array([400.0, 408.2, 412.5, 417.2, 420.5, 423.3, 425.0, 500])
x = df_new.columns
y = df_new.values
newx = x.union(wl)
result = pd.DataFrame(
interpolate.interp1d(x, y, fill_value='extrapolate')(newx),
columns=newx)
yields
400.0 402.3 407.2 408.2 412.3 412.5 415.8 417.2 419.9 420.5 423.3 423.5 425.0 428.3 500.0
0 -0.679793 -0.276768 0.581851 0.889017 2.148399 1.952520 -1.279487 -0.671080 0.502277 0.561236 0.836376 0.856029 0.543898 -0.142790 -15.062654
1 0.484717 0.110079 -0.688065 -0.468138 0.433564 0.437944 0.510221 0.279613 -0.165131 -0.362906 -1.285854 -1.351779 -0.758526 0.546631 28.904127
2 1.303039 1.230655 1.076446 0.628001 -1.210625 -1.158971 -0.306677 -0.563028 -1.057419 -0.814173 0.320975 0.402057 0.366778 0.289165 -1.397156
3 2.385057 1.282733 -1.065696 -1.191370 -1.706633 -1.618985 -0.172797 -0.092039 0.063710 0.114863 0.353577 0.370628 -0.246613 -1.604543 -31.108665
4 -3.360837 -2.165729 0.380370 0.251572 -0.276501 -0.293597 -0.575682 -0.235060 0.421854 0.469009 0.689062 0.704780 0.498724 0.045401 -9.804075
If you wish to create a DataFrame containing only the wl columns, you could sub-select those columns using result[wl], or you could simplying interpolate only at the wl values:
result_wl = pd.DataFrame(
interpolate.interp1d(x, y, fill_value='extrapolate')(wl),
columns=wl)
Related
I have a drug database saved in a SINGLE column in CSV file that I can read with Pandas. The file containts 750000 rows and its elements are devided by "///". The column also ends with "///". Seems every row is ended with ";".
I would like to split it to multiple columns in order to create structured database. Capitalized words (drug information) like "ENTRY", "NAME" etc. will be headers of these new columns.
So it has some structure, although the elements can be described by different number and sort of information. Meaning some elements will just have NaN in some cells. I have never worked with such SQL-like format, it is difficult to reproduce it as Pandas code, too. Please, see the PrtScs for more information.
An example of desired output would look like this:
df = pd.DataFrame({
"ENTRY":["001", "002", "003"],
"NAME":["water", "ibuprofen", "paralen"],
"FORMULA":["H2O","C5H16O85", "C14H24O8"],
"COMPONENT":[NaN, NaN, "paracetamol"]})
I am guessing there will be .split() involved based on CAPITALIZED words? The Python 3 code solution would be appreciated. It can help a lot of people. Thanks!
Whatever he could, he helped:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We create an additional dataframe.
dfi = pd.DataFrame()
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
dfi['Key1'] = dfi['Key'] = df[(df['Key'] == 'ENTRY')].index
dfi = dfi.set_index('Key1')
df = df.join(dfi, lsuffix='_caller', rsuffix='_other')
df.fillna(method="ffill", inplace=True)
df = df.astype({"Key_other": "Int64"})
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key_caller', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
Small code refactoring:
import pandas as pd
cols = ['ENTRY', 'NAME', 'FORMULA', 'COMPONENT']
# We read the file, get two columns and leave only the necessary lines.
df = pd.read_fwf(r'C:\Users\ф\drug\drug', header=None, names=['Key', 'Value'])
df = df[df['Key'].isin(cols)]
# To "flip" the dataframe, we first prepare an additional column
# with indexing by groups from one 'ENTRY' row to another.
df['Key_other'] = None
df.loc[(df['Key'] == 'ENTRY'), 'Key_other'] = df[(df['Key'] == 'ENTRY')].index
df['Key_other'].fillna(method="ffill", inplace=True)
# Change the shape of the table.
df = df.pivot(index='Key_other', columns='Key', values='Value')
df = df.reindex(columns=cols)
# We clean up the resulting dataframe a little.
df['ENTRY'] = df['ENTRY'].str.split(r'\s+', expand=True)[0]
df['NAME'] = df['NAME'].str.split(r'\(', expand=True)[0]
df.reset_index(drop=True, inplace=True)
pd.set_option('display.max_columns', 10)
print(df)
Key ENTRY NAME FORMULA \
0 D00001 Water H2O
1 D00002 Nadide C21H28N7O14P2
2 D00003 Oxygen O2
3 D00004 Carbon dioxide CO2
4 D00005 Flavin adenine dinucleotide C27H33N9O15P2
... ... ... ...
11983 D12452 Fostroxacitabine bralpamide hydrochloride C22H30BrN4O8P. HCl
11984 D12453 Guretolimod C24H34F3N5O4
11985 D12454 Icenticaftor C12H13F6N3O3
11986 D12455 Lirafugratinib C28H24FN7O2
11987 D12456 Lirafugratinib hydrochloride C28H24FN7O2. HCl
Key COMPONENT
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
... ...
11983 NaN
11984 NaN
11985 NaN
11986 NaN
11987 NaN
[11988 rows x 4 columns]
Need a little more to bring to mind, I leave it to your work.
I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003
Given a multi-index multi-column dataframe below, I want to apply LinearRegression to each block of this dataframe, for example, "index(X,1), column A". And compute the predicted dataframe as df_result.
A B
X 1 1997-01-31 -0.061332 0.630682
1997-02-28 -2.671818 0.377036
1997-03-31 0.861159 0.303689
...
1998-01-31 0.535192 -0.076420
...
1998-12-31 1.430995 -0.763758
Y 1 1997-01-31 -0.061332 0.630682
1997-02-28 -2.671818 0.377036
1997-03-31 0.861159 0.303689
...
1998-01-31 0.535192 -0.076420
...
1998-12-31 1.430995 -0.763758
Here is what I tried:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
N = 24
dates = pd.date_range('19970101', periods=N, freq='M')
df=pd.DataFrame(np.random.randn(len(dates),2),index=dates,columns=list('AB'))
df2=pd.concat([df,df],keys=[('X','1'),('Y','1')])
regr = LinearRegression()
# df_result will be reassined, copy the index and metadata from df2
df_result=df2.copy()
# I know the double loop below is not a clever idea. What is the right way?
for row in df2.index.to_series().unique():
for col in df2.columns:
#df2 can contain missing values
lenX=np.count_nonzero(df2.ix[row[:1],col].notnull().values.ravel())
X=np.array(range(lenX)).reshape(lenX,1)
y=df2.ix[row[:1],col]
y=y[y.notnull()]
# train the model
regr.fit(X,y)
df_result.ix[row[:1],col][:lenX] = regr.predict(X)
The question is that the double loop above make the computing quite slow, more than ten minutes for 100kb data set. What is the pythonic way to do this?
EDIT:
The second question for the last line of the code above is that I am working with a copy of a slice of the dataframe. Some columns of "df_result" are not updated with this operation.
EDIT2:
Some columns of the original data can contain missing value, and we cannot apply regression directly on them. For example,
df2.ix[('X','1','1997-12-31')]['A']=np.nan
df2.ix[('Y','1','1998-12-31')]['A']=np.nan
I don't quite understand the row looping.
anyhow, to maintain consistency in the numbers I put np.random.seed(1) at the top
In short I think you can achieve what you want with a function, groupby, and call to .transform().
def do_regression(y):
X=np.array(range(len(y))).reshape(len(y),1)
regr.fit(X,y)
return regr.predict(X)
df_regressed = df2.groupby(level=[0,1]).transform(do_regression)
print df_regressed.head()
A B
X 1 1997-01-31 0.779476 -1.222119
1997-02-28 0.727184 -1.138630
1997-03-31 0.674892 -1.055142
1997-04-30 0.622601 -0.971653
1997-05-31 0.570309 -0.888164
which matches your df_result output.
print df_result.head()
A B
X 1 1997-01-31 0.779476 -1.222119
1997-02-28 0.727184 -1.138630
1997-03-31 0.674892 -1.055142
1997-04-30 0.622601 -0.971653
1997-05-31 0.570309 -0.888164
oh and a couple of alternatives for:
X=np.array(range(len(y))).reshape(len(y),1)
1.) X = np.expand_dims(range(len(y)), axis=1)
2.) X = np.arange(len(y))[:,np.newaxis]
Edit for empty data
ok 2 suggestions:
Would it be legitimate to use the interpolate method to fill the null values?
df2 = df2.interpolate()
OR
do the regression on non null values and then pop the nulls back in at the appropriate index position
def do_regression(y):
x_s =np.arange(len(y))
x_s_non_nulls = x_s[y.notnull().values]
x_s_non_nulls = np.expand_dims(x_s_non_nulls, axis=1)
y_non_nulls = y[y.notnull()] # get the non nulls
regr.fit(x_s_non_nulls,y_non_nulls) # regression
results = regr.predict(x_s_non_nulls)
#pop back in then nulls.
for idx in np.where(y.isnull().values ==True):
results = np.insert(results,idx,np.NaN)
return results
I am trying to downsample grouped data to daily averages, calculated for each group, and plot the resulting time series in a single plot.
My starting point is the following pd.DataFrame:
value time type
0.1234 2013-04-03 A
0.2345 2013-04-05 A
0.34564 2013-04-07 A
... ... ...
0.2345 2013-04-03 B
0.1234 2013-04-05 B
0.2345 2013-04-07 C
0.34564 2013-04-07 C
I would like to calculate daily means for each type of content, and plot the time series of these daily means in a single plot.
I currently have this...
names = list(test['type'].unique())
types = []
for name in names:
single = df.loc[df.type == name]
single = single.set_index(single.time, drop=False)
single = single.resample("D")
types.append(single)
for single, name in zip(types, names):
single.rename(columns={"value":name}, inplace=True)
combined = pd.concat(types, axis=1)
combined.plot()
... resulting in the combined data frame containing the desired output and the following plot:
It seems to me that this could be achieved more easily by using groupby on the initial dataframe but so far I have not been able to reproduce the desired plot using this method.
What is "the smart way" to do this?
EDIT:
Bigger data sample (csv, 1000 rows) at: http://pastebin.com/gi16nZdh
Thanks,
Matthias
You can use pandas.DataFrame.pivot easily to do what you want, I've created a random example DataFrame below and then used df.pivot to arrange the table as wanted.
Note: I've resampled as weekly as I only have one data value per type per day, don't forget to change this for your data.
import pandas as pd
import matplotlib.pyplot as plt
dates = pd.date_range('2013-04-03', periods = 50, freq='D')
dfs = [pd.DataFrame(dict(time=dates, value=pd.np.random.randn(len(dates)), type=i)) for i in ['A', 'B', 'C', 'D']]
df = pd.concat(dfs)
pivoted = df.pivot(index='time', columns='type', values='value')
pivoted.resample('W')
print(pivoted.head(10))
# type A B C D
# time
# 2013-04-03 0.161839 0.509179 0.055078 -2.072243
# 2013-04-04 0.323308 0.891982 -1.266360 1.950389
# 2013-04-05 -2.542464 -0.441849 -2.686183 0.717737
# 2013-04-06 0.750871 0.438343 -0.002004 0.478821
# 2013-04-07 -0.118890 1.026121 1.283397 -1.306257
# 2013-04-08 -0.396373 -1.078925 -0.539617 -1.625549
# 2013-04-09 0.328076 1.964779 0.194198 0.232702
# 2013-04-10 -0.178683 0.177359 0.500873 -0.729988
# 2013-04-11 0.762800 1.576662 -0.456480 0.526162
# 2013-04-12 -1.301265 -0.586977 -0.903313 0.162008
pivoted.plot()
plt.show()
This code creates a pivot_table called pivoted where each of the columns are now type and the data is the index. We then simply resample it using pivoted.resample('W').
I have the following dataframe (p1.head(7)):
ColA
0 6.286333
1 3.317000
2 13.24889
3 26.20667
4 26.25556
5 60.59000
6 79.59000
7 1.361111
I can get the bin ranges using:
pandas.qcut(p1.ColA, 4)
Is there a way I can create a new column where each value corresponds to the mean value of the bin? I.e for each bin, (a,b], I want (a+b)/2
The key here is the retbins option on qcut.
import pandas
df = pandas.DataFrame(np.random.random(100)*100, columns=['val1'])
pctiles = pandas.qcut(df['val1'],4,retbins=True)
pctile_object = pctiles[0]
pctile_boundaries = pctiles[1]
Here pctile_object is just what qcut would return if you hadn't passed retbins=True, and pctile_boundaries is a numpy array of the interval boundaries.
import numpy
bin_halfway = pctile_boundaries[:-1] + (numpy.diff(pctile_boundaries)/2)
This gives us the halfway points of the bins.
Now we make a dataframe with just the interval names (as strings) and the halfway points.
df2 = pandas.DataFrame({'quartile boundaries': pctile_object.levels,
'midway point': bin_halfway})
Finally, merge the bin halfway points back into the original dataframe.
df['quartile boundaries'] = pctile_object
pandas.merge(df,df2,on='quartile boundaries')
Then you can drop quartile boundaries if you want.
I wrote a function to utilize #exp1orer 's logic:
def midway_quantiles(feature_series,q=4):
import pandas as pd
pctiles = pd.qcut(feature_series,q,retbins=True)
pctile_object = pctiles[0]
df1= pd.DataFrame({"feature":feature_series,"q_bound": pctile_object})
pctile_boundaries = pctiles[1]
import numpy as np
bin_halfway = pctile_boundaries[:-1] + (np.diff(pctile_boundaries)/2)
df2 = pd.DataFrame({"q_bound": pctile_object.cat.categories,
"midpoint": bin_halfway})
df3=pd.merge(df1,df2,on="q_bound",how="left")
return df3["midpoint"]