I have the following dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
And I wrote the following code to generate a new DataFrame with a modified output for each 'sim'
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
df = table2.filter(regex="sim|Type")
Output:
>>> df
Product Type sim_1 sim_2
0 A 35.0 60.0
1 B -39.0 36.0
2 C 56.0 92.0
3 D 23.0 33.0
I want to run this on 10,000 sims, and currently each loop takes about .25 seconds. Is there any way to modify this code to avoid the loop and be more time efficient?
Edit: If you're curious what this code is trying to accomplish you can see my self-answered somewhat disorganized question here: Pandas DataFrame: Complex linear interpolation
I was able to accomplish this with no loops using the following code:
As a result on my 10k x 200 table it ran in 3 minutes instead of the previous 2 hours.
Unfortunately now I need to run it on a 10k x 4k table, and I hit MemoryError on that one, but it may be out of the scope of this question.
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']
Related
Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')
With Python 3.10
Sample data:
import pandas as pd
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
print(df)
df = df.mask(df == 0).fillna(df.mean())
print(df) # <---this works but you will see what I mean about looking off..
Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).
The idea is I have one or more column(s) with bad or missing data.
If I use .fillna(df.mean()) for this it sticks out like a sore thumb.
My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from...
I would like to take a len(df)*0.30 (30%) and use divide it in half.
I would collect half the numbers above the index point where the (null/0/bad data) data exists.
I would collect half the numbers below the index where the
These collected elements would be the be used to calculate the missing or bad index point.
This would be more helpful if there were a data set that irregular or had missing bad data
You can take a rolling mean with min periods = 1 to smooth out the data.
or you can do a variant of this method to customise what you want.
Inside the lambda i used this np.nanmean(x).
import pandas as pd
import numpy as np
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])
print(df)
result:
column data something ma final
0 1 14890.0 3 14890.000000 14890.0
1 4 5.0 6 7447.500000 5.0
2 7 8.0 9 4967.666667 8.0
3 11 13.0 14 8.666667 13.0
4 12 0.0 18 7.000000 0.0
5 87 NaN 54 6.500000 6.5
6 1 0.0 3 0.000000 0.0
7 4 5.0 6 2.500000 5.0
8 7 8.0 9 4.333333 8.0
9 11 13.0 14 8.666667 13.0
10 12 0.0 18 7.000000 0.0
11 87 NaN 54 6.500000 6.5
12 1 0.0 3 0.000000 0.0
13 4 5.0 6 2.500000 5.0
14 7 8.0 9 4.333333 8.0
15 11 13.0 14 8.666667 13.0
16 12 0.0 18 7.000000 0.0
17 87 10026.0 54 3346.333333 10026.0
I have a big array with 4 dimensions, as follow:
>>> raw_data
<xarray.DataArray 'TRAC04' (time: 3, Z: 34, YC: 588, XC: 2160)>
[129548160 values with dtype=float32]
Coordinates: (12/15)
iter (time) int64 ...
* time (time) datetime64[ns] 2017-01-30T12:40:00 ... 2017-04-01T09:20:00
* XC (XC) float32 0.08333 0.25 0.4167 0.5833 ... 359.6 359.8 359.9
* YC (YC) float32 -77.98 -77.95 -77.91 -77.88 ... -30.02 -29.87 -29.72
* Z (Z) float32 -2.1 -6.7 -12.15 -18.55 ... -614.0 -700.0 -800.0
rA (YC, XC) float32 ...
... ...
maskC (Z, YC, XC) bool ...
maskCtrlC (Z, YC, XC) bool ...
rhoRef (Z) float32 ...
rLowC (YC, XC) float32 ...
maskInC (YC, XC) bool ...
rSurfC (YC, XC) float32 ...
Attributes:
standard_name: TRAC04
long_name: Variable concentration
units: mol N/m^3
I want to transform it into a Dataframe with 5 columns, as 'XC', 'YC', 'Z', 'time', 'TRAC04'.
I tried to follow this question like this:
import itertools
data = list(itertools.chain(*raw_data))
df = pd.DataFrame.from_records(data)
it runs it, however, I do not see creating anything in the environment. Furthermore, if I try to look at df with pd.head(df), it does run forever, without giving back outputs.
I tried, in any case, to save df, following this question, but it runs without ending also in this case:
np.savetxt(r'c:\data\DF_TRAC04.txt', df.values, fmt='%d')
df.to_csv(r'c:\data\DF_TRAC04.csv', header=None, index=None, sep=' ', mode='a')
I hope my answer can still help.
Let's first create a mock data with space variables x, y, z, and a time variable t.
import numpy as np
import xarray as xr
val = np.arange(54).reshape(2,3,3,3)
xc = np.array([10, 20, 30])
yc = np.array([50, 60, 70])
zc = np.array([1000, 2000, 3000])
t = np.array([0, 1])
da = xr.DataArray(
val,
coords={'time': t,
'z': zc,
'y': yc,
'x': xc},
dims=["time","z","y", "x"]
)
You will get the following DataArray:
<xarray.DataArray (time: 2, z: 3, y: 3, x: 3)>
array([[[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],
[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],
[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]],
[[[27, 28, 29],
[30, 31, 32],
[33, 34, 35]],
[[36, 37, 38],
[39, 40, 41],
[42, 43, 44]],
[[45, 46, 47],
[48, 49, 50],
[51, 52, 53]]]])
Coordinates:
* time (time) int64 0 1
* z (z) int64 1000 2000 3000
* y (y) int64 50 60 70
* x (x) int64 10 20 30
If you want to have a flat file representation of the DataArray, you can use
da.to_dataframe(name='value').reset_index()
and this is the result:
time z y x value
0 0 1000 50 10 0
1 0 1000 50 20 1
2 0 1000 50 30 2
3 0 1000 60 10 3
4 0 1000 60 20 4
...
49 1 3000 60 20 49
50 1 3000 60 30 50
51 1 3000 70 10 51
52 1 3000 70 20 52
53 1 3000 70 30 53
For saving the DataFrame to an ASCII file without the index, use:
da.to_dataframe(name='value').reset_index().to_csv('dump.csv', index=False)
Input Dataframe as below
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,'nan','nan',4],
'r_id' : [[34, 44, 23, 11, 71], [53, 33, 73, 41], [17], [10, 31], [17], [75, 8],[7],[68],[50],[]]
}
df = pd.DataFrame.from_dict(data)
df
Out[240]:
s_id r_id
0 5 [34, 44, 23, 11, 71]
1 7 [53, 33, 73, 41]
2 26 [17]
3 70 [10, 31]
4 55 [17]
5 71 [75, 8]
6 8 [7]
7 nan [68]
8 nan [50]
9 4 []
Expected dataframe
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,'nan','nan',4],
'r_id' : [[5,34, 44, 23, 11, 71], [7,53, 33, 73, 41], [26,17], [70,10, 31], [55,17], [71,75, 8],[8,7],[68],[50],[4]]
}
df = pd.DataFrame.from_dict(data)
df
Out[241]:
s_id r_id
0 5 [5, 34, 44, 23, 11, 71]
1 7 [7, 53, 33, 73, 41]
2 26 [26, 17]
3 70 [70, 10, 31]
4 55 [55, 17]
5 71 [71, 75, 8]
6 8 [8, 7]
7 nan [68]
8 nan [50]
9 4 [4]
Need to populate the list column with the elements from S_id as the first element in the list column of r_id, I also have nan values and some of them are appearing as float columns, Thanking you.
I tried the following,
df['r_id'] = df["s_id"].apply(lambda x : x.append(df['r_id']) )
df['r_id'] = df["s_id"].apply(lambda x : [x].append(df['r_id'].values.tolist()))
If nans are missing values use apply with convert value to one element list with converting to integers and filter for omit mising values:
data = {
's_id' :[5,7,26,70.0,55,71.0,8.0,np.nan,np.nan,4],
'r_id' : [[34, 44, 23, 11, 71], [53, 33, 73, 41],
[17], [10, 31], [17], [75, 8],[7],[68],[50],[]]
}
df = pd.DataFrame.from_dict(data)
print (df)
f = lambda x : [int(x["s_id"])] + x['r_id'] if pd.notna(x["s_id"]) else x['r_id']
df['r_id'] = df.apply(f, axis=1)
print (df)
s_id r_id
0 5.0 [5, 34, 44, 23, 11, 71]
1 7.0 [7, 53, 33, 73, 41]
2 26.0 [26, 17]
3 70.0 [70, 10, 31]
4 55.0 [55, 17]
5 71.0 [71, 75, 8]
6 8.0 [8, 7]
7 NaN [68]
8 NaN [50]
9 4.0 [4]
Another idea is filter column and apply function to non NaNs rows:
m = df["s_id"].notna()
f = lambda x : [int(x["s_id"])] + x['r_id']
df.loc[m, 'r_id'] = df[m].apply(f, axis=1)
print (df)
s_id r_id
0 5.0 [5, 34, 44, 23, 11, 71]
1 7.0 [7, 53, 33, 73, 41]
2 26.0 [26, 17]
3 70.0 [70, 10, 31]
4 55.0 [55, 17]
5 71.0 [71, 75, 8]
6 8.0 [8, 7]
7 NaN [68]
8 NaN [50]
9 4.0 [4]
I have a list of elements and I want to use mapping functions to generate an element-wise list of whether they are within any ranges in a list of ranges. I already have a solution that uses a for-loop, but for-loops are too slow because both my element and range lists will be much larger.
Here is my code so far:
import pandas as pd
# check element-wise if [1,0,45,60] within ranges 1-10, 21-30, or 41-50
# expected output: true, false, true, false
s = pd.Series([1,0,45,60])
f = lambda x: any((x >= pd.Series([1,20,40])) & (x <= pd.Series([10,30,50])))
print map(f, s)
Error:
elif isinstance(other, (np.ndarray, pd.Index)):
--> if len(self) != len(other):
raise ValueError('Lengths must match to compare')
return self._constructor(na_op(self.values, np.asarray(other)),
TypeError: len() of unsized object
Figured it out. Seems like everything works and is still fast if I convert to numpy. Normally I'd frown on introducing a new library but pandas is built on top of numpy.
import pandas as pd, numpy as np
s = pd.Series([1,0,45,60])
mins = np.array(pd.Series([1,20,40]))
maxes = np.array(pd.Series([10,30,50]))
f = lambda x: np.any((x >= mins) & (x <= maxes))
print map(f, s)
I think you can first create all ranges and for check use isin with tolist:
import pandas as pd
s = pd.Series([1,0,45,60])
print s
0 1
1 0
2 45
3 60
dtype: int64
rng = range(1,11) + range(21,31) + range(41,51)
print rng
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
print s.isin(rng)
0 True
1 False
2 True
3 False
dtype: bool
print s.isin(rng).tolist()
[True, False, True, False]
EDIT:
For creating ranges you can use numpy.arange and numpy.concatenate:
import numpy as np
rng = np.concatenate((np.arange(1, 11), np.arange(21, 31), np.arange(41, 51)))
print rng
[ 1 2 3 4 5 6 7 8 9 10 21 22 23 24 25
26 27 28 29 30 41 42 43 44 45 46 47 48 49 50]
Another solution for generating ranges can be slicing:
s = range(0,51)
print s
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
print s[1:11] + s[21:31] + s[41:51]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
you can use cut() function for categorizing your values:
In [296]: s[pd.cut(s, bins=range(0, 110, 10), labels=labels).isin(['1 - 10','21 - 30','41 - 50'])]
Out[296]:
0 1
3 23
5 45
dtype: int64
Explanation:
original series:
In [291]: s
Out[291]:
0 1
1 0
2 19
3 23
4 35
5 45
6 60
dtype: int64
labels for categories:
In [292]: labels = [ "{0} - {1}".format(i, i + 9) for i in range(1, 100, 10) ]
In [293]: labels
Out[293]:
['1 - 10',
'11 - 20',
'21 - 30',
'31 - 40',
'41 - 50',
'51 - 60',
'61 - 70',
'71 - 80',
'81 - 90',
'91 - 100']
using cut() for categorizing your series:
In [294]: pd.cut(s, bins=range(0, 110, 10), labels=labels)
Out[294]:
0 1 - 10
1 NaN
2 11 - 20
3 21 - 30
4 31 - 40
5 41 - 50
6 51 - 60
dtype: category
Categories (10, object): [1 - 10 < 11 - 20 < 21 - 30 < 31 - 40 ... 61 - 70 < 71 - 80 < 81 - 90 <
91 - 100]
select only intereseting categories:
In [295]: pd.cut(s, bins=range(0, 110, 10), labels=labels).isin(['1 - 10','21 - 30','41 - 50'])
Out[295]:
0 True
1 False
2 False
3 True
4 False
5 True
6 False
dtype: bool
and finally:
In [296]: s[pd.cut(s, bins=range(0, 110, 10), labels=labels).isin(['1 - 10','21 - 30','41 - 50'])]
Out[296]:
0 1
3 23
5 45
dtype: int64