With Python 3.10
Sample data:
import pandas as pd
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
print(df)
df = df.mask(df == 0).fillna(df.mean())
print(df) # <---this works but you will see what I mean about looking off..
Updated Solution:
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = round(df['data'].rolling(4, 1).apply(lambda x: np.nanmean(x)), 2)
df['final2'] = np.where(df['data'] > 0, df['data'], df['ma'])
print(df)
# it replaces the zeros and NULLS with a value, (sometimes it fits well, sometimes, not so much).
The idea is I have one or more column(s) with bad or missing data.
If I use .fillna(df.mean()) for this it sticks out like a sore thumb.
My Goal is to have a percentage of the total number of elements in the dataframe column to make the new mean from...
I would like to take a len(df)*0.30 (30%) and use divide it in half.
I would collect half the numbers above the index point where the (null/0/bad data) data exists.
I would collect half the numbers below the index where the
These collected elements would be the be used to calculate the missing or bad index point.
This would be more helpful if there were a data set that irregular or had missing bad data
You can take a rolling mean with min periods = 1 to smooth out the data.
or you can do a variant of this method to customise what you want.
Inside the lambda i used this np.nanmean(x).
import pandas as pd
import numpy as np
data = [[1, 14890, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9],
[11, 13, 14], [12, 0, 18], [87, None, 54], [1, 0, 3], [4, 5, 6], [7, 8, 9], [11, 13, 14], [12, 0, 18],
[87,10026, 54]]
df = pd.DataFrame(data, columns=['column', 'data', 'something'])
df['ma'] = df['data'].rolling(3,1).apply(lambda x : np.nanmean(x))
df['final'] = np.where(df['data'] >= 0, df['data'], df['ma'])
print(df)
result:
column data something ma final
0 1 14890.0 3 14890.000000 14890.0
1 4 5.0 6 7447.500000 5.0
2 7 8.0 9 4967.666667 8.0
3 11 13.0 14 8.666667 13.0
4 12 0.0 18 7.000000 0.0
5 87 NaN 54 6.500000 6.5
6 1 0.0 3 0.000000 0.0
7 4 5.0 6 2.500000 5.0
8 7 8.0 9 4.333333 8.0
9 11 13.0 14 8.666667 13.0
10 12 0.0 18 7.000000 0.0
11 87 NaN 54 6.500000 6.5
12 1 0.0 3 0.000000 0.0
13 4 5.0 6 2.500000 5.0
14 7 8.0 9 4.333333 8.0
15 11 13.0 14 8.666667 13.0
16 12 0.0 18 7.000000 0.0
17 87 10026.0 54 3346.333333 10026.0
Related
Consider a dataframe like pivoted, where replicates of some data are given as lists in a dataframe:
d = {'Compound': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'Conc': [1, 0.5, 0.1, 1, 0.5, 0.1, 2, 1, 0.5, 0.1],
'Data': [[100, 90, 80], [50, 40, 30], [10, 9.7, 8],
[20, 15, 10], [3, 4, 5, 6], [100, 110, 80],
[30, 40, 50, 20], [10, 5, 9, 3], [2, 1, 2, 2], [1, 1, 0]]}
df = pd.DataFrame(data=d)
pivoted = df.pivot(index='Conc', columns='Compound', values='Data')
This df can be written to an excel file as such:
with pd.ExcelWriter('output.xlsx') as writer:
pivoted.to_excel(writer, sheet_name='Sheet1', index_label='Conc')
How can this instead be written where replicate data are given in side-by-side cells? Desired excel file:
Then you need to pivot your data in a slightly different way, first explode the Data column, and deduplicate with groupby.cumcount:
(df.explode('Data')
.assign(n=lambda d: d.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'n'], values='Data')
.droplevel('n', axis=1).rename_axis(columns=None)
)
Output:
A A A B B B B C C C C
Conc
0.1 10 9.7 8 100 110 80 NaN 1 1 0 NaN
0.5 50 40 30 3 4 5 6 2 1 2 2
1.0 100 90 80 20 15 10 NaN 10 5 9 3
2.0 NaN NaN NaN NaN NaN NaN NaN 30 40 50 20
Beside the #mozway's answer, just for formatting, you can use:
piv = (df.explode('Data').assign(col=lambda x: x.groupby(level=0).cumcount())
.pivot(index='Conc', columns=['Compound', 'col'], values='Data')
.rename_axis(None))
piv.columns = pd.Index([i if j == 0 else '' for i, j in piv.columns], name='Conc')
piv.to_excel('file.xlsx')
I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])
You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.
I am trying to read data using pandas.
Here is what I have tried:
df = pd.read_csv("samples_data.csv")
in_x = df.for_x
in_y = df.for_y
in_init = df.Init
plt.plot(in_x[0], in_y[0], 'b-')
The problem is that, in_x and in_y output a string: (0, '[5 3 9 4.8 2]') (1, '[6 3 9 4.8 2]') ... How could I solve the problem ?
Thank you for taking the time to answer my question.
I was expecting :
in_x_1 = in_x[2][0] # output: [
in_x_2 = in_x[2][1] # output: 6
Read in dataframe, and slice with the iloc method:
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame([
[[5,3,9,4.8,2], [5,3,9,4.8,9], 33],
[[6,3,9,4.8,2], [4,3.8,9,8,4], 87],
[[6.08,2.89,9,4.8,2], [8,3,9,4,7.34], 93],
],
columns=["for_x", "for_y", "Init"]
)
print(df)
in_x = df.for_x.iloc[0]
in_y = df.for_y.iloc[0]
plt.plot(in_x, in_y, 'b-')
plt.show()
Printing the dataframe:
for_x for_y Init
0 [5, 3, 9, 4.8, 2] [5, 3, 9, 4.8, 9] 33
1 [6, 3, 9, 4.8, 2] [4, 3.8, 9, 8, 4] 87
2 [6.08, 2.89, 9, 4.8, 2] [8, 3, 9, 4, 7.34] 93
If your dataframe has string entries, the eval function will turn them into lists which you can then plot data from:
df_2 = pd.DataFrame([
['[5,3,9,4.8,2]', '[5,3,9,4.8,9]', 33],
['[6,3,9,4.8,2]', '[4,3.8,9,8,4]', 87],
['[6.08,2.89,9,4.8,2]', '[8,3,9,4,7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
in_x = eval(df_2.for_x.iloc[0])
in_y = eval(df_2.for_y.iloc[0])
If your values are not comma separated:
df_3 = pd.DataFrame([
['[5 3 9 4.8 2]', '[5 3 9 4.8 9]', 33],
['[6 3 9 4.8 2]', '[4 3.8 9 8 4]', 87],
['[6.08 2.89 9 4.8 2]', '[8 3 9 4 7.34]', 93],
],
columns=["for_x", "for_y", "Init"]
)
string_of_nums_x = df_3.for_x.iloc[0].strip('[').strip(']')
in_x = [float(s) for s in string_of_nums_x.split()]
string_of_nums_y = df_3.for_y.iloc[0].strip('[').strip(']')
in_y = [float(s) for s in string_of_nums_y.split()]
Plotting:
Let's assume that I have the following data-frame:
df_raw = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [np.nan, 3, np.nan, 4, 5], "val3": [4, np.nan, np.nan, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
I want to have access to the rows where the first occurrence of each id is. So these rows would be:
df_first = pd.DataFrame({"id": [102, 103], "val1": [9, 4], "val2": [np.nan, np.nan], "val3": [4, np.nan], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2003, 4, 4)]})
Basically, at the end what I would like to achieve is fill up the NaNs that appear in the first occurrence of each id. So the final data frame might be:
df_processed = pd.DataFrame({"id": [102, 102, 103, 103, 103], "val1": [9,2,4,7,6], "val2": [-1, 3, -1, 4, 5], "val3": [4, np.nan, -1, 5, 1], "date": [pd.Timestamp(2002, 1, 1), pd.Timestamp(2002, 3, 3), pd.Timestamp(2003, 4, 4), pd.Timestamp(2003, 8, 9), pd.Timestamp(2005, 2, 3)]})
An important note is that the rows are already grouped by id and date and sorted in a ascending manner. So they appear exactly as in the provided example.
IIUC using drop_duplicates then concat
df1=df_raw.drop_duplicates('id').fillna(-1)
target=pd.concat([df1,df_raw.loc[~df_raw.index.isin(df1.index)]]).sort_index()
target
date id val1 val2 val3
0 2002-01-01 102 9 -1.0 4.0
1 2002-03-03 102 2 3.0 NaN
2 2003-04-04 103 4 -1.0 -1.0
3 2003-08-09 103 7 4.0 5.0
4 2005-02-03 103 6 5.0 1.0
You can use pd.Series.duplicated with Boolean row indexing:
mask = ~df_raw['id'].duplicated()
val_cols = ['val2', 'val3']
df_raw.loc[mask, val_cols] = df_raw.loc[mask, val_cols].fillna(-1)
print(df_raw)
id val1 val2 val3 date
0 102 9 -1.0 4.0 2002-01-01
1 102 2 3.0 NaN 2002-03-03
2 103 4 -1.0 -1.0 2003-04-04
3 103 7 4.0 5.0 2003-08-09
4 103 6 5.0 1.0 2005-02-03
I have the following dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
And I wrote the following code to generate a new DataFrame with a modified output for each 'sim'
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
df = table2.filter(regex="sim|Type")
Output:
>>> df
Product Type sim_1 sim_2
0 A 35.0 60.0
1 B -39.0 36.0
2 C 56.0 92.0
3 D 23.0 33.0
I want to run this on 10,000 sims, and currently each loop takes about .25 seconds. Is there any way to modify this code to avoid the loop and be more time efficient?
Edit: If you're curious what this code is trying to accomplish you can see my self-answered somewhat disorganized question here: Pandas DataFrame: Complex linear interpolation
I was able to accomplish this with no loops using the following code:
As a result on my 10k x 200 table it ran in 3 minutes instead of the previous 2 hours.
Unfortunately now I need to run it on a 10k x 4k table, and I hit MemoryError on that one, but it may be out of the scope of this question.
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']