I have a dataframe with messy data.
df:
1 2 3
-- ------- ------- -------
0 123/100 221/100 103/50
1 49/100 333/100 223/50
2 153/100 81/50 229/100
3 183/100 47/25 31/20
4 2.23 3.2 3.04
5 2.39 3.61 2.69
I want the fractional values to be converted to decimal with the conversion formula being
e.g:
123/100 = (123/100 + 1) = 2.23
333/100 = (333/100 +1) = 4.33
The calculation is fractional value + 1
And of course leave the decimal values as is.
How can I do it in Pandas and Python?
A simple way to do this is to first define a conversion function that will be applied to each element in a column:
def convert(s):
if '/' in s: # is a fraction
num, den = s.split('/')
return 1+(int(num)/int(den))
else:
return float(s)
Then use the .apply function to run all elements of a column through this function:
df['1'] = df['1'].apply(convert)
Result:
df['1']:
0 2.23
1 1.49
2 2.53
3 2.83
4 2.23
5 2.39
Then repeat on any other column as needed.
If you trust the data in your dataset, the simplest way is to use eval or better, suggested by #mozway, pd.eval:
>>> df.replace(r'(\d+)/(\d+)', r'1+\1/\2', regex=True).applymap(pd.eval)
1 2 3
0 2.23 3.21 3.06
1 1.49 4.33 5.46
2 2.53 2.62 3.29
3 2.83 2.88 2.55
4 2.23 3.20 3.04
5 2.39 3.61 2.69
Related
So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]
I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.
I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]
def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list
I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.
METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])
This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]
new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')
X =
[[14.23 3.06 5.64 2.43]
[13.2 2.76 4.38 2.14]
[13.16 3.24 5.68 2.67]
[14.37 3.49 7.8 2.5 ]
[13.24 2.69 4.32 2.87]
[14.2 3.39 6.75 2.45]
[14.39 2.52 5.25 2.45]
[14.06 2.51 5.05 2.61]
[14.83 2.98 5.2 2.17]
[13.86 3.15 7.22 2.27]
[14.1 3.32 5.75 2.3 ]
[14.12 2.43 5. 2.32]
[13.75 2.76 5.6 2.41]
[14.75 3.69 5.4 2.39]
[14.38 3.64 7.5 2.38]
[13.63 2.91 7.3 2.7 ]
[14.3 3.14 6.2 2.72]
[13.83 3.4 6.6 2.62]
[14.19 3.93 8.7 2.48]
[13.64 3.03 5.1 2.56]]
Here is my dataset. Now I want to calculate the Euclidean distance for 2 of vectors (rows).
Row1 = X[1]
Row2 = X[2]
My function:
def Edistance (v1, v2):
distance = 0.0
for i in range(len(v1)-1):
distance += (v1(i)) - (v2(i))**2
return sqrt(distance)
Edistance(Row1,Row2)
I then get Typerror: NumPy array is not callable. Can I not use an array in my functions input?
You can pass any object as a function argument and so you can pass arrays, but as #xdurch0 mentioned earlier, your syntax is wrong.
def Edistance (v1: dict, v2: dict): # You
distance = 0.0
for i in range(len(v1)-1):
distance += (v1(i)) - (v2(i))**2
return sqrt(distance)
What you try to do here is to call v1 and v2 as if they were a functions, since () used to execute the commands. But what you want to do, as far as i understand, is to use [] to reference at the element inside the array.
So, basically, you want to do v1[i] and v2[i] (instead of v1(i) and v2(i) respectively).
I'm trying to get the mean of each column while grouped by id, BUT for the calculation only the 50% between the first 25% quantil and the third 75% quantil should be used. (So ignore the lowest 25% of values and the highest 25%)
The data:
ID Property3 Property2 Property3
1 10.2 ... ...
1 20.1
1 51.9
1 15.8
1 12.5
...
1203 104.4
1203 11.5
1203 19.4
1203 23.1
What I tried:
data.groupby('id').quantile(0.75).mean();
#data.groupby('id').agg(lambda grp: grp.quantil(0.25, 0,75)).mean(); something like that?
CW 67.089733
fd 0.265917
fd_maxna -1929.522001
fd_maxv -1542.468399
fd_sumna -1928.239954
fd_sumv -1488.165382
planc -13.165445
slope 13.654163
Something like that, but the GroupByDataFrame.quantil doesn't know a inbetween to my knowledge and I don't know how to now remove the lower 25% too.
And this also doesn't return a dataframe.
What I want
Idealy, I would like to have a dataframe as follows:
ID Property3 Property2 Property3
1 37.8 5.6 2.3
2 33.0 1.5 10.4
3 34.9 91.5 10.3
4 33.0 10.3 14.3
Where only the data between the 25% quantil and the 75% quantil are used for the mean calculation. So only the 50% in between.
Using GroupBy.apply here can be slow so I suppose this is your data frame:
print(df)
ID Property3 Property2 Property1
0 1 10.2 58.337589 45.083237
1 1 20.1 70.844807 29.423138
2 1 51.9 67.126043 90.558225
3 1 15.8 17.478715 41.492485
4 1 12.5 18.247211 26.449900
5 1203 104.4 113.728439 130.698964
6 1203 11.5 29.659894 45.991533
7 1203 19.4 78.910591 40.049054
8 1203 23.1 78.395974 67.345487
So I would use GroupBy.cumcount + DataFrame.pivot_table
to calculate quantiles without using apply:
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
print(new_df)
Property1 Property2 Property3
ID 1 1203 1 1203 1 1203
aux
0 45.083237 130.698964 58.337589 113.728439 10.2 104.4
1 29.423138 45.991533 70.844807 29.659894 20.1 11.5
2 90.558225 40.049054 67.126043 78.910591 51.9 19.4
3 41.492485 67.345487 17.478715 78.395974 15.8 23.1
4 26.449900 NaN 18.247211 NaN 12.5 NaN
#remove aux column
df=df.drop('aux',axis=1)
Now we calculate the mean using boolean indexing:
new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
ID
Property1 1 59.963006
1203 70.661294
Property2 1 49.863814
1203 45.703292
Property3 1 15.800000
1203 21.250000
dtype: float64
or create DataFrame with the mean:
mean_df=( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
print(mean_df)
Property Property1 Property2 Property3
ID
1 41.492485 58.337589 15.80
1203 56.668510 78.653283 21.25
Measure times:
%%timeit
df['aux']=df.groupby('ID').cumcount()
new_df=df.pivot_table(columns='ID',index='aux',values=['Property1','Property2','Property3'])
df=df.drop('aux',axis=1)
( new_df[(new_df.quantile(0.75)>new_df)&( new_df>new_df.quantile(0.25) )].mean()
.rename_axis(index=['Property','ID'])
.unstack('Property') )
25.2 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
df.groupby("ID").apply(lambda x: x.apply(mean_of_25_to_75_pct))
33 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = df.groupby("ID").apply(filter_mean)
23 ms ± 809 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
It is even almost faster with a small data frame, in larger data frames such as its original data frame, it would be much faster than the other proposed methods, see: when use apply
You can use the quantile function to return multiple quantiles. Then, you can filter out values based on this, and compute the mean:
def filter_mean(df):
bounds = df.quantile([.25, .75])
mask = (df < bounds.loc[0.75]) & (df > bounds.loc[0.25])
return df[mask].mean()
means = data.groupby("id").apply(filter_mean)
Please try this.
def mean_of_25_to_75_pct(s: pd.Series):
low, high = s.quantile(.25), s.quantile(.75)
return s.loc[(s >= low) & (s < high)].mean()
data.groupby("id").apply(lambda x: x.apply(mean_of_25_to_75_pct))
You could use scipy ready-made function for trimmed mean, trim_mean():
from scipy import stats
means = data.groupby("id").apply(stats.trim_mean, 0.25)
If you insist on getting a dataframe, you could:
data.groupby("id").agg(lambda x: stats.trim_mean(x, 0.25)).reset_index()