rearrange dataframe for paired items - python

I have the a dataframe like this:
id Shimmer
P01_S01_a 2.31
P01_S01_b 3.87
P01_S02_a 2.54
P01_S02_b 2.96
P02_S01_a 1.78
P02_S01_b 3.19
P02_S02_1 2.04
P02_S02_2 2.08
and I want to rearrange it to that:
id Shimmer_a Shimmer_b
P01_S01 2.31 3.87
P01_S02 2.54 2.96
P02_S01 1.78 3.19
P02_S02 2.04 2.08
I think it would be good to start with a range loop because it's always pairwise, but I dont know how to say python to rearrange.

Use Series.replace with dictionary and $ for last values of strings, here last 1, 2, reshape by Series.str.rsplit with n=1 for spliting by last _, then DataFrame.pivot and data cleaning by DataFrame.rename_axis with DataFrame.add_prefix:
df1 = (df.join(df['id'].replace({'1$':'a', '2$':'b'}, regex=True)
.str.rsplit('_', expand=True, n=1))
.pivot(0,1,'Shimmer')
.rename_axis(index='id', columns=None)
.add_prefix('Shimmer_')
.reset_index())
Solution with no rename last values after _ but with counter by GroupBy.cumcount:
df1 = (df.assign(a = df['id'].str.rsplit('_', n=1).str[0],
g = lambda x: x.groupby('a').cumcount())
.pivot('a','g','Shimmer')
.rename(columns=({0:'a', 1:'b'}))
.rename_axis(index='id', columns=None)
.add_prefix('Shimmer_')
.reset_index()
)
print (df1)
id Shimmer_a Shimmer_b
0 P01_S01 2.31 3.87
1 P01_S02 2.54 2.96
2 P02_S01 1.78 3.19
3 P02_S02 2.04 2.08

Combination of str.split and pivot:
temp = df['id'].str.split('_')
df['id'],df['group'] = temp.str[:-1].str.join('_'), temp.str[-1]
df['group'].replace({'1':'a', '2':'b'}, inplace=True)
df = df.pivot(index='id', columns='group', values='Shimmer')
df.columns = ['Shimmer_a', 'Shimmer_b']
Shimmer_a Shimmer_b
id
P01_S01 2.31 3.87
P01_S02 2.54 2.96
P02_S01 1.78 3.19
P02_S02 2.04 2.08

Related

Combine a row with column in dataFrame and show the corresponding values

So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
​
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]

Dataframe split columns value, how to solve error message?

I have a panda dataframe with the following columns:
Stock ROC5 ROC20 ROC63 ROCmean
0 IBGL.SW -0.59 3.55 6.57 3.18
0 EHYA.SW 0.98 4.00 6.98 3.99
0 HIGH.SW 0.94 4.22 7.18 4.11
0 IHYG.SW 0.56 2.46 6.16 3.06
0 HYGU.SW 1.12 4.56 7.82 4.50
0 IBCI.SW 0.64 3.57 6.04 3.42
0 IAEX.SW 8.34 18.49 14.95 13.93
0 AGED.SW 9.45 24.74 28.13 20.77
0 ISAG.SW 7.97 21.61 34.34 21.31
0 IAPD.SW 0.51 6.62 19.54 8.89
0 IASP.SW 1.08 2.54 12.18 5.27
0 RBOT.SW 10.35 30.53 39.15 26.68
0 RBOD.SW 11.33 30.50 39.69 27.17
0 BRIC.SW 7.24 11.08 75.60 31.31
0 CNYB.SW 1.14 4.78 8.36 4.76
0 FXC.SW 5.68 13.84 19.29 12.94
0 DJSXE.SW 3.11 9.24 6.44 6.26
0 CSSX5E.SW -0.53 5.29 11.85 5.54
How can I write in the dataframe a new columns "Symbol" with the stock without ".SW".
Example first row result should be IBGL (modified value IBGL.SW).
Example last row result should be CSSX5E (splited value SSX5E.SW).
If I send the following command:
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
Than I receive an error message:
:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
new_df['Symbol'] = new_df.loc[:, ('Stock')].str.split('.').str[0]
How can I solve this problem?
Thanks a lot for your support.
METHOD 1:
You can do a vectorized operation by str.get(0) -
df['SYMBOL'] = df['Stock'].str.split('.').str.get(0)
METHOD 2:
You can do another vectorized operation by using expand=True in str.split() and then getting the first column.
df['SYMBOL'] = df['Stock'].str.split('.', expand = True)[0]
METHOD 3:
Or you can write a custom lambda function with apply (for more complex processes). Note, this is slower but good if you have your own UDF.
df['SYMBOL'] = df['Stock'].apply(lambda x:x.split('.')[0])
This is not an error, but a warning as you may have probably noticed your script finishes its execution.
edite: Given your comments it seems your issues generate previously in the code, therefore I suggest you use the following:
new_df = new_df.copy(deep=False)
And then proceed to solve it with:
new_df.loc['Symbol'] = new_df['Stock'].str.split('.').str[0]
new_df = new_df.copy()
new_df['Symbol'] = new_df.Stock.str.replace('.SW','')

change series to be dataframe and use number inside label as index of dataframe, the other part as column name

I have a series it is like:
{lag1mid_quoteDiff: 1.51
lag1TradeDirection: 2.12
lag2mid_quoteDiff: 1.53
lag2TradeDirection: 2.18
lag3mid_quoteDiff: 1.59
lag3TradeDirection: 2.10}
I need a dataframe as to have two columns, lagmid_quoteDiff and lagTradeDirection, with 3 rows, index as 1, 2, 3:
lagmid_quoteDiff lagTradeDirection
1 1.51 2.12
2 1.53 2.18
3 1.59 2.10
How can I do this?
Try crosstab after modify the series
s = pd.Series(data)
s1 = s.index.str.extract('(\d+)')[0]
out = pd.crosstab(index = s1, columns = s.index.str.replace('(\d+)',''), values = s.values, aggfunc = 'sum')
out
col_0 lagTradeDirection lagmid_quoteDiff
0
1 2.12 1.51
2 2.18 1.53
3 2.10 1.59
If you can ensure that the order of your dictionary entries is consistent:
import pandas as pd
import numpy as np
data = {"lag1mid_quoteDiff": 1.51,
"lag1TradeDirection": 2.12,
"lag2mid_quoteDiff": 1.53,
"lag2TradeDirection": 2.18,
"lag3mid_quoteDiff": 1.59,
"lag3TradeDirection": 2.10}
data = np.array(list(data.values()))
df = pd.DataFrame(data.reshape(-1, 2), columns=["lagmid_quoteDiff", "lagTradeDirection"])
print(df)
lagmid_quoteDiff lagTradeDirection
0 1.51 2.12
1 1.53 2.18
2 1.59 2.10
If you can not guarantee the order of your dictionary entries, try this:
import pandas as pd
import numpy as np
data = {"lag1mid_quoteDiff": 1.51,
"lag1TradeDirection": 2.12,
"lag2mid_quoteDiff": 1.53,
"lag2TradeDirection": 2.18,
"lag3mid_quoteDiff": 1.59,
"lag3TradeDirection": 2.10}
df = pd.DataFrame(data, index=[0]).melt()
df = (df["variable"].str.extract(r"(?P<lag_n>\d+)(?P<type>\w+)")
.join(df)
.pivot(index="lag_n", columns="type", values="value")
.rename_axis(columns=None))
print(df)
TradeDirection mid_quoteDiff
lag_n
1 2.12 1.51
2 2.18 1.53
3 2.10 1.59
Not quite as elegant, but gets the job done:
series = pd.Series({
'lag1mid_quoteDiff': 1.51,
'lag1TradeDirection': 2.12,
'lag2mid_quoteDiff': 1.53,
'lag2TradeDirection': 2.18,
'lag3mid_quoteDiff': 1.59,
'lag3TradeDirection': 2.10
})
unpack = {}
for k, v in series.iteritems():
re_match = re.match(r'(lag)(\d+)(.*)', k)
try:
index_num = int(re_match.group(2))
col = re_match.group(1)+re_match.group(3)
if unpack.get(col):
unpack[col][index_num] = v
else:
unpack[col] = {index_num: v}
except Exception as e:
raise ValueError(f"Provided key is incorrect format: {k}")
df = pd.DataFrame(unpack)
lagmid_quoteDiff lagTradeDirection
1 1.51 2.12
2 1.53 2.18
3 1.59 2.10

Why is the axes for the .mean() method in pandas the opposite in this scenario?

I have a dataframe, height_df, with three measurements, 'height_1','height_2','height_3'. I want to create a new column that has the mean of all three heights. A printout of height_df is given below
height_1 height_2 height_3
0 1.78 1.80 1.80
1 1.70 1.70 1.69
2 1.74 1.75 1.73
3 1.66 1.68 1.67
The following code works but I don't understand why
height_df['height'] = height_df[['height_1','height_2','height_3']].mean(axis=1)
I actually want the mean across the row axes, i.e. for each row compute the average of the three heights. I would have thought then that the axis argument in mean should be set to 0, as this is what corresponds to applying the mean across rows, however axis=1 is what gets the result I am looking for. Why is this? If axis=1 is for columns and axis=0 is for rows then why does .mean(axis=1) take the mean across rows?
Just need to tell mean to work across columns with axis=1
df = pd.DataFrame({"height_1":[1.78,1.7,1.74,1.66],"height_2":[1.8,1.7,1.75,1.68],"height_3":[1.8,1.69,1.73,1.67]})
df = df.assign(height_mean=df.mean(axis=1))
df = df.assign(height_mean=df.loc[:,['height_1','height_2','height_3']].mean(axis=1))
print(df.to_string(index=False))
output
height_1 height_2 height_3 height_mean
1.78 1.80 1.80 1.793333
1.70 1.70 1.69 1.696667
1.74 1.75 1.73 1.740000
1.66 1.68 1.67 1.670000

dataFrame duplication extraction row

The code below gives exactly the following Jupyter output:
date open high low close volume
0 29/04/1992 2.21 2.21 1.98 1.99 0
1 29/04/1992 2.21 2.21 1.98 1.98 0
2 30/04/1992 2.02 2.32 1.95 1.98 0
size: 6686
no duplicates? False
date open high low close volume
0 29/04/1992 2.21 2.21 1.98 1.99 0
1 29/04/1992 2.21 2.21 1.98 1.98 0
2 30/04/1992 2.02 2.32 1.95 1.98 0
no duplicates? False
size: 6686
What should I change in the duplication-extraction line?
Thanks!
fskilnik
checking = pd.DataFrame(df)
print(checking.head(3))
size2 = len(checking.index)
print('size:',size2)
print('no duplicates?', checking.date.is_unique)
checking.drop_duplicates(['date'], keep='last')
print(checking.head(3))
print('no duplicates?', checking.date.is_unique)
size2 = len(checking.index)
print('size:',size2)
You should add inplace=True to the drop_duplicates method or reassign the dataframe like:
checking.drop_duplicates(['date'], keep='last', inplace=True)
Or:
checking = checking.drop_duplicates(['date'], keep='last')

Categories

Resources