transpose multiple columns in a pandas dataframe

transpose multiple columns in a pandas dataframe - python

AD AP AR MD MS iS AS
0 169.88 0.00 50.50 814.0 57.3 32.3 43.230
1 12.54 0.01 84.75 93.0 51.3 36.6 43.850
2 321.38 0.00 65.08 986.0 56.7 28.9 42.070
I would like to change the dataframe above to a transposed version where for each column, the values are put in a single row, so e.g. for columns AD and AP, it will look like this
d1_AD d2_AD d3_AD d1_AP d2_AP d3_AP
169.88 12.54 321.38 0.00 0.01 0.00
I can do a transpose, but how do I get the column names and output structure like above?
NOTE: The output is truncated for legibility but the actual output should include all the other columns like AR MD MS iS AS

We can rename to make the index of the correct form, then stack and sort_index, then Collapse the MultiIndex and to_frame and transpose
new_df = df.rename(lambda x: f'd{x + 1}').stack().sort_index(level=1)
new_df.index = new_df.index.map('_'.join)
new_df = new_df.to_frame().transpose()
Input df:
df = pd.DataFrame({
'AD': [169.88, 12.54, 321.38], 'AP': [0.0, 0.01, 0.0],
'AR': [50.5, 84.75, 65.08], 'MD': [814.0, 93.0, 986.0],
'MS': [57.3, 51.3, 56.7], 'iS': [32.3, 36.6, 28.9],
'AS': [43.23, 43.85, 42.07]
})
new_df:
d1_AD d2_AD d3_AD d1_AP d2_AP ... d2_MS d3_MS d1_iS d2_iS d3_iS
0 169.88 12.54 321.38 0.0 0.01 ... 51.3 56.7 32.3 36.6 28.9
[1 rows x 21 columns]
If lexicographic sorting does not work we can wait to convert the MultiIndex to string until after sort_index:
new_df = df.stack().sort_index(level=1) # Sort level 1 (by number)
new_df.index = new_df.index.map(lambda x: f'd{x[0]+1}_{x[1]}')
new_df = new_df.to_frame().transpose()
Larger frame:
df = pd.concat([df] * 4, ignore_index=True)
Truncated output:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_iS d9_iS d10_iS d11_iS d12_iS
0 169.88 12.54 321.38 169.88 12.54 ... 36.6 28.9 32.3 36.6 28.9
[1 rows x 84 columns]
If needing columns in same order as df, use melt using ignore_index=False to not have to recalculate groups and let melt handle the ordering:
new_df = df.melt(value_name=0, ignore_index=False)
new_df = new_df[[0]].set_axis(
# Create the new index
'd' + (new_df.index + 1).astype(str) + '_' + new_df['variable']
).transpose()
Truncated output on the larger frame:
d1_AD d2_AD d3_AD d4_AD d5_AD ... d8_AS d9_AS d10_AS d11_AS d12_AS
0 169.88 12.54 321.38 169.88 12.54 ... 43.85 42.07 43.23 43.85 42.07
[1 rows x 84 columns]

You could try melt and set_index with groupby:
x = df.melt().set_index('variable').rename_axis(index=None).T.set_axis([0])
x.set_axis(x.columns + x.columns.to_series().groupby(level=0).transform('cumcount').add(1).astype(str), axis=1)
AD1 AD2 AD3 AP1 AP2 AP3 AR1 AR2 AR3 ... MS1 MS2 MS3 iS1 iS2 iS3 AS1 AS2 AS3
0 169.88 12.54 321.38 0.0 0.01 0.0 50.5 84.75 65.08 ... 57.3 51.3 56.7 32.3 36.6 28.9 43.23 43.85 42.07
[1 rows x 21 columns]

Related

merge pandas dataframe causing duplicate row values

I have two pandas dataframe that I want to merge. My first dataframe, names, is a list of stock tickers and corresponding dates. Example below:
Date Symbol DateSym
0 2017-01-05 AGRX AGRX01-05-2017
1 2017-01-05 TMDI TMDI01-05-2017
2 2017-01-06 ATHE ATHE01-06-2017
3 2017-01-06 AVTX AVTX01-06-2017
4 2017-01-09 CVM CVM01-09-2017
5 2017-01-10 DFFN DFFN01-10-2017
6 2017-01-10 VKTX VKTX01-10-2017
7 2017-01-11 BIOC BIOC01-11-2017
8 2017-01-11 CVM CVM01-11-2017
9 2017-01-11 PALI PALI01-11-2017
I created another dataframe, price1, that loops through the tickers and creates a dataframe with the open, high, low, close and other relevant info I need. When I merge the two dataframes together, I want to only show the names dataframe on the left with the corresponding price data on the right. What I ran a test of the first 10 tickers, I noticed that the combined dataframe is outputting redundant rows. (See CVM in row 4 and 5 below), even though the price1 dataframe doesn't have duplicate values. What am I doing wrong?
def price_stats(df):
# df['ticker'] = df
df['Gap Up%'] = df['Open'] / df['Close'].shift(1) - 1
df['HOD%'] = df['High'] / df['Open'] - 1
df['Close vs Open%'] = df['Close'] / df['Open'] - 1
df['Close%'] = df['Close'] / df['Close'].shift(1) - 1
df['GU and Goes Red'] = np.where((df['Low'] < df['Close'].shift(1)) & (df['Open'] > df['Close'].shift(1)), 1, 0)
df['Yday Intraday>30%'] = np.where((df['Close vs Open%'].shift(1) > .30), 1, 0)
df['Gap Up?'] = np.where((df['Gap Up%'] > 0), 1, 0)
df['Sloppy $ Vol'] = (df['High'] + df['Low'] + df['Close']) / 3 * df['Volume']
df['Prev Day Sloppy $ Vol'] = (df['High'].shift(1) + df['Low'].shift(1) + df['Close'].shift(1)) / 3 * df[
'Volume'].shift(1)
df['Prev Open'] = df['Open'].shift(1)
df['Prev High'] = df['High'].shift(1)
df['Prev Low'] = df['Low'].shift(1)
df['Prev Close'] = df['Close'].shift(1)
df['Prev Vol'] = df['Volume'].shift(1)
df['D-2 Close'] = df['Close'].shift(2)
df['D-2 Vol'] = df['Volume'].shift(2)
df['D-3 Close']= df['Close'].shift(3)
df['D-2 Open'] = df['Open'].shift(2)
df['D-2 High'] = df['High'].shift(2)
df['D-2 Low'] = df['Low'].shift(2)
df['D-2 Intraday Rnage'] = df['D-2 Close']/df['D-2 Open']-1
df['D-2 Close%'] = df['D-2 Close']/df['D-3 Close']-1
df.dropna(inplace=True)
vol_excel = pd.read_excel('C://U******.xlsx')
names = vol_excel.Symbol.to_list()
price1 = []
price1 = pd.DataFrame(price1)
for name in names[0:10]:
print(name)
price = yf.download(name, start="2016-12-01", end="2022-03-04")
price['ticker'] = name
price_stats(price)
price1 = pd.concat([price1, price])
price1 = price1.reset_index()
orig_day = pd.to_datetime(price1['Date'])
price1['Prev Day Date'] = orig_day - pd.tseries.offsets.CustomBusinessDay(1, holidays=nyse.holidays().holidays)
price1['DateSym'] = price1['ticker']+ price1['Date'].dt.strftime('%m-%d-%Y')
price1 = price1.rename(columns={'ticker':'Symbol'})
datesym = price1['DateSym']
price1.drop(labels=['DateSym'], axis=1,inplace = True)
price1.insert(0, 'DateSym', datesym)
vol_excel['DateSym'] = vol_excel['Symbol']+vol_excel['Date'].dt.strftime('%m-%d-%Y')
dfcombo = vol_excel.merge(price1,on=['Date','Symbol'],how='inner')
See how CVM is duplicated twice when i print out dfcombo
Date Symbol DateSym_x DateSym_y Open High Low Close Adj Close Volume ... Prev Vol D-2 Close D-2 Vol D-3 Close D-2 Open D-2 High D-2 Low D-2 Intraday Rnage D-2 Close% Prev Day Date
0 2017-01-05 AGRX AGRX01-05-2017 AGRX01-05-2017 2.71 2.71 2.40 2.52 2.52 2408400 ... 18584900.0 5.000 2390400.0 5.700000 5.770 5.813000 4.460 -0.133449 -0.122807 2017-01-04
1 2017-01-05 TMDI TMDI01-05-2017 TMDI01-05-2017 15.60 16.50 12.90 13.50 13.50 43830 ... 114327.0 10.500 61543.0 7.200000 7.500 10.500000 7.500 0.400000 0.458333 2017-01-04
2 2017-01-06 ATHE ATHE01-06-2017 ATHE01-06-2017 2.58 2.60 2.23 2.42 2.42 222500 ... 1750700.0 1.930 53900.0 1.750000 1.790 1.950000 1.790 0.078212 0.102857 2017-01-05
3 2017-01-06 AVTX AVTX01-06-2017 AVTX01-06-2017 1.24 1.24 1.02 1.07 1.07 480500 ... 1246100.0 0.883 44900.0 0.890000 0.896 0.950000 0.827 -0.014509 -0.007865 2017-01-05
4 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
5 2017-01-09 CVM CVM01-09-2017 CVM01-09-2017 2.75 3.00 2.75 2.75 2.75 376520 ... 414056.0 2.000 77360.0 2.000000 2.000 2.250000 2.000 0.000000 0.000000 2017-01-06
6 2017-01-10 DFFN DFFN01-10-2017 DFFN01-10-2017 111.00 232.50 108.75 125.25 125.25 165407 ... 43167.0 30.900 67.0 34.650002 31.500 34.349998 30.900 -0.019048 -0.108225 2017-01-09
7 2017-01-10 VKTX VKTX01-10-2017 VKTX01-10-2017 1.64 1.64 1.43 1.56 1.56 981700 ... 1550400.0 1.260 264900.0 1.230000 1.250 1.299000 1.210 0.008000 0.024390 2017-01-09
8 2017-01-11 BIOC BIOC01-11-2017 BIOC01-11-2017 858.00 1017.00 630.00 813.00 813.00 210182 ... 78392.0 306.000 5368.0 285.000000 285.000 315.000000 285.000 0.073684 0.073684 2017-01-10
9 2017-01-11 CVM CVM01-11-2017 CVM01-11-2017 4.25 4.50 3.00 3.75 3.75 487584 ... 672692.0 2.750 376520.0 2.750000 2.750 3.000000 2.750 0.000000 0.000000 2017-01-10
I'm wondering since the names dataframe may have the same tickers, but different dates, in the dataframe and each time the price1 dataframe is pulling the price data and adding to that price1 dataframe maybe causing the issue.
For example, in the names dataframe, AGRX can be listed for the date 2017-01-05 and 2020-12-20. My loop function as shown pulls from yahoo data and appends it to the price1 dataframe even though its the same set of data. Along the same token, is there a way for me to skip appending that duplicate ticker and will that solve the issue?

can someone turn columns from a different dataset as values in another dataset by matching values from a column in the first dataset with the second

sorry if I ain't clear, but got a challenge,
[this is the sample data I have generated to try to make my challenge clear] 1
sample data 1
B
V
S
F
K
0.32
10.32
11.32
12.32
13.32
1.32
11.32
12.32
13.32
14.32
2.32
12.32
13.32
14.32
15.32
3.32
13.32
14.32
15.32
16.32
4.32
14.32
15.32
16.32
17.32
5.32
15.32
16.32
17.32
18.32
6.32
16.32
17.32
18.32
19.32
7.32
17.32
18.32
19.32
20.32
8.32
18.32
19.32
20.32
21.32
9.32
19.32
20.32
21.32
22.32
10.32
20.32
21.32
22.32
23.32
my expected output
K
L
M
1
2.32
2
3.32
3
4.32
4
5.32
5
6.32
6
13.32
7
14.32
8
15.32
9
16.32
10
17.32
the second image explains the outcome
I would like to know how I would create another column M in dataset 2 that will return the name of the column from dataset 1 contains the values in column L (which is in dataset 2)
I have tried the code below but it wasn't adding up to it since I had got this error and I thought someone here will help with this, thanks in advance!
spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]```
returned the following error
```~\AppData\Local\Temp/ipykernel_25368/552331776.py in <module>
----> 1 spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]
~\AppData\Local\Temp/ipykernel_25368/552331776.py in <listcomp>(.0)
----> 1 spike_cols = [col for col in stata.columns if df['IMAGE NUMBER'] in col]
TypeError: 'in <string>' requires string as left operand, not Series```

I created two dataframes here.For loop searches for matches and records column names for each row. You can also take dataframes from my message: df, df1 and place them in your message.
import pandas as pd
import numpy as np
df = pd.DataFrame({'B':[0.32,1.32,2.32,3.32,4.32,5.32,6.32,7.32,8.32,9.32,10.32],
'V':[10.32,11.32,12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32],
'S':[11.32,12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32],
'F':[12.32,13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32,22.32],
'K':[13.32,14.32,15.32,16.32,17.32,18.32,19.32,20.32,21.32,22.32,23.32]})
print(df)
df1 = pd.DataFrame({'K':[1,2,3,4,5,6,7,8,9,10],
'L':[2.32,3.32,4.32,5.32,6.32,13.32,14.32,15.32,16.32,17.32]})
M = []
for k in range(0, len(df1)):
i, c = np.where(df == df1['L'][k])#Get the indexes of the columns where there was a match
ttt = df.columns[c]#Get the name of the columns
M.append(','.join(list(ttt)))#Output values to a comma-separated string if there are more than one values
df1['M'] = M #Adding a column with the received values
print(df1)

Filter out the column having same value throughout all the rows

This is my dataset:
df = {'brand_no':['BH 1', 'BH 2', 'BH 5', 'BH 7', 'BH 6'],
'1240000601_min':[5.87,5.87,5.87,5.87,np.nan],
'1240000601_max':[8.87,7.47,10.1,1.9,10.8],
'1240000603_min':[5.87,np.nan,6.5,2.0,7.8],
'1240000603_max':[8.57,7.47,10.2,1.0,10.2],
'1240000604_min':[5.87,5.67,6.9,1.0,7.8],
'1240000604_max':[8.87,8.87,8.87,np.nan,8.87],
'1240000605_min':[15.87,15.67,16.9,1.0,17.8],
'1240000605_max':[18.11,17.47,20.1,1.9,22.6],
'1240000606_min':[8.12,8.12,np.nan,8.12,np.nan],
'1240000606_max;':[np.nan,7.47,10.1,1.9,np.nan]}
# Create DataFrame
df = pd.DataFrame(df)
# Print the output.
df
As you can see the column stays the same except nan.(Because the data is sparse it has nan as well), so I want drop these column which has value same across all the rows. (in this case column 1240000601_min, 1240000604_max and 1240000606_min)
Desired output:
As we can see here, all the column with same value across all rows are dropped. Pls help get this.

You can use something like this:
columns = [column for column in df.columns if df[column].nunique()==1]
df.drop(columns=columns)
df.nunique() drops nans by default, so you don't have to worry about that.

Try this:
df_cleared = df.loc[:, df.nunique() > 1]

You can use: df.nunique() to check for number of unique items in all columns and filter those > 1 with .gt(1). This will form a boolean mask of the columns. Then, use .loc and put the boolean mask just created on the second parameter of .loc to filter the columns:
df_cleaned = df.loc[:, df.nunique().gt(1)]
Result:
pritn(df_cleaned)
brand_no 1240000601_max 1240000603_min 1240000603_max 1240000604_min 1240000605_min 1240000605_max 1240000606_max;
0 BH 1 8.87 5.87 8.57 5.87 15.87 18.11 NaN
1 BH 2 7.47 NaN 7.47 5.67 15.67 17.47 7.47
2 BH 5 10.10 6.50 10.20 6.90 16.90 20.10 10.10
3 BH 7 1.90 2.00 1.00 1.00 1.00 1.90 1.90
4 BH 6 10.80 7.80 10.20 7.80 17.80 22.60 NaN

Express pandas operations as pipeline

df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
df['cover'] = df.loc[:, 'cover'] * 100.
df['id'] = df['condition'].map(constants.dict_c)
df['temperature'] = (df['min_t'] + df['max_t])/2.
Is there a way to express the code above as a pandas pipeline? I am stuck at the first step where I rename some columns in the dataframe and select a subset of the columns.
-- EDIT:
Data is here:
max_t col_a min_t cover condition pressure
0 38.02 1523106000 19.62 0.48 269.76 1006.64
1 39.02 1523196000 20.07 0.29 266.77 1008.03
2 39 1523282400 19.48 0.78 264.29 1008.29
3 39.11 1523368800 20.01 0.7 263.68 1008.29
4 38.59 1523455200 20.88 0.83 262.35 1007.36
5 39.33 1523541600 22 0.65 261.87 1006.82
6 38.96 1523628000 24.05 0.57 259.27 1006.96
7 39.09 1523714400 22.53 0.88 256.49 1007.94

I think need assign:
df = df.loc[:, dict_lup.values()].rename(columns={v: k for k, v in dict_lup.items()})
.assign(cover = df['cover'] * 100.,
id = df['condition'].map(constants.dict_c),
temperature = (df['min_t'] + df['max_t'])/2.)

Columns error while converting to float

df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.

To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

transpose multiple columns in a pandas dataframe - python

Related

merge pandas dataframe causing duplicate row values

can someone turn columns from a different dataset as values in another dataset by matching values from a column in the first dataset with the second

Filter out the column having same value throughout all the rows

Express pandas operations as pipeline

Columns error while converting to float

Categories

Resources