So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]
I have the following 2 data frames:
df1 = pd.DataFrame({
'dates': ['02-Jan','03-Jan','30-Jan'],
'currency': ['aud','gbp','eur'],
'amount': [100,330,500]
})
df2 = pd.DataFrame({
'dates': ['01-Jan','02-Jan','03-Jan','30-Jan'],
'aud': [0.72,0.73,0.74,0.71],
'gbp': [1.29,1.30,1.4,1.26],
'eur': [1.15,1.16,1.17,1.18]
})
I want to obtain the intersection of df1.dates & df1.currency. For eg: Looking up the prevalent 'aud' exchange rate on '02-Jan'
It can be solved using the Index + Match functionality of excel. What shall be the best way to replicate it in Pandas.
Desired Output: add a new column 'price'
dates currency amount price
02-Jan aud 100 0.73
03-Jan gbp 330 1.4
30-Jan eur 500 1.18
The best equivalent of INDEX MATCH is DataFrame.lookup:
df2 = df2.set_index('dates')
df1['price'] = df2.lookup(df1['dates'], df1['currency'])
Reshaping your df2 makes it a lot easier to do a straightforward merge:
In [42]: df2.set_index("dates").unstack().to_frame("value")
Out[42]:
value
dates
aud 01-Jan 0.72
02-Jan 0.73
03-Jan 0.74
30-Jan 0.71
gbp 01-Jan 1.29
02-Jan 1.30
03-Jan 1.40
30-Jan 1.26
eur 01-Jan 1.15
02-Jan 1.16
03-Jan 1.17
30-Jan 1.18
In this form, you just need to match the df1 fields with df2's new index as such:
In [43]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True)
Out[43]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
You can also left merge it if you don't want to lose missing data (I had to fix your df1 a little for this:
In [44]: df1.merge(df2.set_index("dates").unstack().to_frame("value"), left_on=["currency", "dates"], right_index=True, how="left")
Out[44]:
dates currency amount value
0 02-Jan aud 100 0.73
1 03-Jan gbp 330 1.40
2 04-Jan eur 500 NaN
df = pd.read_csv('TestdataNALA.csv', sep=';', na_values='ND')
df.fillna(0, inplace=True)
df = df[['ID','VegType','Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
new_list = df[['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']]
nnew_list = [float(i) for i in new_list]
print(type(df['ID']))
#--> <class 'pandas.core.series.Series'>
print('ID' in df.columns)
#-->True
Some data from the CSV:
ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
When I run this, I get the error:
ValueError: could not convert string to float: 'Elevation'
It seems as if it's trying to convert the headlines, but I only wanted to convert the list.
To convert columns to numeric, you should use pd.to_numeric:
cols = ['Elevation','C','N','C/N','Soil13C','Soil15N','pH','Na_[mg/g]','K_[mg/g]','Mg_[mg/g]','Ca_[mg/g]','P_[µg/g]']
df[cols] = df[cols].apply(pd.to_numeric, errors='coerce')
Your code will not work because, like a dictionary and its keys, when you loop through a dataframe you only find the column headings.
Update 1
Try also using the options below:
import pandas as pd
from io import StringIO
mystr = StringIO(""" <bound method NDFrame.head of ID VegType Elevation C N C/N Soil13C Soil15N pH \
1 C2489WCG Cyt_Gen 2489 2,18 0,17 12,50 -24,51 5,53 4,21
2 C2591WCG Cyt_Gen 2591 5,13 0,36 14,29 -25,52 3,41 4,30
3 E2695WC Cyt 2695 1,43 0,14 10,55 -25,71 5,75 4,95
""")
df = pd.read_csv(mystr, skiprows=1, decimal=',', header=None, delim_whitespace=True)
# 0 1 2 3 4 5 6 7 8 9
# 0 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 1 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 2 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
Update 2
import pandas as pd
from io import StringIO
mystr = StringIO("""ID;VegType;Elevation;C;N;C/N;Soil13C;Soil15N;pH;Na_[mg/g];K_[mg/g];Mg_[mg/g];Ca_[mg/g];P_[µg/g]
A2425WC;Cyt;2425;2,78;0,23;12,14;-24,43;4,85;4,88;0,0005;0,0772;0,0646;0,3588;13,8400
C2489WCG;Cyt_Gen;2489;2,18;0,17;12,50;-24,51;5,53;4,21;0,0010;0,0639;0,0286;0,0601;0,6800
C2591WCG;Cyt_Gen;2591;5,13;0,36;14,29;-25,52;3,41;4,30;0,0046;0,0854;0,1169;0,7753;5,7000
E2695WC;Cyt;2695;1,43;0,14;10,55;-25,71;5,75;4,95;ND;0,0766;0,0441;0,0978;8,8500
A2900WC;Cyt;2900;6,95;0,53;13,11;-25,54;3,38;5,35;0,0032;0,1119;0,2356;1,8050;14,7100
A2800WC;Cyt;2800;1,88;0,16;11,62;-23,68;5,88;5,28;0,0025;0,0983;0,0770;0,3777;5,4200
A3050WC;Cyt;3050;1,50;0,12;12,50;-24,62;2,23;5,97;ND;0,0696;0,0729;0,5736;9,4000
""")
df = pd.read_csv(mystr, decimal=',', delimiter=';')
# ID VegType Elevation C N C/N Soil13C Soil15N pH \
# 0 A2425WC Cyt 2425 2.78 0.23 12.14 -24.43 4.85 4.88
# 1 C2489WCG Cyt_Gen 2489 2.18 0.17 12.50 -24.51 5.53 4.21
# 2 C2591WCG Cyt_Gen 2591 5.13 0.36 14.29 -25.52 3.41 4.30
# 3 E2695WC Cyt 2695 1.43 0.14 10.55 -25.71 5.75 4.95
# 4 A2900WC Cyt 2900 6.95 0.53 13.11 -25.54 3.38 5.35
# 5 A2800WC Cyt 2800 1.88 0.16 11.62 -23.68 5.88 5.28
# 6 A3050WC Cyt 3050 1.50 0.12 12.50 -24.62 2.23 5.97
# Na_[mg/g] K_[mg/g] Mg_[mg/g] Ca_[mg/g] P_[µg/g]
# 0 0,0005 0.0772 0.0646 0.3588 13.84
# 1 0,0010 0.0639 0.0286 0.0601 0.68
# 2 0,0046 0.0854 0.1169 0.7753 5.70
# 3 ND 0.0766 0.0441 0.0978 8.85
# 4 0,0032 0.1119 0.2356 1.8050 14.71
# 5 0,0025 0.0983 0.0770 0.3777 5.42
# 6 ND 0.0696 0.0729 0.5736 9.40
Hi I have data frame like this:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 -
2 GWRE 416.06 9.80 5.33 45.62 -
3 PEGA 129.02 4.41 9.85 285.10 0.28%
4 BLKB 87.68 4.96 14.36 41.81 0.62%
Firstly, I want to convert values in columns that contain numbers (which are string currently) to a float values. So here I would have the 4 middle columns that need the conversion to float. Would simple loop work with this case?
Second thing, there is a problem with the last column 'Dividend' where there is a percentage value as string. As a matter of fact I can convert it to decimals, however I was thinking if there is a way to still retaining the % and the values would still be calculable.
Any ideas for those two issues?
plan
Take out 'Ticker' because it isn't numeric
use assign to overwrite Dividend by striping off %
use apply with pd.to_numeric to convert all the columns
use eval to get Dividend to proper decimal space
df[['Ticker']].join(
df.assign(
Dividend=df.Dividend.str.strip('%')
).drop('Ticker', 1).apply(
pd.to_numeric, errors='coerce'
)
).eval('Dividend = Dividend / 100', inplace=False)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
more lines
more readable
nums = df.drop('Ticker', 1).assign(Dividend=df.Dividend.str.strip('%'))
nums = nums.apply(pd.to_numeric, errors='coerce')
nums = nums.assign(Dividend=nums.Dividend / 100)
df[['Ticker']].join(nums)
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062
Assuming that all P/... columns contain proper numbers:
In [47]: df.assign(Dividend=pd.to_numeric(df.Dividend.str.replace(r'\%',''), errors='coerce')
...: .div(100)) \
...: .set_index('Ticker', append=True) \
...: .astype('float') \
...: .reset_index('Ticker')
...:
Out[47]:
Ticker P/E P/S P/B P/FCF Dividend
No.
1 NTCT 457.32 3.03 1.44 26.04 NaN
2 GWRE 416.06 9.80 5.33 45.62 NaN
3 PEGA 129.02 4.41 9.85 285.10 0.0028
4 BLKB 87.68 4.96 14.36 41.81 0.0062