Pandas dataframe slicing - python

I have the following dataframe:
2012 2013 2014 2015 2016 2017 2018 Kategorie
0 5.31 5.27 5.61 4.34 4.54 5.02 7.07 Gewinn pro Aktie in EUR
1 13.39 14.70 12.45 16.29 15.67 14.17 10.08 KGV
2 -21.21 -0.75 6.45 -22.63 -7.75 9.76 47.52 Gewinnwachstum
3 -17.78 2.27 -0.55 3.39 1.48 0.34 NaN PEG
Now, I am selecting only the KGV row with:
df[df["Kategorie"] == "KGV"]
Which outputs:
2012 2013 2014 2015 2016 2017 2018 Kategorie
1 13.39 14.7 12.45 16.29 15.67 14.17 10.08 KGV
How do I calculate the mean() of the last five years (2016,15,14,13,12 in this example)?
I tried
df[df["Kategorie"] == "KGV"]["2016":"2012"].mean()
but this throws a TypeError. Why can I not slice the columns here?

loc supports that type of slicing (from left to right):
df.loc[df["Kategorie"] == "KGV", "2012":"2016"].mean(axis=1)
Out:
1 14.5
dtype: float64
Note that this does not necessarily mean 2012, 2013, 2014, 2015 and 2016. These are strings so it means all columns between df['2012'] and df['2016']. There could be a column named foo in between and it would be selected.

I used filter and iloc
row = df[df.Kategorie == 'KGV']
row.filter(regex='\d{4}').sort_index(1).iloc[:, -5:].mean(1)
1 13.732
dtype: float64

Not sure why the last five years are 2012-2016 (they seem to be the first five years). Notwithstanding, to find the mean for 2012-2016 for 'KGV', you can use
df[df['Kategorie'] == 'KGV'][[c for c in df.columns if c != 'Kategorie' and 2012 <= int(c) <= 2016]].mean(axis=1)

Related

Python: How do I use the if function when calling out a specific row?

This is my data frame (labeled unp):
data frame unp
LOCATION TIME Unemployment_Rate Unit_Labour_Cost GDP_CAP PTEmployment HR_WKD Collective IndividualCollective Individual Temp GDPCAP_ULC GDP_Growth
0 AUT 2013 5.336031 2.632506 47936.67796 19.863556 1632.1 2.14 1.80 1.66 1.47 18209.522774 NaN
1 AUT 2014 5.621219 1.996807 48813.53441 20.939237 1621.6 2.14 1.80 1.66 1.47 24445.794917 876.85645
2 AUT 2015 5.723468 1.515733 49925.22780 21.026548 1598.9 2.14 1.80 1.66 1.47 32938.009399 1111.69339
3 AUT 2016 6.014071 1.610391 50923.69330 20.889132 1609.4 2.14 1.80 1.66 1.47 31621.943553 998.46550
4 BEL 2013 8.425185 1.988013 43745.95156 18.212509 1558.0 2.48 2.22 2.11 1.91 22004.861920 -7177.74174
... ... ... ... ... ... ... ... ... ... ... ... ... ...
101 SWE 2016 6.991096 1.899792 48690.14644 13.800736 1626.0 2.72 2.54 2.48 1.55 25629.198586 779.74573
102 USA 2013 7.375000 1.099109 53016.28880 12.255613 1782.0 1.33 1.31 1.30 0.27 48235.697096 4326.14236
103 USA 2014 6.166667 2.027852 54935.20048 10.611552 1784.0 1.33 1.31 1.30 0.27 27090.340163 1918.91168
104 USA 2015 5.291667 1.912012 56700.88042 9.879047 1785.0 1.33 1.31 1.30 0.27 29655.086066 1765.67994
105 USA 2016 4.866667 1.045644 57797.46221 9.454144 1781.0 1.33 1.31 1.30 0.27 55274.512367 1096.58179
I want to change the row GDP_Growth which is currently blank to have the value of:
unp.GDP_CAP - unp.GDP_CAP.shift(1)
If it fulfils the condition that the 'TIME' is not 2014 or >2014, else it should be N/A
Tried using the if function directly but it's not working:
if unp.loc[unp['TIME'] > 2014]:
unp['GDP_Growth'] = unp.GDP_CAP - unp.GDP_CAP.shift(1)
else:
return
You should avoid the if statement when using dataframes as it will be slower (less efficient).
In place, depending on what you need, you can use np.where().
because the dataframe in the question is a picture (as opposed to text), i give you the standard implementation, which looks like this:
import pandas as pd
import numpy as np
# Create a sample DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9]})
# Use np.where() to select values from column 'A' where column 'B' is greater than 7
result = np.where(df['B'] > 7, df['A'], 0)
# Print the result
print(result)
The result of the above is this:
[0, 0, 0, 4, 5]
You will need to modify the above for your particular dataframe.
The question in title is currently Python: How do I use the if function when calling out a specific row?, which my answer will not apply to. Instead, we will compute the derivate / 'growth' and selectively apply it.
Explanation: In Python, you generally want to use a functional programming style to keep most computations outside of the Python interpreter and instead work with C-implemented functions.
Solution:
A. Obtain the derivate/'growth'
For your dataframe df = pd.DataFrame(...) you can obtain the change in value for a specific column with df['column_name'].diff(), e.g.
# This is your dataframe
In : df
Out:
gdp growth year
0 0 <NA> 2000
1 1 <NA> 2001
2 2 <NA> 2002
3 3 <NA> 2003
4 4 <NA> 2004
In : df['gdp'].diff()
Out:
0 NaN
1 1.0
2 1.0
3 1.0
4 1.0
Name: year, dtype: float64
B. Apply it to the 'growth' column
In :df['growth'] = df['gdp'].diff()
df
Out:
gdp growth year
0 0 NaN 2000
1 1 1.0 2001
2 2 1.0 2002
3 3 1.0 2003
4 4 1.0 2004
C. Selectively exclude values
If you then want specific years to have a certain value, apply them selectively
In : df['growth'].iloc[np.where(df['year']<2003)] = np.nan
df
Out:
gdp growth year
0 0 NaN 2000
1 1 NaN 2001
2 2 NaN 2002
3 3 1.0 2003
4 4 1.0 2004

Combine a row with column in dataFrame and show the corresponding values

So I want to show this data in just two columns. For example, I want to turn this data
Year Jan Feb Mar Apr May Jun
1997 3.45 2.15 1.89 2.03 2.25 2.20
1998 2.09 2.23 2.24 2.43 2.14 2.17
1999 1.85 1.77 1.79 2.15 2.26 2.30
2000 2.42 2.66 2.79 3.04 3.59 4.29
into this
Date Price
Jan-1977 3.45
Feb-1977 2.15
Mar-1977 1.89
Apr-1977 2.03
....
Jan-2000 2.42
Feb-2000 2.66
So far, I have read about how to combine two columns into another dataframe using .apply() .agg(), but no info how to combine them as I showed above.
import pandas as pd
df = pd.read_csv('matrix-A.csv', index_col =0 )
matrix_b = ({})
new = pd.DataFrame(matrix_b)
new["Date"] = df['Year'].astype(float) + "-" + df["Dec"]
print(new)
I have tried this way, but it of course does not work. I have also tried using pd.Series() but no success
I want to ask whether there is any site where I can learn how to do this, or does anybody know correct way to solve this?
Another possible solution, which is based on pandas.DataFrame.stack:
out = df.set_index('Year').stack()
out.index = ['{}_{}'.format(j, i) for i, j in out.index]
out = out.reset_index()
out.columns = ['Date', 'Value']
Output:
Date Value
0 Jan_1997 3.45
1 Feb_1997 2.15
2 Mar_1997 1.89
3 Apr_1997 2.03
4 May_1997 2.25
....
19 Feb_2000 2.66
20 Mar_2000 2.79
21 Apr_2000 3.04
22 May_2000 3.59
23 Jun_2000 4.29
You can first convert it to long-form using melt. Then, create a new column for Date by combining two columns.
long_df = pd.melt(df, id_vars=['Year'], var_name='Month', value_name="Price")
long_df['Date'] = long_df['Month'] + "-" + long_df['Year'].astype('str')
long_df[['Date', 'Price']]
If you want to sort your date column, here is a good resource. Follow those instructions after melting and before creating the Date column.
You can use pandas.DataFrame.melt :
out = (
df
.melt(id_vars="Year", var_name="Month", value_name="Price")
.assign(month_num= lambda x: pd.to_datetime(x["Month"] , format="%b").dt.month)
.sort_values(by=["Year", "month_num"])
.assign(Date= lambda x: x.pop("Month") + "-" + x.pop("Year").astype(str))
.loc[:, ["Date", "Price"]]
)
# Output :
print(out)
​
Date Price
0 Jan-1997 3.45
4 Feb-1997 2.15
8 Mar-1997 1.89
12 Apr-1997 2.03
16 May-1997 2.25
.. ... ...
7 Feb-2000 2.66
11 Mar-2000 2.79
15 Apr-2000 3.04
19 May-2000 3.59
23 Jun-2000 4.29
[24 rows x 2 columns]

Slicing pandas dataframe by ordered values into clusters

I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together
Time Value
0 56610.41341 8.55
1 56587.56394 5.27
2 56590.62965 6.81
3 56598.63790 5.47
4 56606.52203 6.71
5 56980.44206 4.75
6 56592.53327 6.53
7 57335.52837 0.74
8 56942.59094 6.96
9 56921.63669 9.16
10 56599.52053 6.14
11 56605.50235 5.20
12 57343.63828 3.12
13 57337.51641 3.17
14 56593.60374 5.69
15 56882.61571 9.50
I tried sorting this and taking time difference of two consecutive points with
df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)
And it gives
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
0 56610.41341 8.55 -272.20230
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
5 56980.44206 4.75 -355.08631
7 57335.52837 0.74 -1.98804
13 57337.51641 3.17 -6.12187
12 57343.63828 3.12 NaN
Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?
I could loop the rows but this is frowned upon so is there a smarter solution?
Edit: Here is a example:
df1:
Time Value t_dif
1 56587.56394 5.27 -3.06571
2 56590.62965 6.81 -1.90362
6 56592.53327 6.53 -1.07047
14 56593.60374 5.69 -5.03416
3 56598.63790 5.47 -0.88263
10 56599.52053 6.14 -5.98182
11 56605.50235 5.20 -1.01968
4 56606.52203 6.71 -3.89138
df2:
0 56610.41341 8.55 -272.20230
df3:
15 56882.61571 9.50 -39.02098
9 56921.63669 9.16 -20.95425
8 56942.59094 6.96 -37.85112
...
etc.
I think you can just
df1 = df[df['t_dif']<30]
df2 = df[df['t_dif']>=30]
def split_dataframe(df, value):
df = df.sort_values("Time")
df = df.reset_index()
df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
indxs = df.index[df['t_dif'] > value].tolist()
indxs.append(-1)
indxs.append(len(df))
indxs.sort()
frames = []
for i in range(1, len(indxs)):
val = df.iloc[indxs[i] + 1: indxs[i]]
frames.append(val)
return frames
Returns the correct dataframes as a list

Python pandas: how to vectorize this function

I have two DataFrames df and evol as follows (simplified for the example):
In[6]: df
Out[6]:
data year_final year_init
0 12 2023 2012
1 34 2034 2015
2 9 2019 2013
...
In[7]: evol
Out[7]:
evolution
year
2000 1.474946
2001 1.473874
2002 1.079157
...
2037 1.463840
2038 1.980807
2039 1.726468
I would like to operate the following operation in a vectorized way (current for loop implementation is just too long when I have Gb of data):
for index, row in df.iterrows():
for year in range(row['year_init'], row['year_final']):
factor = evol.at[year, 'evolution']
df.at[index, 'data'] += df.at[index, 'data'] * factor
Complexity comes from the fact that the range of year is not the same on each row...
In the above example the ouput would be:
data year_final year_init
0 163673 2023 2012
1 594596046 2034 2015
2 1277 2019 2013
(full evol dataframe for testing purpose:)
evolution
year
2000 1.474946
2001 1.473874
2002 1.079157
2003 1.876762
2004 1.541348
2005 1.581923
2006 1.869508
2007 1.289033
2008 1.924791
2009 1.527834
2010 1.762448
2011 1.554491
2012 1.927348
2013 1.058588
2014 1.729124
2015 1.025824
2016 1.117728
2017 1.261009
2018 1.705705
2019 1.178354
2020 1.158688
2021 1.904780
2022 1.332230
2023 1.807508
2024 1.779713
2025 1.558423
2026 1.234135
2027 1.574954
2028 1.170016
2029 1.767164
2030 1.995633
2031 1.222417
2032 1.165851
2033 1.136498
2034 1.745103
2035 1.018893
2036 1.813705
2037 1.463840
2038 1.980807
2039 1.726468
One vectorization approach using only pandas is to do a cartesian join between the two frames and subset. Would start out like:
df['dummy'] = 1
evol['dummy'] = 1
combined = df.merge(evol, on='dummy')
# filter date ranges, multiply etc
This will likely be faster than what you are doing, but is memory inefficient and might blow up on your real data.
If you can take on the numba dependency, something like this should be very fast - essentially a compiled version of what you are doing now. Something similar would be possible in cython as well. Note that this requires that the evol dataframe is sorted and contigous by year, that could be relaxed with modification.
import numba
#numba.njit
def f(data, year_final, year_init, evol_year, evol_factor):
data = data.copy()
for i in range(len(data)):
year_pos = np.searchsorted(evol_year, year_init[i])
n_years = year_final[i] - year_init[i]
for offset in range(n_years):
data[i] += data[i] * evol_factor[year_pos + offset]
return data
f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
Out[24]: array([ 163673, 594596044, 1277], dtype=int64)
Edit:
Some timings with your test data
In [25]: %timeit f(df['data'].values, df['year_final'].values, df['year_init'].values, evol.index.values, evol['evolution'].values)
15.6 µs ± 338 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
In [26]: %%time
...: for index, row in df.iterrows():
...: for year in range(row['year_init'], row['year_final']):
...: factor = evol.at[year, 'evolution']
...: df.at[index, 'data'] += df.at[index, 'data'] * factor
Wall time: 3 ms

Appending data row from one dataframe to another with respect to date

I am brand new to pandas and working with two dataframes. My goal is to append the non-date values of df_ls (below) column-wise to their nearest respective date in df_1. Is the only way to do this with a traditional for-loop or is their some more effective built-in method/function. I have googled this extensively without any luck and have only found ways to append blocks of dataframes to other dataframes. I haven't found a way to search through a dataframe and append a row in another dataframe at the nearest respective date. See example below:
Example of first dataframe (lets call it df_ls):
DATE ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 1999-07-04 0.070771 1.606958 1.292280 0.128069 0.103018
1 1999-07-20 0.030795 2.326290 1.728147 0.099020 0.073595
2 1999-08-21 0.022819 2.492871 1.762536 0.096888 0.068502
3 1999-09-06 0.014613 2.792271 1.894225 0.090590 0.061445
4 1999-10-08 0.004978 2.781847 1.790768 0.089291 0.057521
5 1999-10-24 0.003144 2.818474 1.805257 0.090623 0.058054
6 1999-11-09 0.000859 3.146100 1.993941 0.092787 0.058823
7 1999-12-11 0.000912 2.913604 1.656642 0.097239 0.055357
8 1999-12-27 0.000877 2.974692 1.799949 0.098282 0.059427
9 2000-01-28 0.000758 3.092533 1.782112 0.095153 0.054809
10 2000-03-16 0.002933 2.969185 1.727465 0.083059 0.048322
11 2000-04-01 0.016814 2.366437 1.514110 0.089720 0.057398
12 2000-05-03 0.047370 1.847763 1.401930 0.109767 0.083290
13 2000-05-19 0.089432 1.402798 1.178798 0.137965 0.115936
14 2000-06-04 0.056340 1.807828 1.422489 0.118601 0.093328
Example of second dataframe (let's call it df_1)
Sample Date Value
0 2000-05-09 1.68
1 2000-05-09 1.68
2 2000-05-18 1.75
3 2000-05-18 1.75
4 2000-05-31 1.40
5 2000-05-31 1.40
6 2000-06-13 1.07
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
In the end, my goal is to have something like this (Note the appended values are values closest to the Sample Date, even though they dont match up perfectly):
Sample Date Value ALBEDO_SUR B13_RATIO B23_RATIO B1_RAW B2_RAW
0 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
1 2000-05-09 1.68 0.047370 1.847763 1.401930 0.109767 0.083290
2 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
3 2000-05-18 1.75 0.089432 1.402798 1.178798 0.137965 0.115936
4 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
5 2000-05-31 1.40 0.056340 1.807828 1.422489 0.118601 0.093328
6 2000-06-13 1.07 ETC.... ETC.... ETC ...
7 2000-06-13 1.07
8 2000-06-27 1.49
9 2000-06-27 1.49
10 2000-07-11 2.29
11 2000-07-11 2.29
Thanks for any and all help. As I said I am new to this and I have experience with this sort of thing in MATLAB but PANDAS is a new to me.
Thanks

Categories

Resources