Turning 2D pd.dataFrame into 3D array - python

My Dataset consists of 3535560 rows × 16 columns. However it contains three 'index-variables, I would like to use to reshape the dataset: 1830 days, 46 latitude values and 42 longitude values. The transformed dataset should therefore become (1830,46,42) but I have no idea how to do this. I saw that I could maybe use pd.pivot or pd.multiindex, but I am not able to find a solution.
In short data looks like:
time lat lon var 1 var 2 var 3 var 4 var 5
0 2021-01-01 60.125 -120.125 0.381828 0.917779 0.718022 0.064032 0.886050
1 2021-01-01 60.125 -119.875 0.221697 0.232657 0.298497 0.680900 0.124440
...
41 2021-01-01 60.125 -109.125 0.922149 0.708139 0.778329 0.267685 0.552542
42 2021-01-01 59.875 -120.125 0.569874 0.053829 0.740229 0.747286 0.194214
43 2021-01-01 59.875 -119.875 0.500091 0.185990 0.845510 0.877692 0.556584
....
1931 2021-01-01 48.875 -109.125 0.221697 0.232657 0.298497 0.680900 0.124440
1932 2021-01-02 60.125 -120.125 0.666589 0.849857 0.338648 0.552114 0.730678
1933 2021-01-02 60.125 -119.875 0.351144 0.467692 0.161488 0.530906 0.277561
As you can see in the table, after 42 longitude values, it takes a new latitude value and loops over all longitude values again. it does this for all latitude values before going to the next day (after 1932 rows (46x42)).
Would someone be able to help me fix this?

Related

How can I subtract two panda data frame columns without getting an index error? [duplicate]

In python, how can I reference previous row and calculate something against it? Specifically, I am working with dataframes in pandas - I have a data frame full of stock price information that looks like this:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
Here is how I created this dataframe:
import pandas
url = 'http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
data = data = pandas.read_csv(url)
## now I sorted the data frame ascending by date
data = data.sort(columns='Date')
Starting with row number 2, or in this case, I guess it's 250 (PS - is that the index?), I want to calculate the difference between 2011-01-03 and 2011-01-04, for every entry in this dataframe. I believe the appropriate way is to write a function that takes the current row, then figures out the previous row, and calculates the difference between them, the use the pandas apply function to update the dataframe with the value.
Is that the right approach? If so, should I be using the index to determine the difference? (note - I'm still in python beginner mode, so index may not be the right term, nor even the correct way to implement this)
I think you want to do something like this:
In [26]: data
Out[26]:
Date Close Adj Close
251 2011-01-03 147.48 143.25
250 2011-01-04 147.64 143.41
249 2011-01-05 147.05 142.83
248 2011-01-06 148.66 144.40
247 2011-01-07 147.93 143.69
In [27]: data.set_index('Date').diff()
Out[27]:
Close Adj Close
Date
2011-01-03 NaN NaN
2011-01-04 0.16 0.16
2011-01-05 -0.59 -0.58
2011-01-06 1.61 1.57
2011-01-07 -0.73 -0.71
To calculate difference of one column. Here is what you can do.
df=
A B
0 10 56
1 45 48
2 26 48
3 32 65
We want to compute row difference in A only and want to consider the rows which are less than 15.
df['A_dif'] = df['A'].diff()
df=
A B A_dif
0 10 56 Nan
1 45 48 35
2 26 48 19
3 32 65 6
df = df[df['A_dif']<15]
df=
A B A_dif
0 10 56 Nan
3 32 65 6
I don't know pandas, and I'm pretty sure it has something specific for this; however, I'll give you the pure-Python solution, that might be of some help even if you need to use pandas:
import csv
import urllib
# This basically retrieves the CSV files and loads it in a list, converting
# All numeric values to floats
url='http://ichart.finance.yahoo.com/table.csv?s=IBM&a=00&b=1&c=2011&d=11&e=31&f=2011&g=d&ignore=.csv'
reader = csv.reader(urllib.urlopen(url), delimiter=',')
# We sort the output list so the records are ordered by date
cleaned = sorted([[r[0]] + map(float, r[1:]) for r in list(reader)[1:]])
for i, row in enumerate(cleaned): # enumerate() yields two-tuples: (<id>, <item>)
# The try..except here is to skip the IndexError for line 0
try:
# This will calculate difference of each numeric field with the same field
# in the row before this one
print row[0], [(row[j] - cleaned[i-1][j]) for j in range(1, 7)]
except IndexError:
pass

Pandas: calculating mean value of multiple columns using datetime and Grouper removes columns or doesn't return correct Dataframe

As part of a larger task, I want to calculate the monthly mean values for each specific station. This is already difficult to do, but I am getting close.
The dataframe has many columns, but ultimately I only use the following information:
Date Value Station_Name
0 2006-01-03 18 2
1 2006-01-04 12 2
2 2006-01-05 11 2
3 2006-01-06 10 2
4 2006-01-09 22 2
... ... ...
3510 2006-12-23 47 45
3511 2006-12-24 46 45
3512 2006-12-26 35 45
3513 2006-12-27 35 45
3514 2006-12-30 28 45
I am running into two issues, using:
df.groupby(['Station_Name', pd.Grouper(freq='M')])['Value'].mean()
It results in something like:
Station_Name Date
2 2003-01-31 29.448387
2003-02-28 30.617857
2003-03-31 28.758065
2003-04-30 28.392593
2003-05-31 30.318519
...
45 2003-09-30 16.160000
2003-10-31 18.906452
2003-11-30 26.296667
2003-12-31 30.306667
2004-01-31 29.330000
Which I can't seem to use as a regular dataframe, and the datetime is messed up as it doesn't show the monthly mean but gives the last day back. Also the station name is a single index, and not for the whole column. Plus the mean value doesn't have a "column name" at all. This isn't a dataframe, but a pandas.core.series.Series. I can't convert this again because it's not correct, and using the .to_frame() method shows that it is still indeed a Dataframe. I don't get this part.
I found that in order to return a normal dataframe, to use
as_index = False
In the groupby method. But this results in the months not being shown:
df.groupby(['station_name', pd.Grouper(freq='M')], as_index = False)['Value'].mean()
Gives:
Station_Name Value
0 2 29.448387
1 2 30.617857
2 2 28.758065
3 2 28.392593
4 2 30.318519
... ... ...
142 45 16.160000
143 45 18.906452
144 45 26.296667
145 45 30.306667
146 45 29.330000
I can't just simply add the month later, as not every station has an observation in every month.
I've tried using other methods, such as
df.resample("M").mean()
But it doesn't seem possible to do this on multiple columns. It returns the mean value of everything.
Edit: This is ultimately what I would want.
Station_Name Date Value
0 2 2003-01 29.448387
1 2 2003-02 30.617857
2 2 2003-03 28.758065
3 2 2003-04 28.392593
4 2 2003-05 30.318519
... ... ...
142 45 2003-08 16.160000
143 45 2003-09 18.906452
144 45 2003-10 26.296667
145 45 2003-11 30.306667
146 45 2003-12 29.330000
ok , how baout this :
df = df.groupby(['Station_Name',df['Date'].dt.to_period('M')])['Value'].mean().reset_index()
outut:
>>
Station_Name Date Value
0 2 2006-01 14.6
1 45 2006-12 38.2

Python: How to allocate a 'misc' total between other categories

I'm building a report in Python to automate a lot of manual transformation we do in Excel at the moment. I'm able to extract the data and pivot it, to get something like this
Date
Category 1
Category 2
Category 3
Misc
01/01/21
40
30
30
10
02/01/21
30
20
50
20
Is it possible to divide the misc total for each date in to the other categories by ratio? So I would end up with the below
Date
Category 1
Category 2
Category 3
01/01/21
44
33
33
02/01/21
36
24
60
The only way I can think of is to split the misc values off to their own table, work out the ratios of the other categories, and then add misc * ratio to each category value, but I just wondered if there was a function I could use to condense the working on this?
Thanks
I think your solution hits the nail on the head. However it can be quite dense already:
>>> cat = df.filter(regex='Category')
>>> df.update(cat + cat.mul(df['Misc'] / cat.sum(axis=1), axis=0))
>>> df.drop(columns=['Misc'])
Date Category 1 Category 2 Category 3
0 01/01/21 44.0 33.0 33.0
1 02/01/21 36.0 24.0 60.0
cat.mul(df['Misc'] / cat.sum(axis=1), axis=0) gets you the reallocated misc values per row, since you multiply each value by misc and divide it by the row total. .mul() allows to do the the multiplication while specifying along which axis, the rest is about having the right columns.

Get maximum relative difference between row-values and row-mean in new pandas dataframe column

I want to have an extra column with the maximum relative difference [-] of the row-values and the mean of these rows:
The df is filled with energy use data for several years.
The theoretical formula that should get me this is as follows:
df['max_rel_dif'] = MAX [ ABS(highest energy use – mean energy use), ABS(lowest energy use – mean energy use)] / mean energy use
Initial dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014
0 23 22631 21954.0 22314.0 22032 21843
1 43 27456 29654.0 28159.0 28654 2000
2 36 61200 NaN NaN 31895 1600
3 87 87621 86542.0 87542.0 88456 86961
4 90 58951 57486.0 2000.0 0 0
5 98 24587 25478.0 NaN 24896 25461
Desired dataframe:
ID y_2010 y_2011 y_2012 y_2013 y_2014 max_rel_dif
0 23 22631 21954.0 22314.0 22032 21843 0.02149
1 43 27456 29654.0 28159.0 28654 2000 0.91373
2 36 61200 NaN NaN 31895 1600 0.94931
3 87 87621 86542.0 87542.0 88456 86961 0.01179
4 90 58951 57486.0 2000.0 0 0 1.48870
5 98 24587 25478.0 NaN 24896 25461 0.02065
tried code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
"y_2010": [22631,27456,61200,87621,58951,24587],
"y_2011": [21954,29654,np.nan,86542,57486,25478],
"y_2012": [22314,28159,np.nan,87542,2000,np.nan],
"y_2013": [22032,28654,31895,88456,0,24896,],
"y_2014": [21843,2000,1600,86961,0,25461]})

print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]


# calculate mean
mean = a.mean(1)
# calculate max_rel_dif
df['max_rel_dif'] = (((df.max(axis=1).sub(mean)).abs(),(df.min(axis=1).sub(mean)).abs()).max()).div(mean)
# AttributeError: 'tuple' object has no attribute 'max'
-> I'm obviously doing the wrong thing with the tuple, I just don't know how to get the maximum values
from the tuples and divide them then by the mean in the proper Phytonic way
I feel like the whole function can be
s=df.filter(like='y')
s.sub(s.mean(1),axis=0).abs().max(1)/s.mean(1)
0 0.021494
1 0.913736
2 0.949311
3 0.011800
4 1.488707
5 0.020653
dtype: float64

calculating moving average in pandas

So, this is fairly a new topic for me and I don't quite understand it yet. I wanted to make a new column in a dataset that contains the moving average of the volume column. The window size is 5 and moving average of row x is calculated from rows x-2, x-1, x, x+1, and x+2. For x=1 and x=2, the moving average is calculated using three and four rows, respectively
I did this.
df['Volume_moving'] = df.iloc[:,5].rolling(window=5).mean()
df
Date Open High Low Close Volume Adj Close Volume_moving
0 2012-10-15 632.35 635.13 623.85 634.76 15446500 631.87 NaN
1 2012-10-16 635.37 650.30 631.00 649.79 19634700 646.84 NaN
2 2012-10-17 648.87 652.79 644.00 644.61 13894200 641.68 NaN
3 2012-10-18 639.59 642.06 630.00 632.64 17022300 629.76 NaN
4 2012-10-19 631.05 631.77 609.62 609.84 26574500 607.07 18514440.0
... ... ... ... ... ... ... ... ...
85 2013-01-08 529.21 531.89 521.25 525.31 16382400 525.31 17504860.0
86 2013-01-09 522.50 525.01 515.99 517.10 14557300 517.10 16412620.0
87 2013-01-10 528.55 528.72 515.52 523.51 21469500 523.51 18185340.0
88 2013-01-11 521.00 525.32 519.02 520.30 12518100 520.30 16443720.0
91 2013-01-14 502.68 507.50 498.51 501.75 26179000 501.75 18221260.0
However, I think that the result is not accurate as I tried it with a different dataframe and get the exact same result.
Can anyone please help me with this?
Try with this:
df['Volume_moving'] = df['Volume'].rolling(window=5).mean()

Categories

Resources