I have the following Pandas DF:
A B
0 0.0 114422.0
1 99997.0 174382.0
2 0.0 24863.0
3 0.0 91559.0
4 0.0 94248.0
5 0.0 66020.0
6 0.0 61543.0
7 0.0 69267.0
8 0.0 6253.0
9 0.0 93002.0
10 0.0 13891.0
11 0.0 49261.0
12 0.0 20050.0
13 0.0 24710.0
14 0.0 10034.0
15 0.0 24508.0
16 0.0 18249.0
17 0.0 50646.0
18 0.0 150033.0
19 0.0 68424.0
20 0.0 125526.0
21 0.0 110526.0
22 40000.0 217450.0
23 0.0 75543.0
24 145000.0 305310.0
25 12000.0 98583.0
26 0.0 262202.0
27 0.0 277680.0
28 0.0 101420.0
29 0.0 109480.0
30 0.0 65230.0
which I tried to normalize (columnswise) with scikit-learn's RobustScaler:
array_scaled = RobustScaler().fit_transform(df)
df_scaled = pd.DataFrame(array_scaled, columns = df.columns)
However, in the resulted df_scaled the first column has not been scaled (or changed) at all:
A B
0 0.0 0.515555
1 99997.0 1.310653
2 0.0 -0.672042
3 0.0 0.212380
4 0.0 0.248037
5 0.0 -0.126280
6 0.0 -0.185647
7 0.0 -0.083223
8 0.0 -0.918819
9 0.0 0.231515
10 0.0 -0.817536
11 0.0 -0.348512
12 0.0 -0.735864
13 0.0 -0.674070
14 0.0 -0.868681
15 0.0 -0.676749
16 0.0 -0.759746
17 0.0 -0.330146
18 0.0 0.987774
19 0.0 -0.094401
20 0.0 0.662799
21 0.0 0.463892
22 40000.0 1.881756
23 0.0 0.000000
24 145000.0 3.046823
25 12000.0 0.305522
26 0.0 2.475190
27 0.0 2.680435
28 0.0 0.343142
29 0.0 0.450021
30 0.0 -0.136755
I do not understand this. I expect column A to be scaled (and centered) too by the interquartile range (like in case of column B). What is the explanation here?
your middle 50% of values in A are all zero, thus the IQR as well as the overall median are both zero - effectively leading to no change when the median is removed as well as no change when the data is scaled according to the quantile range.
Related
I grouped a data frame by week number and get a column of numbers that looks like this
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 235.0
17 849.0
18 1013.0
19 1155.0
20 1170.0
21 1247.0
22 1037.0
23 1197.0
24 1125.0
25 1106.0
26 1229.0
I used the following line of code on the column: df_group['rolling_total'] = df_group['totals'].rolling(2).sum()
This is my desired result:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
235.0
1084.0
2097.0
3252.0
4422.0
5669.0
6706.0
7903.0
9028.0
10134.0
11363.0
I get this instead:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
235.0
1084.0
1862.0
2168.0
2325.0
2417.0
2284.0
2234.0
2322.0
2231.0
2335.0
All I want is the rolling sum of the column. Is .rolling() not the way to accomplish this? Is there something I am doing wrong?
Use .cumsum(), not .rolling():
print(df.cumsum())
Prints:
column1
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 235.0
17 1084.0
18 2097.0
19 3252.0
20 4422.0
21 5669.0
22 6706.0
23 7903.0
24 9028.0
25 10134.0
26 11363.0
Suppose I have the following code that calculates how many products I can purchase given my budget-
import math
import pandas as pd
data = [['2021-01-02', 5.5], ['2021-02-02', 10.5], ['2021-03-02', 15.0], ['2021-04-02', 20.0]]
df = pd.DataFrame(data, columns=['Date', 'Current_Price'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(mn - pd.tseries.offsets.MonthBegin(), mx + pd.tseries.offsets.MonthEnd(), name="Date")
df = df.set_index("Date").reindex(dr).reset_index()
df['Current_Price'] = df.groupby(
pd.Grouper(key='Date', freq='1M'))['Current_Price'].ffill().bfill()
# The dataframe below shows the current price of the product
# I'd like to buy at the specific date_range
print(df)
# Create 'Day' column to know which day of the month
df['Day'] = pd.to_datetime(df['Date']).dt.day
# Create 'Deposit' column to record how much money is
# deposited in, say, my bank account to buy the product.
# 'Withdrawal' column is to record how much I spent in
# buying product(s) at the current price on a specific date.
# 'Num_of_Products_Bought' shows how many items I bought
# on that specific date.
#
# Please note that the calculate below takes into account
# the left over money, which remains after I've purchased a
# product, for future purchase. For example, if you observe
# the resulting dataframe at the end of this code, you'll
# notice that I was able to purchase 7 products on March 1, 2021
# although my deposit on that day was $100. That is because
# on the days leading up to March 1, 2021, I have been saving
# the spare change from previous product purchases and that
# extra money allows me to buy an extra product on March 1, 2021
# despite my budget of $100 should only allow me to purchase
# 6 products.
df[['Deposit', 'Withdrawal', 'Num_of_Products_Bought']] = 0.0
# Suppose I save $100 at the beginning of every month in my bank account
df.loc[df['Day'] == 1, 'Deposit'] = 100.0
for index, row in df.iterrows():
if df.loc[index, 'Day'] == 1:
# num_prod_bought = (sum_of_deposit_so_far - sum_of_withdrawal)/current_price
df.loc[index, 'Num_of_Products_Bought'] = math.floor(
(sum(df.iloc[0:(index + 1)]['Deposit'])
- sum(df.iloc[0:(index + 1)]['Withdrawal']))
/ df.loc[index, 'Current_Price'])
# Record how much I spent buying the product on specific date
df.loc[index, 'Withdrawal'] = df.loc[index, 'Num_of_Products_Bought'] * df.loc[index, 'Current_Price']
print(df)
# This code above is working as intended,
# but how can I make it more efficient/pandas-like?
# In particular, I don't like to idea of having to
# iterate the rows and having to recalculate
# the running (sum of) deposit amount and
# the running (sum of) the withdrawal.
As mentioned in the comment in the code, I would like to know how to accomplish the same without having to iterate the rows one by one and calculating the sum of the rows up to the current row in my iteration (I read around StackOverflow and saw cumsum() function, but I don't think cumsum has the notion of current row in the iteration).
Thank you very much in advance for your suggestions/answers!
A solution using .apply:
def fn():
leftover = 0
amount, deposit = yield
while True:
new_amount, new_deposit = yield (deposit + leftover) // amount
leftover = (deposit + leftover) % amount
amount, deposit = new_amount, new_deposit
df = df.set_index("Date")
s = fn()
next(s)
m = df.index.day == 1
df.loc[m, "Deposit"] = 100
df.loc[m, "Num_of_Products_Bought"] = df.loc[
m, ["Current_Price", "Deposit"]
].apply(lambda x: s.send((x["Current_Price"], x["Deposit"])), axis=1)
df.loc[m, "Withdrawal"] = (
df.loc[m, "Num_of_Products_Bought"] * df.loc[m, "Current_Price"]
)
print(df.fillna(0).reset_index())
Prints:
Date Current_Price Deposit Num_of_Products_Bought Withdrawal
0 2021-01-01 5.5 100.0 18.0 99.0
1 2021-01-02 5.5 0.0 0.0 0.0
2 2021-01-03 5.5 0.0 0.0 0.0
3 2021-01-04 5.5 0.0 0.0 0.0
4 2021-01-05 5.5 0.0 0.0 0.0
5 2021-01-06 5.5 0.0 0.0 0.0
6 2021-01-07 5.5 0.0 0.0 0.0
7 2021-01-08 5.5 0.0 0.0 0.0
8 2021-01-09 5.5 0.0 0.0 0.0
9 2021-01-10 5.5 0.0 0.0 0.0
10 2021-01-11 5.5 0.0 0.0 0.0
11 2021-01-12 5.5 0.0 0.0 0.0
12 2021-01-13 5.5 0.0 0.0 0.0
13 2021-01-14 5.5 0.0 0.0 0.0
14 2021-01-15 5.5 0.0 0.0 0.0
15 2021-01-16 5.5 0.0 0.0 0.0
16 2021-01-17 5.5 0.0 0.0 0.0
17 2021-01-18 5.5 0.0 0.0 0.0
18 2021-01-19 5.5 0.0 0.0 0.0
19 2021-01-20 5.5 0.0 0.0 0.0
20 2021-01-21 5.5 0.0 0.0 0.0
21 2021-01-22 5.5 0.0 0.0 0.0
22 2021-01-23 5.5 0.0 0.0 0.0
23 2021-01-24 5.5 0.0 0.0 0.0
24 2021-01-25 5.5 0.0 0.0 0.0
25 2021-01-26 5.5 0.0 0.0 0.0
26 2021-01-27 5.5 0.0 0.0 0.0
27 2021-01-28 5.5 0.0 0.0 0.0
28 2021-01-29 5.5 0.0 0.0 0.0
29 2021-01-30 5.5 0.0 0.0 0.0
30 2021-01-31 5.5 0.0 0.0 0.0
31 2021-02-01 10.5 100.0 9.0 94.5
32 2021-02-02 10.5 0.0 0.0 0.0
33 2021-02-03 10.5 0.0 0.0 0.0
34 2021-02-04 10.5 0.0 0.0 0.0
35 2021-02-05 10.5 0.0 0.0 0.0
36 2021-02-06 10.5 0.0 0.0 0.0
37 2021-02-07 10.5 0.0 0.0 0.0
38 2021-02-08 10.5 0.0 0.0 0.0
39 2021-02-09 10.5 0.0 0.0 0.0
40 2021-02-10 10.5 0.0 0.0 0.0
41 2021-02-11 10.5 0.0 0.0 0.0
42 2021-02-12 10.5 0.0 0.0 0.0
43 2021-02-13 10.5 0.0 0.0 0.0
44 2021-02-14 10.5 0.0 0.0 0.0
45 2021-02-15 10.5 0.0 0.0 0.0
46 2021-02-16 10.5 0.0 0.0 0.0
47 2021-02-17 10.5 0.0 0.0 0.0
48 2021-02-18 10.5 0.0 0.0 0.0
49 2021-02-19 10.5 0.0 0.0 0.0
50 2021-02-20 10.5 0.0 0.0 0.0
51 2021-02-21 10.5 0.0 0.0 0.0
52 2021-02-22 10.5 0.0 0.0 0.0
53 2021-02-23 10.5 0.0 0.0 0.0
54 2021-02-24 10.5 0.0 0.0 0.0
55 2021-02-25 10.5 0.0 0.0 0.0
56 2021-02-26 10.5 0.0 0.0 0.0
57 2021-02-27 10.5 0.0 0.0 0.0
58 2021-02-28 10.5 0.0 0.0 0.0
59 2021-03-01 15.0 100.0 7.0 105.0
60 2021-03-02 15.0 0.0 0.0 0.0
61 2021-03-03 15.0 0.0 0.0 0.0
62 2021-03-04 15.0 0.0 0.0 0.0
63 2021-03-05 15.0 0.0 0.0 0.0
64 2021-03-06 15.0 0.0 0.0 0.0
65 2021-03-07 15.0 0.0 0.0 0.0
66 2021-03-08 15.0 0.0 0.0 0.0
67 2021-03-09 15.0 0.0 0.0 0.0
68 2021-03-10 15.0 0.0 0.0 0.0
69 2021-03-11 15.0 0.0 0.0 0.0
70 2021-03-12 15.0 0.0 0.0 0.0
71 2021-03-13 15.0 0.0 0.0 0.0
72 2021-03-14 15.0 0.0 0.0 0.0
73 2021-03-15 15.0 0.0 0.0 0.0
74 2021-03-16 15.0 0.0 0.0 0.0
75 2021-03-17 15.0 0.0 0.0 0.0
76 2021-03-18 15.0 0.0 0.0 0.0
77 2021-03-19 15.0 0.0 0.0 0.0
78 2021-03-20 15.0 0.0 0.0 0.0
79 2021-03-21 15.0 0.0 0.0 0.0
80 2021-03-22 15.0 0.0 0.0 0.0
81 2021-03-23 15.0 0.0 0.0 0.0
82 2021-03-24 15.0 0.0 0.0 0.0
83 2021-03-25 15.0 0.0 0.0 0.0
84 2021-03-26 15.0 0.0 0.0 0.0
85 2021-03-27 15.0 0.0 0.0 0.0
86 2021-03-28 15.0 0.0 0.0 0.0
87 2021-03-29 15.0 0.0 0.0 0.0
88 2021-03-30 15.0 0.0 0.0 0.0
89 2021-03-31 15.0 0.0 0.0 0.0
90 2021-04-01 20.0 100.0 5.0 100.0
91 2021-04-02 20.0 0.0 0.0 0.0
92 2021-04-03 20.0 0.0 0.0 0.0
93 2021-04-04 20.0 0.0 0.0 0.0
94 2021-04-05 20.0 0.0 0.0 0.0
95 2021-04-06 20.0 0.0 0.0 0.0
96 2021-04-07 20.0 0.0 0.0 0.0
97 2021-04-08 20.0 0.0 0.0 0.0
98 2021-04-09 20.0 0.0 0.0 0.0
99 2021-04-10 20.0 0.0 0.0 0.0
100 2021-04-11 20.0 0.0 0.0 0.0
101 2021-04-12 20.0 0.0 0.0 0.0
102 2021-04-13 20.0 0.0 0.0 0.0
103 2021-04-14 20.0 0.0 0.0 0.0
104 2021-04-15 20.0 0.0 0.0 0.0
105 2021-04-16 20.0 0.0 0.0 0.0
106 2021-04-17 20.0 0.0 0.0 0.0
107 2021-04-18 20.0 0.0 0.0 0.0
108 2021-04-19 20.0 0.0 0.0 0.0
109 2021-04-20 20.0 0.0 0.0 0.0
110 2021-04-21 20.0 0.0 0.0 0.0
111 2021-04-22 20.0 0.0 0.0 0.0
112 2021-04-23 20.0 0.0 0.0 0.0
113 2021-04-24 20.0 0.0 0.0 0.0
114 2021-04-25 20.0 0.0 0.0 0.0
115 2021-04-26 20.0 0.0 0.0 0.0
116 2021-04-27 20.0 0.0 0.0 0.0
117 2021-04-28 20.0 0.0 0.0 0.0
118 2021-04-29 20.0 0.0 0.0 0.0
119 2021-04-30 20.0 0.0 0.0 0.0
I am trying to make an ML model in the titanic dataset and while preparing it I used OneHotEncoder to make Embarked dummies and while doing that I lost my column headers.
Here is how the dataset looked before.
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 1 22.000000 1 0 7.2500 146 2
1 1 0 38.000000 1 0 71.2833 81 0
2 3 0 26.000000 0 0 7.9250 146 2
3 1 0 35.000000 1 0 53.1000 55 2
4 3 1 35.000000 0 0 8.0500 146 2
... ... ... ... ... ... ... ... ...
886 2 1 27.000000 0 0 13.0000 146 2
887 1 0 19.000000 0 0 30.0000 30 2
888 3 0 29.699118 1 2 23.4500 146 2
889 1 1 26.000000 0 0 30.0000 60 0
890 3 1 32.000000 0 0 7.7500 146 1
Here is the code.
ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X))
X
Here is how the dataset is looking now.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 22.000000 1.0 7.2500 146.0
1 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 38.000000 1.0 71.2833 81.0
2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 26.000000 0.0 7.9250 146.0
3 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 35.000000 1.0 53.1000 55.0
4 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 35.000000 0.0 8.0500 146.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 27.000000 0.0 13.0000 146.0
887 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 19.000000 0.0 30.0000 30.0
888 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 29.699118 1.0 23.4500 146.0
889 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 26.000000 0.0 30.0000 60.0
890 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 3.0 1.0 32.000000 0.0 7.7500 146.0
You can use the get_feature_names method of ColumnTransformer, provided all your transformers support that method and you've trained on a dataframe.
ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X), columns=ct.get_feature_names())
X
output of fit_transform is array like
X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components)
(not DataFrame-like)
Thus no Headers. If you want headers, you'll have to name them when rebuilding the DataFrame.
This question already has answers here:
How read Common Data Format (CDF) in Python
(4 answers)
Closed 4 years ago.
I have a .txt file that looks like this:
08/19/93 UW ARCHIVE 100.0 1962 W IEEE 14 Bus Test Case
BUS DATA FOLLOWS 14 ITEMS
1 Bus 1 HV 1 1 3 1.060 0.0 0.0 0.0 232.4 -16.9 0.0 1.060 0.0 0.0 0.0 0.0 0
2 Bus 2 HV 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045 50.0 -40.0 0.0 0.0 0
3 Bus 3 HV 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010 40.0 0.0 0.0 0.0 0
4 Bus 4 HV 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
5 Bus 5 HV 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
6 Bus 6 LV 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070 24.0 -6.0 0.0 0.0 0
7 Bus 7 ZV 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
8 Bus 8 TV 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090 24.0 -6.0 0.0 0.0 0
9 Bus 9 LV 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19 0
10 Bus 10 LV 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
11 Bus 11 LV 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
12 Bus 12 LV 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
13 Bus 13 LV 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
14 Bus 14 LV 1 1 0 1.036 -16.04 14.9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I need to remove the characters from this file and need only the numerical data in a matrix form. I am relatively new to python, so any kind of help will be really appreciated. Thank you.
I would suggest reading the Data in a Pandas Dataframe and than deleting the column with text or create a second Frame without the text column.
Try:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
As it is simple to do this in pandas if data is correct, here is my take:
import pandas as pd
data = '''\
08/19/93 UW ARCHIVE 100.0 1962 W IEEE 14 Bus Test Case
BUS DATA FOLLOWS 14 ITEMS
1 Bus 1 HV 1 1 3 1.060 0.0 0.0 0.0 232.4 -16.9 0.0 1.060 0.0 0.0 0.0 0.0 0
2 Bus 2 HV 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045 50.0 -40.0 0.0 0.0 0
3 Bus 3 HV 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010 40.0 0.0 0.0 0.0 0
4 Bus 4 HV 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
5 Bus 5 HV 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
6 Bus 6 LV 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070 24.0 -6.0 0.0 0.0 0
7 Bus 7 ZV 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
8 Bus 8 TV 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090 24.0 -6.0 0.0 0.0 0
9 Bus 9 LV 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19 0
10 Bus 10 LV 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
11 Bus 11 LV 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
12 Bus 12 LV 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
13 Bus 13 LV 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
'''
fileobj = pd.compat.StringIO(data)
# change fileobj to filepath and sep to `\t`
df = pd.read_csv(fileobj, sep='\s+', header=None, skiprows=2)
df = df.loc[:,df.dtypes != 'object']
print(df)
Returns:
0 2 4 5 6 7 8 9 10 11 12 13 14 \
0 1 1 1 1 3 1.060 0.00 0.0 0.0 232.4 -16.9 0.0 1.060
1 2 2 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045
2 3 3 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010
3 4 4 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.000
4 5 5 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.000
5 6 6 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070
6 7 7 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.000
7 8 8 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090
8 9 9 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.000
9 10 10 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.000
10 11 11 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.000
11 12 12 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.000
12 13 13 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.000
15 16 17 18 19
0 0.0 0.0 0.0 0.00 0
1 50.0 -40.0 0.0 0.00 0
2 40.0 0.0 0.0 0.00 0
3 0.0 0.0 0.0 0.00 0
4 0.0 0.0 0.0 0.00 0
5 24.0 -6.0 0.0 0.00 0
6 0.0 0.0 0.0 0.00 0
7 24.0 -6.0 0.0 0.00 0
8 0.0 0.0 0.0 0.19 0
9 0.0 0.0 0.0 0.00 0
10 0.0 0.0 0.0 0.00 0
11 0.0 0.0 0.0 0.00 0
12 0.0 0.0 0.0 0.00 0
It's a bit complicated for explain, so I'll do my best. I have a pandas with two columns: hour (from 1 to 24) and value(corresponding to each hour). Dataset index is huge but column hour is repeated on that 24 hours basis (from 1 to 24). I am trying to create new 24 columns: value -1, value -2, value -3...value -24 that will correspond to each row and value from -1 hour, value from -2 hours(from above rows).
hour | value | value -1 | value -2 | value -3| ... | value - 24
1 10 0 0 0 0
2 11 10 0 0 0
3 12 11 10 0 0
4 13 12 11 10 0
...
24 32 31 30 29 0
1 33 32 31 30 10
2 34 33 32 31 11
and so on...
All value numbers are for the example. As I said there are lots of rows, not only 24 for all hours in a day time but all following time series from 1 to 24 and etc.
Thanks in advance and may the force be with you!
Is this what you need?
df = pd.DataFrame([[1,10],[2,11],
[3,12],[4,13]], columns=['hour','value'])
for i in range(1, 24):
df['value -' + str(i)] = df['value'].shift(i).fillna(0)
result:
Is this what you are looking for?
import pandas as pd
df = pd.DataFrame({'hour': list(range(24))*2,
'value': list(range(48))})
shift_cols_n = 10
for shift in range(1, shift_cols_n):
new_columns_name = 'value - ' + str(shift)
# Assuming that you don't have any NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift).fillna(0)
# A safer (and a less simple) way, in case you have NAs in your dataframe
df[new_columns_name] = df['value'].shift(shift)
df.loc[:shift, new_columns_name] = 0
print(df.head(9))
hour value value - 1 value - 2 value - 3 value - 4 value - 5 \
0 0 0 0.0 0.0 0.0 0.0 0.0
1 1 1 0.0 0.0 0.0 0.0 0.0
2 2 2 1.0 0.0 0.0 0.0 0.0
3 3 3 2.0 1.0 0.0 0.0 0.0
4 4 4 3.0 2.0 1.0 0.0 0.0
5 5 5 4.0 3.0 2.0 1.0 0.0
6 6 6 5.0 4.0 3.0 2.0 1.0
7 7 7 6.0 5.0 4.0 3.0 2.0
8 8 8 7.0 6.0 5.0 4.0 3.0
value - 6 value - 7 value - 8 value - 9
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 0.0
6 0.0 0.0 0.0 0.0
7 1.0 0.0 0.0 0.0
8 2.0 1.0 0.0 0.0