.rolling() on groupby dataframe - python

I grouped a data frame by week number and get a column of numbers that looks like this
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 235.0
17 849.0
18 1013.0
19 1155.0
20 1170.0
21 1247.0
22 1037.0
23 1197.0
24 1125.0
25 1106.0
26 1229.0
I used the following line of code on the column: df_group['rolling_total'] = df_group['totals'].rolling(2).sum()
This is my desired result:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
235.0
1084.0
2097.0
3252.0
4422.0
5669.0
6706.0
7903.0
9028.0
10134.0
11363.0
I get this instead:
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
235.0
1084.0
1862.0
2168.0
2325.0
2417.0
2284.0
2234.0
2322.0
2231.0
2335.0
All I want is the rolling sum of the column. Is .rolling() not the way to accomplish this? Is there something I am doing wrong?

Use .cumsum(), not .rolling():
print(df.cumsum())
Prints:
column1
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
5 0.0
6 0.0
7 0.0
8 0.0
9 0.0
10 0.0
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 235.0
17 1084.0
18 2097.0
19 3252.0
20 4422.0
21 5669.0
22 6706.0
23 7903.0
24 9028.0
25 10134.0
26 11363.0

Related

Sklearn's RobustScaler doesn't scale certain columns at all

I have the following Pandas DF:
A B
0 0.0 114422.0
1 99997.0 174382.0
2 0.0 24863.0
3 0.0 91559.0
4 0.0 94248.0
5 0.0 66020.0
6 0.0 61543.0
7 0.0 69267.0
8 0.0 6253.0
9 0.0 93002.0
10 0.0 13891.0
11 0.0 49261.0
12 0.0 20050.0
13 0.0 24710.0
14 0.0 10034.0
15 0.0 24508.0
16 0.0 18249.0
17 0.0 50646.0
18 0.0 150033.0
19 0.0 68424.0
20 0.0 125526.0
21 0.0 110526.0
22 40000.0 217450.0
23 0.0 75543.0
24 145000.0 305310.0
25 12000.0 98583.0
26 0.0 262202.0
27 0.0 277680.0
28 0.0 101420.0
29 0.0 109480.0
30 0.0 65230.0
which I tried to normalize (columnswise) with scikit-learn's RobustScaler:
array_scaled = RobustScaler().fit_transform(df)
df_scaled = pd.DataFrame(array_scaled, columns = df.columns)
However, in the resulted df_scaled the first column has not been scaled (or changed) at all:
A B
0 0.0 0.515555
1 99997.0 1.310653
2 0.0 -0.672042
3 0.0 0.212380
4 0.0 0.248037
5 0.0 -0.126280
6 0.0 -0.185647
7 0.0 -0.083223
8 0.0 -0.918819
9 0.0 0.231515
10 0.0 -0.817536
11 0.0 -0.348512
12 0.0 -0.735864
13 0.0 -0.674070
14 0.0 -0.868681
15 0.0 -0.676749
16 0.0 -0.759746
17 0.0 -0.330146
18 0.0 0.987774
19 0.0 -0.094401
20 0.0 0.662799
21 0.0 0.463892
22 40000.0 1.881756
23 0.0 0.000000
24 145000.0 3.046823
25 12000.0 0.305522
26 0.0 2.475190
27 0.0 2.680435
28 0.0 0.343142
29 0.0 0.450021
30 0.0 -0.136755
I do not understand this. I expect column A to be scaled (and centered) too by the interquartile range (like in case of column B). What is the explanation here?
your middle 50% of values in A are all zero, thus the IQR as well as the overall median are both zero - effectively leading to no change when the median is removed as well as no change when the data is scaled according to the quantile range.

I am trying to write a For Loop in Python to identify types of sales for use with a 'sales report'

UPDATED - 4.13.22
I am new to programming python and am trying to create a program using For Loops that will go through a data frame by rows to identify different types of 'group sales' made up by different combinations of product sales and posting the results in a 'Result' column.
I was told in previous comments to print the df and paste it:
Date LFMIX SALE LCSIX SALE LOTIX SALE LSPIX SALE LEQIX SALE \
0 0.0 0.0 30000.0 0.0 0.0 0.0
1 0.0 0.0 30000.0 0.0 0.0 0.0
2 0.0 30000.0 0.0 0.0 0.0 0.0
3 0.0 25000.0 25000.0 0.0 0.0 0.0
4 0.0 30000.0 30000.0 0.0 0.0 0.0
5 0.0 30000.0 0.0 0.0 0.0 30000.0
6 0.0 0.0 30000.0 0.0 0.0 30000.0
7 0.0 25000.0 25000.0 0.0 0.0 25000.0
AUM LFMIX AUM LCSIX AUM LOTIX AUM LSPIX AUM LEQIX \
0 200000.0 0.0 0.0 0.0 0.0
1 500000.0 0.0 0.0 0.0 0.0
2 0.0 200000.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 200000.0
5 0.0 200000.0 0.0 0.0 0.0
6 200000.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
is the sale = 10% of pairing fund AUM LFMIX LCSIX LOTIX LSPIX LEQIX \
0 0.0 1 1 0.0 0.0 0.0
1 0.0 1 1 0.0 0.0 0.0
2 0.0 1 1 0.0 0.0 0.0
3 0.0 1 1 0.0 0.0 0.0
4 0.0 1 1 0.0 0.0 1.0
5 0.0 1 1 0.0 0.0 1.0
6 0.0 1 1 0.0 0.0 1.0
7 0.0 1 1 0.0 0.0 1.0
Expected_Result Result
0 DP1
1 0
2 DP2
3 DP3
4 TT1
5 TT2
6 TT3
7 TT4
my Python code to sort just the 1st row:
for row in range(len(df)):
if (df["LCSIX"][row] >= (df["AUM LFMIX"][row] * .1)): df["Result"][row] = "DP1"
and the results:
Date LFMIX SALE LCSIX SALE LOTIX SALE LSPIX SALE LEQIX SALE \
0 0.0 0.0 30000.0 0.0 0.0 0.0
1 0.0 0.0 30000.0 0.0 0.0 0.0
2 0.0 30000.0 0.0 0.0 0.0 0.0
3 0.0 25000.0 25000.0 0.0 0.0 0.0
4 0.0 30000.0 30000.0 0.0 0.0 0.0
5 0.0 30000.0 0.0 0.0 0.0 30000.0
6 0.0 0.0 30000.0 0.0 0.0 30000.0
7 0.0 25000.0 25000.0 0.0 0.0 25000.0
AUM LFMIX AUM LCSIX AUM LOTIX AUM LSPIX AUM LEQIX \
0 200000.0 0.0 0.0 0.0 0.0
1 500000.0 0.0 0.0 0.0 0.0
2 0.0 200000.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 200000.0
5 0.0 200000.0 0.0 0.0 0.0
6 200000.0 0.0 0.0 0.0 0.0
7 0.0 0.0 0.0 0.0 0.0
is the sale = 10% of pairing fund AUM LFMIX LCSIX LOTIX LSPIX LEQIX \
0 0.0 1 1 0.0 0.0 0.0
1 0.0 1 1 0.0 0.0 0.0
2 0.0 1 1 0.0 0.0 0.0
3 0.0 1 1 0.0 0.0 0.0
4 0.0 1 1 0.0 0.0 1.0
5 0.0 1 1 0.0 0.0 1.0
6 0.0 1 1 0.0 0.0 1.0
7 0.0 1 1 0.0 0.0 1.0
Expected_Result Result
0 DP1
1 0
2 DP2 DP1
3 DP3 DP1
4 TT1 DP1
5 TT2 DP1
6 TT3
7 TT4 DP1
As you can see, the code fail to identify row[0] as a DP1 and misidentifies other rows.
I am planning on coding 'For Loops' that will identify 17 different types of group sales, this is simply the 1st group I am trying to identify...
Thanks for the help.
When you're working with pandas, you need to think in terms of doing things with whole columns, NOT row by row, which is hopelessly slow in pandas. If you need to go row by row, then do all of that before you convert to pandas.
In this case, you need to set the "result" column for all rows where your condition is met. This does that in one line:
df["result"][df["LCIX"] >= df["AUM_LFMIX"]*0.1] = "DP1"
So, we select the column as "result", and we select the rows where the relation is true. Simple. ;)

Calculating sum of up to the current row in pandas while iterating on each row in a time series data

Suppose I have the following code that calculates how many products I can purchase given my budget-
import math
import pandas as pd
data = [['2021-01-02', 5.5], ['2021-02-02', 10.5], ['2021-03-02', 15.0], ['2021-04-02', 20.0]]
df = pd.DataFrame(data, columns=['Date', 'Current_Price'])
df.Date = pd.to_datetime(df.Date)
mn = df.Date.min()
mx = df.Date.max()
dr = pd.date_range(mn - pd.tseries.offsets.MonthBegin(), mx + pd.tseries.offsets.MonthEnd(), name="Date")
df = df.set_index("Date").reindex(dr).reset_index()
df['Current_Price'] = df.groupby(
pd.Grouper(key='Date', freq='1M'))['Current_Price'].ffill().bfill()
# The dataframe below shows the current price of the product
# I'd like to buy at the specific date_range
print(df)
# Create 'Day' column to know which day of the month
df['Day'] = pd.to_datetime(df['Date']).dt.day
# Create 'Deposit' column to record how much money is
# deposited in, say, my bank account to buy the product.
# 'Withdrawal' column is to record how much I spent in
# buying product(s) at the current price on a specific date.
# 'Num_of_Products_Bought' shows how many items I bought
# on that specific date.
#
# Please note that the calculate below takes into account
# the left over money, which remains after I've purchased a
# product, for future purchase. For example, if you observe
# the resulting dataframe at the end of this code, you'll
# notice that I was able to purchase 7 products on March 1, 2021
# although my deposit on that day was $100. That is because
# on the days leading up to March 1, 2021, I have been saving
# the spare change from previous product purchases and that
# extra money allows me to buy an extra product on March 1, 2021
# despite my budget of $100 should only allow me to purchase
# 6 products.
df[['Deposit', 'Withdrawal', 'Num_of_Products_Bought']] = 0.0
# Suppose I save $100 at the beginning of every month in my bank account
df.loc[df['Day'] == 1, 'Deposit'] = 100.0
for index, row in df.iterrows():
if df.loc[index, 'Day'] == 1:
# num_prod_bought = (sum_of_deposit_so_far - sum_of_withdrawal)/current_price
df.loc[index, 'Num_of_Products_Bought'] = math.floor(
(sum(df.iloc[0:(index + 1)]['Deposit'])
- sum(df.iloc[0:(index + 1)]['Withdrawal']))
/ df.loc[index, 'Current_Price'])
# Record how much I spent buying the product on specific date
df.loc[index, 'Withdrawal'] = df.loc[index, 'Num_of_Products_Bought'] * df.loc[index, 'Current_Price']
print(df)
# This code above is working as intended,
# but how can I make it more efficient/pandas-like?
# In particular, I don't like to idea of having to
# iterate the rows and having to recalculate
# the running (sum of) deposit amount and
# the running (sum of) the withdrawal.
As mentioned in the comment in the code, I would like to know how to accomplish the same without having to iterate the rows one by one and calculating the sum of the rows up to the current row in my iteration (I read around StackOverflow and saw cumsum() function, but I don't think cumsum has the notion of current row in the iteration).
Thank you very much in advance for your suggestions/answers!
A solution using .apply:
def fn():
leftover = 0
amount, deposit = yield
while True:
new_amount, new_deposit = yield (deposit + leftover) // amount
leftover = (deposit + leftover) % amount
amount, deposit = new_amount, new_deposit
df = df.set_index("Date")
s = fn()
next(s)
m = df.index.day == 1
df.loc[m, "Deposit"] = 100
df.loc[m, "Num_of_Products_Bought"] = df.loc[
m, ["Current_Price", "Deposit"]
].apply(lambda x: s.send((x["Current_Price"], x["Deposit"])), axis=1)
df.loc[m, "Withdrawal"] = (
df.loc[m, "Num_of_Products_Bought"] * df.loc[m, "Current_Price"]
)
print(df.fillna(0).reset_index())
Prints:
Date Current_Price Deposit Num_of_Products_Bought Withdrawal
0 2021-01-01 5.5 100.0 18.0 99.0
1 2021-01-02 5.5 0.0 0.0 0.0
2 2021-01-03 5.5 0.0 0.0 0.0
3 2021-01-04 5.5 0.0 0.0 0.0
4 2021-01-05 5.5 0.0 0.0 0.0
5 2021-01-06 5.5 0.0 0.0 0.0
6 2021-01-07 5.5 0.0 0.0 0.0
7 2021-01-08 5.5 0.0 0.0 0.0
8 2021-01-09 5.5 0.0 0.0 0.0
9 2021-01-10 5.5 0.0 0.0 0.0
10 2021-01-11 5.5 0.0 0.0 0.0
11 2021-01-12 5.5 0.0 0.0 0.0
12 2021-01-13 5.5 0.0 0.0 0.0
13 2021-01-14 5.5 0.0 0.0 0.0
14 2021-01-15 5.5 0.0 0.0 0.0
15 2021-01-16 5.5 0.0 0.0 0.0
16 2021-01-17 5.5 0.0 0.0 0.0
17 2021-01-18 5.5 0.0 0.0 0.0
18 2021-01-19 5.5 0.0 0.0 0.0
19 2021-01-20 5.5 0.0 0.0 0.0
20 2021-01-21 5.5 0.0 0.0 0.0
21 2021-01-22 5.5 0.0 0.0 0.0
22 2021-01-23 5.5 0.0 0.0 0.0
23 2021-01-24 5.5 0.0 0.0 0.0
24 2021-01-25 5.5 0.0 0.0 0.0
25 2021-01-26 5.5 0.0 0.0 0.0
26 2021-01-27 5.5 0.0 0.0 0.0
27 2021-01-28 5.5 0.0 0.0 0.0
28 2021-01-29 5.5 0.0 0.0 0.0
29 2021-01-30 5.5 0.0 0.0 0.0
30 2021-01-31 5.5 0.0 0.0 0.0
31 2021-02-01 10.5 100.0 9.0 94.5
32 2021-02-02 10.5 0.0 0.0 0.0
33 2021-02-03 10.5 0.0 0.0 0.0
34 2021-02-04 10.5 0.0 0.0 0.0
35 2021-02-05 10.5 0.0 0.0 0.0
36 2021-02-06 10.5 0.0 0.0 0.0
37 2021-02-07 10.5 0.0 0.0 0.0
38 2021-02-08 10.5 0.0 0.0 0.0
39 2021-02-09 10.5 0.0 0.0 0.0
40 2021-02-10 10.5 0.0 0.0 0.0
41 2021-02-11 10.5 0.0 0.0 0.0
42 2021-02-12 10.5 0.0 0.0 0.0
43 2021-02-13 10.5 0.0 0.0 0.0
44 2021-02-14 10.5 0.0 0.0 0.0
45 2021-02-15 10.5 0.0 0.0 0.0
46 2021-02-16 10.5 0.0 0.0 0.0
47 2021-02-17 10.5 0.0 0.0 0.0
48 2021-02-18 10.5 0.0 0.0 0.0
49 2021-02-19 10.5 0.0 0.0 0.0
50 2021-02-20 10.5 0.0 0.0 0.0
51 2021-02-21 10.5 0.0 0.0 0.0
52 2021-02-22 10.5 0.0 0.0 0.0
53 2021-02-23 10.5 0.0 0.0 0.0
54 2021-02-24 10.5 0.0 0.0 0.0
55 2021-02-25 10.5 0.0 0.0 0.0
56 2021-02-26 10.5 0.0 0.0 0.0
57 2021-02-27 10.5 0.0 0.0 0.0
58 2021-02-28 10.5 0.0 0.0 0.0
59 2021-03-01 15.0 100.0 7.0 105.0
60 2021-03-02 15.0 0.0 0.0 0.0
61 2021-03-03 15.0 0.0 0.0 0.0
62 2021-03-04 15.0 0.0 0.0 0.0
63 2021-03-05 15.0 0.0 0.0 0.0
64 2021-03-06 15.0 0.0 0.0 0.0
65 2021-03-07 15.0 0.0 0.0 0.0
66 2021-03-08 15.0 0.0 0.0 0.0
67 2021-03-09 15.0 0.0 0.0 0.0
68 2021-03-10 15.0 0.0 0.0 0.0
69 2021-03-11 15.0 0.0 0.0 0.0
70 2021-03-12 15.0 0.0 0.0 0.0
71 2021-03-13 15.0 0.0 0.0 0.0
72 2021-03-14 15.0 0.0 0.0 0.0
73 2021-03-15 15.0 0.0 0.0 0.0
74 2021-03-16 15.0 0.0 0.0 0.0
75 2021-03-17 15.0 0.0 0.0 0.0
76 2021-03-18 15.0 0.0 0.0 0.0
77 2021-03-19 15.0 0.0 0.0 0.0
78 2021-03-20 15.0 0.0 0.0 0.0
79 2021-03-21 15.0 0.0 0.0 0.0
80 2021-03-22 15.0 0.0 0.0 0.0
81 2021-03-23 15.0 0.0 0.0 0.0
82 2021-03-24 15.0 0.0 0.0 0.0
83 2021-03-25 15.0 0.0 0.0 0.0
84 2021-03-26 15.0 0.0 0.0 0.0
85 2021-03-27 15.0 0.0 0.0 0.0
86 2021-03-28 15.0 0.0 0.0 0.0
87 2021-03-29 15.0 0.0 0.0 0.0
88 2021-03-30 15.0 0.0 0.0 0.0
89 2021-03-31 15.0 0.0 0.0 0.0
90 2021-04-01 20.0 100.0 5.0 100.0
91 2021-04-02 20.0 0.0 0.0 0.0
92 2021-04-03 20.0 0.0 0.0 0.0
93 2021-04-04 20.0 0.0 0.0 0.0
94 2021-04-05 20.0 0.0 0.0 0.0
95 2021-04-06 20.0 0.0 0.0 0.0
96 2021-04-07 20.0 0.0 0.0 0.0
97 2021-04-08 20.0 0.0 0.0 0.0
98 2021-04-09 20.0 0.0 0.0 0.0
99 2021-04-10 20.0 0.0 0.0 0.0
100 2021-04-11 20.0 0.0 0.0 0.0
101 2021-04-12 20.0 0.0 0.0 0.0
102 2021-04-13 20.0 0.0 0.0 0.0
103 2021-04-14 20.0 0.0 0.0 0.0
104 2021-04-15 20.0 0.0 0.0 0.0
105 2021-04-16 20.0 0.0 0.0 0.0
106 2021-04-17 20.0 0.0 0.0 0.0
107 2021-04-18 20.0 0.0 0.0 0.0
108 2021-04-19 20.0 0.0 0.0 0.0
109 2021-04-20 20.0 0.0 0.0 0.0
110 2021-04-21 20.0 0.0 0.0 0.0
111 2021-04-22 20.0 0.0 0.0 0.0
112 2021-04-23 20.0 0.0 0.0 0.0
113 2021-04-24 20.0 0.0 0.0 0.0
114 2021-04-25 20.0 0.0 0.0 0.0
115 2021-04-26 20.0 0.0 0.0 0.0
116 2021-04-27 20.0 0.0 0.0 0.0
117 2021-04-28 20.0 0.0 0.0 0.0
118 2021-04-29 20.0 0.0 0.0 0.0
119 2021-04-30 20.0 0.0 0.0 0.0

OneHotEncoder striping headers

I am trying to make an ML model in the titanic dataset and while preparing it I used OneHotEncoder to make Embarked dummies and while doing that I lost my column headers.
Here is how the dataset looked before.
Pclass Sex Age SibSp Parch Fare Cabin Embarked
0 3 1 22.000000 1 0 7.2500 146 2
1 1 0 38.000000 1 0 71.2833 81 0
2 3 0 26.000000 0 0 7.9250 146 2
3 1 0 35.000000 1 0 53.1000 55 2
4 3 1 35.000000 0 0 8.0500 146 2
... ... ... ... ... ... ... ... ...
886 2 1 27.000000 0 0 13.0000 146 2
887 1 0 19.000000 0 0 30.0000 30 2
888 3 0 29.699118 1 2 23.4500 146 2
889 1 1 26.000000 0 0 30.0000 60 0
890 3 1 32.000000 0 0 7.7500 146 1
Here is the code.
ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X))
X
Here is how the dataset is looking now.
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 22.000000 1.0 7.2500 146.0
1 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 38.000000 1.0 71.2833 81.0
2 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 26.000000 0.0 7.9250 146.0
3 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 35.000000 1.0 53.1000 55.0
4 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 1.0 35.000000 0.0 8.0500 146.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 1.0 27.000000 0.0 13.0000 146.0
887 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 19.000000 0.0 30.0000 30.0
888 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 3.0 0.0 29.699118 1.0 23.4500 146.0
889 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 26.000000 0.0 30.0000 60.0
890 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 3.0 1.0 32.000000 0.0 7.7500 146.0
You can use the get_feature_names method of ColumnTransformer, provided all your transformers support that method and you've trained on a dataframe.
ct = ColumnTransformer([('encoder', OneHotEncoder(), [7])], remainder='passthrough')
X = pd.DataFrame(ct.fit_transform(X), columns=ct.get_feature_names())
X
output of fit_transform is array like
X_t{array-like, sparse matrix} of shape (n_samples, sum_n_components)
(not DataFrame-like)
Thus no Headers. If you want headers, you'll have to name them when rebuilding the DataFrame.

Separating strings from numerical data in a .txt file by python [duplicate]

This question already has answers here:
How read Common Data Format (CDF) in Python
(4 answers)
Closed 4 years ago.
I have a .txt file that looks like this:
08/19/93 UW ARCHIVE 100.0 1962 W IEEE 14 Bus Test Case
BUS DATA FOLLOWS 14 ITEMS
1 Bus 1 HV 1 1 3 1.060 0.0 0.0 0.0 232.4 -16.9 0.0 1.060 0.0 0.0 0.0 0.0 0
2 Bus 2 HV 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045 50.0 -40.0 0.0 0.0 0
3 Bus 3 HV 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010 40.0 0.0 0.0 0.0 0
4 Bus 4 HV 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
5 Bus 5 HV 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
6 Bus 6 LV 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070 24.0 -6.0 0.0 0.0 0
7 Bus 7 ZV 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
8 Bus 8 TV 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090 24.0 -6.0 0.0 0.0 0
9 Bus 9 LV 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19 0
10 Bus 10 LV 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
11 Bus 11 LV 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
12 Bus 12 LV 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
13 Bus 13 LV 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
14 Bus 14 LV 1 1 0 1.036 -16.04 14.9 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
I need to remove the characters from this file and need only the numerical data in a matrix form. I am relatively new to python, so any kind of help will be really appreciated. Thank you.
I would suggest reading the Data in a Pandas Dataframe and than deleting the column with text or create a second Frame without the text column.
Try:
data = pd.read_csv('output_list.txt', sep=" ", header=None)
data.columns = ["a", "b", "c", "etc."]
As it is simple to do this in pandas if data is correct, here is my take:
import pandas as pd
data = '''\
08/19/93 UW ARCHIVE 100.0 1962 W IEEE 14 Bus Test Case
BUS DATA FOLLOWS 14 ITEMS
1 Bus 1 HV 1 1 3 1.060 0.0 0.0 0.0 232.4 -16.9 0.0 1.060 0.0 0.0 0.0 0.0 0
2 Bus 2 HV 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045 50.0 -40.0 0.0 0.0 0
3 Bus 3 HV 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010 40.0 0.0 0.0 0.0 0
4 Bus 4 HV 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
5 Bus 5 HV 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
6 Bus 6 LV 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070 24.0 -6.0 0.0 0.0 0
7 Bus 7 ZV 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
8 Bus 8 TV 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090 24.0 -6.0 0.0 0.0 0
9 Bus 9 LV 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.19 0
10 Bus 10 LV 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
11 Bus 11 LV 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
12 Bus 12 LV 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
13 Bus 13 LV 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
'''
fileobj = pd.compat.StringIO(data)
# change fileobj to filepath and sep to `\t`
df = pd.read_csv(fileobj, sep='\s+', header=None, skiprows=2)
df = df.loc[:,df.dtypes != 'object']
print(df)
Returns:
0 2 4 5 6 7 8 9 10 11 12 13 14 \
0 1 1 1 1 3 1.060 0.00 0.0 0.0 232.4 -16.9 0.0 1.060
1 2 2 1 1 2 1.045 -4.98 21.7 12.7 40.0 42.4 0.0 1.045
2 3 3 1 1 2 1.010 -12.72 94.2 19.0 0.0 23.4 0.0 1.010
3 4 4 1 1 0 1.019 -10.33 47.8 -3.9 0.0 0.0 0.0 0.000
4 5 5 1 1 0 1.020 -8.78 7.6 1.6 0.0 0.0 0.0 0.000
5 6 6 1 1 2 1.070 -14.22 11.2 7.5 0.0 12.2 0.0 1.070
6 7 7 1 1 0 1.062 -13.37 0.0 0.0 0.0 0.0 0.0 0.000
7 8 8 1 1 2 1.090 -13.36 0.0 0.0 0.0 17.4 0.0 1.090
8 9 9 1 1 0 1.056 -14.94 29.5 16.6 0.0 0.0 0.0 0.000
9 10 10 1 1 0 1.051 -15.10 9.0 5.8 0.0 0.0 0.0 0.000
10 11 11 1 1 0 1.057 -14.79 3.5 1.8 0.0 0.0 0.0 0.000
11 12 12 1 1 0 1.055 -15.07 6.1 1.6 0.0 0.0 0.0 0.000
12 13 13 1 1 0 1.050 -15.16 13.5 5.8 0.0 0.0 0.0 0.000
15 16 17 18 19
0 0.0 0.0 0.0 0.00 0
1 50.0 -40.0 0.0 0.00 0
2 40.0 0.0 0.0 0.00 0
3 0.0 0.0 0.0 0.00 0
4 0.0 0.0 0.0 0.00 0
5 24.0 -6.0 0.0 0.00 0
6 0.0 0.0 0.0 0.00 0
7 24.0 -6.0 0.0 0.00 0
8 0.0 0.0 0.0 0.19 0
9 0.0 0.0 0.0 0.00 0
10 0.0 0.0 0.0 0.00 0
11 0.0 0.0 0.0 0.00 0
12 0.0 0.0 0.0 0.00 0

Categories

Resources