I am playing with time series data, and I want to train a model to predict future outcomes. I have some data shaped like the following:
Date Failures
0 2021-06 10
1 2021-05 2
2 2021-04 7
3 2021-03 9
4 2021-02 3
...
I would like to shape this data, not necessarily a pandas df, as a rolling window with four entries:
10 2 7 9 3
...
and then the fifth entry being the number I want to predict. I have read on stack exchange that one wants to avoid iterating over pandas DataFrame, so what would be the appropriate manner to transform my dataframe? I have heard of .rolling method, however, this does not seem to achieve what I want.
IIUC, you want to reshape you column to shape (49,5) when it has an initial length of 245. You can use the underlying numpy array:
df['Failures'].values.reshape(-1,5)
Output (dummy numbers):
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[ 10, 11, 12, 13, 14],
[ 15, 16, 17, 18, 19],
...
[235, 236, 237, 238, 239],
[240, 241, 242, 243, 244]])
Related
I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])
You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.
I have a code in Matlab which I need to translate in Python. A point here that shapes and indexes are really important since it works with tensors. I'm a little bit confused since it seems that it's enough to use order='F' in python reshape(). But when I work with 3D data I noticed that it does not work. For example, if A is an array from 1 to 27 in python
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
if I perform A.reshape(3, 9, order='F') I get
[[ 1 4 7 2 5 8 3 6 9]
[10 13 16 11 14 17 12 15 18]
[19 22 25 20 23 26 21 24 27]]
In Matlab for A = 1:27 reshaped to [3, 3, 3] and then to [3, 9] it seems that I get another array:
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27
And SVD in Matlab and Python gives different results. So, is there a way to fix this?
And maybe you know the correct way of operating with multidimensional arrays in Matlab -> python, like should I get the same SVD for arrays like arange(1, 13).reshape(3, 4) and in Matlab 1:12 -> reshape(_, [3, 4]) or what is the correct way to work with that? Maybe I can swap axes somehow in python to get the same results as in Matlab? Or change the order of axes in reshape(x1, x2, x3,...) in Python?
I was having the same issues, until I found this wikipedia article: row- and column-major order
Python (and C) organizes the data arrays in row major order. As you can see in your first example code, the elements first increases with the columns:
array([[[ 1, 2, 3],
- - - -> increasing
Then in the rows
array([[[ 1, 2, 3],
[ 4, <--- new element
When all columns and rows are full, it moves to the next page.
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, <-- new element in next page
In matlab (as fortran) increases first the rows, then the columns, and so on.
For N-dimensionals arrays it looks like:
Python (row major -> last dimension is contiguous): [dim1,dim2,...,dimN]
Matlab (column major -> first dimension is contiguous): the same tensor in memory would look the other way around .. [dimN,...,dim2,dim1]
If you want to export n-dim. arrays from python to matlab, the easiest way is to permute the dimensions first:
(in python)
import numpy as np
import scipy.io as sio
A=np.reshape(range(1,28),[3,3,3])
sio.savemat('A',{'A':A})
(in matlab)
load('A.mat')
A=permute(A,[3 2 1]);%dimensions in reverse ordering
reshape(A,9,3)' %gives the same result as A.reshape([3,9]) in python
Just notice that the (9,3) an the (3,9) are intentionally putted in reverse order.
In Matlab
A = 1:27;
A = reshape(A,3,3,3);
B = reshape(A,9,3)'
B =
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27
size(B)
ans =
3 9
In Python
A = np.array(range(1,28))
A = A.reshape(3,3,3)
B = A.reshape(3,9)
B
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24, 25, 26, 27]])
np.shape(B)
(3, 9)
after doing few data manipulation i got 2 list avglist and sumlist
and now i passed this 2 list to my result_df
result_df = pd.DataFrame({"File Name": filelist ,"Average":avglist,"Sum":sumlist})
print(result_df)
so below is my Output result, but problem here is
1) even my header Continental AG, datatype info also include..
i just my my values "874" and 584 in sum needed..
-i tried avglist.value(), but .value is not a list function
also tried few variation in .index but did not get expected result..
am i missing any steps here?
There is something wrong with how you're importing your files. If you take a .sum() of your dataframe, it will give you back the sum of the columns. I suspect you may be doing that since you are summing a dataframe. Then when you try to put the list in another dataframe its looking funky.
lets take the following two dataframes:
df = pd.DataFrame({'a':[1, 20, 30, 4, 0],
'b':[1, 0, 3, 4, 0],
'c':[1, 3, 7, 7, 5],
'd':[1, 8, 3, 8, 5],
'e':[1, 11, 3, 4, 0]})
df2 = pd.DataFrame({'a':[1, 20, 100, 4, 0],
'b':[1, 0, 39, 49, 10],
'c':[1, 3, 97, 7, 95],
'd':[441, 38, 23, 8, 115],
'e':[1, 11, 13, 114, 0]})
looking at the sum of one of these dataframes:
df.sum()
a 55
b 8
c 23
d 25
e 19
dtype: int64
now if we were to take the sums of dataframes and put them in a list:
sums = [x.sum() for x in [df, df2]]
when we inspect this we get:
[a 55
b 8
c 23
d 25
e 19
dtype: int64, a 125
b 99
c 203
d 625
e 139
dtype: int64]
if you want the sum of the whole dataframe and not just by column, you can use .sum().sum() which will sum first by columns and then sum those columns
df.sum().sum()
130
so across dataframes it would be:
sums = [x.sum().sum() for x in [df, df2]]
doing the mean would depend on how your csvs are. if you were to do .mean().mean() that might be very different than what you're looking for. If its just 1 column every time it would be fine. but if it were more, it would be taking the mean of 5 columns, and then taking the mean of that (those 5 averages summed divided by 5).
lastly it looks like "Continental AG (Worldwide)" is the name of your column.
So in your for loop you should be doing:
sums = [df['Continental AG (Worldwide)'.sum() for df in list_dfs]
i performed few operation sometime like below...
while i < len(filepath):
.....
df['Date']=df['Time'].apply(lambda i:i.split('T')[0])
.......
.......
sum1=sum_df.sum(axis=0)
avg1=Avg_df.sum(axis=0)
.......
.......
avglist.append(avg1)
sumlist.append(sum1)
.....
i+=1
so i have changed my all operation to below..
df['Date']=df.iloc[:,0].apply(lambda i:i.split('T')[0])
.........
.........
sum1=sum_df.iloc[:,0].sum()
avg1=Avg_df.iloc[:,0].mean()
.....
.....
avglist.append(avg1)
sumlist.append(sum1)
instead of using column name, axis in my operation.
i updated to dataframe.iloc in all my operation and it started giving me correct result..
still not sure about precise reason , but this changes worked for me..
I want to generate "category intervals" from categories.
for example, suppose I have the following :
>>> df['start'].describe()
count 259431.000000
mean 10.435858
std 5.504730
min 0.000000
25% 6.000000
50% 11.000000
75% 15.000000
max 20.000000
Name: start, dtype: float64
and unique value of my column are:
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20], dtype=int8)
but I want to use the following list of intervals:
>>> intervals
[[0, 2.2222222222222223],
[2.2222222222222223, 4.4444444444444446],
[4.4444444444444446, 6.666666666666667],
[6.666666666666667, 8.8888888888888893],
[8.8888888888888893, 11.111111111111111],
[11.111111111111111, 13.333333333333332],
[13.333333333333332, 15.555555555555554],
[15.555555555555554, 17.777777777777775],
[17.777777777777775, 20]]
to change my column 'start' into values x where x represents the index of the interval that contains df['start'] (so x in my case will vary from 0 to 8)
is there a more or less simple way to do it using pandas/numpy?
In advance, thanks a lot for the help.
Regards.
You can use np.digitize:
import numpy as np
import pandas as pd
df = pd.DataFrame(dict(start=np.random.random_integers(0, 20, 10000)))
# the left-hand edges of each "interval"
intervals = np.linspace(0, 20, 9, endpoint=False)
print(intervals)
# [ 0. 2.22222222 4.44444444 6.66666667 8.88888889
# 11.11111111 13.33333333 15.55555556 17.77777778]
df['start_idx'] = np.digitize(df['start'], intervals) - 1
print(df.head())
# start start_idx
# 0 8 3
# 1 16 7
# 2 0 0
# 3 7 3
# 4 0 0
I need to subtract a number from my numpy arrays.
Let's say, we have two arrays and I need to subtract 10 from each of its elements.
a = numpy.array([10, 11, 23, 45])
b = numpy.array([55, 23, 54, 489, 45, 12])
To do that, I enter:
a - 10
b - 10
And I get the desired output, which is:
array([ 0, 1, 13, 35])
array([ 45, 13, 44, 479, 35, 2])
But, as I have lots of such arrays, I was wondering if it is possible to get the same result, for example by entering (a,b)-10?
numpy.array([a,b]) - 10 will work.
If you enter:
numpy.array((a, b)) - 10
You get the desired result:
array([[ 0 1 13 35], [ 45 13 44 479 35 2]], dtype=object)
(a,b) - 10 doesn't work because mathematical operations only work element-by-element when operating on numpy arrays. So, the solution, as above, is to put a and b into one numpy array.