Plot moving average with data [duplicate] - python

This question already has answers here:
Moving Average Pandas
(4 answers)
Closed 2 years ago.
I am trying to calculate and plot moving average along with the data it is calculated from:
def movingAvg(df):
window_size = 7
i = 0
moving_averages = []
while i < len(df) - window_size + 1:
current_window = df[i : i + window_size]
window_average = current_window.mean()
moving_averages.append(window_average)
i += 1
return moving_averages
dates = df_valid['dateTime']
startDay = dates.iloc[0]
lastDay = dates.iloc[-1]
fig, ax = plt.subplots(figsize=(20, 10))
ax.autoscale()
#plt.xlim(startDay, lastDay)
df_valid.sedentaryActivityMins.reset_index(drop=True, inplace=True)
df_moving = pd.DataFrame(movingAvg(df_valid['sedentaryActivityMins']))
df_nan = [np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan]
df_nan = pd.DataFrame(df_nan)
df_moving = pd.concat([df_nan, df_moving])
plt.plot(df_valid.sedentaryActivityMins)
plt.plot(df_moving)
#plt.show()
But as the moving average uses 7 windows, the list of moving averages is 7 items short, and therefore the plots do not follow each other correctly.
I tried putting 7 "NaN" into the moving average list, but those are ignored when I plot.
The plot is as follows:
But I would like the the orange line to start 7 steps ahead.
So it looks like this:
df_valid.sedentaryActivityMins.head(40)
0 608
1 494
2 579
3 586
4 404
5 750
6 573
7 466
8 389
9 604
10 351
11 553
12 768
13 572
14 616
15 522
16 675
17 607
18 229
19 529
20 746
21 646
22 625
23 590
24 572
25 462
26 708
27 662
28 649
29 626
30 485
31 509
32 561
33 664
34 517
35 587
36 602
37 601
38 495
39 352
Name: sedentaryActivityMins, dtype: int64
Any ideas as to how?
Thanks in advance!

When you do a concat, the indexes don't change. The NaNs will also take the same indices as the first 7 observations of your series. So either do a reset index after the concat or set ignore_index as True as follows:
df_moving = pd.concat([df_nan, df_moving],ignore_index=True)
plt.plot(x)
plt.plot(df_moving)
This gives the output as expected:

Related

text file rows into CSV column python

I've a question
I've a text file containing data like this
A 34 45 7789 3475768 443 67 8999 3343 656 8876 802 383358 873 36789 2374859 485994 86960 32838459 3484549 24549 58423
T 3445 574649 68078 59348604 45959 64585304 56568 595 49686 656564 55446 665 677 778 433 545 333 65665 3535
and so on
I want to make a csv file from this text file, displaying data like this, A & T as column headings, and then numbers
A T
34 3445
45 574649
7789 68078
3475768 59348604
443 45959
EDIT (A lot simpler solution inspired by Michael Butscher's comment):
import pandas as pd
df = pd.read_csv("filename.txt", delimiter=" ")
df.T.to_csv("filename.csv", header=False)
Here is the code:
import pandas as pd
# Read file
with open("filename.txt", "r") as f:
data = f.read()
# Split data by lines and remove empty lines
columns = data.split("\n")
columns = [x.split() for x in columns if x!=""]
# Row sizes are different in your example so find max number of rows
column_lengths = [len(x) for x in columns]
max_col_length = max(column_lengths)
data = {}
for i in columns:
# Add None to end for columns that have less values
if len(i)<max_col_length:
i += [None]*(max_col_length-len(i))
data[i[0]] = i[1:]
# Create dataframe
df = pd.DataFrame(data)
# Create csv
df.to_csv("filename.csv", index=False)
Output should look like this:
A T
0 34 3445
1 45 574649
2 7789 68078
3 3475768 59348604
4 443 45959
5 67 64585304
6 8999 56568
7 3343 595
8 656 49686
9 8876 656564
10 802 55446
11 383358 665
12 873 677
13 36789 778
14 2374859 433
15 485994 545
16 86960 333
17 32838459 65665
18 3484549 3535
19 24549 None
20 58423 None
here is my code
import pandas as pd
data = pd.read_csv("text (3).txt", header = None)
Our_Data = pd.DataFrame(data)
for rows in Our_Data:
New_Data=pd.DataFrame(Our_Data[rows].str.split(' ').tolist()).T
New_Data.columns = New_Data.iloc[0]
New_Data = New_Data[1:]
New_Data.to_csv("filename.csv", index=False)
The Output
A T
1 34 3445
2 45 574649
3 7789 68078
4 3475768 59348604
5 443 45959
6 67 64585304
7 8999 56568
8 3343 595
9 656 49686
10 8876 656564
11 802 55446
12 383358 665
13 873 677
14 36789 778
15 2374859 433
16 485994 545
17 86960 333
18 32838459 65665
19 3484549 3535
20 24549 None
21 58423 None

Pandas DataFrame: Creating 3D Surface Plots

I am trying to draw a 3D surface plot using 3 columns (3 features) in a data frame:
age size_tc Survival_days
0 60.463 43185.0 289
1 52.263 15709.0 616
2 54.301 3731.0 464
3 39.068 26400.0 788
4 68.493 14410.0 465
5 67.126 44774.0 269
6 69.912 9557.0 503
7 56.419 76260.0 1155
8 48.367 6994.0 515
9 65.899 8280.0 495
10 59.693 14535.0 698
11 51.734 27568.0 359
12 62.614 17677.0 169
13 55.759 41082.0 368
14 58.258 14713.0 439
15 61.605 2036.0 486
16 68.049 20547.0 287
17 56.921 5669.0 576
18 44.162 30637.0 350
19 67.833 17526.0 332
20 46.666 28472.0 331
21 76.367 15027.0 106
22 67.860 24355.0 473
23 46.452 44985.0 1283
24 71.370 5751.0 89
25 75.978 24963.0 172
26 53.362 19018.0 84
27 75.312 40795.0 726
28 46.570 3461.0 660
29 77.337 7635.0 522
My code is as follows:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import random
from matplotlib import cm
fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')
x = df['age']
y = df['size_tc']
z = df['survival']
surf = ax.plot_trisurf(x, y, z, cmap= cm.coolwarm, linewidth=0.2)
The plot looks like this:
Why the plot is not smooth and how can I generate a plot like this?

How can I loop though pandas groupby and manipulate data?

I am trying to work out the time delta between values in a grouped pandas df.
My df looks like this:
Location ID Item Qty Time
0 7 202545942 100130 1 07:19:46
1 8 202545943 100130 1 07:20:08
2 11 202545950 100130 1 07:20:31
3 13 202545955 100130 1 07:21:08
4 15 202545958 100130 1 07:21:18
5 18 202545963 100130 3 07:21:53
6 217 202546320 100130 1 07:22:43
7 219 202546324 100130 1 07:22:54
8 229 202546351 100130 1 07:23:32
9 246 202546376 100130 1 07:24:09
10 273 202546438 100130 1 07:24:37
11 286 202546464 100130 1 07:24:59
12 296 202546490 100130 1 07:25:16
13 297 202546491 100130 1 07:25:24
14 310 202546516 100130 1 07:25:59
15 321 202546538 100130 1 07:26:17
16 329 202546549 100130 1 07:28:09
17 388 202546669 100130 1 07:29:02
18 420 202546717 100130 2 07:30:01
19 451 202546766 100130 1 07:30:19
20 456 202546773 100130 1 07:30:27
(...)
42688 458 202546777 999969 1 06:51:16
42689 509 202546884 999969 1 06:53:09
42690 567 202546977 999969 1 06:54:21
42691 656 202547104 999969 1 06:57:27
I have grouped this using the following method:
ndf = df.groupby(['ID','Location','Time'])
If I add .size() to the end of the above and print(ndf) I get the following output:
(...)
ID Location Time
995812 696 07:10:36 1
730 07:11:41 1
761 07:12:30 1
771 07:20:49 1
995820 381 06:55:07 1
761 07:12:44 1
(...)
This is the as desired.
My challenge is that I need to work out the time delta between each time per Item and add this as a column in the dataframe grouping. It should give me the following:
ID Location Time Delta
(...)
995812 696 07:10:36 0
730 07:11:41 00:01:05
761 07:12:30 00:00:49
771 07:20:49 00:08:19
995820 381 06:55:07 0
761 07:12:44 00:17:37
(...)
I am pulling my hair out trying to work out a method of doing this, so I'm turning to the greats.
Please help. Thanks in advance.
Convert Time column to timedeltas by to_timedelta, sort by all 3 columns by DataFrame.sort_values, get difference per groups by DataFrameGroupBy.diff, replace missing values to 0 timedelta by Series.fillna:
#if strings astype should be omit
df['Time'] = pd.to_timedelta(df['Time'].astype(str))
df = df.sort_values(['ID','Location','Time'])
df['Delta'] = df.groupby('ID')['Time'].diff().fillna(pd.Timedelta(0))
Also is possible convert timedeltas to seconds - add Series.dt.total_seconds:
df['Delta_sec'] = df.groupby('ID')['Time'].diff().dt.total_seconds().fillna(0)
If you just wanted to iterate over the groupby object, based on your original question title you can do it:
for (x, y) in df.groupby(['ID','Location','Time']):
print("{0}, {1}".format(x, y))
# your logic
However, this works for 10.000 rows, 100.000 rows, but not so good for 10^6 rows or more.

Python pdist: Setting an array element with a sequence

I have written the following code
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': [x[0]],'Y':[x[1]],'Z':[x[2]]})
coord_table = pd.DataFrame(arr_coord)
print(coord_table)
To generate the following dataframe
X Y Z
0 [-5.43] [28.077] [-0.842]
1 [-3.183] [26.472] [1.741]
2 [-2.574] [22.752] [1.69]
3 [-1.743] [21.321] [5.121]
4 [0.413] [18.212] [5.392]
5 [0.714] [15.803] [8.332]
6 [4.078] [15.689] [10.138]
7 [5.192] [12.2] [9.065]
8 [4.088] [12.79] [5.475]
9 [5.875] [16.117] [4.945]
10 [8.514] [15.909] [2.22]
11 [12.235] [15.85] [2.943]
12 [13.079] [16.427] [-0.719]
13 [10.832] [19.066] [-2.324]
14 [12.327] [22.569] [-2.163]
15 [8.976] [24.342] [-1.742]
16 [7.689] [25.565] [1.689]
17 [5.174] [23.336] [3.467]
18 [2.339] [24.135] [5.889]
19 [0.9] [22.203] [8.827]
20 [-1.217] [22.065] [11.975]
21 [0.334] [20.465] [15.09]
22 [0.0] [20.066] [18.885]
23 [2.738] [21.762] [20.915]
24 [4.087] [19.615] [23.742]
25 [7.186] [21.618] [24.704]
26 [8.867] [24.914] [23.91]
27 [11.679] [27.173] [24.946]
28 [10.76] [30.763] [25.731]
29 [11.517] [33.056] [22.764]
.. ... ... ...
431 [8.093] [34.654] [68.474]
432 [7.171] [32.741] [65.298]
433 [5.088] [35.626] [63.932]
434 [7.859] [38.22] [64.329]
435 [10.623] [35.908] [63.1]
436 [12.253] [36.776] [59.767]
437 [10.65] [35.048] [56.795]
438 [7.459] [34.084] [58.628]
439 [4.399] [35.164] [56.713]
440 [0.694] [35.273] [57.347]
441 [-1.906] [34.388] [54.667]
442 [-5.139] [35.863] [55.987]
443 [-8.663] [36.808] [55.097]
444 [-9.629] [40.233] [56.493]
445 [-12.886] [42.15] [56.888]
446 [-12.969] [45.937] [56.576]
447 [-14.759] [47.638] [59.485]
448 [-14.836] [51.367] [60.099]
449 [-11.607] [51.863] [58.176]
450 [-9.836] [48.934] [59.829]
451 [-8.95] [45.445] [58.689]
452 [-9.824] [42.599] [61.073]
453 [-8.559] [39.047] [60.598]
454 [-11.201] [36.341] [60.195]
455 [-11.561] [32.71] [59.077]
456 [-7.786] [32.216] [59.387]
457 [-5.785] [29.886] [61.675]
458 [-2.143] [29.222] [62.469]
459 [-0.946] [25.828] [61.248]
460 [2.239] [25.804] [63.373]
[461 rows x 3 columns]
What I intend to do is to create a Euclidean distance matrix using these X, Y, and Z values. I tried to do this using the pdist function
dist = pdist(coord_table, metric = 'euclidean')
distance_matrix = squareform(dist)
print(distance_matrix)
However, the interpreter gives the following error
ValueError: setting an array element with a sequence.
I am not sure how to interpret this error or how to fix it.
Change your loop
arr_coord = []
for chains in structure:
for chain in chains:
for residue in chain:
for atom in residue:
x = atom.get_coord()
arr_coord.append({'X': x[0],'Y':x[1],'Z':x[2]}) # here do not need list of list

Format Pandas Pivot Table

I met a problem in formatting pivot table that created by Pandas.
So I made a matrix table between 2 columns (A,B) from my source data, by using pandas.pivot_table with A as Column, and B as Index.
>> df = PD.read_excel("data.xls")
>> table = PD.pivot_table(df,index=["B"],
values='Count',columns=["A"],aggfunc=[NUM.sum],
fill_value=0,margins=True,dropna= True)
>> table
It returns as:
sum
A 1 2 3 All
B
1 23 52 0 75
2 16 35 12 65
3 56 0 0 56
All 95 87 12 196
And I hope to have a format like this:
A All_B
1 2 3
1 23 52 0 75
B 2 16 35 12 65
3 56 0 0 56
All_A 95 87 12 196
How should I do this? Thanks very much ahead.
The table returned by pd.pivot_table is very convenient to do work on (it's single-level index/column) and normally does NOT require any further format manipulation. But if you insist on changing the format to the one you mentioned in the post, then you need to construct a multi-level index/column using pd.MultiIndex. Here is an example on how to do it.
Before manipulation,
import pandas as pd
import numpy as np
np.random.seed(0)
a = np.random.randint(1, 4, 100)
b = np.random.randint(1, 4, 100)
df = pd.DataFrame(dict(A=a,B=b,Val=np.random.randint(1,100,100)))
table = pd.pivot_table(df, index='A', columns='B', values='Val', aggfunc=sum, fill_value=0, margins=True)
print(table)
B 1 2 3 All
A
1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All 1458 1472 1718 4648
After:
multi_level_column = pd.MultiIndex.from_arrays([['A', 'A', 'A', 'All_B'], [1,2,3,'']])
multi_level_index = pd.MultiIndex.from_arrays([['B', 'B', 'B', 'All_A'], [1,2,3,'']])
table.index = multi_level_index
table.columns = multi_level_column
print(table)
A All_B
1 2 3
B 1 454 649 770 1873
2 628 576 467 1671
3 376 247 481 1104
All_A 1458 1472 1718 4648

Categories

Resources