I have a dataset that contains nan values and I am attempting to fill in those values using a rolling average. My code for doing so is as follows:
df = pd.DataFrame({'vals': med_vals})
print(df[353:363])
vals
353 17682.196292
354 13796.403594
355 14880.418179
356 14139.141779
357 15397.070537
358 15108.345602
359 14286.259755
360 14962.745719
361 NaN
362 NaN
df_filled = df.fillna(df.rolling(7,min_periods = 1).mean())
print(df_filled[353:365])
vals
353 17682.196292
354 13796.403594
355 14880.418179
356 14139.141779
357 15397.070537
358 15108.345602
359 14286.259755
360 14962.745719
361 14795.663595
362 14778.712678
363 14938.605403
364 14785.783692
365 14624.502737
366 14962.745719
367 NaN
368 NaN
369 NaN
How can I make it so my code takes into account previously filled in values when calculating the rolling average?
Edit: I found a method that works but I'm not too happy with it:
while pd.isnull(df).any().any() == True:
df.fillna(df.rolling(window=8,min_periods = 7).mean(), inplace = True)
You are getting exactly what you asked for. When you do a rolling average, numpy has the current cell as the right-edge of the window. So, when setting cell 361:
355 356 357 358 359 360 361 362 363 364 365 366
^-----------------------------^
Since 361 is a NaN, you get the average of the other six. Continuing:
355 356 357 358 359 360 361 362 363 364 365 366
^-----------------------------^
^-----------------------------^
^-----------------------------^
^-----------------------------^
^-----------------------------^
So, when it's computing a value for 366, it will average from 360 through 366. The only cell in that range that has a value is 360, so that becomes the average. You told it there only needed to be one value in the range to be valid.
You're saying there is an issue, but it is not at all clear to me what you were expecting.
Related
I created a dataframe with the help of a pivot, and I have:
name x y z All
A 155 202 218 575
C 206 149 45 400
B 368 215 275 858
Total 729 566 538 1833
I would like sort by column "All" not taking into account row "Total". i am using:
df.sort_values(by = ["All"], ascending = False)
Thank you in advance!
If the Total row is the last one, you can sort other rows and then concat the last row:
df = pd.concat([df.iloc[:-1, :].sort_values(by="All"), df.iloc[-1:, :]])
print(df)
Prints:
name x y z All
C 206 149 45 400
A 155 202 218 575
B 368 215 275 858
Total 729 566 538 1833
You can try with the following, although it has a FutureWarning you should be careful of:
df = df.iloc[:-1,:].sort_values('All',ascending=False).append(df.iloc[-1,:])
This outputs:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
You can get the sorted order without Total (assuming here the last row), then index by position:
import numpy as np
idx = np.argsort(df['All'].iloc[:-1])
df2 = df.iloc[np.r_[idx[::-1], len(df)-1]]
NB. as we are sorting only an indexer here this should be very fast
output:
name x y z All
2 B 368 215 275 858
0 A 155 202 218 575
1 C 206 149 45 400
3 Total 729 566 538 1833
you can just ignore the last column
df.iloc[:-1].sort_values(by = ["All"], ascending = False)
I want to edit a table based on overlaping values.
On column 1 I have a group name, on column 3 I have a start position value, and in column 4 is the end position.
I want to keep only rows with position values (start and end) that are not contained within the range of other rows of a given group (ex CE170_HUMAN).
For example, for CE170_HUMAN I have 6 rows, some of them have overlapping values: for example 165-523 (358 positions) range is contained within 1-523 range, I want to keep only the row with 1-523 as it covers a longer range (523 positions). Then do the same for the next group PURA2 and so on.
Input:
RAEG_00037367-RA CE170_HUMAN 557 1584
RAEG_00037368-RB CE170_HUMAN 165 523
RAEG_00037368-RA CE170_HUMAN 326 523
RAEG_00037368-RD CE170_HUMAN 165 370
RAEG_00037368-RC CE170_HUMAN 1 523
RAEG_00037368-RE CE170_HUMAN 1 370
RAEG_00037388-RB PURA2_PIG 61 456
RAEG_00037388-RC PURA2_PIG 61 357
RAEG_00037388-RA PURA2_PIG 181 456
RAEG_00037400-RA KI26B_HUMAN 454 545
RAEG_00037401-RA KI26B_HUMAN 753 2108
RAEG_00037415-RA CNST_HUMAN 137 613
RAEG_00037416-RA CNST_HUMAN 637 725
RAEG_00037420-RE ELYS_HUMAN 1 2266
RAEG_00037420-RG ELYS_HUMAN 1080 2266
RAEG_00037420-RF ELYS_HUMAN 1 2266
RAEG_00037420-RD ELYS_HUMAN 1080 2266
RAEG_00037420-RC ELYS_HUMAN 205 2266
RAEG_00037420-RB ELYS_HUMAN 1080 2266
Desired output
RAEG_00037367-RA CE170_HUMAN 557 1584
RAEG_00037368-RB CE170_HUMAN 1 523
RAEG_00037388-RC PURA2_PIG 61 357
RAEG_00037400-RA KI26B_HUMAN 454 545
RAEG_00037401-RA KI26B_HUMAN 753 2108
RAEG_00037415-RA CNST_HUMAN 137 613
RAEG_00037416-RA CNST_HUMAN 637 725
RAEG_00037420-RE ELYS_HUMAN 1 2266
I am looking for a solution either on bash, perl or python.
I appreciate your help!
I don't understand your format, but I am sure you can adapt this:
rows = [
"Hello",
"World",
"Hello World"
]
solution = []
found = False
for i in range(len(rows)):
for j in range(len(rows)):
if i == j:
# Comparing equal things (will result in false positive)
continue
if str(rows[i]) in str(rows[j]):
# Not a solution
found = True
break
if not found:
# We have found a solution!
solution.append(rows[i])
else:
# Not a solution. Resetting
found = False
for i in solution:
print(i)
I am trying to create a seaborn Facetgrid to plot the normality distribution of all columns in my dataFrame decathlon. The data looks as such:
P100m Plj Psp Phj P400m P110h Ppv Pdt Pjt P1500
0 938 1061 773 859 896 911 880 732 757 752
1 839 975 870 749 887 878 880 823 863 741
2 814 866 841 887 921 939 819 778 884 691
3 872 898 789 878 848 879 790 790 861 804
4 892 913 742 803 816 869 1004 789 854 699
... ... ... ... ... ... ... ... ... ...
7963 755 760 604 714 812 794 482 571 539 780
7964 830 845 524 767 786 783 601 573 562 535
7965 819 804 653 840 791 699 659 461 448 632
7966 804 720 539 758 830 782 731 487 425 729
7967 687 809 692 714 565 741 804 527 738 523
I am relatively new to python and I can't understand my error. My attempt to format the data and create the grid is as such:
import seaborn as sns
df_stacked = decathlon.stack().reset_index(1).rename({'level_1': 'column', 0: 'values'}, axis=1)
g = sns.FacetGrid(df_stacked, row = 'column')
g = g.map(plt.hist, "values")
However I recieve the following error:
ValueError: Axes instance argument was not found in a figure
Can anyone explain what exactly this error means and how I would go about fixing it?
EDIT
df_stacked looks as such:
column values
0 P100m 938
0 Plj 1061
0 Psp 773
0 Phj 859
0 P400m 896
... ...
7967 P110h 741
7967 Ppv 804
7967 Pdt 527
7967 Pjt 738
7967 P1500 523
I encountered this similar issue when running a Jupyter Notebook.
My solution involved:
Restart the notebook
Re-run the imports %matplotlib inline; import matplotlib.pyplot as plt
As you did not post a full working example its a bit of guessing.
What might go wrong is in the line where you have g = g.map(plt.hist, "values") because the error comes from deep within matplotlib. You can see this here in this SO question where its another function pylab.sca(axes[i]) outside matplotlib due to not being in that module available, is being triggered by matplotlib.
Likely you installed/updated something in your (conda?) environment (changes in environment paths?) and after the next reboot it was found.
I also wonder how you come up with plt.hist ... fully typed it should resemble matplotlib.pyplot.hist ... but guessing... (waiting for your updated example code).
I am working on a problem set where the data in a microsecond. I have 4 hours of data as of now. the data set is very huge as it contains microsecond wise data. I want to aggregate each microsecond data into their respective seconds so that it would be helpful for analysis.
example:
Vibration1 Vibration2 Vibration3 Temperature Pressure Time
816 698 822 1852 710 2019-03-26 09:49:09.013650
702 690 764 2002 810 2019-03-26 09:49:09.014308
702 692 768 1888 706 2019-03-26 09:49:09.014680
696 690 704 2004 810 2019-03-26 09:49:09.015094
738 696 772 1990 710 2019-03-26 09:49:09.015682
834 692 704 2066 704 2019-03-26 09:49:09.016153
798 692 690 1892 722 2019-03-26 09:49:09.016520
696 722 708 2102 700 2019-03-26 09:49:09.016875
824 690 700 2058 718 2019-03-26 09:49:09.017213
692 702 694 2106 704 2019-03-26 09:49:09.017564
Like this, I have many rows in the 09th second.
I have a total of 4 hours of data. How should I group by each second with their respective seconds and minutes?
Please help me.
If I am doing groupby with seconds its basically grouping all the data with seconds irrespective of its hours, minutes.
I have set the index as DateTime index then I tried with this code. and it returned with some 60 seconds data aggregating irrespective with hours and minutes.
df.groupby(df.index.minute).mean()
First, make sure your Time is a datetime object:
df.Time = pd.to_datetime(df.Time)
Then you need to resample:
df.set_index('Time').resample('1S').mean()
With your example data as df, the above results in:
Vibration1 Vibration2 Vibration3 Temperature Pressure
Time
2019-03-26 09:49:09 749.8 696.4 732.6 1996.0 729.4
Can you change column 'Time'?
Example:
import pandas as pd
data = {
'dates': ['09:49:09.015682', '09:50:09.025682', '09:51:09.055682', '09:49:09.035682', '09:50:09.015682'],
'values': [ 1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
for i in df.index:
df['dates'][i] = df['dates'][i][:8]
print(df.groupby('dates').mean())
Output:
values
dates
09:49:09 2.5
09:50:09 3.5
09:51:09 3.0
I have the following collection of items. I would like to add a comma followed by a space at the end of each item so I can create a list out of them. I am assuming the best way to do this is to form a string out of the items and then replace 3 spaces between each item with a comma, using regular expressions?
I would like to do this with python, which I am new to.
179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281
283 293 307 311 313 317 331 337 347 349
353 359 367 373 379 383 389 397 401 409
419 421 431 433 439 443 449 457 461 463
Instead of a regular expression, how about this (assuming you have it in a file somewhere):
items = open('your_file.txt').read().split()
If it's just in a string variable:
items = your_input.split()
To combine them again with a comma in between:
print ', '.join(items)
data = """179 181 191 193 197 199 211 223 227 229
233 239 241 251 257 263 269 271 277 281 """
To get the list out of it:
lst = re.findall("(\d+)", data)
print lst
To add comma after each item, replace multiple spaces with , and space.
data = re.sub("[ ]+", ", ", data)
print data