Using if statements to filter data? - python

Lets say I have an excel document with the following format. I'm reading said excel doc with pandas and plotting data using matplotlib and numpy. Everything is great!
Buttttt..... I wan't more constraints. Now I want to constrain my data so that I can sort for only specific zenith angles and azimuth angles. More specifically: I only want zenith when it is between 30 and 90, and I only want azimuth when it is between 30 and 330
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
0 81 10
70 35 7
110 90 17
270 45 23
330 45 13
345 47 6
175 82 7
220 7 8
This is an example of the sort of constraint I'm looking for.
Air Quality Data
Azimuth Zenith Ozone Amount
230 50 12
70 35 7
110 90 17
270 45 23
330 45 13
175 82 7
The following is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
P_file = file1
out_file = file2
out_file2 = file3
data = pd.read_csv(file1,header=None,sep=' ')
df=pd.DataFrame(data=data)
df.to_csv(file2,sep=',',header = [19 headers. The three that matter for this question are 'DateTime', 'Zenith', 'Azimuth', and 'Ozone Amount'.]
df=pd.read_csv(file2,header='infer')
mask = df[df['DateTime'].str.contains('20141201')] ## In this line I'm sorting for anything containing the locator for the given day.
mask.to_csv(file2) ##I'm now updating file 2 so that it only has the data I want sorted for.
data2 = pd.read_csv(file2,header='infer')
df2=pd.DataFrame(data=data2)
def tojuliandate(date):
return.... ##give a function that changes normal date of format %Y%m%dT%H%M%SZ to julian date format of %y%j
def timeofday(date):
changes %Y%m%dT%H%M%SZ to %H%M%S for more narrow views of data
df2['Time of Day'] = df2['DateTime'].apply(timeofday)
df2.to_csv(file2) ##adds a column for "timeofday" to the file
So basically at this point this is all the code that goes into making the csv I want to sort. How would I go about sorting
'Zenith' and 'Azimuth'
If they met the criteria I specified above?
I know that I will need if statements to do this.
I tried something like this but it didn't work and I was looking for a bit of help:

df[(df["Zenith"]>30) & (df["Zenith"]<90) & (df["Azimuth"]>30) & (df["Azimuth"]<330)]
Basically a duplicate of Efficient way to apply multiple filters to pandas DataFrame or Series

You can use series between:
df[(df['Zenith'].between(30, 90)) & (df['Azimuth'].between(30, 330))]
Yields:
Azimuth Zenith Ozone Amount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7
Note that by default, these upper and lower bounds are inclusive (inclusive=True).

You can only write those entries of the dataframe to your file, which are meeting your boundary conditions
# replace the line df.to_csv(...) in your example with
df[((df['Zenith'] >= 3) & (df['Zenith'] <= 90)) and
((df['Azimuth'] >= 30) & (df['Azimuth'] <= 330))].to_csv('my_csv.csv')

Using pd.DataFrame.query:
df_new = df.query('30 <= Zenith <= 90 and 30 <= Azimuth <= 330')
print(df_new)
Azimuth Zenith OzoneAmount
0 230 50 12
2 70 35 7
3 110 90 17
4 270 45 23
5 330 45 13
7 175 82 7

Related

Sort rows of curve shaped data in python

I have a dataset that consists of 5 rows that are formed like a curve. I want to separate the inner row from the other or if possible each row and store them in a separate array. Is there any way to do this, like somehow flatten the curved data and sorting it afterwards based on the x and y values?
I would like to assign each row from left to right numbers from 0 to the max of the row. Right now the labels for each dot are not useful for me and I can't change the labels.
Here are the first 50 data points of my data set:
x y
0 -6.4165 0.3716
1 -4.0227 2.63
2 -7.206 3.0652
3 -3.2584 -0.0392
4 -0.7565 2.1039
5 -0.0498 -0.5159
6 2.363 1.5329
7 -10.7253 3.4654
8 -8.0621 5.9083
9 -4.6328 5.3028
10 -1.4237 4.8455
11 1.8047 4.2297
12 4.8147 3.6074
13 -5.3504 8.1889
14 -1.7743 7.6165
15 1.1783 6.9698
16 4.3471 6.2411
17 7.4067 5.5988
18 -2.6037 10.4623
19 0.8613 9.7628
20 3.8054 9.0202
21 7.023 8.1962
22 9.9776 7.5563
23 0.1733 12.6547
24 3.7137 11.9097
25 6.4672 10.9363
26 9.6489 10.1246
27 12.5674 9.3369
28 3.2124 14.7492
29 6.4983 13.7562
30 9.2606 12.7241
31 12.4003 11.878
32 15.3578 11.0027
33 6.3128 16.7014
34 9.7676 15.6557
35 12.2103 14.4967
36 15.3182 13.5166
37 18.2495 12.5836
38 9.3947 18.5506
39 12.496 17.2993
40 15.3987 16.2716
41 18.2212 15.1871
42 21.1241 14.0893
43 12.3548 20.2538
44 15.3682 18.9439
45 18.357 17.8862
46 21.0834 16.6258
47 23.9992 15.4145
48 15.3776 21.9402
49 18.3568 20.5803
50 21.1733 19.3041
It seems that your curves have a pattern, so you could select the curve of interest using splicing. I had the offset the selection slightly to get the five curves because the first 8 points are not in the same order as the rest of the data. So the initial 8 data points are discarded. But these could be added back in afterwards if required.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({ 'x': [-6.4165, -4.0227, -7.206, -3.2584, -0.7565, -0.0498, 2.363, -10.7253, -8.0621, -4.6328, -1.4237, 1.8047, 4.8147, -5.3504, -1.7743, 1.1783, 4.3471, 7.4067, -2.6037, 0.8613, 3.8054, 7.023, 9.9776, 0.1733, 3.7137, 6.4672, 9.6489, 12.5674, 3.2124, 6.4983, 9.2606, 12.4003, 15.3578, 6.3128, 9.7676, 12.2103, 15.3182, 18.2495, 9.3947, 12.496, 15.3987, 18.2212, 21.1241, 12.3548, 15.3682, 18.357, 21.0834, 23.9992, 15.3776, 18.3568, 21.1733],
'y': [0.3716, 2.63, 3.0652, -0.0392, 2.1039, -0.5159, 1.5329, 3.4654, 5.9083, 5.3028, 4.8455, 4.2297, 3.6074, 8.1889, 7.6165, 6.9698, 6.2411, 5.5988, 10.4623, 9.7628, 9.0202, 8.1962, 7.5563, 12.6547, 11.9097, 10.9363, 10.1246, 9.3369, 14.7492, 13.7562, 12.7241, 11.878, 11.0027, 16.7014, 15.6557, 14.4967, 13.5166, 12.5836, 18.5506, 17.2993, 16.2716, 15.1871, 14.0893, 20.2538, 18.9439, 17.8862, 16.6258, 15.4145, 21.9402, 20.5803, 19.3041]})
# Generate the 5 dataframes
df_list = [df.iloc[i+8::5, :] for i in range(5)]
# Generate the plot
fig = plt.figure()
for frame in df_list:
plt.scatter(frame['x'], frame['y'])
plt.show()
# Print the data of the innermost curve
print(df_list[4])
OUTPUT:
The 5th dataframe df_list[4] contains the data of the innermost plot.
x y
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
You can then add the missing data like this:
# Retrieve the two missing points of the inner curve
inner_curve = pd.concat([df_list[4], df[5:7]]).sort_index(ascending=True)
print(inner_curve)
# Plot the inner curve only
fig2 = plt.figure()
plt.scatter(inner_curve['x'], inner_curve['y'], color = '#9467BD')
plt.show()
OUTPUT: inner curve
x y
5 -0.0498 -0.5159
6 2.3630 1.5329
12 4.8147 3.6074
17 7.4067 5.5988
22 9.9776 7.5563
27 12.5674 9.3369
32 15.3578 11.0027
37 18.2495 12.5836
42 21.1241 14.0893
47 23.9992 15.4145
Complete Inner Curve

Numpy Vectorized Window Operations

I'm interested in figuring out how to do vectorized computations in a numpy array / pandas dataframe where each new cell is updated with local information.
For example, lets say I'm a weatherman interested in making predictions about the weather. My prediction algorithm will be the mean of the past 3 days. While this prediction is simple, I'd like to be able to do this with an arbitrary function.
Example data:
day temp
1 70
2 72
3 68
4 67
...
After a transformation should become
day temp prediction
1 70 None (no previous data)
2 72 70 (only one data point)
3 68 71 (two data points)
4 67 70
5 70 69
...
I'm only interested in the prediction column, so no need to make an attempt to join the data back together after achieving the prediction! Thanks!
Use rolling with a window of 3 and the min_periods of 1
df['prediction'] = df['temp'].rolling(window = 3, min_periods = 1).mean().shift()
df
day temp prediction
0 1 70 NaN
1 2 72 70
2 3 68 71
3 4 67 70
4 5 70 69

Python combine integer columns to create multiple columns with suffix

I have a dataframe with a sample of the employee survey results as shown below. The values in the delta columns are just the difference between the FY21 and FY20 columns.
Employee leadership_fy21 leadership_fy20 leadership_delta comms_fy21 comms_fy20 comms_delta
patrick.t#abc.com 88 50 38 90 80 10
johnson.g#abc.com 22 82 -60 80 90 -10
pamela.u#abc.com 41 94 -53 44 60 -16
yasmine.a#abc.com 90 66 24 30 10 20
I'd like to create multiple columns that
i. contain the % in the fy21 values
ii. merge it with the columns with the delta suffix such that the delta values are in a ().
example output would be:
Employee leadership_fy21 leadership_delta leadership_final comms_fy21 comms_delta comms_final
patrick.t#abc.com 88 38 88% (38) 90 10 90% (10)
johnson.g#abc.com 22 -60 22% (-60) 80 -10 80% (-10)
pamela.u#abc.com 41 -53 41% (-53) 44 -16 44% (-16)
yasmine.a#abc.com 90 24 90% (24) 30 20 30% (20)
I have tried the following code but it doesn't seem to work. It might have to do with numpy not being able to combine strings. Appreciate any form of help I can get, thank you.
#create a list of all the rating columns
ratingcollist = ['leadership','comms','wellbeing','teamwork']
#create a for loop to get all the columns that match the column list
for rat in ratingcollist:
cols = df.filter(like=rat).columns
fy21cols = df[cols].filter(like='_fy21').columns
deltacols = df[cols].filter(like='_delta').columns
if len(cols) > 0:
df[f'{rat.lower()}final'] = (df[fy21cols].values.astype(str) + '%' + '(' + df[deltacols].values.astype(str) + ')')
You can do this:
def yourfunction(ratingcol):
x=df.filter(regex=f'{ratingcol}(_delta|_fy21)')
fy=x.filter(regex='21').iloc[:,0].astype(str)
delta=x.filter(regex='_delta').iloc[:,0].astype(str)
return(fy+"%("+delta+")")
yourfunction('leadership')
0 88%(38)
1 22%(-60)
2 41%(-53)
3 90%(24)
Then, using a for loop you can create your columns
for i in ratingcollist:
df[f"{i}_final"]=yourfunction(i)

Calculating average/standard deviations of rows containing certain string in pandas dataframe

I have a large pandas dataframe read as table. I would like to calculate the means and standard deviations of the two different groups, CRPS and Age, so I can plot them in a bar plot with std deviations as the error bars.
I can get the mean calculated by just the Age column. I figured it's a for loop that I have to construct, but I don't know how to construct further than table["Age"].mean(), which just gives me the average of all data points' age values. This is where I need some guidance. I want to look in the group column, tell it to calculate the average and standard deviation for the ages of that group. So, an average and standard deviation value for the ages of the CRPS group, for example.
I have the first 25 rows down below just to show what the dataframe looks like. I also have imported numpy as np as well.
Group Age
0 CRPS 50
1 CRPS 59
2 CRPS 22
3 CRPS 48
4 CRPS 53
5 CRPS 48
6 CRPS 29
7 CRPS 44
8 CRPS 28
9 CRPS 42
10 CRPS 35
11 CONTROLS 54
12 CONTROLS 43
13 CRPS 50
14 CRPS 62
15 CONTROLS 64
16 CONTROLS 39
17 CRPS 40
18 CRPS 59
19 CRPS 46
20 CONTROLS 56
21 CRPS 21
22 CRPS 45
23 CONTROLS 41
24 CRPS 46
25 CONTROLS 35
I don't think you need a for-loop.
Instead, you might try something like:
table.iloc[table['Group'] == 'CRPS']['Age'].mean()
I haven't tested with your table, but I think that will work.
The idea is to first create a boolean array, which is true for row indices where the group field contains 'CRPS', then to select all of those rows using iloc, and finally to take the mean. You could iterate over all of the groups in the following way:
mean_age = dict()
for group in set(table['Group']):
mean_age[group] = table.iloc[table['Group'] == group]['Age'].mean()
Maybe this is where you intended to use a for loop.

Plot histogram using two columns (values, counts) in python dataframe

I have a dataframe having multiple columns in pairs: if one column is values then the adjacent column is the corresponding counts. I want to plot a histogram using values as x variable and counts as the frequency.
For example, I have the following columns:
Age Counts
60 1204
45 700
21 400
. .
. .
34 56
10 150
I want my code to bin the Age values in ten-year intervals between the maximum and minimum values and get the cumulative frequencies for each interval from the Counts column and then plot a histogram. Is there a way to do this using matplotlib ?
I have tried the following but in vain:
patient_dets.plot(x='PatientAge', y='PatientAgecounts', kind='hist')
(patient_dets is the dataframe with 'PatientAge' and 'PatientAgecounts' as columns)
I think you need Series.plot.bar:
patient_dets.set_index('PatientAge')['PatientAgecounts'].plot.bar()
If need bins, one possible solution is with pd.cut:
#helper df with min and max ages
df1 = pd.DataFrame({'G':['14 yo and younger','15-19','20-24','25-29','30-34',
'35-39','40-44','45-49','50-54','55-59','60-64','65+'],
'Min':[0, 15,20,25,30,35,40,45,50,55,60,65],
'Max':[14,19,24,29,34,39,44,49,54,59,64,120]})
print (df1)
G Max Min
0 14 yo and younger 14 0
1 15-19 19 15
2 20-24 24 20
3 25-29 29 25
4 30-34 34 30
5 35-39 39 35
6 40-44 44 40
7 45-49 49 45
8 50-54 54 50
9 55-59 59 55
10 60-64 64 60
11 65+ 120 65
cutoff = np.hstack([np.array(df1.Min[0]), df1.Max.values])
labels = df1.G.values
patient_dets['Groups'] = pd.cut(patient_dets.PatientAge, bins=cutoff, labels=labels, right=True, include_lowest=True)
print (patient_dets)
PatientAge PatientAgecounts Groups
0 60 1204 60-64
1 45 700 45-49
2 21 400 20-24
3 34 56 30-34
4 10 150 14 yo and younger
patient_dets.groupby(['PatientAge','Groups'])['PatientAgecounts'].sum().plot.bar()
You can use pd.cut() to bin your data, and then plot using the function plot('bar')
import numpy as np
nBins = 10
my_bins = np.linspace(patient_dets.Age.min(),patient_dets.Age.max(),nBins)
patient_dets.groupby(pd.cut(patient_dets.Age, bins =nBins)).sum()['Counts'].plot('bar')

Categories

Resources