Python barbs wrong direction - python

There is probably a really simple answer to this and I'm only asking as a last resort as I usually get my answers by searching but I can't figure this out or find an answer. Basically I'm plotting some wind barbs in Python but they are pointing in the wrong direction and I don't know why.
Data is imported from a file and put into lists, I found on another stackoverflow post how to set the U, V for barbs using np.sin and np.cos, which results in the correct wind speed but the direction is wrong. I'm basically plotting a very simple tephigram or Skew-T.
# Program to read in radiosonde data from a file named "raob.dat"
# Import numpy since we are going to use numpy arrays and the loadtxt
# function.
import numpy as np
import matplotlib.pyplot as plt
# Open the file for reading and store the file handle as "f"
# The filename is 'raob.dat'
f=open('data.dat')
# Read the data from the file handle f. np.loadtxt() is useful for reading
# simply-formatted text files.
datain=np.loadtxt(f)
# Close the file.
f.close();
# We can copy the different columns into
# pressure, temperature and dewpoint temperature
# Note that the colon means consider all elements in that dimension.
# and remember indices start from zero
p=datain[:,0]
temp=datain[:,1]
temp_dew=datain[:,2]
wind_dir=datain[:,3]
wind_spd=datain[:,4]
print 'Pressure/hPa: ', p
print 'Temperature/C: ', temp
print 'Dewpoint temperature: ', temp_dew
print 'Wind Direction/Deg: ', wind_dir
print 'Wind Speed/kts: ', wind_spd
# for the barb vectors. This is the bit I think it causing the problem
u=wind_spd*np.sin(wind_dir)
v=wind_spd*np.cos(wind_dir)
#change units
#p=p/10
#temp=temp/10
#temp_dew=temp_dew/10
#plot graphs
fig1=plt.figure()
x1=temp
x2=temp_dew
y1=p
y2=p
x=np.linspace(50,50,len(y1))
#print x
plt.plot(x1,y1,'r',label='Temp')
plt.plot(x2,y2,'g',label='Dew Point Temp')
plt.legend(loc=3,fontsize='x-small')
plt.gca().invert_yaxis()
#fig2=plt.figure()
plt.barbs(x,y1,u,v)
plt.yticks(y1)
plt.grid(axis='y')
plt.show()
The barbs should all mostly be in the same direction as you can see in the direction in degrees from the data.
Any help is appreciated. Thank you.
Here is the data that is used:
996 25.2 24.9 290 12
963.2 24.5 22.6 315 42
930.4 23.8 20.1 325 43
929 23.8 20 325 43
925 23.4 19.6 325 43
900 22 17 325 43
898.6 21.9 17 325 43
867.6 20.1 16.5 320 41
850 19 16.2 320 44
807.9 16.8 14 320 43
779.4 15.2 12.4 320 44
752 13.7 10.9 325 43
725.5 12.2 9.3 320 44
700 10.6 7.8 325 45
649.7 7 4.9 315 44
603.2 3.4 1.9 325 49
563 0 -0.8 325 50
559.6 -0.2 -1 325 50
500 -3.5 -4.9 335 52
499.3 -3.5 -5 330 54
491 -4.1 -5.5 332 52
480.3 -5 -6.4 335 50
427.2 -9.7 -11 330 45
413 -11.1 -12.3 335 43
400 -12.7 -14.4 340 42
363.9 -16.9 -19.2 350 37
300 -26.3 -30.2 325 40
250 -36.7 -41.2 330 35
200 -49.9 0 335 0
150 -66.6 0 0 10
100 -83.5 0 0 30
Liam

# for the barb vectors. This is the bit I think it causing the problem
u=wind_spd*np.sin(wind_dir)
v=wind_spd*np.cos(wind_dir)
Instead try:
u=wind_spd*np.sin((np.pi/180)*wind_dir)
v=wind_spd*np.cos((np.pi/180)*wind_dir)
(http://tornado.sfsu.edu/geosciences/classes/m430/Wind/WindDirection.html)

Related

data.dropna() doesnt work for my data.csv file and i still get a data with NaN elements

I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
TL,DR:
Try this:
new_data = df.fillna(pd.NA).dropna()
or:
import numpy as np
new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc, the pandas.read_csv contains an argument na_values:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType

Splitting data into subsamples

I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.

Possible to store getvalue() from FTP and store into dataframe Python

I'm trying to see if I can store .txt files directly from an FTP and turning it into a single concatenated dataframe variable without saving each file on my local drive.
My method of doing this right now looks like this:
#import packages
import pandas as pd
import numpy as np
import csv
import os, io
import re
import ftplib
from ftplib import *
from ftplib import FTP
from io import StringIO, BytesIO
ftp = ftplib.FTP('ftp.cpc.ncep.noaa.gov') # enter the main page
ftp.login() # log in to archive
# directory for monthly data
ftp.cwd('htdocs/products/analysis_monitoring/cdus/degree_days/archives/Heating degree Days/monthly states/2018')
filenames_monthly = ftp.nlst() # get list of filenames in archive
#grab files wanted from a list of files already on local machine
filenames_wanted_monthly = list(set(filenames_monthly).intersection(date_list_monthly)) # get list of unobtained dates
for file_month in filenames_wanted_monthly:
r = BytesIO()
ftp.retrbinary('RETR '+ file_month, r.write)
df_list.append(r.getvalue().decode('utf-8'))
print(r.getvalue())
When I print the values I get the text I'm looking for but in an unaggregated and messy form:
b' \n HEATING DEGREE DAY DATA MONTHLY SUMMARY\n POPULATION-WEIGHTED STATE,REGIONAL,AND NATIONAL AVERAGES\n CLIMATE PREDICTION CENTER-NCEP-NWS-NOAA\n \n MONTHLY DATA FOR OCT 2018\n ACCUMULATIONS ARE FROM JULY 1, 2018\n -999 = NORMAL LESS THAN 100 OR RATIO INCALCULABLE\n \n STATE MONTH MON MON CUM CUM CUM CUM CUM\n TOTAL DEV DEV TOTAL DEV DEV DEV DEV\n FROM FROM FROM FROM FROM FROM\n NORM L YR NORM L YR NORM L YR\n PRCT PRCT\n \n ALABAMA 90 -63 -28 90 -78 -48 -46 -35\n ALASKA 685 -314 -149 1436 -629 -251 -30 -15\n ARIZONA 43 -33 42 43 -40 42 -999 -999\n ARKANSAS 197 24 24 215 14 24 7 13\n CALIFORNIA 43 -76 -5 43 -126 -22 -75 -34\n COLORADO 593 20 37 721 -187 -38 -21 -5\n CONNECTICUT 399 -22 179 498 -56 137 -10 38\n DELAWARE 242 -43 94 245 -83 75 -25 44\n DISTRCT COLUMBIA 194 -11 103 195 -36 97 -16 99\n FLORIDA 3 -7 -19 3 -7 -19 -999 -999\n GEORGIA 117 -37 -18 117 -52 -39 -31 -25\n HAWAII 0 0 0 0 0 0 -999 -999\n IDAHO 535 3 -78 732 -112 -104 -13 -12\n ILLINOIS 430 35 135 529 9 107 2 25\n INDIANA 386 10 95 471 -24 45 -5 11\n IOWA
Appending to the list gives me a set of str valued lists of size=1.
Is there a way to store these .txt files by separating them by the \n tags into a single readable dataframe?
My desired output would be a dataframe with 2 columns with the first 2 values after each instance of \n:
STATE TOTAL
ALABAMA 90
ARIZONA 43
CALIFORNIA 43
CONNECTICUT 399
.
.
use the StringIO to write the strings from getvalue() to a file and then read that file using pandas to get the desired dataframe
1. test_result = StringIO(result) # write strings to a file
2. pd.read_csv(test_result,sep="\n") # convert to dataframe

Pandas diff SeriesGroupBy is relatively slow

Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)
If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.

Graphically displaying BLAST alignments from local source

I have an issue that I am trying to work through. I have a large dataset of about 25,000 genes that seem to the product of domain shuffling or gene fusions. I would like to view these alignments in pdf format based on BLAST outfmt 6 output.
I have BLAST output files for each of these genes with 1 sequence (the recombinogenic gene) and a varying number of subject genes with the following columns:
qseqid sseqid evalue qstart qend qlen sstart send slen length
I was hoping to parse the files through some code to produce images like the attached file, using the following example blast output file:
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-30 5 100 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-10 200 230 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_3___Aver00900 1e-20 5 100 300 10 105 125 100
Cluster_1___Hsap10003 Cluster_3___Atha00809 1e-20 5 110 300 5 115 120 105
Cluster_1___Hsap10003 Cluster_4___Ecol00002 1e-10 70 170 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_4___Ecol00003 1e-30 75 175 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_4___Sfle00009 1e-10 80 180 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Spom00010 1e-30 160 260 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_5___Scer01566 1e-10 170 270 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Afla00888 1e-30 175 275 300 10 105 240 95
I am looking for the query sequence to be a thick coloured bar, and the alignment section of each subject to be thick colourful bars with thin black lines showing the rest of the gene length (one subject per line showing all alignment sections against the query).
Does anyone know any software or know of any github code that may do something like this?
Thanks so much!

Categories

Resources