Related
I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
TL,DR:
Try this:
new_data = df.fillna(pd.NA).dropna()
or:
import numpy as np
new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc, the pandas.read_csv contains an argument na_values:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType
I have a huge dataset which contains coordinates of particles. In order to split the data into test and training set I want to divide the space into many subspaces; I did this with a for-loop in every direction (x,y,z) but when running the code it takes very long and is not efficient enough especially for large datasets:
particle_boxes = []
init = 0
final = 50
number_box = 5
for i in range(number_box):
for j in range(number_box):
for k in range(number_box):
index_particle = df_particles['X'].between(init+i*final, final+final*i)&df_particles['Y'].between(init+j*final, final+final*j)&df_particles['Z'].between(init+k*final, final+final*k)
particle_boxes.append(df_particles[index_particle])
where init and final define the box size, df_particles contains every particle coordinate (x,y,z).
After running this particle_boxes contains 125 (number_box^3) equal spaced subboxes.
Is there any way to write this code more efficiently?
Note on efficiency
I conducted a number of tests using other tricks and nothing changed substantially. This is roughly as good as any other technique I used.
I'm curious to see if anyone else comes up with something order of magnitude faster.
Sample data
np.random.seed([3, 1415])
df_particles = pd.DataFrame(
np.random.randint(250, size=(1000, 3)),
columns=['X', 'Y', 'Z']
)
Solution
Construct an array a that represents your boundaries
a = np.array([50, 100, 150, 200, 250])
Then use searchsorted to create the individual dimensional bins
x_bin = a.searchsorted(df_particles['X'].to_numpy())
y_bin = a.searchsorted(df_particles['Y'].to_numpy())
z_bin = a.searchsorted(df_particles['Z'].to_numpy())
Use groupby on the three bins. I used trickery to get that into a dict
g = dict((*df_particles.groupby([x_bin, y_bin, z_bin]),))
We can see the first zone
g[(0, 0, 0)]
X Y Z
30 2 36 47
194 0 34 45
276 46 37 34
364 10 16 21
378 4 15 4
429 12 34 13
645 36 17 5
743 18 36 13
876 46 11 34
and the last
g[(4, 4, 4)]
X Y Z
87 223 236 213
125 206 241 249
174 218 247 221
234 222 204 237
298 208 211 225
461 234 204 238
596 209 229 241
731 210 220 242
761 225 215 231
762 206 241 240
840 211 241 238
846 212 242 241
899 249 203 228
970 214 217 232
981 236 216 248
Instead of multiple nested for loops, consider one loop using itertools.product. But of course avoid any loops if possible as #piRSquared shows:
from itertools import product
particle_boxes = []
for i, j, k in product(range(number_box), range(number_box), range(number_box)):
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
particle_boxes.append(df_particles[index_particle])
Alternatively, with list comprehension:
def sub_df(i, j, k)
index_particle = (df_particles['X'].between(init+i*final, final+final*i) &
df_particles['Y'].between(init+j*final, final+final*j) &
df_particles['Z'].between(init+k*final, final+final*k))
return df_particles[index_particle]
particle_boxes = [sub_df(i, j, k) for product(range(number_box), range(number_box), range(number_box))]
Have a look at train_test_split function available in the scikit-learn lib.
I think it is almost the kind of functionality that you need.
The code is consultable on Github.
I'm trying to see if I can store .txt files directly from an FTP and turning it into a single concatenated dataframe variable without saving each file on my local drive.
My method of doing this right now looks like this:
#import packages
import pandas as pd
import numpy as np
import csv
import os, io
import re
import ftplib
from ftplib import *
from ftplib import FTP
from io import StringIO, BytesIO
ftp = ftplib.FTP('ftp.cpc.ncep.noaa.gov') # enter the main page
ftp.login() # log in to archive
# directory for monthly data
ftp.cwd('htdocs/products/analysis_monitoring/cdus/degree_days/archives/Heating degree Days/monthly states/2018')
filenames_monthly = ftp.nlst() # get list of filenames in archive
#grab files wanted from a list of files already on local machine
filenames_wanted_monthly = list(set(filenames_monthly).intersection(date_list_monthly)) # get list of unobtained dates
for file_month in filenames_wanted_monthly:
r = BytesIO()
ftp.retrbinary('RETR '+ file_month, r.write)
df_list.append(r.getvalue().decode('utf-8'))
print(r.getvalue())
When I print the values I get the text I'm looking for but in an unaggregated and messy form:
b' \n HEATING DEGREE DAY DATA MONTHLY SUMMARY\n POPULATION-WEIGHTED STATE,REGIONAL,AND NATIONAL AVERAGES\n CLIMATE PREDICTION CENTER-NCEP-NWS-NOAA\n \n MONTHLY DATA FOR OCT 2018\n ACCUMULATIONS ARE FROM JULY 1, 2018\n -999 = NORMAL LESS THAN 100 OR RATIO INCALCULABLE\n \n STATE MONTH MON MON CUM CUM CUM CUM CUM\n TOTAL DEV DEV TOTAL DEV DEV DEV DEV\n FROM FROM FROM FROM FROM FROM\n NORM L YR NORM L YR NORM L YR\n PRCT PRCT\n \n ALABAMA 90 -63 -28 90 -78 -48 -46 -35\n ALASKA 685 -314 -149 1436 -629 -251 -30 -15\n ARIZONA 43 -33 42 43 -40 42 -999 -999\n ARKANSAS 197 24 24 215 14 24 7 13\n CALIFORNIA 43 -76 -5 43 -126 -22 -75 -34\n COLORADO 593 20 37 721 -187 -38 -21 -5\n CONNECTICUT 399 -22 179 498 -56 137 -10 38\n DELAWARE 242 -43 94 245 -83 75 -25 44\n DISTRCT COLUMBIA 194 -11 103 195 -36 97 -16 99\n FLORIDA 3 -7 -19 3 -7 -19 -999 -999\n GEORGIA 117 -37 -18 117 -52 -39 -31 -25\n HAWAII 0 0 0 0 0 0 -999 -999\n IDAHO 535 3 -78 732 -112 -104 -13 -12\n ILLINOIS 430 35 135 529 9 107 2 25\n INDIANA 386 10 95 471 -24 45 -5 11\n IOWA
Appending to the list gives me a set of str valued lists of size=1.
Is there a way to store these .txt files by separating them by the \n tags into a single readable dataframe?
My desired output would be a dataframe with 2 columns with the first 2 values after each instance of \n:
STATE TOTAL
ALABAMA 90
ARIZONA 43
CALIFORNIA 43
CONNECTICUT 399
.
.
use the StringIO to write the strings from getvalue() to a file and then read that file using pandas to get the desired dataframe
1. test_result = StringIO(result) # write strings to a file
2. pd.read_csv(test_result,sep="\n") # convert to dataframe
Total time: 1.01876 s
Function: prepare at line 91
Line # Hits Time Per Hit % Time Line Contents
==============================================================
91 #profile
92 def prepare():
93
94 1 5681.0 5681.0 0.6
95 1 2416.0 2416.0 0.2
96
97
98 1 536.0 536.0 0.1 tss = df.groupby('user_id').timestamp
99 1 949643.0 949643.0 93.2 delta = tss.diff()
100 1 1822.0 1822.0 0.2
101 1 13030.0 13030.0 1.3
102 1 5193.0 5193.0 0.5
103 1 1251.0 1251.0 0.1
104
105 1 2038.0 2038.0 0.2
106
107 1 1851.0 1851.0 0.2
108
109 1 282.0 282.0 0.0
110
111 1 3088.0 3088.0 0.3
112 1 2943.0 2943.0 0.3
113 1 438.0 438.0 0.0
114 1 4658.0 4658.0 0.5
115 1 17083.0 17083.0 1.7
116 1 3115.0 3115.0 0.3
117 1 3691.0 3691.0 0.4
118
119 1 2.0 2.0 0.0
I have a dataframe which I group by some key and then select a column from each group and perform diff on that column (per group). As shown in the profiling results, the diff operation is super slow compared to the rest and is kind of a bottleneck. Is this expected? Are there faster alternatives to achieve the same result?
Edit: some more explanation
In my use case timestamps represent the times for some actions of a user to which I want to calculate the deltas between these actions (they are sorted) but each user's actions are completely independent of other users.
Edit: Sample code
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'ts':[1,2,3,4,60,61,62,63,64,150,155,156,
1,2,3,4,60,61,62,63,64,150,155,163,
1,2,3,4,60,61,62,63,64,150,155,183],
'id': [1,2,3,4,60,61,62,63,64,150,155,156,
71,72,73,74,80,81,82,83,64,160,165,166,
21,22,23,24,90,91,92,93,94,180,185,186],
'other':['x','x','x','','x','x','','x','x','','x','',
'y','y','y','','y','y','','y','y','','y','',
'z','z','z','','z','z','','z','z','','z',''],
'user':['x','x','x','x','x','x','x','x','z','x','x','y',
'y','y','y','y','y','y','y','y','x','y','y','x',
'z','z','z','z','z','z','z','z','y','z','z','z']
})
df.set_index('id',inplace=True)
deltas=df.groupby('user').ts.transform(pd.Series.diff)
If you do not wish to sort your data or drop down to numpy, then a significant performance improvement may be possible by changing your user series to Categorical. Categorical data is stored effectively as integer pointers.
In the below example, I see an improvement from 86ms to 59ms. This may improve further for larger datasets and where more users are repeated.
df = pd.concat([df]*10000)
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 86.1 ms per loop
%timeit df['user'].astype('category') # 23.4 ms per loop
df['user'] = df['user'].astype('category')
%timeit df.groupby('user').ts.transform(pd.Series.diff) # 35.7 ms per loop
If you are performing multiple operations, then the one-off cost of converting to categorical can be discounted.
I have an issue that I am trying to work through. I have a large dataset of about 25,000 genes that seem to the product of domain shuffling or gene fusions. I would like to view these alignments in pdf format based on BLAST outfmt 6 output.
I have BLAST output files for each of these genes with 1 sequence (the recombinogenic gene) and a varying number of subject genes with the following columns:
qseqid sseqid evalue qstart qend qlen sstart send slen length
I was hoping to parse the files through some code to produce images like the attached file, using the following example blast output file:
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-30 5 100 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_2___Hsap00200 1e-10 200 230 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_3___Aver00900 1e-20 5 100 300 10 105 125 100
Cluster_1___Hsap10003 Cluster_3___Atha00809 1e-20 5 110 300 5 115 120 105
Cluster_1___Hsap10003 Cluster_4___Ecol00002 1e-10 70 170 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_4___Ecol00003 1e-30 75 175 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_4___Sfle00009 1e-10 80 180 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Spom00010 1e-30 160 260 300 10 105 240 95
Cluster_1___Hsap10003 Cluster_5___Scer01566 1e-10 170 270 300 205 235 30 95
Cluster_1___Hsap10003 Cluster_5___Afla00888 1e-30 175 275 300 10 105 240 95
I am looking for the query sequence to be a thick coloured bar, and the alignment section of each subject to be thick colourful bars with thin black lines showing the rest of the gene length (one subject per line showing all alignment sections against the query).
Does anyone know any software or know of any github code that may do something like this?
Thanks so much!