I'm studying Pandas from Python.
I'm trying to remove NaN elements from my data.csv file with data.dropna() and it isn't removing.
import pandas as pd
data = pd.read_csv('data.csv')
new_data = data.dropna()
print(new_data)
This is data.csv content.
Duration Date Pulse Maxpulse Calories
60 '2020/12/01' 110 130 409.1
60 '2020/12/02' 117 145 479.0
60 '2020/12/03' 103 135 340.0
45 '2020/12/04' 109 175 282.4
45 '2020/12/05' 117 148 406.0
60 '2020/12/06' 102 127 300.0
60 '2020/12/07' 110 136 374.0
450 '2020/12/08' 104 134 253.3
30 '2020/12/09' 109 133 195.1
60 '2020/12/10' 98 124 269.0
60 '2020/12/11' 103 147 329.3
60 '2020/12/12' 100 120 250.7
60 '2020/12/12' 100 120 250.7
60 '2020/12/13' 106 128 345.3
60 '2020/12/14' 104 132 379.3
60 '2020/12/15' 98 123 275.0
60 '2020/12/16' 98 120 215.2
60 '2020/12/17' 100 120 300.0
45 '2020/12/18' 90 112 NaN
60 '2020/12/19' 103 123 323.0
45 '2020/12/20' 97 125 243.0
60 '2020/12/21' 108 131 364.2
45 NaN 100 119 282.0
60 '2020/12/23' 130 101 300.0
45 '2020/12/24' 105 132 246.0
60 '2020/12/25' 102 126 334.5
60 2020/12/26 100 120 250.0
60 '2020/12/27' 92 118 241.0
60 '2020/12/28' 103 132 NaN
60 '2020/12/29' 100 132 280.0
60 '2020/12/30' 102 129 380.3
60 '2020/12/31' 92 115 243.0
My guess is that data.csv is written incorrect?
The data.csv file is written wrong, to fix it need to add commas.
Corrected format: data.csv
Duration,Date,Pulse,Maxpulse,Calories
60,2020/12/01',110,130,409.1
60,2020/12/02',117,145,479.0
60,2020/12/03',103,135,340.0
45,2020/12/04',109,175,282.4
45,2020/12/05',117,148,406.0
60,2020/12/06',102,127,300.0
60,2020/12/07',110,136,374.0
450,2020/12/08',104,134,253.3
30,2020/12/09',109,133,195.1
60,2020/12/10',98,124,269.0
60,2020/12/11',103,147,329.3
60,2020/12/12',100,120,250.7
60,2020/12/12',100,120,250.7
60,2020/12/13',106,128,345.3
60,2020/12/14',104,132,379.3
60,2020/12/15',98,123,275.0
60,2020/12/16',98,120,215.2
60,2020/12/17',100,120,300.0
45,2020/12/18',90,112,
60,2020/12/19',103,123,323.0
45,2020/12/20',97,125,243.0
60,2020/12/21',108,131,364.2
45,,100,119,282.0
60,2020/12/23',130,101,300.0
45,2020/12/24',105,132,246.0
60,2020/12/25',102,126,334.5
60,20201226,100,120,250.0
60,2020/12/27',92,118,241.0
60,2020/12/28',103,132,
60,2020/12/29',100,132,280.0
60,2020/12/30',102,129,380.3
60,2020/12/31',92,115,243.0
TL,DR:
Try this:
new_data = df.fillna(pd.NA).dropna()
or:
import numpy as np
new_data = df.fillna(np.NaN).dropna()
That's the real csv file? I don't think so.
There isn't any specification of missing values in csv doc. From my experience, missing values in csv are represented by nothing between two separators (if the separator is a comma, it looks like ,,).
From pandas doc, the pandas.read_csv contains an argument na_values:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
If your csv file contains 'NaN', pandas are capable to infer and read as NaN, but you can pass the parameter as you need.
Also, you can use (consider i as the number of row and j for column):
type(df.iloc[i,j])
Compare with:
type(np.NaN) # numpy NaN
float
type(pd.NA) # pandas NaN
pandas._libs.missing.NAType
I am new in Python and i have a question. I have an exported .csv with values and i want to sum each row's total value than make a total column to there.
I've tried that but it doesnt work.
import pandas as pd
wine = pd.read_csv('testelek.csv', 'rb', delimiter=';')
wine['Total'] = [wine[row].sum(axis=1) for row in wine]
I want to make my DataFrame like this.
101 102 103 104 .... Total
__________________________________________________________________________
0 80 84 86 78 .... 328
1 78 76 77 79 .... 310
2 79 81 88 83 .... 331
3 70 85 89 84 .... 328
4 78 84 88 85 .... 335
You can bypass the need for the list comprehension and just use the axis=1 parameter to get what you want.
wine['Total'] = wine.sum(axis=1)
A nice way to do this is by using .apply().
Suppose that you want to create a new column named Total by adding the values per row for columns named 101, 102, and 103 you can try the following:
wine['Total'] = wine.apply(lambda row: sum([row['101'], row['102'], row['103']]), axis=1)
I need to change the names of a subset of columns in a dataframe from whatever number they are to that number plus a string suffix. I know there is a function to add a suffix, but it doesn't seem to work on just indices.
I create a list with all the column indices in it, then run a loop that, for each item in that list, it renames the dataframe column that matches the list item to the same number, plus the suffix string.
if scalename == "CDR":
print(scaledf.columns.tolist())
oldCols = scaledf.columns[7:].tolist()
for f in range(len(oldCols)):
changeCol = int(oldCols[f])
print(changeCol)
scaledf.rename(columns = {changeCol:scalename + str(changeCol)})
print(scaledf.columns)
This doesn't work.
The code will print out the column names, and prints out every item, but it does not rename the columns. It doesn't throw errors, it just doesn't work. I've tried variation after variation, and gotten all kinds of other errors, but this error-free code does nothing. It just runs, and doesn't rename anything.
Any help would be seriously appreciated! Thank you.
Adding sample of list:
45
52
54
55
59
60
61
66
67
68
69
73
74
75
80
81
82
94
101
103
104
108
110
115
116
117
129
136
138
139
143
144
145
150
151
157
158
159
171
178
180
181
185
186
187
192
193
199
200
201
213
220
222
223
227
228
229
234
235
236
Try this:
scaledf = scaledf.rename(columns=lambda c:scalename + str(c) if c in oldCols else c)
I am trying to translate the input dataframe (inp_df) to output dataframe (out_df) using the the data from the cell based intermediate dataframe (matrix_df) as shown below.
There are several cell number based files with distance values shown in matrix_df .
The program iterates by cell & fetches data from appropriate file so each time matrix_df will have the data for all rows of the current cell# that we are iterating for in inp_df.
inp_df
A B cell
100 200 1
115 270 1
145 255 2
115 266 1
matrix_df (cell_1.csv)
B 100 115 199 avg_distance
200 7.5 80.7 67.8 52
270 6.8 53 92 50
266 58 84 31 57
matrix_df (cell_2.csv)
B 145 121 166 avg_distance
255 74.9 77.53 8 53.47
out_df dataframe
A B cell distance avg_distance
100 200 1 7.5 52
115 270 1 53 50
145 255 2 74.9 53.47
115 266 1 84 57
My current thought process for each cell# based data is
use a apply function to go row by row
then use a join based on column B in the inp_df with with matrix_df, where the matrix df is somehow translated into a tuple of column name, distance & average distance.
But I am looking for a pandonic way of doing this since my approach will slow down when there are millions of rows in the input. I am specifically looking for core logic inside an iteration to fetch the matches, since in each cell the number of columns in matrix_df would vary
If its any help the matrix files is the distance based outputs from sklearn.metrics.pairwise.pairwise_distances .
NB: In inp_df the value of column B is unique and values of column A may or may not be unique
Also the matrix_dfs first column was empty & i had renamed it with the following code for easiness in understanding since it was a header-less matrix output file.
dist_df = pd.read_csv(mypath,index_col=False)
dist_df.rename(columns={'Unnamed: 0':'B'}, inplace=True)
Step 1: Concatenate your inputs with pd.concat and merge with inp_df using df.merge
In [641]: out_df = pd.concat([matrix_df1, matrix_df2]).merge(inp_df)
Step 2: Create the distance column with df.apply by using A's values to index into the correct column
In [642]: out_df.assign(distance=out_df.apply(lambda x: x[str(int(x['A']))], axis=1))\
[['A', 'B', 'cell', 'distance', 'avg_distance']]
Out[642]:
A B cell distance avg_distance
0 100 200 1 7.5 52.00
1 115 270 1 53.0 50.00
2 115 266 1 84.0 57.00
3 145 255 2 74.9 53.47