have a quick question about pandas replace.
import pandas as pd
import numpy as np
infile = pd.read_csv('sum_prog.csv')
df = pd.DataFrame(infile)
df_no_na = df.replace({0: np.nan})
df_no_na = df_no_na.dropna()
print(df_no_na.head())
print(df.head())
This code will return:
Cell ID Duration ... Overall Angle Median Overall Euclidean Median
0 372003 148 ... 0.0 1.9535615635898635
1 372005 536 ... 45.16432169606084 37.85959470668756
2 372006 840 ... 0.0 1.0821891332154392
3 372010 840 ... 0.0 1.4200380286464513
4 372011 840 ... 0.0 1.0594536197046835
[5 rows x 20 columns]
Cell ID Duration ... Overall Angle Median Overall Euclidean Median
0 372003 148 ... 0.0 1.9535615635898635
1 372005 536 ... 45.16432169606084 37.85959470668756
2 372006 840 ... 0.0 1.0821891332154392
3 372010 840 ... 0.0 1.4200380286464513
4 372011 840 ... 0.0 1.0594536197046835
I have done this exact same thing and it has worked before I have no idea why it won't now, any help would be awesome, thanks!
You are passing df.replace() a set instead of a dictionary. You need to replace {0, np.nan} with {0: np.nan}:
import pandas as pd
import numpy as np
infile = pd.read_csv('sum_prog.csv')
df = pd.DataFrame(infile)
print(df)
df_no_na = df.replace({0: np.nan}) # change this line
print(df_no_na)
df_no_na = df_no_na.dropna()
print(df_no_na())
index Cell_ID Duration Overall_Angle_Median Overall_Euclidean_Median
1 1.0 372005.0 536.0 45.164322 37.859595
Related
I using pandas to webscrape this site https://www.mapsofworld.com/lat_long/poland-lat-long.html but i only gettin 3 elements. How could I get all elements from table?
import numpy as np
import pandas as pd
#for getting world map
import folium
# Retreiving Latitude and Longitude coordinates
info = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",match='Augustow',skiprows=2)
#convering the table data into DataFrame
coordinates = pd.DataFrame(info[0])
data = coordinates.head()
print(data)
It looks like if you install and use html5lib as your parser it may fix your issues:
df = pd.read_html("https://www.mapsofworld.com/lat_long/poland-lat-long.html",attrs={"class":"tableizer-table"},skiprows=2,flavor="html5lib")
>>>df
[ 0 1 2
0 Locations Latitude Longitude
1 NaN NaN NaN
2 Augustow 53°51'N 23°00'E
3 Auschwitz/Oswiecim 50°02'N 19°11'E
4 Biala Podxlaska 52°04'N 23°06'E
.. ... ... ...
177 Zawiercie 50°30'N 19°24'E
178 Zdunska Wola 51°37'N 18°59'E
179 Zgorzelec 51°10'N 15°0'E
180 Zyrardow 52°3'N 20°28'E
181 Zywiec 49°42'N 19°10'E
[182 rows x 3 columns]]
Can someone help me to understand where I'm wrong? I don't know why I get different volatility of each column...
This is an example of my code:
from math import sqrt
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from statistics import stdev
data = around(a=uniform(low=1.0, high=50.0, size=(500, 1)), decimals=3)
df = DataFrame(data=data, columns=['close'], dtype='float64')
df.loc[:, 'delta'] = df.loc[:, 'close'].pct_change().fillna(0).round(3)
volatility = []
for index in range(df.shape[0]):
if index < 90:
volatility.append(0)
else:
start = index - 90
stop = index + 1
volatility.append(stdev(df.loc[start:stop, 'delta']) * sqrt(252))
df.loc[:, 'volatility1'] = volatility
df.loc[:, 'volatility2'] = df.loc[:, 'delta'].rolling(window=90).std(ddof=0) * sqrt(252)
print(df)
close delta volatility1 volatility2
0 10.099 0.000 0.000000 NaN
1 26.331 1.607 0.000000 NaN
2 32.361 0.229 0.000000 NaN
3 2.068 -0.936 0.000000 NaN
4 36.241 16.525 0.000000 NaN
.. ... ... ... ...
495 48.015 -0.029 46.078037 46.132943
496 6.988 -0.854 46.036210 46.178820
497 23.331 2.339 46.003184 45.837245
498 25.551 0.095 45.608260 45.792188
499 46.248 0.810 45.793012 45.769787
[500 rows x 4 columns]
Thanks you so much!
There are three small changes needed. Added comments inline. 89 is needed since endpoint inclusive (unlike a lot of other python stuff). ddof=1 is needed because stdev uses this by default. This article talks about numpy std instead of stdev but the theory of what ddof is doing is still the same.
Also, in the future, try changing size to something like 95. You don't need the other 405 rows when debugging and it is nice to see the changeover from 0/NaN to actual volatility to see you need 89 not 90.
The 0 vs NaN difference still exists. This is a result of you appending 0 and rolling's default behavior. I wasn't sure if that was intentional or not so I left it.
from math import sqrt
from numpy import around
from numpy.random import uniform
from pandas import DataFrame
from statistics import stdev
data = around(a=uniform(low=1.0, high=50.0, size=(500, 1)), decimals=3)
df = DataFrame(data=data, columns=['close'], dtype='float64')
df['delta'] = df['close'].pct_change().fillna(0).round(3)
volatility = []
for index in range(df.shape[0]):
if index < 89: #change to 89
volatility.append(0)
else:
start = index - 89 #change to 89
stop = index
volatility.append(stdev(df.loc[start:stop, 'delta']) * sqrt(252))
df['volatility1'] = volatility
df['volatility2'] = df.loc[:, 'delta'].rolling(window=90).std(ddof=1) * sqrt(252) #change to ddof=1
print(df)
I have been searching for hours. I have 190 columns of pivot table to loop on my script
I have this script:
corr = pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[0]], list(df_pvt.columns)], method='pearson')[['X','Y','r']]
this provide output:
X ... r
0 CORSEC_Mainstream Media_Negative Count ... 1.000
1 CORSEC_Mainstream Media_Negative Count ... 0.960
2 CORSEC_Mainstream Media_Negative Count ... -0.203
3 CORSEC_Mainstream Media_Negative Count ... -0.446
4 CORSEC_Mainstream Media_Negative Count ... 0.488
.. ... ... ...
179 CORSEC_Mainstream Media_Negative Count ... -0.483
180 CORSEC_Mainstream Media_Negative Count ... -0.487
181 CORSEC_Mainstream Media_Negative Count ... 0.145
182 CORSEC_Mainstream Media_Negative Count ... 0.128
183 CORSEC_Mainstream Media_Negative Count ... 0.520
[184 rows x 3 columns]
I want to append 189 other columns to my script,
but this script keep providing 2 appended variables and keep replacing until the 189th variables
for var in list(range(1,189)):
corr_all = corr.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[var]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
print(corr_all)
Any advice?
Edit:
Its work like this:
corr = pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[0]], list(df_pvt.columns)], method='pearson')[['X','Y','r']]
corr_1 = corr.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[1]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
corr_2 = corr_1.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[2]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
But how I loop it until the corr_189?
You can try making 189 lists of values (Pearson coefficients) for each of your 189 columns, and then concatenate the columns with " df_final " which would be the dataframe containing all the 190 columns :
corr = pd.DataFrame(corr)
df_final = pd.DataFrame()
for k in range(189):
list_Pearson_k = 'formula to compute a list of pearson values'
df_list_k = pd.DataFrame(list_Pearson_k)
df_final = pd.concat([corr,df_list_k ], axis = 1)
Python list append method returns None.
Change your code to this:-
corr_all = []
for var in range(1,189):
corr_all.append(pg.pairwise_corr(df_pvt, columns=[[df_pvt.columns[var]], list(df_pvt.columns)], method='pearson')[['X','Y','r']])
print(corr_all)
This should help.
I'm at a beginner to intermediate data science level. I want to impute missing values from a dataframe using knn.
As the dataframe contains strings and floats, I need to encode / decode values using LabelEncoder.
My method is as follows:
Replace NaN to be able to encode
Encode the text values and put them in a dictionary
Retrieve the NaN (previously converted) to be imputed with knn
Assign values with knn
Decode values from the dictionary
Unfortunately, in the last step, imputing values adds new values that cannot be decoded (unseen labels error message).
Could you please explain to me what I am doing wrong? Ideally help me to correct it please. Before concluding, I wanted to say that I know that there are other tools like OneHotEncoder, but I don't know them well enough and I found LabelEncoder much more intuitive because you can see it directly in the dataframe (where LabelEncoder provides an array).
Please find below an example of my method, thank you very much for your help :
[1]
# Import libraries.
import pandas as pd
import numpy as np
# intialise data of lists.
data = {'Name':['Jack', np.nan, 'Victoria', 'Nicolas', 'Victor', 'Brad'], 'Age':[59, np.nan, 29, np.nan, 65, 50], 'Car color':['Blue', 'Black', np.nan, 'Black', 'Grey', np.nan], 'Height ':[177, 150, np.nan, 180, 175, 190]}
# Make a DataFrame
df = pd.DataFrame(data)
# Print the output.
df
Output :
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 NaN NaN Black 150.0
2 Victoria 29.0 NaN NaN
3 Nicolas NaN Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 NaN 190.0
[2]
# LabelEncoder does not work with NaN values, so I replace them with value '1000' :
df = df.replace(np.nan, 1000)
# And to avoid errors, str columns must be set as strings (even '1000' value) :
df[['Name','Car color']] = df[['Name','Car color']].astype(str)
df
Output
Name Age Car color Height
0 Jack 59.0 Blue 177.0
1 1000 1000.0 Black 150.0
2 Victoria 29.0 1000 1000.0
3 Nicolas 1000.0 Black 180.0
4 Victor 65.0 Grey 175.0
5 Brad 50.0 1000 190.0
[3]
# Import LabelEncoder library :
from sklearn.preprocessing import LabelEncoder
# define labelencoder :
le = LabelEncoder()
# Import defaultdict library to make a dict of labelencoder :
from collections import defaultdict
# Initiate a dict of LabelEncoder values :
encoder_dict = defaultdict(LabelEncoder)
# Make a new dataframe of LabelEncoder values :
df[['Name','Car color']] = df[['Name','Car color']].apply(lambda x: encoder_dict[x.name].fit_transform(x))
# Show output :
df
Output
Name Age Car color Height
0 2 59.0 2 177.0
1 0 1000.0 1 150.0
2 5 29.0 0 1000.0
3 3 1000.0 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
[4]
#Reverse back 1000 to missing values in order to impute them :
df = df.replace(1000, np.nan)
df
Output
Name Age Car color Height
0 2 59.0 2 177.0
1 0 NaN 1 150.0
2 5 29.0 0 NaN
3 3 NaN 1 180.0
4 4 65.0 3 175.0
5 1 50.0 0 190.0
[5]
# Import knn imputer library to replace impute missing values :
from sklearn.impute import KNNImputer
# Define imputer :
imputer = KNNImputer(n_neighbors=2)
# impute and reassign index/colonnes :
df = pd.DataFrame(np.round(imputer.fit_transform(df)),columns = df.columns)
df
Output
Name Age Car color Height
0 2.0 59.0 2.0 177.0
1 0.0 47.0 1.0 150.0
2 5.0 29.0 0.0 165.0
3 3.0 44.0 1.0 180.0
4 4.0 65.0 3.0 175.0
5 1.0 50.0 0.0 190.0
[6]
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
# Apply it to df -> THIS IS WHERE ERROR OCCURS :
df[['Name','Car color']].apply(inverse_transform_lambda)
Error message :
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-55-8a5e369215f6> in <module>()
----> 1 df[['Name','Car color']].apply(inverse_transform_lambda)
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
6926 kwds=kwds,
6927 )
-> 6928 return op.get_result()
6929
6930 def applymap(self, func):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in get_result(self)
184 return self.apply_raw()
185
--> 186 return self.apply_standard()
187
188 def apply_empty_result(self):
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_standard(self)
290
291 # compute the result using the series generator
--> 292 self.apply_series_generator()
293
294 # wrap results
/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py in apply_series_generator(self)
319 try:
320 for i, v in enumerate(series_gen):
--> 321 results[i] = self.f(v)
322 keys.append(v.name)
323 except Exception as e:
<ipython-input-54-f16f4965b2c4> in <lambda>(x)
----> 1 inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x)
/usr/local/lib/python3.6/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
297 "y contains previously unseen labels: %s" % str(diff))
298 y = np.asarray(y)
--> 299 return self.classes_[y]
300
301 def _more_tags(self):
IndexError: ('arrays used as indices must be of integer (or boolean) type', 'occurred at index Name')
Based on my comment you should do
# Decode data :
inverse_transform_lambda = lambda x: encoder_dict[x.name].inverse_transform(x.astype(int)) # or x[].astype(int)
Lets say i have 10gb of csv file and i want to get the summary statistics of the file using DataFrame describe method.
In this case first i need to create a DataFrame for all the 10gb csv data.
text_csv=Pandas.read_csv("target.csv")
df=Pandas.DataFrame(text_csv)
df.describe()
Does this mean all the 10gb will get loaded in to memory and calculate the statistics?
Yes, I think you are right. And you can omit df=Pandas.DataFrame(text_csv), because output from read_csv is DataFrame:
import pandas as pd
df = pd.read_csv("target.csv")
print df.describe()
Or you can use dask:
import dask.dataframe as dd
df = dd.read_csv('target.csv.csv')
print df.describe()
You can use parameter chunksize of read_csv, but you get output TextParser not DataFrame, so then you need concat:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
#chunksize = 2 for testing
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
<pandas.io.parsers.TextFileReader object at 0x000000001995ADA0>
df = pd.concat(tp, ignore_index=True)
print df.describe()
a b
count 15.000000 15.000000
mean 3.333333 527.600000
std 1.877181 5.082182
min 1.000000 519.000000
25% 2.000000 524.500000
50% 3.000000 528.000000
75% 5.000000 531.500000
max 6.000000 535.000000
You can convert TextFileReader to DataFrame, but aggregate this output can be difficult:
import pandas as pd
import io
temp=u"""a;b
1;525
1;526
1;533
2;527
2;528
2;532
3;519
3;534
3;535
4;530
5;529
5;531
6;520
6;521
6;524"""
#after testing replace io.StringIO(temp) to filename
tp = pd.read_csv(io.StringIO(temp), sep=";", chunksize=2)
print tp
dfs = []
for t in tp:
df = pd.DataFrame(t)
df1 = df.describe()
dfs.append(df1.T)
df2 = pd.concat(dfs)
print df2
count mean std min 25% 50% 75% max
a 2 1.0 0.000000 1 1.00 1.0 1.00 1
b 2 525.5 0.707107 525 525.25 525.5 525.75 526
a 2 1.5 0.707107 1 1.25 1.5 1.75 2
b 2 530.0 4.242641 527 528.50 530.0 531.50 533
a 2 2.0 0.000000 2 2.00 2.0 2.00 2
b 2 530.0 2.828427 528 529.00 530.0 531.00 532
a 2 3.0 0.000000 3 3.00 3.0 3.00 3
b 2 526.5 10.606602 519 522.75 526.5 530.25 534
a 2 3.5 0.707107 3 3.25 3.5 3.75 4
b 2 532.5 3.535534 530 531.25 532.5 533.75 535
a 2 5.0 0.000000 5 5.00 5.0 5.00 5
b 2 530.0 1.414214 529 529.50 530.0 530.50 531
a 2 6.0 0.000000 6 6.00 6.0 6.00 6
b 2 520.5 0.707107 520 520.25 520.5 520.75 521
a 1 6.0 NaN 6 6.00 6.0 6.00 6
b 1 524.0 NaN 524 524.00 524.0 524.00 524
Seems there is no limitation of file size for pandas.read_csv method.
According to #fickludd's and #Sebastian Raschka's answer in Large, persistent DataFrame in pandas, you can use iterator=True and chunksize=xxx to load the giant csv file and calculate the statistics you want:
import pandas as pd
df = pd.read_csv('some_data.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iteratable with chunks of 1000 rows.
partial_desc = df.describe()
And aggregate all the partial describe info all yourself.