Pandas replace method and object datatypes - python

I am using df= df.replace('No data', np.nan) on a csv file containing ‘No data’ instead of blank/null entries where there is no numeric data. Using the head() method I see that the replace method does replace the ‘No data’ entries with NaN. When I use df.info() it says that the datatypes for each of the series is an object.
When I open the csv file in Excel and manually edit the data using find and replace to change ‘No data’ to blank/null entries, although the dataframes look exactly the same when I use df.head(), when I use df.info() it says that the datatypes are floats.
I was wondering why this was and how can I make it so that the datatypes for my series are floats, without having to manually edit the csv files.

If the rest of the data in your columns is numeric then you should use pd.to_numeric(df, errors='coerce')

import pandas as pd
import numpy as np
# Create data for columns with strings in it
column_data = [1,2,3] + ['no data']
# Costruct data frame with two columns
df = pd.DataFrame({'col1':column_data, 'col2':column_data[::-1]})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1 4 non-null object
col2 4 non-null object
dtypes: object(2)
memory usage: 144.0+ bytes
# Replace 'no data' with Nan
df_nan = df.replace('no data', np.nan)
# Set type of all columns to float
df_result = df_nan.as_type({c:float for c in df_nan.columns})
df_result.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
col1 3 non-null float64
col2 3 non-null float64
dtypes: float64(2)
memory usage: 144.0 bytes

Consider using the converters arg in pandas.read_csv() where you pass a dictionary of column numbers referencing a conversion function. Below checks the No data string and conditionally replaces value with np.nan otherwise leave as is:
import numpy as np
import pandas as pd
c_fct = lambda x : float(x if 'No data' not in x else np.nan)
convertdict = {1:c_fct, 2:c_fct, 3:c_fct, 4:c_fct, 5:c_fct}
df = pd.read_csv('Input.csv', converters=convertdict)
Input CSV
ID Col1 Col2 Col3 Col4 Col5
TGG 0.634516647 0.900464347 0.998505978 0.170422713 0.893340128
GRI No data 0.045915333 0.718398939 0.924813864 No data
NLB 0.921127268 0.614460813 0.677857676 0.343612947 0.559437744
SEI 0.081852313 No data 0.890816385 0.943313021 0.874857844
LOY 0.632556715 0.362855866 0.038702448 0.253762859 No data
OPL 0.375088582 0.268283238 0.761552111 0.589547625 0.192223208
CTK 0.349464541 0.844718987 No data 0.841439909 0.898093646
EUE 0.629784261 0.982589843 0.315670377 0.832419474 0.950044814
JLP 0.543942659 0.988380305 0.417191823 0.823857176 0.542514099
RHK 0.728053447 0.521816539 0.402523435 No data 0.558226706
AEM 0.005495116 0.715363776 0.075508356 0.959119268 0.844730368
VLQ 0.21146319 0.558208766 0.501769554 0.226539046 0.795861461
MDB 0.230514689 0.223163664 No data 0.324636384 0.700716246
LPH 0.853433224 0.582678173 0.633109347 0.432191426 No data
PEP 0.41096305 No data .627776178 0.482359278 0.179863537
UQK 0.252598809 0.497517585 0.276060768 No data 0.087985623
KGJ 0.033985585 0.033702088 anNo data 0.286682709 0.543349787
JUQ 0.25971543 0.142067155 0.597985191 0.219841249 0.699822866
NYW No data 0.17187907 0.157413049 0.209011772 0.592824483
Output
print(df)
# ID Col1 Col2 Col3 Col4 Col5
# 0 TGG 0.634517 0.900464 0.998506 0.170423 0.893340
# 1 GRI NaN 0.045915 0.718399 0.924814 NaN
# 2 NLB 0.921127 0.614461 0.677858 0.343613 0.559438
# 3 SEI 0.081852 NaN 0.890816 0.943313 0.874858
# 4 LOY 0.632557 0.362856 0.038702 0.253763 NaN
# 5 OPL 0.375089 0.268283 0.761552 0.589548 0.192223
# 6 CTK 0.349465 0.844719 NaN 0.841440 0.898094
# 7 EUE 0.629784 0.982590 0.315670 0.832419 0.950045
# 8 JLP 0.543943 0.988380 0.417192 0.823857 0.542514
# 9 RHK 0.728053 0.521817 0.402523 NaN 0.558227
# 10 AEM 0.005495 0.715364 0.075508 0.959119 0.844730
# 11 VLQ 0.211463 0.558209 0.501770 0.226539 0.795861
# 12 MDB 0.230515 0.223164 NaN 0.324636 0.700716
# 13 LPH 0.853433 0.582678 0.633109 0.432191 NaN
# 14 PEP 0.410963 NaN 0.627776 0.482359 0.179864
# 15 UQK 0.252599 0.497518 0.276061 NaN 0.087986
# 16 KGJ 0.033986 0.033702 NaN 0.286683 0.543350
# 17 JUQ 0.259715 0.142067 0.597985 0.219841 0.699823
# 18 NYW NaN 0.171879 0.157413 0.209012 0.592824
print(df.types)
# ID object
# Col1 float64
# Col2 float64
# Col3 float64
# Col4 float64
# Col5 float64
# dtype: object

Related

Pandas .loc doesn't find null values based on .isnull() boolean array

I'm trying to get info about the null-values in my DF column LotFrontage. As you can see, there are some of them, confirmed in 2 ways:
hp_lot = houseprices[['LotFrontage', 'LotArea']]
hp_lot.describe()
hp_lot.info()
output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2919 entries, 1 to 2919
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 LotFrontage 2433 non-null float64
1 LotArea 2919 non-null int64
dtypes: float64(1), int64(1)
memory usage: 68.4 KB
houseprices_num['LotFrontage'].isnull().describe()
output:
count 2919
unique 2
top False
freq 2433
But when I'm trying to locate them, I'm just getting this:
lf_null = houseprices_num.loc[houseprices_num['LotFrontage'].isnull(), ['LotFrontage']]
lf_null.describe()
output:
count 0.0
mean NaN
std NaN
min NaN
25% NaN
50% NaN
75% NaN
max NaN
Name: LotFrontage, dtype: float64
My question is where the hell are my null-values? And if I messed up something in the syntax, why am I not getting an error message of some kind?
The variables:
traindf = pd.read_csv('.\\train.csv', sep=',', header=1, index_col='Id')
testdf = pd.read_csv('.\\test.csv', sep=',', header=0, index_col='Id')
houseprices = pd.concat([traindf, testdf], axis=0)
houseprices_num = houseprices[['LotFrontage', 'LotArea', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF',
'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch',
'3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'SalePrice']]
The mistake is with this line
lf_null = houseprices_num['LotFrontage'].loc[houseprices_num['LotFrontage'].isnull()]
lf_null.describe()
You need to use loc in the following way
lf_null = houseprices_num.loc[houseprices_num['LotFrontage'].isnull(), ['LotFrontage']]
lf_null.describe()

In pandas, how to set a index with the combine of multi columns?

I have a DataFrame like below:
InvoiceID PayerAccountId ... user:Project user:Purpose
0 314758801 123456789012 ... NaN NaN
1 314758801 123456789012 ... NaN NaN
2 314758801 123456789012 ... NaN NaN
3 314758801 123456789012 ... NaN NaN
4 314758801 123456789012 ... NaN NaN
... ... ... ... ... ...
1726119 NaN 123456789012 ... NaN NaN
1726120 NaN 123456789012 ... NaN NaN
1726121 NaN 123456789012 ... NaN NaN
1726122 NaN 123456789012 ... NaN NaN
1726123 NaN 123456789012 ... NaN NaN
[1726124 rows x 27 columns]
And it's info is here:
[1726124 rows x 27 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1726124 entries, 0 to 1726123
Data columns (total 27 columns):
InvoiceID object
PayerAccountId object
LinkedAccountId object
RecordType object
ProductName object
RateId object
SubscriptionId object
UsageType object
Operation object
AvailabilityZone object
ReservedInstance object
ItemDescription object
UsageStartDate datetime64[ns]
UsageEndDate datetime64[ns]
UsageQuantity float64
BlendedRate float64
BlendedCost float64
UnBlendedRate float64
UnBlendedCost float64
ResourceId object
aws:cloudformation:stack-name object
user:Cost object
user:CostNo object
user:Dept object
user:Name object
user:Project object
user:Purpose object
dtypes: datetime64[ns](2), float64(5), object(20)
memory usage: 355.6+ MB
I want to set a index with the columns which is object type:
Index(['InvoiceID', 'PayerAccountId', 'LinkedAccountId', 'RecordType',
'ProductName', 'RateId', 'SubscriptionId', 'UsageType', 'Operation',
'AvailabilityZone', 'ReservedInstance', 'ItemDescription', 'ResourceId',
'aws:cloudformation:stack-name', 'user:Cost', 'user:CostNo',
'user:Dept', 'user:Name', 'user:Project', 'user:Purpose'],
dtype='object')
Then I want to get the sum of the float type, sum of the UsageEndDate - UsageStartDate, how to reach that? Thanks in advance.
Thanks for #Joshua Maerker 's help. Your code inspired me. So, the final solution is here:
import pandas as pd
import numpy as np
# Define the columns data type
data_type = {
"UsageStartDate": "datetime64[ns]",
"UsageEndDate": "datetime64[ns]",
"UsageQuantity": np.float,
"BlendedRate": np.float,
"BlendedCost": np.float,
"UnBlendedRate": np.float,
"UnBlendedCost": np.float
}
df = pd.read_csv("data.csv", dtype=np.object)
# Drop the useless columns
list_drop = ["RecordId", "PricingPlanId"]
df.drop(columns=list_drop, inplace=True)
# Change the type of some column
for k, v in data_type.items():
df[k] = df[k].astype(v)
# Get the unique attributes
df1 = df.drop(columns=list(data_type.keys())).drop_duplicates().reset_index(drop=True)
# Add the auxiliary column
df["Auxiliary"] = df[df1.columns].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
df1["Auxiliary"] = df1[df1.columns].apply(lambda row: ''.join(row.values.astype(str)), axis=1)
# Add the duration column
df["Duration"] = df["UsageEndDate"] - df["UsageStartDate"]
# Structure the rules for grouped to apply
agg = {
"UsageQuantity": "sum",
"Duration": "sum",
"BlendedCost": "sum",
"UnBlendedCost": "sum",
}
# Get the result
result = df.groupby("Auxiliary", sort=False).agg(agg)
# Combine the result
cleaned = pd.merge(df1, result, how="inner", on="Auxiliary")
# Drop auxiliary column
df = cleaned.drop(columns="Auxiliary")
# Transfer the result into mysql database
df.to_sql(name="cleaned_result", con=engine, if_exists="replace", index=False)
BTW, the func of yours to create auxiliary column NOT work for me, maybe it caused by there are some Nan in my rows.
Try this:
# select all columns with object Type
dtypOpj = df.select_dtypes(include=['object'])
# create a new column
df['indexstring'] = ""
# iterate over all Columns with object Type
for column in dtypOpj.columns:
df['indexstring'] = df['indexstring'] + df[column]
# Set new Column as Index
df = df.set_index('indexstring')
# select all float types
dtypFloat = df.select_dtypes(include=['float64', 'float32'])
# sum of all Float Columns
sumFloats = df[dtypFloat.columns].sum()
# sum of UsageEndDate - UsageStartDate
df['sumDifference'] = df["UsageEndDate"] - df["UsageStartDate"]
df['sumDifference'].sum()

I lose my values in the columns

I've organized my data using pandas. and I fill my procedure out like below
import pandas as pd
import numpy as np
df1 = pd.read_table(r'E:\빅데이터 캠퍼스\골목상권 프로파일링 - 서울 열린데이터 광장 3.초기-16년5월분1\17.상권-추정매출\201301-201605\tbsm_trdar_selng.txt\tbsm_trdar_selng_utf8.txt' , sep='|' ,header=None
,dtype = { '0' : pd.np.int})
df1 = df1.replace('201301', int(201301))
df2 = df1[[0 ,1, 2, 3 ,4, 11,12 ,82 ]]
df2_rename = df2.columns = ['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO' ]
print(df2.head(40))
df3_groupby = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ])
df4_agg = df3_groupby.agg(np.sum)
print(df4_agg.head(30))
When I print df2 I can see the 11947 and 11948 values in my TRDAR_CD column. like below picture
after that, I used groupby function and I lose my 11948 values in my TRDAR_CD column. You can see this situation in below picture
probably, this problem from the warning message?? warning message is 'sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.'
help me plz
print(df2.info()) is
RangeIndex: 1089023 entries, 0 to 1089022
Data columns (total 8 columns):
STDR_YM_CD 1089023 non-null object
TRDAR_CD 1089023 non-null int64
TRDAR_CD_NM 1085428 non-null object
SVC_INDUTY_CD 1089023 non-null object
SVC_INDUTY_CD_NM 1089023 non-null object
THSMON_SELNG_AMT 1089023 non-null int64
THSMON_SELNG_CO 1089023 non-null int64
STOR_CO 1089023 non-null int64
dtypes: int64(4), object(4)
memory usage: 66.5+ MB
None
MultiIndex is called first and second columns and if first level has duplicates by default it 'sparsified' the higher levels of the indexes to make the console output a bit easier on the eyes.
You can show data in first level of MultiIndex by setting display.multi_sparse to False.
Sample:
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6],
'C':[7,8,9]})
df.set_index(['A','B'], inplace=True)
print (df)
C
A B
1 4 7
5 8
3 6 9
#temporary set multi_sparse to False
#http://pandas.pydata.org/pandas-docs/stable/options.html#getting-and-setting-options
with pd.option_context('display.multi_sparse', False):
print (df)
C
A B
1 4 7
1 5 8
3 6 9
EDIT by edit of question:
I think problem is type of value 11948 is string, so it is omited.
EDIT1 by file:
You can simplify your solution by add parameter usecols in read_csv and then aggregating by GroupBy.sum:
import pandas as pd
import numpy as np
df2 = pd.read_table(r'tbsm_trdar_selng_utf8.txt' ,
sep='|' ,
header=None ,
usecols=[0 ,1, 2, 3 ,4, 11,12 ,82],
names=['STDR_YM_CD', 'TRDAR_CD', 'TRDAR_CD_NM', 'SVC_INDUTY_CD', 'SVC_INDUTY_CD_NM', 'THSMON_SELNG_AMT', 'THSMON_SELNG_CO', 'STOR_CO'],
dtype = { '0' : int})
df4_agg = df2.groupby(['STDR_YM_CD', 'TRDAR_CD' ]).sum()
print(df4_agg.head(10))
THSMON_SELNG_AMT THSMON_SELNG_CO STOR_CO
STDR_YM_CD TRDAR_CD
201301 11947 1966588856 74798 73
11948 3404215104 89064 116
11949 1078973946 42005 45
11950 1759827974 93245 71
11953 779024380 21042 84
11954 2367130386 94033 128
11956 511840921 23340 33
11957 329738651 15531 50
11958 1255880439 42774 118
11962 1837895919 66692 68

Filtering out string in a Panda Dataframe

I have the following formulas that I use to compute data in my Dataframe. The Datframe consists of data downloaded. My Index is made of dates, and the first row contains only strings..
cols = df.columns.values.tolist()
weight =
pd.DataFrame([df[col] / df.sum(axis=1) for col in df], index=cols).T
std = pd.DataFrame([df.std(axis=1) for col in df], index=cols).T
A B C D E
2006-04-27 00:00:00 'dd' 'de' 'ede' 'wew' 'were'
2006-04-28 00:00:00 69.62 69.62 6.518 65.09 69.62
2006-05-01 00:00:00 71.5 71.5 6.522 65.16 71.5
2006-05-02 00:00:00 72.34 72.34 6.669 66.55 72.34
2006-05-03 00:00:00 70.22 70.22 6.662 66.46 70.22
2006-05-04 00:00:00 68.32 68.32 6.758 67.48 68.32
2006-05-05 00:00:00 68 68 6.805 67.99 68
2006-05-08 00:00:00 67.88 67.88 6.768 67.56 67.88
The Issue I am having is that the formulas I use do not seem to ignore the Index and also the first Indexed row where it's only 'strings'. Thus i get the following error for the weight formula:
TypeError: Cannot compare type 'Timestamp' with type 'str'
and I get the following error for the std formula:
ValueError: No axis named 1 for object type
You could filter the rows so as to compute weight and standard deviation as follows:
df_string = df.iloc[0] # Assign First row to DF
df_numeric = df.iloc[1:].astype(float) # Assign All rows after first row to DF
cols = df_numeric.columns.values.tolist()
Computing:
weight = pd.DataFrame([df_numeric[col] / df_numeric.sum(axis=1) for col in df_numeric],
index=cols).T
weight
std = pd.DataFrame([df_numeric.std(axis=1) for col in df_numeric],index=cols).T
std
To reassign, say std values back to the original DF, you could do:
df_string_std = df_string.to_frame().T.append(std)
df_string_std
As the OP had difficulty in reproducing the results, here is the complete summary of the DF used:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8 entries, 2006-04-27 to 2006-05-08
Data columns (total 5 columns):
A 8 non-null object
B 8 non-null object
C 8 non-null object
D 8 non-null object
E 8 non-null object
dtypes: object(5)
memory usage: 384.0+ bytes
df.index
DatetimeIndex(['2006-04-27', '2006-04-28', '2006-05-01', '2006-05-02',
'2006-05-03', '2006-05-04', '2006-05-05', '2006-05-08'],
dtype='datetime64[ns]', name='Date', freq=None)
Starting DFused:
df

Replacing Strings in Column of Dataframe with the number in the string

I currently have a dataframe as follows and all I want to do is just replace the strings in Maturity with just the number within them. For example, I want to replace FZCY0D with 0 and so on.
Date Maturity Yield_pct Currency
0 2009-01-02 FZCY0D 4.25 AUS
1 2009-01-05 FZCY0D 4.25 AUS
2 2009-01-06 FZCY0D 4.25 AUS
My code is as follows and I tried replacing these strings with the numbers, but that lead to the error AttributeError: 'Series' object has no attribute 'split' in the line result.Maturity.replace(result['Maturity'], [int(s) for s in result['Maturity'].split() if s.isdigit()]). I am hence struggling to understand how to do this.
from pandas.io.excel import read_excel
import pandas as pd
import numpy as np
import xlrd
url = 'http://www.rba.gov.au/statistics/tables/xls/f17hist.xls'
xls = pd.ExcelFile(url)
#Gets rid of the information that I dont need in my dataframe
df = xls.parse('Yields', skiprows=10, index_col=None, na_values=['NA'])
df.rename(columns={'Series ID': 'Date'}, inplace=True)
# This line assumes you want datetime, ignore if you don't
#combined_data['Date'] = pd.to_datetime(combined_data['Date'])
result = pd.melt(df, id_vars=['Date'])
result['Currency'] = 'AUS'
result.rename(columns={'value': 'Yield_pct'}, inplace=True)
result.rename(columns={'variable': 'Maturity'}, inplace=True)
result.Maturity.replace(result['Maturity'], [int(s) for s in result['Maturity'].split() if s.isdigit()])
print result
You can use the vectorised str methods and pass a regex to extract the number:
In [15]:
df['Maturity'] = df['Maturity'].str.extract('(\d+)')
df
Out[15]:
Date Maturity Yield_pct Currency
0 2009-01-02 0 4.25 AUS
1 2009-01-05 0 4.25 AUS
2 2009-01-06 0 4.25 AUS
You can call astype(int) to cast the series to int:
In [17]:
df['Maturity'] = df['Maturity'].str.extract('(\d+)').astype(int)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 0 to 2
Data columns (total 4 columns):
Date 3 non-null object
Maturity 3 non-null int32
Yield_pct 3 non-null float64
Currency 3 non-null object
dtypes: float64(1), int32(1), object(2)
memory usage: 108.0+ bytes

Categories

Resources