Adding new column to dataframe and change it to list - python

Based on the given dataset, I must create a function which takes as input the original dataframe and returns the same dataframe, but with one additional column called subr_faved_by_as_list, where you have the same information as in subr_faved_by, but as a python list instead of a string.
That's my code:
from urllib import request
import pandas as pd
module_url = f"https://raw.githubusercontent.com/luisespinosaanke/cmt309-portfolio/master/data_portfolio_21.csv"
module_name = module_url.split('/')[-1]
print(f'Fetching {module_url}')
#with open("file_1.txt") as f1, open("file_2.txt") as f2
with request.urlopen(module_url) as f, open(module_name,'w') as outf:
a = f.read()
outf.write(a.decode('utf-8'))
df = pd.read_csv('data_portfolio_21.csv')
# this fills empty cells with empty strings
df = df.fillna('')
df.info()
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 author 19940 non-null object
1 posted_at 19940 non-null object
2 num_comments 19940 non-null int64
3 score 19940 non-null int64
4 selftext 19940 non-null object
5 subr_created_at 19940 non-null object
6 subr_description 19940 non-null object
7 subr_faved_by 19940 non-null object
8 subr_numb_members 19940 non-null int64
9 subr_numb_posts 19940 non-null int64
10 subreddit 19940 non-null object
11 title 19940 non-null object
12 total_awards_received 19940 non-null int64
13 upvote_ratio 19940 non-null float64
14 user_num_posts 19940 non-null int64
15 user_registered_at 19940 non-null object
16 user_upvote_ratio 19940 non-null float64
dtypes: float64(2), int64(6), object(9)
And that's the function
def transform_faves(df):
df['subr_faved_by_as_list'] = df['subr_faved_by'].str.split(' ', n = 1, expand = True)
return df
df = transform_faves(df)
I am getting the following error:
list indices must be integers or slices, not str

If use expand=True it return DataFrame, so need omit it, also is possible omit space string for split:
def transform_faves(df):
df['subr_faved_by_as_list'] = df['subr_faved_by'].str.split(n = 1)
return df
df = transform_faves(df)

Related

pandas.DataFrame.convert_dtypes increasing memory usage

Question to discuss and understand a bit more about pandas.DataFrame.convert_dtypes.
I have this DF imported from a SAS table:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null object
1 cd_ref_cnv 856389 non-null object
2 cd_cli 849637 non-null object
3 nm_prd 857613 non-null object
4 nm_ctgr_cpr 857613 non-null object
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null float64
9 qt_prd 857613 non-null float64
10 pc_cmss_rec 857242 non-null float64
11 nm_loja 857242 non-null object
12 vl_brto_cpr 857242 non-null float64
13 vl_cpr 857242 non-null float64
14 qt_dvlc 857613 non-null float64
15 cd_in_evt_espl 857613 non-null float64
16 cd_mm_aa_ref 840959 non-null object
17 nr_est_ctbc_evt 857613 non-null float64
18 nr_est_cnfc_pcr 18963 non-null float64
19 cd_tran_pcr 0 non-null object
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null object
22 vl_tran 18963 non-null float64
23 cd_pcr 0 non-null float64
24 vl_cbac_cli 653563 non-null float64
25 pc_cbac_cli 653563 non-null float64
26 cd_vndr 18963 non-null float64
dtypes: datetime64[ns](4), float64(14), object(9)
memory usage: 176.7+ MB
Basically, the DF is composed of datetime64, float64 and object types. All not memory efficient (as far as I know).
I read a bit about DataFrame.convert_dtypes to optimize memory usage, this is the result:
dfcompras = dfcompras.convert_dtypes(infer_objects=True, convert_string=True, convert_integer=True, convert_boolean=True, convert_floating=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 857613 entries, 0 to 857612
Data columns (total 27 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cd_unco_tab 857613 non-null string
1 cd_ref_cnv 856389 non-null string
2 cd_cli 849637 non-null string
3 nm_prd 857613 non-null string
4 nm_ctgr_cpr 857613 non-null string
5 ts_cpr 857229 non-null datetime64[ns]
6 ts_cnfc 857613 non-null datetime64[ns]
7 ts_incl 857613 non-null datetime64[ns]
8 vl_cmss_rec 857613 non-null Float64
9 qt_prd 857613 non-null Int64
10 pc_cmss_rec 857242 non-null Float64
11 nm_loja 857242 non-null string
12 vl_brto_cpr 857242 non-null Float64
13 vl_cpr 857242 non-null Float64
14 qt_dvlc 857613 non-null Int64
15 cd_in_evt_espl 857613 non-null Int64
16 cd_mm_aa_ref 840959 non-null string
17 nr_est_ctbc_evt 857613 non-null Int64
18 nr_est_cnfc_pcr 18963 non-null Int64
19 cd_tran_pcr 0 non-null Int64
20 ts_est 18963 non-null datetime64[ns]
21 tx_est_tran 18963 non-null string
22 vl_tran 18963 non-null Float64
23 cd_pcr 0 non-null Int64
24 vl_cbac_cli 653563 non-null Float64
25 pc_cbac_cli 653563 non-null Float64
26 cd_vndr 18963 non-null Int64
dtypes: Float64(7), Int64(8), datetime64[ns](4), string(8)
memory usage: 188.9 MB
Most columns were changed from object to strings and float64 to int64, so, it would reduce memory usage, but as we can see, the memory usage increased!
Any guess?
After doing some analysis it seems like there is an additional memory overhead while using the new Int64/Float64 Nullable dtypes. Int64/Float64 dtypes takes approximately 9 bytes while int64/float64 dtypes takes 8 bytes to store a single value.
Here is a small example to demonstrate this:
pd.DataFrame({'col': range(10)}).astype('float64').memory_usage()
Index 128
col 80 # 8 byte per item * 10 items
dtype: int64
pd.DataFrame({'col': range(10)}).astype('Float64').memory_usage()
Index 128
col 90 # 9 byte per item * 10 items
dtype: int64
Now, coming back to your example. After executing convert_dtypes around 15 columns got converted from float64 to Int64/Float64 dtypes. Now lets calculate the amount of extra bytes required to store the data with new types. The formula would be fairly simple: n_columns * n_rows * overhead_in_bytes
>>> extra_bytes = 15 * 857613 * 1
>>> extra_mega_bytes = extra_bytes / 1024 ** 2
>>> extra_mega_bytes
12.2682523727417
Turns out extra_mega_bytes is around 12.26 MB which is approximately same as the difference between the memory usage of your new and old dataframe.
Some details about new nullable integer datatype:
Int64/Float64(notice the first capital letter) are some of the new nullable types that are introduced for the first time in pandas version>=0.24 on a high level they allow you use pd.NA instead of pd.NaN/np.nan to represent missing values and implication of this can be better understood in the following example:
s = pd.Series([1, 2, np.nan])
print(s)
0 1.0
1 2.0
2 NaN
dtype: float64
Let's say you have a series s now when you check the dtype, pandas will automatically cast it to float64 because of presence of null values this is not problematic in most of cases but in case you have an column which acts as an identifier the automatic conversion to float is undesirable. To prevent this pandas has introduced these new nullable integer type.
s = pd.Series([1, 2, np.nan], dtype='Int64')
print(s)
0 1
1 2
2 <NA>
dtype: Int64
Some details on string dtype
As of now there isn't a much performance and memory difference when using the new string type but this can change in the near future. See the quote from pandas docs:
Currently, the performance of object dtype arrays of strings and
StringArray are about the same. We expect future enhancements to
significantly increase the performance and lower the memory overhead
of StringArray.

Sorting alphanumeric Dataframe in Pandas

My plot in y-axis is set in 1, 11, 12, etc... I would like it to be 1,2,3,4 ... 10, 11
df = pd.read_csv("numerical_subset_cleaned.csv",
names=["age","fnlwgt","educational-num","capital-gain","capital-loss","hours-per-week"])
sns.set_style("darkgrid")
bubble_plot(df,x = 'age', y = 'educational-num', fontsize=16, figsize=(15,10), normalization_by_all = True)
My df.info():
<class 'pandas.core.frame.DataFrame'>
Index: 32420 entries, 0.0 to string
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 32420 non-null object
1 fnlwgt 32420 non-null object
2 educational-num 32420 non-null object
3 capital-gain 32420 non-null object
4 capital-loss 32420 non-null object
5 hours-per-week 32420 non-null object
dtypes: object(6)
memory usage: 1.7+ MB[enter image description here][1]

Python: Pandas df.fillna() function change all data type into object

I want to fill feature with null value in dataframe. But when I fill to all feature, every data type I'm filling was changed to "Object".
I have dataframe with data type:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 umur 7832 non-null float64
1 jenis_kelamin 7840 non-null object
2 pekerjaan 7760 non-null object
3 provinsi 7831 non-null object
4 gaji 7843 non-null float64
5 is_menikah 7917 non-null object
6 is_keturunan 7917 non-null object
7 berat 7861 non-null float64
8 tinggi 7843 non-null float64
9 sampo 7858 non-null object
10 is_merokok 7917 non-null object
11 pendidikan 7847 non-null object
12 stress 7853 non-null float64
And I use fillna() for filling null value to every feature
# Feature categoric type inputation
df['jenis_kelamin'].fillna(df['jenis_kelamin'].mode()[0], inplace = True)
df['pekerjaan'].fillna(df['pekerjaan'].mode()[0], inplace = True)
df['provinsi'].fillna(df['provinsi'].mode()[0], inplace = True)
df['sampo'].fillna(df['sampo'].mode()[0], inplace = True)
df['pendidikan'].fillna(df['pendidikan'].mode()[0], inplace = True)
# Feature numeric type inputation
df['umur'].fillna(df['umur'].mean, inplace = True)
df['gaji'].fillna(df['gaji'].mean, inplace = True)
df['berat'].fillna(df['berat'].mean, inplace = True)
df['tinggi'].fillna(df['tinggi'].mean, inplace = True)
df['stress'].fillna(df['stress'].mean, inplace = True)
But after that, all feature's data type has been changed to Object:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 umur 7917 non-null object
1 jenis_kelamin 7917 non-null object
2 pekerjaan 7917 non-null object
3 provinsi 7917 non-null object
4 gaji 7917 non-null object
5 is_menikah 7917 non-null object
6 is_keturunan 7917 non-null object
7 berat 7917 non-null object
8 tinggi 7917 non-null object
9 sampo 7917 non-null object
10 is_merokok 7917 non-null object
11 pendidikan 7917 non-null object
12 stress 7917 non-null object
I think it can be work to convert every feature with astype(), but is there any other efficient way to fill null value without change the datatype?
I think you are missing your brackets on .mean(), so it is filling the series with a method instead of the actual values.
You want, for example:
df['umur'].fillna(df['umur'].mean(), inplace = True)

How do I convert integer 'category' dtypes in a Pandas DataFrame to 'int64'/'float64'?

Take a look at the Pandas DataFrame here.
I have certain columns that are strings, and others that are integers/floats. However, all the columns in the dataset are currently formatted with a 'category' dtype.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29744 entries, 0 to 29743
Data columns (total 366 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ASBG01 29360 non-null category
1 ASBG03 28726 non-null category
2 ASBG04 28577 non-null category
3 ASBG05A 29130 non-null category
4 ASBG05B 29055 non-null category
5 ASBG05C 29001 non-null category
6 ASBG05D 28938 non-null category
7 ASBG05E 28938 non-null category
8 ASBG05F 29030 non-null category
9 ASBG05G 28745 non-null category
10 ASBG05H 28978 non-null category
11 ASBG05I 28971 non-null category
12 ASBG06A 28956 non-null category
13 ASBG06B 28797 non-null category
14 ASBG07 28834 non-null category
15 ASBG08 28955 non-null category
16 ASBG09A 28503 non-null category
17 ASBG09B 27778 non-null category
18 ASBG10A 29025 non-null category
19 ASBG10B 28940 non-null category
...
363 ATDMDAT 13133 non-null category
364 ATDMMEM 25385 non-null category
365 Target 29744 non-null float64
dtypes: category(365), float64(1)
memory usage: 60.5 MB
How can I convert all the columns that have a integer/float value under them to actual integer/float dtypes?
Thanks.
Suppose the following dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat_str': ['Hello', 'World'],
'cat_int': [0, 1],
'cat_float': [3.14, 2.71]}, dtype='category')
print(df.dtypes)
# Output
cat_str category
cat_int category
cat_float category
dtype: object
You can try:
dtypes = {col: df[col].cat.categories.dtype for col in df.columns
if np.issubdtype(df[col].cat.categories.dtype, np.number)}
df = df.astype(dtypes)
print(df.dtypes)
# Output
cat_str category
cat_int int64
cat_float float64
dtype: object
Or if you want to remove all category dtypes, use:
dtypes = {col: df[col].cat.categories.dtype for col in df.columns}
df = df.astype(dtypes)
print(df.dtypes)
# Output
cat_str object
cat_int int64
cat_float float64
dtype: object

In python with pandas and chunks, are there ways to read a file faster?

In Python3 and pandas, I have a script to read a CSV file of 9.8 GB
I use chunks and search the "cnae_fiscal" column for codes of interest
Then the result creates a new dataframe. I did so:
import pandas as pd
import numpy as np
# Establish chunks in the file and reduce the number of columns that will be used
# Two columns with identifier codes need to transform into str
# And other routines to handle the file
TextFileReader = pd.read_csv('dados/empresa.csv',\
chunksize=1000,\
sep=',',\
names=['indicador_full_diario',
'tipo_de_atualizacao',
'cnpj',
'identificador_matrizfilial',
'razao_socialnome_empresarial',
'nome_fantasia',
'situacao_cadastral',
'data_situacao_cadastral',
'motivo_situacao_cadastral',
'nm_cidade_exterior',
'co_pais',
'nm_pais',
'codigo_natureza_juridica',
'data_inicio_atividade',
'cnae_fiscal',
'descricao_tipo_logradouro',
'logradouro',
'numero',
'complemento',
'bairro',
'cep',
'uf',
'codigo_municipio',
'municipio',
'ddd_telefone_1',
'ddd_telefone_2',
'ddd_fax',
'correio_eletronico',
'qualificacao_do_responsavel',
'capital_social_da_empresa',
'porte_empresa',
'opcao_pelo_simples',
'data_opcao_pelo_simples',
'data_exclusao_do_simples',
'opcao_pelo_mei',
'situacao_especial',
'data_situacao_especial'],\
header=None,\
converters={'cnpj': lambda x: str(x),
'cnae_fiscal': lambda x: str(x)},\
usecols=['cnpj',
'identificador_matrizfilial',
'razao_socialnome_empresarial',
'nome_fantasia',
'situacao_cadastral',
'nm_cidade_exterior',
'nm_pais',
'codigo_natureza_juridica',
'data_inicio_atividade',
'cnae_fiscal',
'descricao_tipo_logradouro',
'logradouro',
'numero',
'complemento',
'bairro',
'cep',
'uf',
'municipio',
'qualificacao_do_responsavel',
'capital_social_da_empresa',
'porte_empresa',
'situacao_especial'],\
decimal=',')
dfList = [] # Create empty list
# Set a counter, if 0 does an assignment of found values, greater than zero does an append
conta = 0
# Iteration in each chunk
for df in TextFileReader:
dfList.append(df)
df_parcial = pd.concat(dfList, sort=False)
# Search by codes of interest
nome = df_parcial[((df_parcial['cnae_fiscal'] == '2121101') |
(df_parcial['cnae_fiscal'] == '4771701') |
(df_parcial['cnae_fiscal'] == '2121103') |
(df_parcial['cnae_fiscal'] == '4644301') |
(df_parcial['cnae_fiscal'] == '2110600') |
(df_parcial['cnae_fiscal'] == '2121102')) &
(df_parcial['situacao_cadastral'] == 2)
]
# Checks if found
if nome.empty is False:
if conta == 0:
df_final = nome
else:
df_final = df_final.append(nome)
conta = conta + 1
dfList = []
df = ''
df_final.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 120154 entries, 9101 to 40183445
Data columns (total 22 columns):
cnpj 120154 non-null object
identificador_matrizfilial 120154 non-null int64
razao_socialnome_empresarial 120154 non-null object
nome_fantasia 90585 non-null object
situacao_cadastral 120154 non-null int64
nm_cidade_exterior 39 non-null object
nm_pais 203 non-null object
codigo_natureza_juridica 120154 non-null int64
data_inicio_atividade 120154 non-null object
cnae_fiscal 120154 non-null object
descricao_tipo_logradouro 119251 non-null object
logradouro 120152 non-null object
numero 119979 non-null object
complemento 49883 non-null object
bairro 119759 non-null object
cep 119951 non-null float64
uf 120154 non-null object
municipio 120154 non-null object
qualificacao_do_responsavel 120154 non-null int64
capital_social_da_empresa 120154 non-null int64
porte_empresa 120154 non-null int64
situacao_especial 63 non-null object
dtypes: float64(1), int64(6), object(15)
memory usage: 21.1+ MB
I did this on a Mac (2.2 GHz Intel Core i7) with 16GB of RAM
The time taken was about 50 minutes
Please, are there code techniques to make this processing faster?
Or is it a matter of having a faster computer with more memory?

Categories

Resources