Loading CSV with Pandas - Array's not parsed correctly - python

I have a dataset which I transformed to CSV as potential input for a keras auto encoder.
The loading of the CSV works flawless with pandas.read_csv() but the data types are not correct.
The csv solely contains two colums: label and features whereas the label column contains strings and the features column arrays with signed integers ([-1, 1]). So in general pretty simple structure.
To get two different dataframes for further processing I created them via:
labels = pd.DataFrame(columns=['label'], data=csv_data, dtype='U')
and
features = pd.DataFrame(columns=['features'], data=csv_data)
in both cases I got wrong datatypes as both are marked as object typed dataframes. What am I doing wrong?
For the features it is even harder because the parsing returns me a pandas.sequence that contains the array as string: ['[1, ..., 1]'].
So I tried a tedious workaround by parsing the string back to an numpy array via .to_numpy() a python cast for every element and than an np.assarray() - but the type of the dataframe is still incorrect. I think this could not be the general approach how to solve this task. As I am fairly new to pandas I checked some tutorials and the API but in most cases a cell in a dataframe rather contains a single value instead of a complete array. Maybe my overall design of the dataframe ist just not suitable for this task.
Any help appreacheated!

You are reading the file as string but you have a python list as a column you need to evaluate it to get the list.
I am not sure of the use case but you can split the labels for a more readable dataframe
import pandas as pd
features = ["featurea","featureb","featurec","featured","featuree"]
labels = ["[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]","[1,0,1,1,1,1]"]
df = pd.DataFrame(list(zip(features, labels)),
columns =['Features', 'Labels'])
import ast
#convert Strings to lists
df['Labels'] = df['Labels'].map(ast.literal_eval)
df.index = df['Features']
#Since list itself might not be useful you can split and expand it to multiple columns
new_df = pd.DataFrame(df['Labels'].values.tolist(),index= df.index)
Output
0 1 2 3 4 5
Features
featurea 1 0 1 1 1 1
featureb 1 0 1 1 1 1
featurec 1 0 1 1 1 1
featured 1 0 1 1 1 1
featuree 1 0 1 1 1 1

The input csv was formatted incorrectly, therefore the parsing was accurate but not what i intended. I expanded the real columns and skipped the header to have a column for every array entry - now panda recognize the types and the correct dimensions.

Related

Cannot set a Categorical with another, without identical categories. Replace almost identical categories

I have the following dataframe
np.random.seed(3)
s = pd.DataFrame((np.random.choice(['Feijão','feijão'],size=[3,2])),dtype='category')
print(s[0].cat.categories)
print(s[1].cat.categories)
As you can see the dataframe is basically two similar strings with one letter in uppercase. What I am trying to do is replace the category 'feijão' with 'Feijão'
When I write the following line of code I get this error
s.loc[s[0].isin(['feijão']),1] = s.loc[s[0].isin(['feijão']),1].replace({'feijão':'Feijão'})
TypeError: Cannot set a Categorical with another, without identical categories
I was wondering what does this error means, and also I am genuinely curious if filtering the invalid values and replacing them uniquely on the dataframe is the most optimal way of doing this. Should I just use replace without the filter part?
Use DataFrame.update:
s.update( s.loc[s[0].isin(['feijão']),1].replace({'feijão':'Feijão'}))
print (s)
0 1
0 Feijão Feijão
1 feijão Feijão
2 Feijão Feijão

What is the best way to process a list of numerical codes as descriptions and in Pandas?

Here the dataset:
df = pd.read_csv('https://data.lacity.org/api/views/d5tf-ez2w/rows.csv?accessType=DOWNLOAD')
The problem:
I have a pandas dataframe of traffic accidents in Los Angeles.
Each accident has a column of mo_codes which is a string of numerical codes (which I converted into a list of codes). Here is a screenshot:
I also have a dictionary of mo_codes description for each respective mo_code and loaded in the notebook.
Now, using the code below I can combine the numeric code with the description:
mo_code_list_final = []
for i in range(20):
for j in df.mo_codes.iloc[i]:
print(i, mo_code_dict[j])
So, I haven't added this as a column to Pandas yet. I wanted to ask if there is a better way to solve the problem I have which is, how best to add the textual description in pandas as a column.
Also, is there an easier way to process this with a pandas function like .assign instead of the for loop. Maybe a list comprehension to process the mo_codes into a new dataframe with the description?
Thanks in advance.
ps. if there is a technical word for this type of problem, pls let me know.
import pandas
codes = {0:'Test1',1:'test 2',2:'test 3',3:'test 4'}
df1 = pandas.DataFrame([["red",[0,1,2],5],["blue",[3,1],6]],columns=[0,'codes',2])
# first explode the list into its own rows
df2 = df1['codes'].apply(pandas.Series).stack().astype(int).reset_index(level=1, drop=True).to_frame('codes').join(df1[[0,2]])
#now use map to apply the text descriptions
df2['desc'] = df2['codes'].map(codes)
print(df2)
"""
codes 0 2 desc
0 0 red 5 Test1
0 1 red 5 test 2
0 2 red 5 test 3
1 3 blue 6 test 4
1 1 blue 6 test 2
"""
I figured out how to finally do this. However, I found the answer in Javascript but the same concept applies.
You simply create a dictionary of mocodes and its string value.
export const mocodesDict = {
"0100": "Suspect Impersonate",
"0101": "Aid victim",
"0102": "Blind",
"0103": "Crippled",
...
}
After that, its as simple as doing this
mocodesDict[item)]
where item you want to convert.

faster replacement of -1 and 0 to NaNs in column for a large dataset

The 'azdias' is a dataframe which is my main dataset and meta data or feature summary of it lies in dataframe 'feat_info'. The 'feat_info' shows the values in every column that have been displayed as NaN.
Ex: column1 has values [-1,0] as NaN values. So my job will be to find and replace these -1,0 in column1 as NaN.
azdias dataframe:
feat_info dataframe:
I have tried following in jupyter notebook.
def NAFunc(x, miss_unknown_list):
x_output = x
for i in miss_unknown_list:
try:
miss_unknown_value = float(i)
except ValueError:
miss_unknown_value = i
if x == miss_unknown_value:
x_output = np.nan
break
return x_output
for cols in azdias.columns.tolist():
NAList = feat_info[feat_info.attribute == cols]['missing_or_unknown'].values[0]
azdias[cols] = azdias[cols].apply(lambda x: NAFunc(x, NAList))
Question 1: I am trying to impute NaN values. But my code is very
slow. I wish to speed up my process of execution.
I have attached sample of both dataframes:
azdias_sample
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST
0 -1 2 1 2.0 3
1 -1 1 2 5.0 1
2 -1 3 2 3.0 1
3 2 4 2 2.0 4
4 -1 3 1 5.0 4
feat_info_sample
attribute information_level type missing_or_unknown
AGER_TYP person categorical [-1,0]
ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
ANREDE_KZ person categorical [-1,0]
CJT_GESAMTTYP person categorical [0]
FINANZ_MINIMALIST person ordinal [-1]
If the azdias dataset is obtained from read_csv or similar IO functions, the na_values keyword argument can be used to specify column-specific missing value representations to make sure the returned data frame already has in-place NaN values from the very beginning. The sample code is shown in the following.
from ast import literal_eval
feat_info.set_index("attribute", inplace=True)
# A more concise but less efficient alternative is
# na_dict = feat_info["missing_or_unknown"].apply(literal_eval).to_dict()
na_dict = {attr: literal_eval(val) for attr, val in feat_info["missing_or_unknown"].items()}
df_azdias = pd.read_csv("azidas.csv", na_values=na_dict)
As for the data type, there is no built-in NaN representation for integer data types. Hence a float data type is needed. If the missing values are imputed using fillna, the downcast argument can be specified to make the returned series or data frame have an appropriate data type.
Try using the DataFrame's replace method. How about this?
for c in azdias.columns.tolist():
replace_list = feat_info[feat_info['attribute'] == c]['missing_or_unknown'].values
azidias[c] = azidias[c].replace(to_replace=list(replace_list), value=np.nan)
A couple things I'm not sure about without being able to execute your code:
In your example, you used .values[0]. Don't you want all the values?
I'm not sure if it's necessary to do to_replace=list(replace_list), it may work to just use to_replace=replace_list.
In general, I recommend thinking to yourself "surely Pandas has a function to do this for me." Often, they do. For performance with Pandas generally, avoid looping over and setting things. Vectorized methods tend to be much faster.

Apply Feature Hashing to specific columns from a DataFrame

I'm a bit lost with the use of Feature Hashing in Python Pandas .
I have the a DataFrame with multiple columns, with many information in different types. There is one column that represent a class for the data.
Example:
col1 col2 colType
1 1 2 'A'
2 1 1 'B'
3 2 4 'C'
My goal is to apply FeatureHashing for the ColType, in order to be able to apply a Machine Learning Algorithm.
I have created a separate DataFrame for the colType, having something like this:
colType value
1 'A' 1
2 'B' 2
3 'C' 3
4 'D' 4
Then, applied Feature Hashing for this class Data Frame. But I don't understand how to add the result of Feature Hashing to my DataFrame with the info, in order to use it as an input in a Machine Learning Algorithm.
This is how I use FeatureHashing:
from sklearn.feature_extraction import FeatureHasher
fh = FeatureHasher(n_features=10, input_type='string')
result = fh.fit_transform(categoriesDF)
How do I insert this FeatureHasher result, to my DataFrame? How bad is my approach? Is there any better way to achieve what I am doing?
Thanks!
I know this answer comes in late, but I stumbled upon the same problem and found this works:
fh = FeatureHasher(n_features=8, input_type='string')
sp = fh.fit_transform(df['colType'])
df = pd.DataFrame(sp.toarray(), columns=['fh1', 'fh2', 'fh3', 'fh4', 'fh5', 'fh6', 'fh7', 'fh8'])
pd.concat([df1, df], axis=1)
This creates a dataframe out of the sparse matrix retrieved by the FeatureHasher and concatenates the matrix to the existing dataframe.
I have switched to One Hot Coding, using something like this:
categoriesDF = pd.get_dummies(categoriesDF)
This function will create a column for every non-category value, with 1 or 0.

Pandas Groupby - Sparse Matrix Error

This question is related to the question I asked previously about using pandas get_dummies() function (link below).
Pandas Get_dummies for nested tables
However in the course of utilizing the solution provide in the answer I noticed odd behavior when looking at the groupby function. The issue is that repeated (non-unique) index values for a dataframe appear to cause an error when the matrix is represented in sparse format, while working as expected for dense matrix.
I have extremely high dimensional data thus sparse matrix will be required for memory reasons. An example of the error is below. If anyone has a work around it would be greatly appreciated
Working:
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name')
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Failing
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Note you will need version 16.1 or greater of pandas.
Thank you in advance
You can perform your groupby in a different way as a workaround. Don't set Instance as the index and use the column for your groupby and drop the Instance column (last column in this case since it was just added). Groupby will will make an Instance index.
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
#WORKAROUND:
result=result.groupby('Instance').apply(max)[result.columns[:-1]]
result
Out[58]:
Name_Doe Name_Jane Name_John Name_Smith
Instance
1 0 0 1 1
2 0 1 0 0
3 1 0 0 0
Note: The sparse dataframe stores your Instance int's as floats within a BlockIndex in the dataframe column. In order to have the index the exact same as the first example you'd need to change to int from float.
result.index=result.index.map(int)
result.index.name='Instance'

Categories

Resources