Categorical Data mining with python (pandas) (data Cleaning) - python

Can anyone help me out?
I am new to data mining and have a categorical dataset.
I want to clean the data using PYTHON; I have an issue in the ' Signature' column (Data Cleaning Step). Please see the data sample below for a better understanding. I want to clean the "signature" column to use it later in Machine Learning Model. The output should be numeric.
Please Find the attached screenshot for better understanding>
DataSet
print(df.sample(2).to_dict("list"))
{'md5': ['da90ea57d95eb78e2caffdd7344a12d8', '7cd341f0d4ab31ecd4e5dec4f24c23ab'], 'version': ['9.20.0', '2.73'], 'min_sdk': ['18', '15'], 'min_screen': ['small', 'small'], 'min_opengl': [2.0, nan], 'supported_cpu': ['arm64-v8a', 'arm64-v8a, armeabi, armeabi-v7a, mips, mips64, x86, x86_64'], 'signature': ['38:91:8A:45:3D:07:19:93:54:F8:B1:9A:F0:5E:C6:56:2C:ED:57:88', '98:24:D8:09:21:56:CE:EC:84:3F:66:04:EF:7D:10:D0:ED:BF:4E:B7'], 'rating_number': [0.0, 4.1], 'rating_count': [0, 125], 'androidpermissionbind_wallpaper': [0, 0], 'androidpermissionforce_back': [0, 0], 'androidpermissionread_calendar': [0, 0], 'androidpermissionbody_sensors': [0, 0], 'androidpermissionread_social_stream': [0, 0], 'androidpermissionread_sync_stats': [0, 0], 'androidpermissioninternet': [1, 1], 'androidpermissionchange_configuration': [0, 0], 'androidpermissionbind_dream_service': [0, 0], 'androidpermissionhardware_test': [0, 0], 'comandroidbrowserpermissionwrite_history_bookmarks': [0, 0], 'comandroidlauncherpermissioninstall_shortcut': [1, 1], 'androidpermissionbind_tv_input': [0, 0], 'androidpermissionbind_vpn_service': [0, 0], 'androidpermissionbluetooth_privileged': [0, 0], 'androidpermissionwrite_call_log': [0, 0], 'androidpermissionchange_wifi_multicast_state': [0, 0], 'androidpermissionbind_input_method': [0, 0], 'androidpermissionset_time_zone': [0, 0], 'androidpermissionwrite_sync_settings': [1, 1], 'androidpermissionwrite_gservices': [0, 0], 'androidpermissionset_orientation': [0, 0], 'androidpermissionbind_device_admin': [0, 0], 'androidpermissionmanage_documents': [0, 0], 'androidpermissionforce_stop_packages': [0, 0], 'androidpermissionwrite_secure_settings': [0, 0], 'androidpermissioncall_privileged': [0, 0], 'androidpermissionmount_format_filesystems': [0, 0], 'androidpermissionsystem_alert_window': [0, 0], 'androidpermissionaccess_location_extra_commands': [0, 0], 'androidpermissionbrick': [0, 0], 'androidpermissiondump': [0, 0], 'androidpermissionchange_wifi_state': [0, 0], 'androidpermissionrecord_audio': [0, 0], 'androidpermissionmodify_phone_state': [0, 0], 'androidpermissionread_profile': [0, 0], 'androidpermissionaccount_manager': [0, 0], 'androidpermissionset_animation_scale': [0, 0], 'androidpermissionset_process_limit': [0, 0], 'androidpermissioncapture_secure_video_output': [0, 0], 'androidpermissionset_preferred_applications': [0, 0], 'androidpermissionaccess_all_downloads': [0, 0], 'androidpermissionset_debug_app': [0, 0], 'androidpermissionstop_app_switches': [0, 0], 'androidpermissionbluetooth': [0, 0], 'androidpermissionaccess_wifi_state': [1, 1], 'androidpermissionset_wallpaper_hints': [0, 0], 'androidpermissionbind_notification_listener_service': [0, 0], 'androidpermissionmms_send_outbox_msg': [0, 0], 'androidpermissioncontrol_location_updates': [0, 0], 'androidpermissionupdate_app_ops_stats': [0, 0], 'androidpermissionreboot': [0, 0], 'androidpermissionbroadcast_wap_push': [0, 0], 'comandroidlauncher3permissionread_settings': [0, 0], 'androidpermissionaccess_network_state': [1, 1], 'androidpermissionstatus_bar': [0, 0], 'androidpermissionwrite_user_dictionary': [0, 0], 'comandroidbrowserpermissionread_history_bookmarks': [0, 0], 'androidpermissionbroadcast_package_removed': [0, 0], 'androidpermissionreceive_sms': [0, 0], 'androidpermissionwrite_contacts': [0, 0], 'androidpermissionread_contacts': [0, 0], 'androidpermissionbind_appwidget': [0, 0], 'androidpermissionsignal_persistent_processes': [0, 0], 'androidpermissioninstall_location_provider': [0, 0], 'androidpermissionaccess_download_manager_advanced': [0, 0], 'androidpermissionwrite_settings': [0, 0], 'androidpermissionmaster_clear': [0, 0], 'androidpermissionread_input_state': [0, 0], 'androidpermissionmanage_app_tokens': [0, 0], 'androidpermissionbind_remoteviews': [0, 0], 'comandroidemailpermissionaccess_provider': [0, 0], 'androidpermissionbind_voice_interaction': [0, 0], 'comandroidlauncherpermissionwrite_settings': [0, 0], 'comandroidgallery3dfiltershowpermissionread': [0, 0], 'androidpermissionbind_print_service': [0, 0], 'androidpermissionmodify_audio_settings': [0, 0], 'androidpermissionuse_sip': [0, 0], 'androidpermissionwrite_apn_settings': [0, 0], 'androidpermissionaccess_surface_flinger': [0, 0], 'androidpermissionfactory_test': [0, 0], 'androidpermissionread_logs': [1, 1], 'androidpermissionprocess_outgoing_calls': [0, 0], 'androidpermissionupdate_device_stats': [0, 0], 'androidpermissionsend_download_completed_intents': [0, 0], 'androidpermissionwrite_calendar': [0, 0], 'androidpermissionnfc': [0, 0], 'androidpermissionmanage_accounts': [1, 1], 'androidpermissionsend_sms': [0, 0], 'androidpermissioninteract_across_users_full': [0, 0], 'androidpermissionaccess_mock_location': [0, 0], 'androidpermissionbind_accessibility_service': [0, 0], 'androidpermissioncapture_audio_output': [0, 0], 'androidpermissionwrite_sms': [0, 0], 'androidpermissionget_tasks': [0, 0], 'androidpermissiondelete_packages': [0, 0], 'androidpermissionaccess_checkin_properties': [0, 0], 'androidpermissionsend_respond_via_message': [0, 0], 'androidpermissionmedia_content_control': [0, 0], 'androidpermissiondownload_without_notification': [0, 0], 'androidpermissionreceive_boot_completed': [0, 0], 'androidpermissionvibrate': [0, 0], 'androidpermissiondiagnostic': [0, 0], 'androidpermissionwrite_profile': [0, 0], 'androidpermissioncall_phone': [0, 0], 'androidpermissionflashlight': [0, 0], 'androidpermissionread_phone_state': [1, 1], 'androidpermissionchange_component_enabled_state': [0, 0], 'androidpermissionclear_app_user_data': [0, 0], 'androidpermissionbroadcast_sms': [0, 0], 'androidpermissionkill_background_processes': [0, 0], 'androidpermissionread_frame_buffer': [0, 0], 'androidpermissionsubscribed_feeds_write': [0, 0], 'androidpermissioncamera': [0, 0], 'androidpermissionreceive_mms': [0, 0], 'androidpermissionwake_lock': [0, 0], 'androidpermissionaccess_download_manager': [0, 0], 'comandroidlauncher3permissionwrite_settings': [0, 0], 'androidpermissiondelete_cache_files': [0, 0], 'androidpermissionrestart_packages': [0, 0], 'androidpermissionget_accounts': [1, 1], 'androidpermissionsubscribed_feeds_read': [0, 0], 'androidpermissionchange_network_state': [0, 0], 'androidpermissionread_sync_settings': [1, 1], 'androidpermissiondisable_keyguard': [0, 0], 'comandroidlauncherpermissionuninstall_shortcut': [1, 1], 'androidpermissionuse_credentials': [1, 1], 'androidpermissionread_user_dictionary': [0, 0], 'androidpermissionwrite_media_storage': [0, 0], 'androidpermissionaccess_coarse_location': [1, 1], 'comandroidemailpermissionread_attachment': [0, 0], 'androidpermissionset_pointer_speed': [0, 0], 'androidpermissionbackup': [0, 0], 'androidpermissionexpand_status_bar': [0, 0], 'androidpermissionbluetooth_admin': [0, 0], 'androidpermissionaccess_fine_location': [0, 0], 'androidpermissionlocation_hardware': [0, 0], 'androidpermissionpersistent_activity': [0, 0], 'androidpermissionreorder_tasks': [0, 0], 'androidpermissionbind_text_service': [0, 0], 'androidpermissiondevice_power': [0, 0], 'androidpermissionset_wallpaper': [0, 0], 'androidpermissionread_call_log': [0, 0], 'androidpermissionwrite_external_storage': [1, 1], 'androidpermissionget_package_size': [0, 0], 'androidpermissionwrite_social_stream': [0, 0], 'androidpermissionread_external_storage': [0, 0], 'androidpermissioninstall_packages': [0, 0], 'androidpermissionauthenticate_accounts': [1, 1], 'comandroidlauncherpermissionread_settings': [0, 0], 'comandroidalarmpermissionset_alarm': [0, 0], 'androidpermissioninternal_system_window': [0, 0], 'androidpermissionclear_app_cache': [0, 0], 'androidpermissioncapture_video_output': [0, 0], 'androidpermissionget_top_activity_info': [0, 0], 'androidpermissioninject_events': [0, 0], 'androidpermissionset_activity_watcher': [0, 0], 'androidpermissionread_sms': [0, 0], 'androidpermissionbattery_stats': [0, 0], 'androidpermissionglobal_search': [0, 0], 'androidpermissionbind_nfc_service': [0, 0], 'androidpermissionpackage_usage_stats': [0, 0], 'androidpermissionset_always_finish': [0, 0], 'androidpermissionaccess_drm': [0, 0], 'androidpermissionbroadcast_sticky': [0, 0], 'androidpermissionmount_unmount_filesystems': [0, 0], 'label': ['malware', 'malware']}

Related

Filling a Pandas table with values

First of all sorry for the kind-of misleading title but I didn't know how to word that properly. My problem is that basically I don't know how I can fill a Pandas table with the values below. I need a table with rows and columns going from 0 to X (in which X is the maximum value in the "[X,Y]" brackets below), which i managed to make with Numpy, but I don't know how to insert the corresponding value_counts() data.
import pandas as pd
det_vect1 = [[0, 1], [1, 1], [0, 1], [0, 1], [1, 0], [0, 0], [0, 0], [0, 0], [1, 1], [0, 1], [0, 0], [0, 0], [0, 0], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 0], [0, 1], [0, 0], [3, 1], [0, 0], [0, 0], [0, 0], [2, 0], [1, 0], [3, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [1, 0], [0, 0], [0, 1], [0, 0], [1, 0], [0, 0], [2, 1], [0, 0], [0, 0], [1, 0], [0, 0], [0, 0], [0, 1], [1, 0], [0, 0], [0, 0], [0, 0], [2, 0], [0, 0], [0, 1], [0, 0], [0, 0], [6, 0], [0, 1], [0, 1], [2, 0], [0, 0], [0, 1], [0, 0], [0, 0], [0, 1], [1, 0], [2, 1], [0, 0], [0, 1], [0, 0], [0, 1], [1, 0], [0, 1], [0, 1], [0, 0], [0, 1], [0, 1], [1, 0], [0, 0], [1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [0, 0], [0, 0], [1, 0], [2, 0], [0, 0], [1, 0], [0, 1], [0, 0], [2, 1], [0, 1], [2, 0], [1, 0], [0, 1], [0, 0], [0, 0], [0, 1], [0, 0], [1, 0], [2, 0], [0, 2], [1, 0], [0, 1], [0, 1], [0, 0], [1, 0], [0, 0], [0, 0], [0, 1], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 1], [0, 0], [0, 1], [0, 1], [0, 0], [0, 1], [1, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [0, 0], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 1], [0, 0], [0, 1], [0, 0], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [1, 0], [0, 0], [1, 0], [0, 1], [2, 1], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [1, 0], [0, 0], [0, 1], [1, 0], [0, 3], [0, 0], [0, 0], [1, 1], [0, 1], [0, 0], [0, 0], [1, 0], [0, 0], [1, 0], [0, 0], [0, 1], [1, 2], [0, 1], [1, 0], [1, 0], [1, 1], [0, 1], [1, 0], [0, 0], [0, 1], [0, 0], [0, 3], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [2, 1], [0, 1], [1, 1], [0, 0], [0, 0], [0, 1], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 1], [0, 1], [0, 0]]
print(pd.Series(det_vect1).value_counts())
which gives
[0, 0] 116
[0, 1] 64
[1, 0] 27
[2, 0] 6
[1, 1] 5
[2, 1] 5
[0, 3] 2
[3, 1] 1
[3, 0] 1
[6, 0] 1
[0, 2] 1
[1, 2] 1
dtype: int64
So i need something like (sorry for the scrappy Paint example with random values):
Of course the non-existing values can be easily filled with zeros, no problem.
Thank you in advance!
Use a DataFrame constructor, not Series, then unstack:
out = pd.DataFrame(det_vect1).value_counts().unstack(fill_value=0)
Or, with crosstab:
df_tmp = pd.DataFrame(det_vect1)
df = pd.crosstab(df_tmp[0], df_tmp[1])
output:
1 0 1 2 3
0
0 116 64 1 2
1 27 5 1 0
2 6 5 0 0
3 1 1 0 0
6 1 0 0 0
For a complete index of values, reindex:
(pd.DataFrame(det_vect1)
.value_counts().unstack(fill_value=0) # or crosstab alternative
.pipe(lambda d: d.reindex(index=range(d.index.max()+1),
columns=range(d.columns.max()+1),
fill_value=0
)
)
.rename_axis(index=None, columns=None)
)
output:
0 1 2 3
0 116 64 1 2
1 27 5 1 0
2 6 5 0 0
3 1 1 0 0
4 0 0 0 0
5 0 0 0 0
6 1 0 0 0

Strange behavior of skimage.morphology.skeletonize3d

It is strange the if using skimage.morphology.skeletonize_3don structure as below. It will remove all elements. Such structure is a equilateral triangle in 3d space.
array = np.array([
[[0, 1, 0],
[0, 0, 1],
[0, 0, 0]],
[[0, 0, 0],
[0, 1, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]]).astype('uint8')
morphology.skeletonize_3d(array)
Output:
array([[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]], dtype=uint8)
It results as an empty array. This is strange. Could anyone explain it? How to avoid it?

Change all positive values in array to 1 (Python)

So I have several 3D arrays that I need to add together. Each array consists of entries with either 0 or 1. All arrays also have the same dimension. Now, when I add these arrays together some of the values overlap (which they do). However, I just need to know how the structure of the total combined array is, which means that I don't need the values 1, 2 or 3 when 2 or 3 arrays have overlapped. This also just need to be one, and of course, wherever there is a zero, the value zero just need to remain zero.
So basically what I have is:
array1 =
[[[1, 0, 0], [0, 0, 0], [0, 0, 0]],
[[0, 1, 0], [0, 0, 0], [0, 0, 0]],
[[0, 0, 1], [1, 1, 1], [0, 0, 0]]]
array2 =
[[[1, 0, 0], [0, 1, 0], [0, 0, 0]],
[[0, 0, 0], [1, 1, 0], [0, 0, 0]],
[[0, 0, 1], [0, 1, 0], [0, 0, 0]]]
So when adding them together I get:
array_total = array1 + array2 =
[[[2, 0, 0], [0, 1, 0], [0, 0, 0]],
[[0, 1, 0], [1, 1, 0], [0, 0, 0]],
[[0, 0, 2], [1, 2, 1], [0, 0, 0]]]
Where I actually want it to give me:
array_total = array1 + array2 =
[[[1, 0, 0], [0, 1, 0], [0, 0, 0]],
[[0, 1, 0], [1, 1, 0], [0, 0, 0]],
[[0, 0, 1], [1, 1, 1], [0, 0, 0]]]
So can anyone give me a hint to how this is done ?
(Assuming those are numpy arrays, or array1 + array2 would behave differently).
If you want to "change all positive values to 1", you can do this
array_total[array_total > 0] = 1
But what you actually want is an array that has a 1 where array1 or array2 has a 1, so just write it directly like that:
array_total = array1 | array2
Example:
>>> array1 = np.array([[[1, 0, 0], [0, 0, 0], [0, 0, 0]],
... [[0, 1, 0], [0, 0, 0], [0, 0, 0]],
... [[0, 0, 1], [1, 1, 1], [0, 0, 0]]])
>>> array2 = np.array([[[1, 0, 0], [0, 1, 0], [0, 0, 0]],
... [[0, 0, 0], [1, 1, 0], [0, 0, 0]],
... [[0, 0, 1], [0, 1, 0], [0, 0, 0]]])
>>> array1 | array2
array([[[1, 0, 0], [0, 1, 0], [0, 0, 0]],
[[0, 1, 0], [1, 1, 0], [0, 0, 0]],
[[0, 0, 1], [1, 1, 1], [0, 0, 0]]])

label 3d numpy array with scipy.ndimage.label

I've got a large 3d numpy array which consists of ones and zeros. I would like to use the scipy.ndimage.label tool to label the features in each sub-array (2d).
A subset of the 3d-array looks like:
import numpy as np
from scipy.ndimage import label
subset = np.array([[[1, 0, 0],
[1, 0, 1],
[0, 0, 0]],
[[0, 0, 0],
[1, 0, 1],
[0, 0, 1]],
[[0, 0, 0],
[1, 0, 0],
[0, 1, 1]],
[[0, 0, 0],
[1, 0, 0],
[1, 1, 1]]], dtype=uint8)
When I use the label tool on a small part of this subset is works correct:
>>>label(subset[0:3])
(array([[[1, 0, 0],
[1, 0, 2],
[0, 0, 0]],
[[0, 0, 0],
[1, 0, 2],
[0, 0, 2]],
[[0, 0, 0],
[1, 0, 0],
[0, 2, 2]]]), 2)
However, when I use the entire subset the label tool is not working properly:
>>>label(subset)
(array([[[1, 0, 0],
[1, 0, 1],
[0, 0, 0]],
[[0, 0, 0],
[1, 0, 1],
[0, 0, 1]],
[[0, 0, 0],
[1, 0, 0],
[0, 1, 1]],
[[0, 0, 0],
[1, 0, 0],
[1, 1, 1]]]), 1)
Any ideas how this problem can be tackled?
ps.
The complete array which I am trying to label consists of 350219 2d arrays.
I answered this question with the help of dan-man.
I had to define a new 3D structure for the label tool:
import numpy as np
from scipy.dimage import label
str_3D = np.array([[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],
[[0, 1, 0],
[1, 1, 1],
[0, 1, 0]],
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]], dtype='uint8')
Now the label returns the following for my subset:
>>> label(subset, structure=str_3D)
# outputs:
(array([[[1, 0, 0],
[1, 0, 2],
[0, 0, 0]],
[[0, 0, 0],
[3, 0, 4],
[0, 0, 4]],
[[0, 0, 0],
[5, 0, 0],
[0, 6, 6]],
[[0, 0, 0],
[7, 0, 0],
[7, 7, 7]]]), 7)

Python matrix: instead of updating an element it updates all rows [duplicate]

This question already has answers here:
List of lists changes reflected across sublists unexpectedly
(17 answers)
Closed 6 years ago.
c_k_list = [[0, 0]]*(sorted_degrees[len(sorted_degrees)-1]+1)
c_k_list[entry[1]][0] = c_k_list[entry[1]][0]+1
where entry[1]=1
In the above statement, instead of adding 1 to a particular element in c_k_list, it adds 1 to all the rows.
Eg:
c_k_list is
[[1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0], [1, 0]]
instead of
[[0,0], [1,0], [0,0]......[0,0]]
Lists are objects, and so are stored by reference. Using * will just create copies of that reference. To correct this try:
c_k_list = [[0, 0] for i in range(5)]
c_k_list[1][0] = c_k_list[1][0]+1

Categories

Resources