Command to call non numeric value from a dataset? - python

New to Python, as part of an assessment I need to collect data samples from a dataset, the information has been put through a labelencoder with:
le = LabelEncoder()
for i in columns:
#print(i)
data[i] = le.fit_transform(data[i])
data.head()
this shows the below table.
if i use the command:
data['native-country'].value_counts()
I will get numerical values when at this point I want to see the actual country rather than the numerical value assigned. how do I do this?
thanks.

The funcion value_counts returns a series with values as index entries. Since values in the dataframe are numbers, that's what you get.
You can use the library phone_iso3166 to lookup the numeric country codes (I assume they're telephone prefixes) and update the index
df
# Out:
# col_1 native_country
# 0 3 39
# 1 2 39
# 2 1 20
vc = df['native_country'].value_counts()
vc
# 39 2
# 20 1
# Name: native_country, dtype: int64
Import library and lookup country code
from phone_iso3166.country import *
vc.to_frame().set_index(vc.index.map(phone_country))
# Out:
# native_country
# IT 2
# EG 1
vc.to_frame().set_index(vc.index.map(phone_country)) \
.rename(columns={'native_country':'count'})
# Out:
# count
# IT 2
# EG 1
Or just use any other feasible dictionary/library for converting the codes to country names.

Related

Ignore Pandas Dataframe indexes which are not intended to be mapped using map function

I have the following dataframe
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,James,NaN,3,no
e,Emily,9.0,2,no
I am trying to use pandas map function to update name column where name is either James or Emily to any test value 99.
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes)
dff
I am getting the following output -
index,name,score,attempts,qualify
a,NaN,12.5,1,yes
b,NaN,9.0,3,no
c,NaN,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
Note that name column values James and Emily have been updated to 99, but the rest of name values are mapped to NaN.
How can we ignore indexes which are not intended to be mapped?
The issue is that the map function will apply the dictionary values to all values in the 'name' column, not just the ones specified. To get around this, you can use the replace method instead:
dff['name'] = dff['name'].replace({'James':'99','Emily':'99'})
This will replace only the specified values and leave the others unchanged.
I believe you may be looking for replace instead of map.
import pandas as pd
names = pd.Series([
"Anastasia",
"Dima",
"Katherine",
"James",
"Emily"
])
names.replace({"James": "99", "Emily": "99"})
# 0 Anastasia
# 1 Dima
# 2 Katherine
# 3 99
# 4 99
# dtype: object
If you're really set on using map, then you have to provide a function that knows how to handle every single name it might encounter.
codes = {"James": "99", "Emily": "99"}
# If the lookup into `code` fails,
# return the name that was used for lookup
names.map(lambda name: codes.get(name, name))
codes = {'James':'99',
'Emily':'99'}
dff['name'] = dff['name'].replace(codes)
dff
replace() satisfies the requirement -
index,name,score,attempts,qualify
a,Anastasia,12.5,1,yes
b,Dima,9.0,3,no
c,Katherine,16.5,2,yes
d,99,NaN,3,no
e,99,9.0,2,no
You can replace back one way to achiev it
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
codes = {'James':'99','Emily':'99'}
dff['name'] = dff['name'].map(codes).fillna(dff['name'])
dff
index name score attempts qualify
0 a Anastasia 12.5 1 yes
1 b Dima 9.0 3 no
2 c Katherine 16.5 2 yes
3 d 99 NaN 3 no
4 e 99 9.0 2 no

Use a value from one dataframe to lookup the value in another and return an adjacent cell value and update the first dataframe value

I have a 2 datasets (dataframes), one called source and the other crossmap. I am trying to find rows with a specific column value starting with "999", if one is found I need to look up the complete value of that column (e.x. "99912345") on the crossmap dataset (dataframe) and return the value from a column on that row in the cross-map.
# Source Dataframe
0 1 2 3 4
------ -------- -- --------- -----
0 303290 544981 2 408300622 85882
1 321833 99910722 1 408300902 85897
2 323241 99902978 3 408056001 95564
# Cross Map Dataframe
ID NDC ID DIN(NDC) GTIN NAME PRDID
------- ------ -------- -------------- ---------------------- -----
44563 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
69281 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
6002800 323241 99902978 75402850039706 EPINEPHRINE (A) 95564
8001116 323241 99902978 99902978000000 EPINEPHRINE (A) 95564
The 'straw dog' logic I am working with is this:
search source file and find '999' entries in column 1
df_source[df_source['Column1'].str.contains('999')]
interate through the rows returned and search for the value in column 1 in the crossmap dataframe column (DIN(NDC)) and return the corresponding PRDID
update the source dataframe with the PRDID, and write the updated file
It is these last two logic pieces where I am struggling with how to do this. Appreciate any direction/guidance anyone can provide.
Is there maybe a better/easier means of doing this using python but not pandas/dataframes?
So, as far as I understood you correctly: we are looking for the first digits of 999 in the 'Source Dataframe' in the first column of the value. Next, we find these values in the 'Cross Map' column 'DIN(NDC)' and we get the values of the column 'PRDID' on these lines.
If everything is correct, then I can't understand your further actions?
import pandas as pd
import more_itertools as mit
Cross_Map = pd.DataFrame({'DIN(NDC)': [99910722, 99910722, 99902978, 99902978],
'PRDID': [90367, 90367, 95564, 95564]})
df = pd.DataFrame({0: [303290, 321833, 323241], 1: [544981, 99910722, 99902978], 2: [2, 1, 3],
3: [408300622, 408300902, 408056001], 4: [85882, 85897, 95564]})
m = [i for i in df[1] if str(i)[:3] == '999'] #find the values in column 1
index = list(mit.locate(list(Cross_Map['DIN(NDC)']), lambda x: x in m)) #get the indexes of the matched column values DIN(NDC)
print(Cross_Map['PRDID'][index])

How to generate 8 digit unique identifier to replace the existing one in python pandas

Let us say I have the following simple data frame. But in reality, I have hundreds thousands of rows like this.
df
ID Sales
倀굖곾ꆹ譋῾理 100
倀굖곾ꆹ 50
倀굖곾ꆹ譋῾理 70
곾ꆹ텊躥㫆 60
My idea is that I want to replace the Chinese digit with randomly generated 8 digits something looks like below.
ID Sales
13434535 100
67894335 50
13434535 70
10986467 60
The digits are randomly generated but they should keep uniqueness as well. For example, row 0 and 2 are same and when it replaced by a random unique ID, it should be the same as well.
Can anyone help on this in Python pandas? Any solution that is already done before is also welcome.
The primary method here will be to use Series.map() on the 'ID's to assign the new values.
Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.
which is exactly what you're looking for.
Here are some options for generating the new IDs:
1. Randomly generated 8-digit integers, as asked
You can first create a map of randomly generated 8-digit integers with each of the unique ID's in the dataframe. Then use Series.map() on the 'ID's to assign the new values back. I've included a while loop to ensure that the generated ID's are unique.
import random
original_ids = df['ID'].unique()
while True:
new_ids = {id_: random.randint(10_000_000, 99_999_999) for id_ in original_ids}
if len(set(new_ids.values())) == len(original_ids):
# all the generated id's were unique
break
# otherwise this will repeat until they are
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 91154173 100
1 27127403 50
2 91154173 70
3 55892778 60
Edit & Warning: The original ids are Chinese characters and they are already length 8. There's definitely more than 10 Chinese characters so with the wrong combination of original IDs, it could become impossible to make unique-enough 8-digit IDs for the new set. Unless you are memory bound, I'd recommend using 16-24 digits. Or even better...
2. Use UUIDs. [IDEAL]
You can still use the "integer" version of the ID instead of hex. This has the added benefit of not needing to check for uniqueness:
import uuid
original_ids = df['ID'].unique()
new_ids = {cid: uuid.uuid4().int for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
(If you are okay with hex id's, change uuid.uuid4().int above to uuid.uuid4().hex.)
Output:
ID Sales
0 10302456644733067873760508402841674050 100
1 99013251285361656191123600060539725783 50
2 10302456644733067873760508402841674050 70
3 112767087159616563475161054356643068804 60
2.B. Smaller numbers from UUIDs
If the ID generated above is too long, you could truncate it, with some minor risk. Here, I'm only using the first 16 hex characters and converting those to an int. You may put that in the uniqueness loop check as done for option 1, above.
import uuid
original_ids = df['ID'].unique()
DIGITS = 16 # number of hex digits of the UUID to use
new_ids = {cid: int(uuid.uuid4().hex[:DIGITS], base=16) for cid in original_ids}
df['ID'] = df['ID'].map(new_ids)
Output:
ID Sales
0 14173925717660158959 100
1 10599965012234224109 50
2 14173925717660158959 70
3 13414338319624454663 60
3. Creating a mapping based on the actual value:
This group of options has these advantages:
not needing a uniqueness check since it's deterministically based on the original ID and
So original IDs which were the same will generate the same new ID
doesn't need a map created in advance
3.A. CRC32
(Higher probability of finding a collision with different IDs, compared to option 2.B. above.)
import zlib
df['ID'] = df['ID'].map(lambda cid: zlib.crc32(bytes(cid, 'utf-8')))
Output:
ID Sales
0 2083453980 100
1 1445801542 50
2 2083453980 70
3 708870156 60
3.B. Python's built-in hash() of the orignal ID [My preferred approach in this scenario]
Can be done in one line, no imports needed
Reasonably secure to not generate collisions for IDs which are different
df['ID'] = df['ID'].map(hash)
Output:
ID Sales
0 4663892623205934004 100
1 1324266143210735079 50
2 4663892623205934004 70
3 6251873913398988390 60
3.C. MD5Sum, or anything from hashlib
Since the IDs are expected to be small (8 chars), even with MD5, the probability of a collision is very low.
import hashlib
DIGITS = 16 # number of hex digits of the hash to use
df['ID'] = df['ID'].str.encode('utf-8').map(lambda x: int(hashlib.md5(x).hexdigest()[:DIGITS], base=16))
Output:
ID Sales
0 17469287633857111608 100
1 4297816388092454656 50
2 17469287633857111608 70
3 11434864915351595420 60
Not very expert in Pandas, that's why implementing solution for you with Numpy + Pandas. As solution uses fast Numpy it means it will be much faster than pure Python solution especially if you have thousands of rows.
Try it online!
import pandas as pd, numpy as np
df = pd.DataFrame([
['倀굖곾ꆹ譋῾理', 100],
['倀굖곾ꆹ', 50],
['倀굖곾ꆹ譋῾理', 70],
['곾ꆹ텊躥㫆', 60],
], columns = ['ID', 'Sales'])
u, iv = np.unique(df.ID.values, return_inverse = True)
while True:
ids = np.random.randint(10 ** 7, 10 ** 8, u.size)
if np.all(np.unique(ids, return_counts = True)[1] <= 1):
break
df.ID = ids[iv]
print(df)
Output:
ID Sales
0 31043191 100
1 36168634 50
2 31043191 70
3 17162753 60
Given a dataframe df, create a list of the ids:
id_list = list(df.ID)
Then import the random package
from random import randint
from collections import deque
def idSetToNumber(id_list):
id_set = deque(set(id_list))
checked_numbers = []
while len(id_set)>0:
#get the id
id = randint(10000000,99999999)
#check if the id has been used
if id not in checked_numbers:
checked_numbers.append(id)
id_set.popleft()
return checked_numbers
This gives a list of unique 8-digit number for each of your keys.
Then create a dictionary
checked_numbers = idSetToNumber(id_list)
name2id = {}
for i in range(len(checked_numbers)):
name2id[id_list[i]]=checked_numbers[i]
Last step, replace all the pandas ID fields with the ones in the dictionary.
for i in range(df.shape[0]):
df.ID[i] = str(name2id[df.ID[i]])
I would:
identify the unique ID values
build (from np.random) an array of unique values of same size
build a tranformation dataframe with that array
use merge to replace the original ID values
Possible code:
trans = df[['ID']].drop_duplicates() # unique ID values
n = len(trans)
# np.random.seed(0) # uncomment for reproducible pseudo random sequences
while True:
# build a greater array to have a higher chance to get enough unique values
arr = np.unique(np.random.randint(10000000, 100000000, n + n // 2))
if len(arr) >= n:
arr = arr[:n] # ok keep only the required number
break
trans['new'] = arr # ok we have our transformation table
df['ID'] = df.merge(trans, how='left', on='ID')['new'] # done...
With your sample data (and with np.random.seed(0)), it gives:
ID Sales
0 12215104 100
1 48712131 50
2 12215104 70
3 70969723 60
Per #Arty's comment, np.unique will return a ascending sequence. If you do not want that, shuffle it before using it for the transformation table:
...
np.random.shuffle(arr)
trans['new'] = arr
...

DataFrame.sort_values only looks at first digit rather then entire number

I have a DataFrame that looks like this,
del Ticker Open Interest
0 1 SPY 20,996,893
1 3 IWM 7,391,074
2 5 EEM 6,545,445
...
47 46 MU 1,268,256
48 48 NOK 1,222,759
49 50 ET 1,141,467
I want it to go in order from the lowest number to greatest with df['del'], but when I write df.sort_values('del') I get
del Ticker
0 1 SPY
29 10 BAC
5 11 GE
It appears do do it based on the first number rather than go in order? Am I using the correct code or do I need to completely change it?
Assuming you have numbers as type string you can do:
add leading zeros to the string numbers which will allow for ordering of the string
df["del"] = df["del"].map(lambda x: x.zfill(10))
df = df.sort_values('del')
or convert the type to integer
df["del"] = df["del"].astype('int') # as recommended by Alex.Kh in comment
#df["del"] = df["del"].map(int) # my initial answer
df = df.sort_values('del')
I also noticed that del seems to be sorted in the same way your index is sorted, so you even could do:
df = df.sort_index(ascending=False)
to go from lowest to highest you can explicitly .sort_values('del', ascending=True)

Pandas very slow query

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
is so slow that it takes more than 15 mins. Is there a way to make the query faster?
raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')
data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
illnesses = pd.DataFrame({"Finding_Label":[],
"Count_of_Patientes_Having":[],
"Count_of_Times_Being_Shown_In_An_Image":[]})
ids = raw_data["Patient ID"].drop_duplicates()
index = 0
for ctr in data[:1]:
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
for i in ids:
illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
index = index + 1
Part of dataframes:
Raw_data
Finding Labels - Patient ID
IllnessA|IllnessB - 1
Illness A - 2
From what I read I understand that ctr stands for the name of a disease.
When you are doing this query:
raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease.
This would be:
raw_data[raw_data['Finding Labels'].str.contains(ctr)].size
And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).
Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.
The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:
for index, ctr in enumerate(data[:1]):
illnesses.at[index, "Finding_Label"] = ctr
illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])
I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.
Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.
Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.
# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
.merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
.drop(columns=['key'])
)
# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]
# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
return (grp.groupby('Patient ID').size()).size
illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})
To demonstrate, consider below with random, seeded input data and output.
Input Data (attempting to mirror original data)
import numpy as np
import pandas as pd
alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']
ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia",
"Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax",
"Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]
np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
'Finding Labels': np.core.defchararray.add(
np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
np.random.choice(ills, 25).astype('str')),
np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
})
print(raw_data.head(10))
# Patient ID Finding Labels
# 0 r xPNPneumothoraxXYm
# 1 python ScSInfiltration9Ud
# 2 stata tJhInfiltrationJtG
# 3 r thLPneumoniaWdr
# 4 stata thYAtelectasis6iW
# 5 sas 2WLPneumonia1if
# 6 julia OPEConsolidationKq0
# 7 sas UFFCardiomegaly7wZ
# 8 stata 9NQHerniaMl4
# 9 python NB8HerniapWK
Output (after running above process)
print(illnesses)
# Count_of_Times_Being_Shown_In_An_Image Count_of_Patients_Having
# ills
# Atelectasis 3 1
# Cardiomegaly 2 1
# Consolidation 1 1
# Effusion 1 1
# Emphysema 1 1
# Fibrosis 2 2
# Hernia 4 3
# Infiltration 2 2
# Mass 1 1
# Nodule 2 2
# Pleural_Thickening 1 1
# Pneumonia 3 3
# Pneumothorax 2 2

Categories

Resources