Related
I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2 I want to sort them with Digit being the highest priority and Char_2 being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted() with this list sorting function as my sorted( , key=):
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("\d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key= in df.sort_values() in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )
If you look at the definition of the parameter key in sort_values it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
First way:
sort_int_key = lambda col: col.str.extract("(\d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)\d+(\w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
Assign extracted values as temporary columns and sort them using by parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(\d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)\d+(\w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])
I have a Data Frame with 16 columns and when i group it by one of the columns it reduces the size of the Data Frame to 12 columns
def update_centroids(cluster, k):
df["Cluster"] = cluster
display(df)
display(df.groupby("Cluster").mean().shape)
return df.groupby("Cluster").mean()
This is the "df"
This is what the function returns
It just removes the cloumn "WindGustDir", "WindDir9am" & "WindDir3pm"
I can't think of anything that would cause that and I can't seem to find anything online.
Sample Data (df)
{'MinTemp': {0: 0.11720244784576628,
1: -0.8427745726259455,
2: 0.03720436280645697,
3: -0.5547814664844322,
4: 0.7731867451681026},
'MaxTemp': {0: -0.10786029175347862,
1: 0.20733745161878284,
2: 0.29330047253849006,
3: 0.6228253860640358,
4: 1.2388937026552729},
'Rainfall': {0: -0.20728093808218293,
1: -0.2769572340371439,
2: -0.2769572340371439,
3: -0.2769572340371439,
4: -0.16083007411220893},
'WindGustDir': {0: 1.0108491579487748,
1: 1.2354908672839122,
2: 0.7862074486136377,
3: -1.2355679354025977,
4: 1.0108491579487748},
'WindGustSpeed': {0: 0.24007558342699453,
1: 0.24007558342699453,
2: 0.390124802975703,
3: -1.2604166120600901,
4: 0.015001754103931822},
'WindDir9am': {0: 1.0468595036148063,
1: 1.6948858890138538,
2: 1.0468595036148063,
3: -0.24919326718328894,
4: -0.8972196525823365},
'WindDir3pm': {0: 1.2126203373471025,
1: 0.764386964628212,
2: 0.764386964628212,
3: -0.8044298398879051,
4: 1.4367370237065478},
'WindSpeed9am': {0: 0.5756272310362935,
1: -1.3396174328344796,
2: 0.45592443954437023,
3: -0.5016978923910164,
4: -0.9805090583587096},
'WindSpeed3pm': {0: 0.5236467885906614,
1: 0.29063908419611084,
2: 0.7566544929852119,
3: -1.2239109943684676,
4: 0.057631379801560294},
'Humidity9am': {0: 0.18973417158101255,
1: -1.2387396584055541,
2: -1.556178287291458,
3: -1.1858332202579036,
4: 0.7717049912051694},
'Humidity3pm': {0: -1.381454000080091,
1: -1.2369929482683248,
2: -0.9962245285820479,
3: -1.6703761037036233,
4: -0.8517634767702817},
'Pressure9am': {0: -1.3829003897707315,
1: -0.9704973422317451,
2: -1.3971211845134583,
3: 0.024958289758919134,
4: -0.9420557527463073},
'Pressure3pm': {0: -1.142657774670493,
1: -1.0420314949852805,
2: -0.9126548496756962,
3: -0.32327235437655133,
4: -1.3007847856044166},
'Temp9am': {0: -0.08857174739799159,
1: -0.0413389204605655,
2: 0.5569435540801637,
3: 0.1003595603517128,
4: 0.0531267334142867},
'Temp3pm': {0: -0.04739593761083206,
1: 0.318392868036414,
2: 0.1574457935516255,
3: 0.6402870170059903,
4: 1.1084966882344651},
'Cluster': {0: 1, 1: 1, 2: 1, 3: 2, 4: 1}}
With your sample data for df, it looks like df.groupby("Cluster").mean() does not remove any columns.
One possible explanation: using groupby().mean() on non-numeric types will cause the column to be removed entirely. If you have a large dataset that you are importing from a file or something, is it possible that there are non-numeric data types?
If, for example, you are reading a csv file using pd.read_csv(), and there are cells that read "NULL", they may be parsed as strings instead of numeric values. One option to find rows with non-numeric values is to use:
import numpy as np
df[~df.applymap(np.isreal).all(1)]
I have a dictionary which includes integer keys and string values. The keys go up to a number N but include gaps. It there an effective way of filling all gaps up to a specific number? (numeration starts with 1, not 0)
Example:
{1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"}
N = 9
The result should be:
{1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf", 2: "placeholder", 6: "placeholder", 8: "placeholder", 9: "placeholder"}
Of course I can loop manually through it, but is there a smarter way to do that?
One quick way to do it (which does admittedly involve a bit of looping) is
mydict = dict.fromkeys(range(1,N+1),"placeholder") | {
1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"}
Though I suspect you might be reaching for collections.defaultdict:
mydict = defaultdict(lambda: "placeholder",{
1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"})
I want to compute the percentage distribution over the elements and to have the output dictionary comprised of time_key as keys and dictionary with elements percentage as a value
I wrote this code
def class_distr(x):
fractions = x.value_counts(normalize=True) # use value_counts normalize instead
return [dict(zip(fractions.keys(), fractions.tolist()))]
def get_dict(df, col):
grouped = df[['time_key', col]].groupby('time_key')[col].apply(lambda x: class_distr(x)).reset_index(name=col)
return {key: value for key, value in zip(grouped['time_key'], grouped[col])}
Here is a dataframe:
d = {'time_key': {9394: '2019-03-01',
898: '2018-09-01',
2398: '2018-10-01',
5906: '2018-12-01',
2343: '2018-10-01',
8225: '2019-02-01',
5506: '2018-12-01',
6451: '2019-01-01',
2670: '2018-10-01',
3497: '2018-10-01'},
'target': {9394: 3,
898: 4,
2398: 0,
5906: 3,
2343: 4,
8225: 1,
5506: 0,
6451: 0,
2670: 0,
3497: 2}}
df = pd.DataFrame(d)
get_dict(df, 'target')
Output
{'2018-09-01': [{4: 1.0}],
'2018-10-01': [{0: 0.5, 2: 0.25, 4: 0.25}],
'2018-12-01': [{3: 0.5, 0: 0.5}],
'2019-01-01': [{0: 1.0}],
'2019-02-01': [{1: 1.0}],
'2019-03-01': [{3: 1.0}]}
It can be seen that the inner dictionaries are wrapped with square brackets.
I don't need it, but the class_distr function doesn't work correctly without brackets in conjunction with groupby.
How can I handle this without additional loop for extracting dictionaries from brackets?
We can group the target column by time_key then inside a dict comprehension iterate over the groups and create key-value pairs where key is the timestamp and value is the normalized distribution of target for the corresponding timestamp
grp = df.groupby('time_key')['target']
{k: g.value_counts(normalize=True).to_dict() for k, g in grp}
{'2018-09-01': {4: 1.0},
'2018-10-01': {0: 0.5, 4: 0.25, 2: 0.25},
'2018-12-01': {0: 0.5, 3: 0.5},
'2019-01-01': {0: 1.0},
'2019-02-01': {1: 1.0},
'2019-03-01': {3: 1.0}}
I have a dictionary (named distances) which looks like this :
{0: {0: 122.97560733739029, 1: 208.76062847194152, 2: 34.713109915419565}, 1: {0: 84.463009655114703, 1: 20.83266665599966, 2: 237.6299644405141}, 2: {0: 27.018512172212592, 1: 104.38390680559911, 2: 137.70257804413103}}
Now, what I need to do is I have to find the minimum value corresponding to each key and then store its index separately. I have written this code for that :
weights_indexes = {}
for index1 in distances:
min_dist = min(distances[index1], key=distances[index1].get)
weights_indexes[index1] = min_dist
The output for this, looks like :
{0: 2, 1: 1, 2: 0}
Now, the issue is this that indexes should always be unique. Lets say that now if we have a dictionary like :
{0: {0: 34.713109915419565, 1: 208.76062847194152, 2: 122.97560733739029}, 1: {0: 84.463009655114703, 1: 20.83266665599966, 2: 237.6299644405141}, 2: {0: 27.018512172212592, 1: 104.38390680559911, 2: 137.70257804413103}}
so, the output of finding minimum indexes for this will be :
{0: 0, 1: 1, 2: 0}
Here, the indexes (values) obtained are not unique. In this scenario, the values corresponding to the indexes where duplicates are found have to be compared. So, 34.713109915419565 and 27.018512172212592 will be compared. Since 27.018512172212592 is smaller, so its index will be picked. And for the index 0, mapping will be done to next smallest index, that is index of 122.97560733739029. So, final mapping will look like :
{0: 2, 1: 1, 2: 0}
This should happen iteratively unless, all the values are unique.
I am not able to figure out how to check for uniqueness and the iteratively keep finding the next minimum one to make the mapping.
Here is a workable solution:
test = {0: {0: 12.33334444, 1: 208.76062847194152, 2: 34.713109915419565}, 1: {0: 84.463009655114703, 1: 20.83266665599966, 2: 237.6299644405141}, 2: {0: 27.018512172212592, 1: 104.38390680559911, 2: 137.70257804413103}}
sorted_index_map = {}
for key, value in test.iteritems():
sorted_index_map[key] = sorted(value, key=lambda k: value[k])
index_of_min_index_map = {key: 0 for key in test}
need_to_check_duplicate = True
while need_to_check_duplicate:
need_to_check_duplicate = False
min_index_map = {key: sorted_index_map[key][i] for key, i in index_of_min_index_map.iteritems()}
index_set = list(min_index_map.itervalues())
for key, index in min_index_map.iteritems():
if index_set.count(index) == 1:
continue
else:
for key_to_check, index_to_check in min_index_map.iteritems():
if key != key_to_check and index == index_to_check:
if test[key][index] > test[key_to_check][index_to_check]:
index_of_min_index_map[key] += 1
need_to_check_duplicate = True
break
result = {key: sorted_index_map[key][i] for key, i in index_of_min_index_map.iteritems()}
print result
The result:
{0: 0, 1: 1, 2: 2}
Break down:
First sort the indexes by it's value:
test = {0: {0: 12.33334444, 1: 208.76062847194152, 2: 34.713109915419565}, 1: {0: 84.463009655114703, 1: 20.83266665599966, 2: 237.6299644405141}, 2: {0: 27.018512172212592, 1: 104.38390680559911, 2: 137.70257804413103}}
sorted_index_map = {}
for key, value in test.iteritems():
sorted_index_map[key] = sorted(value, key=lambda k: value[k])
Then for each key the min value's index is the first number in the sorted_index_map:
index_of_min_index_map = {key: 0 for key in test}
Now we need to check if there are any duplicate indexes, if there are, for all the value's of the same index that are not the smallest. we shift to the next small index, i.e. the next one in the sorted_index_map of the key. If there are non duplicate, we're done.
need_to_check_duplicate = True
while need_to_check_duplicate:
need_to_check_duplicate = False
min_index_map = {key: sorted_index_map[key][i] for key, i in index_of_min_index_map.iteritems()}
index_set = list(min_index_map.itervalues())
for key, index in min_index_map.iteritems():
if index_set.count(index) == 1:
continue
else:
for key_to_check, index_to_check in min_index_map.iteritems():
if key != key_to_check and index == index_to_check:
if test[key][index] > test[key_to_check][index_to_check]:
index_of_min_index_map[key] += 1
need_to_check_duplicate = True
break
Note you haven't mentioned how to handle the index if there are two identical value so I assume there won't be.