Related
I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2 I want to sort them with Digit being the highest priority and Char_2 being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted() with this list sorting function as my sorted( , key=):
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("\d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key= in df.sort_values() in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )
If you look at the definition of the parameter key in sort_values it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
First way:
sort_int_key = lambda col: col.str.extract("(\d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)\d+(\w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
Assign extracted values as temporary columns and sort them using by parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(\d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)\d+(\w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])
I have a Data Frame with 16 columns and when i group it by one of the columns it reduces the size of the Data Frame to 12 columns
def update_centroids(cluster, k):
df["Cluster"] = cluster
display(df)
display(df.groupby("Cluster").mean().shape)
return df.groupby("Cluster").mean()
This is the "df"
This is what the function returns
It just removes the cloumn "WindGustDir", "WindDir9am" & "WindDir3pm"
I can't think of anything that would cause that and I can't seem to find anything online.
Sample Data (df)
{'MinTemp': {0: 0.11720244784576628,
1: -0.8427745726259455,
2: 0.03720436280645697,
3: -0.5547814664844322,
4: 0.7731867451681026},
'MaxTemp': {0: -0.10786029175347862,
1: 0.20733745161878284,
2: 0.29330047253849006,
3: 0.6228253860640358,
4: 1.2388937026552729},
'Rainfall': {0: -0.20728093808218293,
1: -0.2769572340371439,
2: -0.2769572340371439,
3: -0.2769572340371439,
4: -0.16083007411220893},
'WindGustDir': {0: 1.0108491579487748,
1: 1.2354908672839122,
2: 0.7862074486136377,
3: -1.2355679354025977,
4: 1.0108491579487748},
'WindGustSpeed': {0: 0.24007558342699453,
1: 0.24007558342699453,
2: 0.390124802975703,
3: -1.2604166120600901,
4: 0.015001754103931822},
'WindDir9am': {0: 1.0468595036148063,
1: 1.6948858890138538,
2: 1.0468595036148063,
3: -0.24919326718328894,
4: -0.8972196525823365},
'WindDir3pm': {0: 1.2126203373471025,
1: 0.764386964628212,
2: 0.764386964628212,
3: -0.8044298398879051,
4: 1.4367370237065478},
'WindSpeed9am': {0: 0.5756272310362935,
1: -1.3396174328344796,
2: 0.45592443954437023,
3: -0.5016978923910164,
4: -0.9805090583587096},
'WindSpeed3pm': {0: 0.5236467885906614,
1: 0.29063908419611084,
2: 0.7566544929852119,
3: -1.2239109943684676,
4: 0.057631379801560294},
'Humidity9am': {0: 0.18973417158101255,
1: -1.2387396584055541,
2: -1.556178287291458,
3: -1.1858332202579036,
4: 0.7717049912051694},
'Humidity3pm': {0: -1.381454000080091,
1: -1.2369929482683248,
2: -0.9962245285820479,
3: -1.6703761037036233,
4: -0.8517634767702817},
'Pressure9am': {0: -1.3829003897707315,
1: -0.9704973422317451,
2: -1.3971211845134583,
3: 0.024958289758919134,
4: -0.9420557527463073},
'Pressure3pm': {0: -1.142657774670493,
1: -1.0420314949852805,
2: -0.9126548496756962,
3: -0.32327235437655133,
4: -1.3007847856044166},
'Temp9am': {0: -0.08857174739799159,
1: -0.0413389204605655,
2: 0.5569435540801637,
3: 0.1003595603517128,
4: 0.0531267334142867},
'Temp3pm': {0: -0.04739593761083206,
1: 0.318392868036414,
2: 0.1574457935516255,
3: 0.6402870170059903,
4: 1.1084966882344651},
'Cluster': {0: 1, 1: 1, 2: 1, 3: 2, 4: 1}}
With your sample data for df, it looks like df.groupby("Cluster").mean() does not remove any columns.
One possible explanation: using groupby().mean() on non-numeric types will cause the column to be removed entirely. If you have a large dataset that you are importing from a file or something, is it possible that there are non-numeric data types?
If, for example, you are reading a csv file using pd.read_csv(), and there are cells that read "NULL", they may be parsed as strings instead of numeric values. One option to find rows with non-numeric values is to use:
import numpy as np
df[~df.applymap(np.isreal).all(1)]
I am trying to replicate a table, which is currently produced in R, in python implementing plotnine library. I am using facet.grid with two variables (CBRegion and CBIndustry).
I have found a similar problem, however, it is also done in R. I applied similar codes as in that link and produced the following table:
I tried to use exactly the same code in python using plotnine library, but the final output is very ugly. This is my python code so far:
myplot = ggplot(data = df_data_bar) + aes(x = "CCR100PDMid %" ,y = "CBSector")+ \
geom_segment(aes(yend="CBSector", xend=0), colour="black", size = 2) +\
geom_text(aes(label = "label")) + \
theme(panel_grid_major_y = element_blank()) + \
facet_grid('CBIndustry ~ CBRegion',scales="free_y",space="free") + \
labs(x="", y = "", title=title) + \
theme_bw() + \
theme(plot_title = element_text(linespacing=0.8, face="bold", size=20, va="center"),
axis_text_x = element_text(colour="#333333",size=12,rotation=0,ha="center",va="top",face="bold"),
axis_text_y = element_text(colour="#333333",size=12,rotation=0,ha="right",va="center",face="bold"),
axis_title_x = element_blank(),
axis_title_y = element_blank(),
legend_position="none",
strip_text_x = element_text(size = 12, face="bold", colour = "black", angle = 0),
strip_text_y = element_text(size = 8, face="bold", colour = "black", angle = 0, ha = "left"),
strip_background_y = element_text(width = 0.2),
figure_size=(30,20))
The image from plotnine is as follows:
Comparing Python vs R, we can clearly see that y-axis labels overlap using plotnine. In addition, when we look at Europe and Community groups we can notice that it has the same size box as others with multiple groups which is not necessary. I also tried different aspect ratios, but it has not resolved my problem.
In short words, I would like to have the same plot as R produces. It does not need to be produced in plotnine. Alternatives are also welcome. Data from top ten rows is:
{'CBRegion': {0: 'Europe', 1: 'Europe', 2: 'Europe', 3: 'Europe', 4: 'Europe', 5: 'Europe', 6: 'Europe', 7: 'Europe', 8: 'Europe', 9: 'Europe'}, 'CBSector': {0: 'Aerospace & Defense', 1: 'Alternative Energy', 2: 'Automobiles & Parts', 3: 'Banks', 4: 'Beverages', 5: 'Chemicals', 6: 'Colleges & Universities', 7: 'Community Groups', 8: 'Construction & Materials', 9: 'Electricity'}, 'CBIndustry': {0: 'Industrials', 1: 'Oil & Gas', 2: 'Consumer Goods', 3: 'Financials', 4: 'Consumer Goods', 5: 'Basic Materials', 6: 'NPO', 7: 'Community Groups', 8: 'Industrials', 9: 'Utilities'}, 'CCR100PDMid': {0: 0.015545818181818181, 1: 0.003296, 2: 0.012897471223021583, 3: 0.008079544600938968, 4: 0.008716597402597401, 5: 0.0094617476340694, 6: 0.008897475862068967, 7: 0.000821, 8: 0.012205547455295736, 9: 0.0050264210526315784}, 'CCR100PDMid %': {0: 1.554581818181818, 1: 0.3296, 2: 1.2897471223021584, 3: 0.8079544600938968, 4: 0.8716597402597401, 5: 0.9461747634069401, 6: 0.8897475862068966, 7: 0.0821, 8: 1.2205547455295735, 9: 0.5026421052631579}, 'label': {0: '1.6%', 1: '0.3%', 2: '1.3%', 3: '0.8%', 4: '0.9%', 5: '0.9%', 6: '0.9%', 7: '0.1%', 8: '1.2%', 9: '0.5%'}}
If it is necessary, I can upload the entire dataset, but I just read the MRC and it says that I should only include a subset of data. I am new to SO and hope that I included all vital information. I will be grateful for any help. Thank you in advance!
The other issues with colours, overlapping labels, wrapping text etc can be fixed, but unfortunately space = 'free' is not currently supported in plotnine. See documentation here. Unfortunately that's kind of a deal-breaker for your table, sadly. You will need to do in R's ggplot.
I want to make a histogram of all the intervals between repeated values in a list. I wrote some code that works, but it's using a for loop with if statements. I often find that if one can manage to write a version using clever slicing and/or predefined python (numpy) methods, that one can get much faster Python code than using for loops, but in this case I can't think of any way of doing that. Can anyone suggest a faster or more pythonic way of doing this?
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = {}, {}
for i in values:
hist[i] = {}
last_index[i] = -1 # some default value
# now go through the array and find intervals
for i in range(len(a)):
val = a[i]
if last_index[val] != -1: # do nothing if it's the first time
interval = i - last_index[val]
if interval in hist[val]:
hist[val][interval] += 1
else:
hist[val][interval] = 1
last_index[val] = i
return hist
# example list/array
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
histdict = hist_intervals(a)
print("histdict = ",histdict)
# correct answer for this example
answer = { 1: {3:1, 6:1},
2: {2:1, 5:1},
3: {1:1, 3:1, 6:1},
4: {6:1},
5: {6:1}
}
print("answer = ",answer)
Sample output:
histdict = {1: {3: 1, 6: 1}, 2: {5: 1, 2: 1}, 3: {3: 1, 6: 1, 1: 1}, 4: {6: 1}, 5: {6: 1}}
answer = {1: {3: 1, 6: 1}, 2: {2: 1, 5: 1}, 3: {1: 1, 3: 1, 6: 1}, 4: {6: 1}, 5: {6: 1}}
^ note: I don't care about the ordering in the dict, so this solution is acceptable, but I want to be able to run on really large arrays/lists and I'm suspecting my current method will be slow.
You can eliminate the setup loop by a carefully constructed defaultdict. Then you're just left with a single scan over the input list, which is as good as it gets. Here I change the resultant defaultdict back to a regular Dict[int, Dict[int, int]], but that's just so it prints nicely.
from collections import defaultdict
def count_intervals(iterable):
# setup
last_seen = {}
hist = defaultdict(lambda: defaultdict(int))
# The actual work
for i, x in enumerate(iterable):
if x in last_seen:
hist[x][i-last_seen[x]] += 1
last_seen[x] = i
return hist
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
hist = count_intervals(a)
for k, v in hist.items():
print(k, dict(v))
# 1 {3: 1, 6: 1}
# 3 {3: 1, 6: 1, 1: 1}
# 2 {5: 1, 2: 1}
# 5 {6: 1}
# 4 {6: 1}
There is an obvious change to make in terms of data structures. instead of using a dictionary of dictionaries for hist use a defaultdict of Counter this lets the code become
from collections import defaultdict, Counter
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = defaultdict(Counter), {}
# now go through the array and find intervals
for i, val in enumerate(a):
if val in last_index
interval = i - last_index[val]
hist[val].update((interval,))
last_index[val] = i
return hist
this will be faster as the if's are written in C, and will also be cleaner.
I have a dictionary of lists like this:
my_dict = {
'key_a': [1, 3, 4],
'key_b': [0, 2],
}
I want to create a reverse lookup dict like this:
reverse_dict = {
0: 'key_b',
1: 'key_a',
2: 'key_b',
3: 'key_a',
4: 'key_a',
}
I have a working version of the solution:
reverse_dict = {elem: key for key, a_list in my_dict.items() for elem in a_list}
But wanted to know if someone can provide an alternate answer that doesn't use a double for as I feel it loses readability. So I'd prefer having a single for loop or use functions like the ones in itertools or functional programming
Your solution using dictionary comprehension is the Pythonic way to achieve it.
However, as an alternative with single for loop as requested by you, here is the functional one using zip(), itertools.repeat(), and itertools.chain.from_iterable(), but I doubt that it is any better then yours solution in terms of readability:
my_dict = {
'key_a': [1, 3, 4],
'key_b': [0, 2],
}
from itertools import chain, repeat
new_dict = dict(chain.from_iterable(zip(v, repeat(k)) for k, v in my_dict.items()))
where new_dict will hold:
{0: 'key_b', 1: 'key_a', 2: 'key_b', 3: 'key_a', 4: 'key_a'}
>>> from itertools import chain
>>> dict(chain.from_iterable(map(lambda k: map(lambda j:(j, k), my_dict.get(k)), my_dict)))
{1: 'key_a', 3: 'key_a', 4: 'key_a', 0: 'key_b', 2: 'key_b'}
No for loops. It's clearly not more readable
Another one based off #MoinuddinQuadri's idea
>>> from itertools import chain, repeat
>>> dict(chain.from_iterable(map(lambda k: zip(my_dict[k], repeat(k)), my_dict)))
{1: 'key_a', 3: 'key_a', 4: 'key_a', 0: 'key_b', 2: 'key_b'}
Here's one without explicit for loop. Not really recommended, though (side-effect in map which is a no-no to many):
>>> import itertools as it
>>> out = {}
>>> any(map(out.update, it.starmap(dict.fromkeys, map(reversed, my_dict.items()))))
False
>>> out
{1: 'key_a', 3: 'key_a', 4: 'key_a', 0: 'key_b', 2: 'key_b'}