Effective placeholder in python dictionary

Effective placeholder in python dictionary - python

I have a dictionary which includes integer keys and string values. The keys go up to a number N but include gaps. It there an effective way of filling all gaps up to a specific number? (numeration starts with 1, not 0)
Example:
{1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"}
N = 9
The result should be:
{1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf", 2: "placeholder", 6: "placeholder", 8: "placeholder", 9: "placeholder"}
Of course I can loop manually through it, but is there a smarter way to do that?

One quick way to do it (which does admittedly involve a bit of looping) is
mydict = dict.fromkeys(range(1,N+1),"placeholder") | {
1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"}
Though I suspect you might be reaching for collections.defaultdict:
mydict = defaultdict(lambda: "placeholder",{
1: "fdkh", 3: "wnww", 4: "fdngfne", 5: "wqiw", 7: "sdfsdf"})

Related

Sorting a Dataframe by column containing char-digit-char values

I have a column in my Dataframe that has values that look like this (I want to sort my Dataframe by this column):
Mutations=['A67D','C447E','C447F','C447G','C447H','C447I','C447K','C447L','C447M','C447N','C447P','C447Q','C447R','C447S','C447T','C447V','C447W','C447Y','C447_','C92A','C92D','C92E','C92F','C92G','C92H','C92I','C92K','C92L','C92M','C92N','C92P','C92Q','C92R','C92S','C92T','C92V','C92W','C92_','D103A','D103C','D103F','D103G','D103H','D103I','D103K','D103L','D103M','D103N','D103R','D103S','D103T','D103V','silent_C88G','silent_G556R']
Basically all the values are in the format of Char_1-Digit-Char_2 I want to sort them with Digit being the highest priority and Char_2 being the second highest priority. With that in mind, I want to sort my whole Dataframe by this column
I thought I could do that with sorted() with this list sorting function as my sorted( , key=):
def alpha_numeric_sort_key(unsorted_list):
return int( "".join( re.findall("\d*", unsorted_list) ) )
This works for lists. I tried the same thing for my dataframe but
got this error:
df = raw_df.sort_values(by='Mutation',key=alpha_numeric_sort_key,ignore_index=True) #sorts values by one letter amino acid code
TypeError: expected string or bytes-like object
I just need to understand whats the right way to understand how to use key= in df.sort_values() in a way thats understandable to someone that has an intermediate level of experience using Python.
I also provided the head of what my Dataframe looks like if this helpful to answering my question. If not, ignore it.
Thanks!
raw_df=pd.DataFrame({'0': {0: 100, 1: 100, 2: 100, 3: 100}, 'Mutation': {0: 'F100D', 1: 'F100S', 2: 'F100M', 3: 'F100G'},
'rep1_AGGTTGGG-TCGATTAG': {0: 2.0, 1: 15.0, 2: 49.0, 3: 19.0},
'Input_AGGTTGGG-TCGATTAG': {0: 48.0, 1: 125.0, 2: 52.0, 3: 98.0}, 'rep2_GTGTGGTG-TGTTCTAG': {0: 8.0, 1: 40.0, 2: 33.0, 3: 11.0}, 'WT_plasmid_GTGTGGTG-TGTTCTAG': {0: 1.0, 1: 4.0, 2: 1.0, 3: 1.0},
'Amplicon': {0: 'Amp1', 1: 'Amp1', 2: 'Amp1', 3: 'Amp1'},
'WT_plas_norm': {0: 1.9076506328630974e-06, 1: 7.63060253145239e-06, 2: 1.9076506328630974e-06, 3: 1.9076506328630974e-06},
'Input_norm': {0: 9.171121666392808e-05, 1: 0.0002388312933956, 2: 9.935381805258876e-05, 3: 0.0001872437340221},
'escape_rep1_norm': {0: 4.499235130027895e-05, 1: 0.000337442634752, 2: 0.0011023126068568, 3: 0.0004274273373526},
'escape_rep1_fitness': {0: -1.5465897459555915, 1: -1.087197258196361, 2: -0.1921857678502714, 3: -0.8788509789836517} } )

If you look at the definition of the parameter key in sort_values it clearly says :
It should expect a Series and return a Series with the same shape as
the input. It will be applied to each column in by independently.
You cannot use a single scalar as a key to sort.
You can do sorting in two ways:
First way:
sort_int_key = lambda col: col.str.extract("(\d+)", expand=False)
sort_char_key = lambda col: col.str.extract("(?<=)\d+(\w+)", expand=False)
raw_df.sort_values(by="Mutation", key=sort_int_key).sort_values(
by="Mutation", key=sort_char_key
)
Assign extracted values as temporary columns and sort them using by parameter specifying those columns:
raw_df.assign(
sort_int=raw_df["Mutation"].str.extract("(\d+)", expand=False),
sort_char=raw_df["Mutation"].str.extract("(?<=)\d+(\w+)", expand=False),
).sort_values(by=["sort_int", "sort_char"])

Why is "df.grouby" reducing some Columns?

I have a Data Frame with 16 columns and when i group it by one of the columns it reduces the size of the Data Frame to 12 columns
def update_centroids(cluster, k):
df["Cluster"] = cluster
display(df)
display(df.groupby("Cluster").mean().shape)
return df.groupby("Cluster").mean()
This is the "df"
This is what the function returns
It just removes the cloumn "WindGustDir", "WindDir9am" & "WindDir3pm"
I can't think of anything that would cause that and I can't seem to find anything online.
Sample Data (df)
{'MinTemp': {0: 0.11720244784576628,
1: -0.8427745726259455,
2: 0.03720436280645697,
3: -0.5547814664844322,
4: 0.7731867451681026},
'MaxTemp': {0: -0.10786029175347862,
1: 0.20733745161878284,
2: 0.29330047253849006,
3: 0.6228253860640358,
4: 1.2388937026552729},
'Rainfall': {0: -0.20728093808218293,
1: -0.2769572340371439,
2: -0.2769572340371439,
3: -0.2769572340371439,
4: -0.16083007411220893},
'WindGustDir': {0: 1.0108491579487748,
1: 1.2354908672839122,
2: 0.7862074486136377,
3: -1.2355679354025977,
4: 1.0108491579487748},
'WindGustSpeed': {0: 0.24007558342699453,
1: 0.24007558342699453,
2: 0.390124802975703,
3: -1.2604166120600901,
4: 0.015001754103931822},
'WindDir9am': {0: 1.0468595036148063,
1: 1.6948858890138538,
2: 1.0468595036148063,
3: -0.24919326718328894,
4: -0.8972196525823365},
'WindDir3pm': {0: 1.2126203373471025,
1: 0.764386964628212,
2: 0.764386964628212,
3: -0.8044298398879051,
4: 1.4367370237065478},
'WindSpeed9am': {0: 0.5756272310362935,
1: -1.3396174328344796,
2: 0.45592443954437023,
3: -0.5016978923910164,
4: -0.9805090583587096},
'WindSpeed3pm': {0: 0.5236467885906614,
1: 0.29063908419611084,
2: 0.7566544929852119,
3: -1.2239109943684676,
4: 0.057631379801560294},
'Humidity9am': {0: 0.18973417158101255,
1: -1.2387396584055541,
2: -1.556178287291458,
3: -1.1858332202579036,
4: 0.7717049912051694},
'Humidity3pm': {0: -1.381454000080091,
1: -1.2369929482683248,
2: -0.9962245285820479,
3: -1.6703761037036233,
4: -0.8517634767702817},
'Pressure9am': {0: -1.3829003897707315,
1: -0.9704973422317451,
2: -1.3971211845134583,
3: 0.024958289758919134,
4: -0.9420557527463073},
'Pressure3pm': {0: -1.142657774670493,
1: -1.0420314949852805,
2: -0.9126548496756962,
3: -0.32327235437655133,
4: -1.3007847856044166},
'Temp9am': {0: -0.08857174739799159,
1: -0.0413389204605655,
2: 0.5569435540801637,
3: 0.1003595603517128,
4: 0.0531267334142867},
'Temp3pm': {0: -0.04739593761083206,
1: 0.318392868036414,
2: 0.1574457935516255,
3: 0.6402870170059903,
4: 1.1084966882344651},
'Cluster': {0: 1, 1: 1, 2: 1, 3: 2, 4: 1}}

With your sample data for df, it looks like df.groupby("Cluster").mean() does not remove any columns.
One possible explanation: using groupby().mean() on non-numeric types will cause the column to be removed entirely. If you have a large dataset that you are importing from a file or something, is it possible that there are non-numeric data types?
If, for example, you are reading a csv file using pd.read_csv(), and there are cells that read "NULL", they may be parsed as strings instead of numeric values. One option to find rows with non-numeric values is to use:
import numpy as np
df[~df.applymap(np.isreal).all(1)]

How to replicate a table from R to python using Matplotlib or Plotly

I am trying to replicate a table, which is currently produced in R, in python implementing plotnine library. I am using facet.grid with two variables (CBRegion and CBIndustry).
I have found a similar problem, however, it is also done in R. I applied similar codes as in that link and produced the following table:
I tried to use exactly the same code in python using plotnine library, but the final output is very ugly. This is my python code so far:
myplot = ggplot(data = df_data_bar) + aes(x = "CCR100PDMid %" ,y = "CBSector")+ \
geom_segment(aes(yend="CBSector", xend=0), colour="black", size = 2) +\
geom_text(aes(label = "label")) + \
theme(panel_grid_major_y = element_blank()) + \
facet_grid('CBIndustry ~ CBRegion',scales="free_y",space="free") + \
labs(x="", y = "", title=title) + \
theme_bw() + \
theme(plot_title = element_text(linespacing=0.8, face="bold", size=20, va="center"),
axis_text_x = element_text(colour="#333333",size=12,rotation=0,ha="center",va="top",face="bold"),
axis_text_y = element_text(colour="#333333",size=12,rotation=0,ha="right",va="center",face="bold"),
axis_title_x = element_blank(),
axis_title_y = element_blank(),
legend_position="none",
strip_text_x = element_text(size = 12, face="bold", colour = "black", angle = 0),
strip_text_y = element_text(size = 8, face="bold", colour = "black", angle = 0, ha = "left"),
strip_background_y = element_text(width = 0.2),
figure_size=(30,20))
The image from plotnine is as follows:
Comparing Python vs R, we can clearly see that y-axis labels overlap using plotnine. In addition, when we look at Europe and Community groups we can notice that it has the same size box as others with multiple groups which is not necessary. I also tried different aspect ratios, but it has not resolved my problem.
In short words, I would like to have the same plot as R produces. It does not need to be produced in plotnine. Alternatives are also welcome. Data from top ten rows is:
{'CBRegion': {0: 'Europe', 1: 'Europe', 2: 'Europe', 3: 'Europe', 4: 'Europe', 5: 'Europe', 6: 'Europe', 7: 'Europe', 8: 'Europe', 9: 'Europe'}, 'CBSector': {0: 'Aerospace & Defense', 1: 'Alternative Energy', 2: 'Automobiles & Parts', 3: 'Banks', 4: 'Beverages', 5: 'Chemicals', 6: 'Colleges & Universities', 7: 'Community Groups', 8: 'Construction & Materials', 9: 'Electricity'}, 'CBIndustry': {0: 'Industrials', 1: 'Oil & Gas', 2: 'Consumer Goods', 3: 'Financials', 4: 'Consumer Goods', 5: 'Basic Materials', 6: 'NPO', 7: 'Community Groups', 8: 'Industrials', 9: 'Utilities'}, 'CCR100PDMid': {0: 0.015545818181818181, 1: 0.003296, 2: 0.012897471223021583, 3: 0.008079544600938968, 4: 0.008716597402597401, 5: 0.0094617476340694, 6: 0.008897475862068967, 7: 0.000821, 8: 0.012205547455295736, 9: 0.0050264210526315784}, 'CCR100PDMid %': {0: 1.554581818181818, 1: 0.3296, 2: 1.2897471223021584, 3: 0.8079544600938968, 4: 0.8716597402597401, 5: 0.9461747634069401, 6: 0.8897475862068966, 7: 0.0821, 8: 1.2205547455295735, 9: 0.5026421052631579}, 'label': {0: '1.6%', 1: '0.3%', 2: '1.3%', 3: '0.8%', 4: '0.9%', 5: '0.9%', 6: '0.9%', 7: '0.1%', 8: '1.2%', 9: '0.5%'}}
If it is necessary, I can upload the entire dataset, but I just read the MRC and it says that I should only include a subset of data. I am new to SO and hope that I included all vital information. I will be grateful for any help. Thank you in advance!

The other issues with colours, overlapping labels, wrapping text etc can be fixed, but unfortunately space = 'free' is not currently supported in plotnine. See documentation here. Unfortunately that's kind of a deal-breaker for your table, sadly. You will need to do in R's ggplot.

Fast/Pythonic way to count intervals between repeated list values

I want to make a histogram of all the intervals between repeated values in a list. I wrote some code that works, but it's using a for loop with if statements. I often find that if one can manage to write a version using clever slicing and/or predefined python (numpy) methods, that one can get much faster Python code than using for loops, but in this case I can't think of any way of doing that. Can anyone suggest a faster or more pythonic way of doing this?
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = {}, {}
for i in values:
hist[i] = {}
last_index[i] = -1 # some default value
# now go through the array and find intervals
for i in range(len(a)):
val = a[i]
if last_index[val] != -1: # do nothing if it's the first time
interval = i - last_index[val]
if interval in hist[val]:
hist[val][interval] += 1
else:
hist[val][interval] = 1
last_index[val] = i
return hist
# example list/array
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
histdict = hist_intervals(a)
print("histdict = ",histdict)
# correct answer for this example
answer = { 1: {3:1, 6:1},
2: {2:1, 5:1},
3: {1:1, 3:1, 6:1},
4: {6:1},
5: {6:1}
}
print("answer = ",answer)
Sample output:
histdict = {1: {3: 1, 6: 1}, 2: {5: 1, 2: 1}, 3: {3: 1, 6: 1, 1: 1}, 4: {6: 1}, 5: {6: 1}}
answer = {1: {3: 1, 6: 1}, 2: {2: 1, 5: 1}, 3: {1: 1, 3: 1, 6: 1}, 4: {6: 1}, 5: {6: 1}}
^ note: I don't care about the ordering in the dict, so this solution is acceptable, but I want to be able to run on really large arrays/lists and I'm suspecting my current method will be slow.

You can eliminate the setup loop by a carefully constructed defaultdict. Then you're just left with a single scan over the input list, which is as good as it gets. Here I change the resultant defaultdict back to a regular Dict[int, Dict[int, int]], but that's just so it prints nicely.
from collections import defaultdict
def count_intervals(iterable):
# setup
last_seen = {}
hist = defaultdict(lambda: defaultdict(int))
# The actual work
for i, x in enumerate(iterable):
if x in last_seen:
hist[x][i-last_seen[x]] += 1
last_seen[x] = i
return hist
a = [1,2,3,1,5,3,2,4,2,1,5,3,3,4]
hist = count_intervals(a)
for k, v in hist.items():
print(k, dict(v))
# 1 {3: 1, 6: 1}
# 3 {3: 1, 6: 1, 1: 1}
# 2 {5: 1, 2: 1}
# 5 {6: 1}
# 4 {6: 1}

There is an obvious change to make in terms of data structures. instead of using a dictionary of dictionaries for hist use a defaultdict of Counter this lets the code become
from collections import defaultdict, Counter
# make a 'histogram'/count of all the intervals between repeated values
def hist_intervals(a):
values = sorted(set(a)) # get list of which values are in a
# setup the dict to hold the histogram
hist, last_index = defaultdict(Counter), {}
# now go through the array and find intervals
for i, val in enumerate(a):
if val in last_index
interval = i - last_index[val]
hist[val].update((interval,))
last_index[val] = i
return hist
this will be faster as the if's are written in C, and will also be cleaner.

Reverse a dictionary using single "for" loop or using "itertools" library

I have a dictionary of lists like this:
my_dict = {
'key_a': [1, 3, 4],
'key_b': [0, 2],
}
I want to create a reverse lookup dict like this:
reverse_dict = {
0: 'key_b',
1: 'key_a',
2: 'key_b',
3: 'key_a',
4: 'key_a',
}
I have a working version of the solution:
reverse_dict = {elem: key for key, a_list in my_dict.items() for elem in a_list}
But wanted to know if someone can provide an alternate answer that doesn't use a double for as I feel it loses readability. So I'd prefer having a single for loop or use functions like the ones in itertools or functional programming

Your solution using dictionary comprehension is the Pythonic way to achieve it.
However, as an alternative with single for loop as requested by you, here is the functional one using zip(), itertools.repeat(), and itertools.chain.from_iterable(), but I doubt that it is any better then yours solution in terms of readability:
my_dict = {
'key_a': [1, 3, 4],
'key_b': [0, 2],
}
from itertools import chain, repeat
new_dict = dict(chain.from_iterable(zip(v, repeat(k)) for k, v in my_dict.items()))
where new_dict will hold:
{0: 'key_b', 1: 'key_a', 2: 'key_b', 3: 'key_a', 4: 'key_a'}

>>> from itertools import chain
>>> dict(chain.from_iterable(map(lambda k: map(lambda j:(j, k), my_dict.get(k)), my_dict)))
{1: 'key_a', 3: 'key_a', 4: 'key_a', 0: 'key_b', 2: 'key_b'}
No for loops. It's clearly not more readable
Another one based off #MoinuddinQuadri's idea
>>> from itertools import chain, repeat
>>> dict(chain.from_iterable(map(lambda k: zip(my_dict[k], repeat(k)), my_dict)))
{1: 'key_a', 3: 'key_a', 4: 'key_a', 0: 'key_b', 2: 'key_b'}

Here's one without explicit for loop. Not really recommended, though (side-effect in map which is a no-no to many):
>>> import itertools as it
>>> out = {}
>>> any(map(out.update, it.starmap(dict.fromkeys, map(reversed, my_dict.items()))))
False
>>> out
{1: 'key_a', 3: 'key_a', 4: 'key_a', 0: 'key_b', 2: 'key_b'}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Effective placeholder in python dictionary - python

Related

Sorting a Dataframe by column containing char-digit-char values

Why is "df.grouby" reducing some Columns?

How to replicate a table from R to python using Matplotlib or Plotly

Fast/Pythonic way to count intervals between repeated list values

Reverse a dictionary using single "for" loop or using "itertools" library

Categories

Resources