Is it possible to do something like the following as a one liner in python where the resulting syntax is readable?
d = dict((i,i+1) for i in range(10))
d.update((i,i+2) for i in range(20,25))
>>> from itertools import chain
>>> dict(chain(((i,i+1) for i in range(10)),
((i,i+2) for i in range(20,25))))
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 20: 22, 21: 23, 22: 24, 23: 25, 24: 26}
how about this:
d = dict(dict((i,i+1) for i in range(10)), **dict(((i,i+2) for i in range(20,25))))
result:
{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 10, 20: 22, 21: 23, 22: 24, 23: 25, 24: 26}
#jamylak's answer is great and should do. Anyway for this specific problem, I would probably do this:
d = dict((i, i+1) if i < 10 else (i, i+2) for i in range(25) if i < 10 or i >= 20)
This gives the same output:
d = dict((i,i+x) for x,y in [(1, range(10)), (2, range(20,25))] for i in y)
You could also write it with enumerate, so:
d = dict((i,i+x) for x,y in enumerate([range(10), range(20,25)], 1) for i in y)
But it's slightly longer and it assumes your intention is to use a smooth incrementation, which might not be the case later (?). The problem is not knowing whether you plan to extend this into an even longer expression, which would alter the requirements and affect which answer is most convenient.
Related
I'm trying to write a piece of code and I keep getting stuck with an issue of trying to search for a list of columns in a dataframe, which is not allowed because the list is unhashable.
Essentially, I have a sequence: 'KGTLPK'
I want to first locate every instance of 'K' in the sequence: [0,5]
Then I want to search columns 0 and 5 of my DataFrame for a specific value: 80
I want to get a list of rows that have '80' in columns 0 & 5, and delete those rows.
query = 'KGTLPK'
AA = 'K'
x = []
for pos,char in enumerate(query):
if(char == AA):
x.append(pos)
print(x)
#pddf = my DataFrame
df2 = pddf.filter(regex=x)
print(df2)
rows_removal = list(pddf.loc[pddf[df2] == Phosphorylation].index.tolist())
print(rows_removal)
pddf.drop(pddf.index[rows_removal])
My full DataFrame has 8855 rows, and this needs to decrease as improper values are identified. In this example I deleted all rows where column 0 was not equal to 16. I just need an easier way to do this so I don't hardcode everything.
pddf.head(10).to_dict() output, dataFrame unedited:
{0: {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7: 16, 8: 16, 9: 16}, 1: {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7: 16, 8: 16, 9: 16}, 2: {0: 16, 1: 16, 2: 16, 3: 16, 4: 16, 5: 16, 6: 16, 7: 16, 8: 16, 9: 16}, 3: {0: 16, 1: 44, 2: 80, 3: 42, 4: 71, 5: 28, 6: 14, 7: 28, 8: 42, 9: 81}}
This extends for 8000 rows.
This is as far as I've gotten. My goal for example would be to say something like "for every instance of '42' in column 3, delete that row"
Any help I can get is greatly appreciated!
I'm having difficulty getting the following complex list comprehension to work as expected. It's a double nested for loop with conditionals.
Let me first explain what I'm doing:
import pandas as pd
dict1 = {'stringA':['ABCDBAABDCBD','BBXB'], 'stringB':['ABDCXXXBDDDD', 'AAAB'], 'num':[42, 13]}
df = pd.DataFrame(dict1)
print(df)
stringA stringB num
0 ABCDBAABDCBD ABDCXXXBDDDD 42
1 BBXB AAAB 13
This DataFrame has two columns stringA and stringB with strings containing characters A, B, C, D, X. By definition, these two strings have the same length.
Based on these two columns, I create dictionaries such that stringA begins at index 0, and stringB begins at the index starting at num.
Here's the function I use:
def create_translation(x):
x['translated_dictionary'] = {i: i +x['num'] for i, e in enumerate(x['stringA'])}
return x
df2 = df.apply(create_translation, axis=1).groupby('stringA')['translated_dictionary']
df2.head()
0 {0: 42, 1: 43, 2: 44, 3: 45, 4: 46, 5: 47, 6: ...
1 {0: 13, 1: 14, 2: 15, 3: 16}
Name: translated_dictionary, dtype: object
print(df2.head()[0])
{0: 42, 1: 43, 2: 44, 3: 45, 4: 46, 5: 47, 6: 48, 7: 49, 8: 50, 9: 51, 10: 52, 11: 53}
print(df2.head()[1])
{0: 13, 1: 14, 2: 15, 3: 16}
That's correct.
However, there are 'X' characters in these strings. That requires a special rule: If X is in stringA, don't create a key-value pair in the dictionary. If X is in stringB, then the value should not be i + x['num'] but -500.
I tried the following list comprehension:
def try1(x):
for count, element in enumerate(x['stringB']):
x['translated_dictionary'] = {i: -500 if element == 'X' else i + x['num'] for i, e in enumerate(x['stringA']) if e != 'X'}
return x
That gives the wrong answer.
df3 = df.apply(try1, axis=1).groupby('stringA')['translated_dictionary']
print(df3.head()[0]) ## this is wrong!
{0: 42, 1: 43, 2: 44, 3: 45, 4: 46, 5: 47, 6: 48, 7: 49, 8: 50, 9: 51, 10: 52, 11: 53}
print(df3.head()[1]) ## this is correct! There is no key for 2:15!
{0: 13, 1: 14, 3: 16}
There are no -500 values!
The correct answer is:
print(df3.head()[0])
{0: 42, 1: 43, 2: 44, 3: 45, 4:-500, 5:-500, 6:-500, 7: 49, 8: 50, 9: 51, 10: 52, 11: 53}
print(df3.head()[1])
{0: 13, 1: 14, 3: 16}
Here's a simple way, without any comprehensions (because they aren't helping clarify the code):
def create_translation(x):
out = {}
num = x['num']
for i, (a, b) in enumerate(zip(x['stringA'], x['stringB'])):
if a == 'X':
pass
elif b == 'X':
out[i] = -500
else:
out[i] = num
num += 1
x['translated_dictionary'] = out
return x
Why not flatten your df , you can check with this post and recreate the dict
n=df.stringA.str.len()
newdf=pd.DataFrame({'num':df.num.repeat(n),'stringA':sum(list(map(list,df.stringA)),[]),'stringB':sum(list(map(list,df.stringB)),[])})
newdf=newdf.loc[newdf.stringA!='X'].copy()# remove stringA value X
newdf['value']=newdf.groupby('num').cumcount()+newdf.num # using groupby create the cumcount
newdf.loc[newdf.stringB=='X','value']=-500# assign -500 when stringB is X
[dict(zip(x.groupby('num').cumcount(),x['value']))for _,x in newdf.groupby('num')] # create the dict for different num by group
Out[390]:
[{0: 13, 1: 14, 2: 15},
{0: 42,
1: 43,
2: 44,
3: 45,
4: -500,
5: -500,
6: -500,
7: 49,
8: 50,
9: 51,
10: 52,
11: 53}]
I have a reference dictionary with subjects and page numbers like so:
reference = { 'maths': [3, 24],'physics': [4, 9, 12],'chemistry': [1, 3, 15] }
I need help writing a function that inverts the reference. That is, returns a dictionary with page numbers as keys, each with an associated list of subjects. For example, swap(reference) run on the above example should return
{ 1: ['chemistry'], 3: ['maths', 'chemistry'], 4: ['physics'],
9: ['physics'], 12: ['physics'], 15: ['chemistry'], 24: ['maths'] }
You can use a defaultdict:
from collections import defaultdict
d = defaultdict(list)
reference = { 'maths': [3, 24],'physics': [4, 9, 12],'chemistry': [1, 3, 15] }
for a, b in reference.items():
for i in b:
d[i].append(a)
print(dict(d))
Output:
{1: ['chemistry'], 3: ['maths', 'chemistry'], 4: ['physics'], 9: ['physics'], 12: ['physics'], 15: ['chemistry'], 24: ['maths']}
Without importing from collections:
d = {}
for a, b in reference.items():
for i in b:
if i in d:
d[i].append(a)
else:
d[i] = [a]
Output:
{1: ['chemistry'], 3: ['maths', 'chemistry'], 4: ['physics'], 9: ['physics'], 12: ['physics'], 15: ['chemistry'], 24: ['maths']}
reference = {'maths': [3, 24], 'physics': [4, 9, 12], 'chemistry': [1, 3, 15]}
table = []
newReference = {}
for key in reference:
values = reference[key]
for value in values:
table.append((value, key))
for x in table:
if x[0] in newReference.keys():
newReference[x[0]] = newReference[x[0]] + [x[1]]
else:
newReference[x[0]] = [x[1]]
print(newReference)
I have following code , i don't understand the scenario behind this please any one can explain.
import sys
data={}
print sys.getsizeof(data)
######output is 280
data={ 1:2,2:1,3:2,4:5,5:5,6:6,7:7,8:8,9:9,0:0,11:11,12:12,13:13,14:14,15:15}
print sys.getsizeof(data)
######output is 1816
data={1:2,2:1,3:2,4:5,5:5,6:6,7:7,8:8,9:9,0:0,11:11,12:12,13:13,14:14,15:15,16:16}
print sys.getsizeof(data)
##### output is 1048
if we increase the len of dictionary then it should increase on size in memory but it decreases why ?
getsizeof() calls the object’s __sizeof__ method and adds an additional garbage collector overhead if the object is managed by the garbage collector.
Windows x64 - If did like below:
data={ 1:2,2:1,3:2,4:5,5:5,6:6,7:7,8:8,9:9,0:0,11:11,12:12,13:13,14:14,15:15}
print sys.getsizeof(data)
print data
data[16]=16
print sys.getsizeof(data)
print data
printed:
1808
{0: 0, 1: 2, 2: 1, 3: 2, 4: 5, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15}
1808
{0: 0, 1: 2, 2: 1, 3: 2, 4: 5, 5: 5, 6: 6, 7: 7, 8: 8, 9: 9, 11: 11, 12: 12, 13: 13, 14: 14, 15: 15, 16: 16}
But I indeed noticed same behavior when rewriting data dictionary as you mentioned:
272 #empty data dict
1808 # 15 elements in data dict
1040 # 16 elements in data dict
I have the following dataset:
data = {'VALVE_SCORE': {0: 34.1,1: 41.0,2: 49.7,3: 53.8,4: 35.8,5: 49.2,6: 38.6,7: 51.2,8: 44.8,9: 51.5,10: 41.9,11: 46.0,12: 41.9,13: 51.4,14: 35.0,15: 49.7,16: 41.5,17: 51.5,18: 45.2,19: 53.4,20: 38.1,21: 50.2,22: 25.4,23: 30.0,24: 28.1,25: 49.9,26: 27.5,27: 37.2,28: 27.7,29: 45.7,30: 27.2,31: 30.0,32: 27.9,33: 34.3,34: 29.5,35: 34.5,36: 28.0,37: 33.6,38: 26.8,39: 31.8},
'DAY': {0: 6, 1: 6, 2: 6, 3: 6, 4: 13, 5: 13, 6: 13, 7: 13, 8: 20, 9: 20, 10: 20, 11: 20, 12: 27, 13: 27, 14: 27, 15: 27, 16: 3, 17: 3, 18: 3, 19: 3, 20: 10, 21: 10, 22: 10, 23: 10, 24: 17, 25: 17, 26: 17, 27: 17, 28: 24, 29: 24, 30: 24, 31: 24, 32: 3, 33: 3, 34: 3, 35: 3, 36: 10, 37: 10, 38: 10, 39: 10},
'MONTH': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1, 10: 1, 11: 1, 12: 1, 13: 1, 14: 1, 15: 1, 16: 2, 17: 2, 18: 2, 19: 2, 20: 2, 21: 2, 22: 2, 23: 2, 24: 2, 25: 2, 26: 2, 27: 2, 28: 2, 29: 2, 30: 2, 31: 2, 32: 3, 33: 3, 34: 3, 35: 3, 36: 3, 37: 3, 38: 3, 39: 3}}
df = pd.DataFrame(data)
First, I would like to take the mean by day and then by month. However, taking the mean by grouping the days results in decimal months. I would like to preserve the months before I do a groupby('MONTH').mean()
In [401]: df.groupby("DAY").mean()
Out[401]:
VALVE_SCORE MONTH
DAY
3 39.7250 2.5
6 44.6500 1.0
10 32.9875 2.5
13 43.7000 1.0
17 35.6750 2.0
20 46.0500 1.0
24 32.6500 2.0
27 44.5000 1.0
I would like the end result to be:
MONTH VALVE_SCORE
1 value
2 value
3 value
Consider with the data you have, you would like to have the daily mean and then the monthly mean. Putting the same in an Excel pivot table will result like this:
Do doing the same in pandas, grouping by months is enough to get the same result:
df.groupby(['MONTH']).mean()
DAY VALVE_SCORE
MONTH
1 16.5 44.7250
2 13.5 38.0375
3 6.5 30.8000
Since the month and day values are numeric, pandas process it, consider 'DAY' and 'MONTH' values are not numeric and are strings, you get this result:
VALVE_SCORE
MONTH
1 44.7250
2 38.0375
3 30.8000
So pandas already computes the daily means and using it computes the monthly means.
Here's a possible solution. Do let me know if there is a more efficient way of doing it.
df = pd.DataFrame(data)
months = list(df['MONTH'].unique())
frames = []
for p in months:
df_part = df[df['MONTH'] == p]
df_part_avg = df_part.groupby("DAY", as_index=False).mean()
df_part_avg = df_part_avg.drop('DAY', axis=1)
frames.append(df_part_avg)
df_months = pd.concat(frames)
df_final = df_months.groupby("MONTH", as_index=False).mean()
And the result is:
In [430]: df_final
Out[430]:
MONTH VALVE_SCORE
0 1 44.7250
1 2 38.0375
2 3 30.8000