Related
I'm creating a function to filter many dataframes using groupby. The dataframes look like below. However each dataframe does not always contain the same number of columns.
df = pd.DataFrame({
'xyz CODE': [1,2,3,3,4, 5,6,7,7,8],
'a': [4, 5, 3, 1, 2, 20, 10, 40, 50, 30],
'b': [20, 10, 40, 50, 30, 4, 5, 3, 1, 2],
'c': [25, 20, 5, 15, 10, 25, 20, 5, 15, 10] })
For each dataframe I always apply groupby to the first column - which are named differently across dataframes. All other columns are named consistently across all dataframes.
My question: Is it possible to run groupby using a combination of column location and column names? How can I do it?
I wrote the following function and got an error TypeError: unhashable type: 'list'
def filter_all_df(df):
df['max_c'] = df.groupby(df.columns[0])['a'].transform('max')
newdf = df[df['a'] == df['max_c']].drop(['max_c'], axis=1)
newdf['max_score'] = newdf.groupby([newdf.columns[0],'a','b'])['c'].transform('max')
newdf = newdf[newdf['c'] == newdf['max_score']]
newdf = newdf.sort_values([newdf.columns[0]]).drop_duplicates([newdf.columns[0], 'a','b', 'c'], keep='last')
newdf.to_csv('newdf_all.csv')
return newdf
I am trying to split an array into three new arrays using inequalities.
This will give you an idea of what I am trying to achieve:
measurement = [1, 5, 10, 13, 40, 43, 60]
for x in measurement:
if 0 < x < 6:
small = measurement
elif 6 < x < 15:
medium = measurement
else
large = measurement
Intended Output:
small = [1, 5]
medium = [10, 13]
large = [40, 43, 60]
If your array is sorted, you can do :
measurement = [1, 5, 10, 13, 40, 43, 60]
one_third = len(measurement) // 3
two_third = (2 * len(measurement)) // 3
small = measurement[:one_third]
medium = measurement[one_third : two_thirds]
large = measurement[two_thirds:]
You could easily generalize to any number of split with a loop. Not sure if you wanted explicitly those inequalities or just split with the array in three. If its the first one, my answer is not right
You can use numpy:
arr = np.array(measurement)
small = arr[(arr>0)&(arr<6)] # array([1, 5])
medium = arr[(arr>6)&(arr<15)] # array([10, 13])
large = arr[(arr>15)] # array([40, 43, 60])
You can also use dictionary:
d = {'small':[], 'medium':[], 'large':[]}
for x in measurement:
if 0 < x < 6:
d['small'].append(x)
elif 6 < x < 15:
d['medium'].append(x)
else:
d['large'].append(x)
Output:
{'small': [1, 5], 'medium': [10, 13], 'large': [40, 43, 60]}
With the bisect module you can do something along these lines:
from bisect import bisect
breaks=[0,6,15,float('inf')]
buckets={}
m = [1, 5, 10, 13, 40, 43, 60]
for e in m:
buckets.setdefault(breaks[bisect(breaks, e)], []).append(e)
You then have a dict of lists matching what you are looking for:
>>> buckets
{6: [1, 5], 15: [10, 13], inf: [40, 43, 60]}
You can also form tuples of your break points and list that will become a dict to form the sub lists:
m = [1, 5, 10, 13, 40, 43, 60]
buckets=[('small',[]), ('medium',[]), ('large',[]), ('other',[])]
breaks=[(0,6),(6,15),(15,float('inf'))]
for x in m:
buckets[
next((i for i,t in enumerate(breaks) if t[0]<=x<t[1]), -1)
][1].append(x)
>>> dict(buckets)
{'small': [1, 5], 'medium': [10, 13], 'large': [40, 43, 60], 'other': []}
I tried several things using pandas iloc and .append()
but my code doesn't work at all :(
what do i want:
i want to look in one row for the values of "dt" and "RT"
then i want to loop through the rest of the dataframe to check, if following conditions are met:
the value of "dt" should be in the range of +-0.1 to the compared "dt" value
and
also the "RT" should be in the range of +-0.1 to the compared "RT" value
if both criterias are met:
copy these 2 rows (all the rows) which fulfill these criterias to a new dataframe
df1 = pd.DataFrame([[1, 760, 36.00, 14.1 , 15000], [2, 184, 36.05, 14.12, 11000], [3, 104, 36.95, 14.13, 12000], [4, 120, 34, 13, 16000]], columns=list(["ID","mz","dt","RT", "area"]))
a = [0,1,2,3]
for i in a:
df2 = df2.append((df1.loc[(df1.loc[i, ["dt"]]) - 0.1) <= df1.loc[(df1.loc[i, ["dt"]]) <= (df1.loc[(df1.loc[i, ["dt"]]) + 0.1)) & (df1.loc[i, ["RT"]])])
I think you overcomplicated the matter with that chain full of conditionals. You can combine multiple conditions in a single ".loc" by simply adding "&" between them.
Here i show you how to apply it in your own example:
import pandas as pd
df1 = pd.DataFrame([[1, 760, 36.00, 14.1 , 15000], [2, 184, 36.05, 14.12, 11000], [3, 104, 36.95, 14.13, 12000], [4, 120, 34, 13, 16000]], columns=list(["ID","mz","dt","RT", "area"]))
df_list = []
a = [0,1,2,3]
for i in a:
df_i = pd.DataFrame(data=None,columns=list(["ID","mz","dt","RT", "area"]))
current_row_dt = df1.loc[i]['dt']
current_row_RT = df1.loc[i]['RT']
# Multiple conditions
# ===================
df_i = df1.loc[(df1['dt'] <= current_row_dt + 0.1) & (df1['dt'] <= current_row_dt + 0.1) & (df1['RT'] <= current_row_RT + 0.1) & (df1['RT'] <= current_row_RT + 0.1)]
df_list.append(df_i)
You have to take into great account the parentheses, since python assigns "&" a higher precedence value than <=, ==, !=, etc.
https://docs.python.org/3/reference/expressions.html#operator-precedence
If you have any questions feel free to ask.
PD: Note that, in my example, the rows obtained on each loop iteration are saved in a dataframes array rather than a single dataframe. I did this so you could see better what was added on each iteration, but you may wanna change that.
thank you soo soo much for your help already :) It does what I want, but now I tried to improve it a little bit, in the way, that if NO entry in the dataframe can be found, which meets the criteria, then the initial row should be dropped.
So my idea is, that i want my data to be "compressed/cleaned".
So the final result should ONLY include data, where at least 2 rows or more match each other in terms of RT and DT.
If the a row is "unique" in terms of RT and DT, it should NOT be present in the final results.
df1 = pd.DataFrame([[1, 760, 36.00, 14.1 , 15000, 22], [3, 104, 35.95, 14.13, 12000, 22], [4, 120, 34, 13, 16000, 22], [2, 184, 36.05, 14.12, 11000, 22],[8, 8, 8, 8, 8, 22],[7, 7, 7, 7, 7, 22],[6, 6, 6, 6, 6, 22]], columns=list(["ID","mz","DT","RT", "area", "random"]))
result = ([1, 760, 36.00, 14.1 , 15000], [2, 184, 36.05, 14.12, 11000], [3, 104, 36.95, 14.13, 12000])
the code below more or less does the job, but takes a huge amount of time, because of the two for loops ...
import pandas as pd
df1 = pd.read_csv("test7.csv")
#df1 = pd.DataFrame([[1, 760, 36.00, 14.1 , 15000, 22], [3, 104, 35.95, 14.13, 12000, 22], [4, 120, 34, 13, 16000, 22], [2, 184, 36.05, 14.12, 11000, 22],[8, 8, 8, 8, 8, 22],[7, 7, 7, 7, 7, 22],[6, 6, 6, 6, 6, 22]], columns=list(["ID","mz","DT","RT", "area", "random"]))
df_list = pd.DataFrame()
final = pd.DataFrame()
a = len(df1)
df2 = df1
for i in range(a):
current_row_dt = df2.loc[i]['DT']
current_row_RT = df2.loc[i]['RT']
for b in range(a):
compared_row_dt = df2.loc[b]['DT']
compared_row_RT = df2.loc[b]['RT']
if compared_row_dt <= (current_row_dt + 0.1) and compared_row_dt >= (current_row_dt - 0.1):
if compared_row_RT <= (current_row_RT + 0.1) and compared_row_RT >= (current_row_RT - 0.1):
df_i = df2.loc[b]
df_list = df_list.append(df_i)
df_dup = df_list[df_list.duplicated(keep=False)]
df_final = df_dup.drop_duplicates()
print(df_final)
df_final.to_csv("test7_sorted.csv")
For the following dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame({'chr_key': [1, 1, 1, 2, 2, 3, 4],
'position': [123,124,125,126,127,128,129],
'hit_count': [20,19,18,17,16,15,14]})
df['strand'] = np.nan
I want to revise the strand column such that:
for i in range(0, len(df['position'])):
if df['chr_key'][i] == df['chr_key'][i+1] and df['hit_count'][i] >= df['hit_count'][i+1]:
df['strand'][i] = 'F'
else:
df['strand'][i] = 'R'
My actual df is >100k lines, so a for-loop is slow as one can imagine. Is there a fast way to achieve this?
I modified my original dataframe. Output will be:
df = pd.DataFrame({'chr_key' : [1, 1, 1, 2, 2, 3, 4], 'position' : [123, 124, 125, 126, 127, 128, 129], 'hit_count' : [20, 19, 18, 17, 16, 15, 14], 'strand': ['R', 'R', 'F', 'R', 'F', 'F', 'F']})
because there are only 3 chr_key == 1 so when it comes to the third row, since it does not have an i+1 comparison row, the strand value will default to F
You can try this:
import pandas as pd
df = pd.DataFrame({'chr_key' : [1, 1, 1, 2, 2, 3, 4], 'position' : [123, 124, 125, 126, 127, 128, 129], 'hit_count' : [20, 19, 18, 17, 16, 15, 14]})
df['strand'] = 'R'
idx_1 = df.chr_key == df.chr_key.shift(-1)
idx_2 = df.hit_count >= df.hit_count.shift(-1)
df.loc[idx_1 & idx_2, 'strand'] = 'F'
Use loc or iloc methods to accessing pandas dataframe is a better practice: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
I am using np.where and shift
c1=(df.chr_key==df.chr_key.shift(-1))
c2=(df.hit_count>=df.hit_count.shift(-1))
df['strand']=np.where(c1&c2,'F','R')
I have an array containing an even number of integers. The array represents a pairing of an identifier and a count. The tuples have already been sorted by the identifier. I would like to merge a few of these arrays together. I have thought of a few ways to do it but they are fairly complicated and I feel there might be an easy way to do this with python.
IE:
[<id>, <count>, <id>, <count>]
Input:
[14, 1, 16, 4, 153, 21]
[14, 2, 16, 3, 18, 9]
Output:
[14, 3, 16, 7, 18, 9, 153, 21]
It would be better to store these as dictionaries than as lists (not just for this purpose, but for other use cases, such as extracting the value of a single ID):
x1 = [14, 1, 16, 4, 153, 21]
x2 = [14, 2, 16, 3, 18, 9]
# turn into dictionaries (could write a function to convert)
d1 = dict([(x1[i], x1[i + 1]) for i in range(0, len(x1), 2)])
d2 = dict([(x2[i], x2[i + 1]) for i in range(0, len(x2), 2)])
print d1
# {16: 4, 153: 21, 14: 1}
After that, you could use any of the solutions in this question to add them together. For example (taken from the first answer):
import collections
def d_sum(a, b):
d = collections.defaultdict(int, a)
for k, v in b.items():
d[k] += v
return dict(d)
print d_sum(d1, d2)
# {16: 7, 153: 21, 18: 9, 14: 3}
collections.Counter() is what you need here:
In [21]: lis1=[14, 1, 16, 4, 153, 21]
In [22]: lis2=[14, 2, 16, 3, 18, 9]
In [23]: from collections import Counter
In [24]: dic1=Counter(dict(zip(lis1[0::2],lis1[1::2])))
In [25]: dic2=Counter(dict(zip(lis2[0::2],lis2[1::2])))
In [26]: dic1+dic2
Out[26]: Counter({153: 21, 18: 9, 16: 7, 14: 3})
or :
In [51]: it1=iter(lis1)
In [52]: it2=iter(lis2)
In [53]: dic1=Counter(dict((next(it1),next(it1)) for _ in xrange(len(lis1)/2)))
In [54]: dic2=Counter(dict((next(it2),next(it2)) for _ in xrange(len(lis2)/2)))
In [55]: dic1+dic2
Out[55]: Counter({153: 21, 18: 9, 16: 7, 14: 3})
Use collections.Counter:
import itertools
import collections
def grouper(n, iterable, fillvalue=None):
args = [iter(iterable)] * n
return itertools.izip_longest(fillvalue=fillvalue, *args)
count1 = collections.Counter(dict(grouper(2, lst1)))
count2 = collections.Counter(dict(grouper(2, lst2)))
result = count1 + count2
I've used the itertools library grouper recipe here to convert your data to dictionaries, but as other answers have shown you there are more ways to skin that particular cat.
result is a Counter with each id pointing to a total count:
Counter({153: 21, 18: 9, 16: 7, 14: 3})
Counters are multi-sets and will keep track of the count of each key with ease. It feels like a much better data structure for your data. They support summing, as used above, for example.
All of the previous answers look good, but I think that the JSON blob should be properly formed to begin with or else (from my experience) it can cause some serious problems down the road during debugging etc. In this case with id and count as the fields, the JSON should look like
[{"id":1, "count":10}, {"id":2, "count":10}, {"id":1, "count":5}, ...]
Properly formed JSON like that is much easier to deal with, and probably similar to what you have coming in anyway.
This class is a bit general, but certainly extensible
from itertools import groupby
class ListOfDicts():
def init_(self, listofD=None):
self.list = []
if listofD is not None:
self.list = listofD
def key_total(self, group_by_key, aggregate_key):
""" Aggregate a list of dicts by a specific key, and aggregation key"""
out_dict = {}
for k, g in groupby(self.list, key=lambda r: r[group_by_key]):
print k
total=0
for record in g:
print " ", record
total += record[aggregate_key]
out_dict[k] = total
return out_dict
if __name__ == "__main__":
z = ListOfDicts([ {'id':1, 'count':2, 'junk':2},
{'id':1, 'count':4, 'junk':2},
{'id':1, 'count':6, 'junk':2},
{'id':2, 'count':2, 'junk':2},
{'id':2, 'count':3, 'junk':2},
{'id':2, 'count':3, 'junk':2},
{'id':3, 'count':10, 'junk':2},
])
totals = z.key_total("id", "count")
print totals
Which gives
1
{'count': 2, 'junk': 2, 'id': 1}
{'count': 4, 'junk': 2, 'id': 1}
{'count': 6, 'junk': 2, 'id': 1}
2
{'count': 2, 'junk': 2, 'id': 2}
{'count': 3, 'junk': 2, 'id': 2}
{'count': 3, 'junk': 2, 'id': 2}
3
{'count': 10, 'junk': 2, 'id': 3}
{1: 12, 2: 8, 3: 10}