Optimising quartiling of columns of panda dataframe?

Optimising quartiling of columns of panda dataframe? - python

I have multiple columns in a data frame that have numerical data. I want to quartile each column, changing each value to either q1, q2, q3 or q4.
I currently loop through each column and change them using the pandas qcut function:
for column_name in df.columns:
df[column_name] = pd.qcut(df[column_name].astype('float'), 4, ['q1','q2','q3','q4'])
This is very slow! Is there a faster way to do this?

Played around with the the following example a little. Looks like converting to float from a string is increasing the time. Though a working example was not provided, so the original type can't be known. df[column].astype(copy=) appears to be performant if copying or not. Not much else to go after.
import pandas as pd
import numpy as np
import random
import time
random.seed(2)
indexes = [i for i in range(1,10000) for _ in range(10)]
df = pd.DataFrame({'A': indexes, 'B': [str(random.randint(1,99)) for e in indexes], 'C':[str(random.randint(1,99)) for e in indexes], 'D':[str(random.randint(1,99)) for e in indexes]})
#df = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
df_result = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
def qcut(copy, x):
for i, column_name in enumerate(df.columns):
s = pd.qcut(df[column_name].astype('float', copy=copy), 4, ['q1','q2','q3','q4'])
df_result["col %d %d"%(x, i)] = s.values
times = []
for x in range(0,10):
a = time.clock()
qcut(True, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
for x in range(10, 20):
a = time.clock()
qcut(False, x)
b = time.clock()
times.append(b-a)
print np.mean(times)

Related

comparing cells iteration using pandas

I'm trying to compare cells within a data frame using pandas.
the data looks like that:
seqnames, start, end, width, strand, s1, s2, s3, sn
1, Ha412HOChr01, 1, 220000, 220000, CN2, CN10, CN2, CN2
2, Ha412HOChr01, 1, 220000, 220000, CN2, CN2, CN2, CN2
3, Ha412HOChr01, 1, 220000, 220000, CN2, CN4, CN2, CN2
n, Ha412HOChr01, 1, 220000, 220000, CN2, CN2, CN2, CN6
I was able to make individual comparisons with the following code
import pandas as pd
df = pd.read_csv("test.csv")
if df.iloc[0,5] != df.iloc[0,6]:
print("yay!")
else:
print("not intersting...")
I would like to iterate a comparison between s1 and all the other s columns, line by line in a loop or in any other more efficient methods.
when i've tried the following code:
df = pd.read_csv("test.csv")
df.columns
#make sure to change in future analysis
ref = df[' Sunflower_14_S8']
all_the_rest = df.drop(['seqnames', ' start', ' end', ' width', ' strand'], axis=1)
#all_the_rest.columns
OP = ref.eq(all_the_rest)
OP.to_csv("OP.csv")
i've got a wired output
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False444,False,False,False,False,False,False,False,False,False,False,False,False,False
it seems like it compare all the characters instead of the strings
I'm new to programming and I'm stuck, appreciate your help!

Does this help?
import pandas as pd
# define a list of columns you want to compare
cols = ['s1', 's2', 's3']
# some sample data
df = pd.DataFrame(columns=cols)
df['s1'] = ['CN2', 'CN10', 'CN2', 'CN2']
df['s2'] = ['CN2', 'CN2', 'CN2', 'CN2']
df['s3'] = ['CN2', 'CN2', 'CN2', 'CN6']
# remove 's1' from the list of columns
cols_except_s1 = [x for x in cols if x!='s1']
# create a blank dataframe to hold our comparisons
df_comparison = pd.DataFrame(columns=cols_except_s1)
# iterate through each other column, comparing it against 's1'
for x in cols_except_s1:
comparison_series = df['s1'] == df[x]
df_comparison[x] = comparison_series
# the result is a dataframe that has columns of Boolean values
print(df_comparison)
outputs
s2 s3
0 True True
1 False False
2 True True
3 True False

well 9 hour later i have found a way without using panadas...
df = pd.read_csv("test.csv")
#df.columns
#convertthe data frame to a list
list = df.values.tolist()
for line in list:
lineAVG = sum(line[5:]) / len(line[5:])
ref = (line[5])
if lineAVG - ref > 0.15:
output = line
print(output)

Creating a dictionary out of pandas dataframe, where the value is the index

I have a pandas dataframe like so:
A
a
b
c
d
I am trying to create a python dictionary which would look like this:
df_dict = {'a':0, 'b':1, 'c':2, 'd':3}
What I've tried:
df.reset_index(inplace=True)
df = {x : y for x in df['A'] for y in df['index']}
But the df is 75k long and its taking a while now, not even sure if this produces the result I need. Is there a neat, fast way of achieving this?

Use dict with zip and range:
d = dict(zip(df['A'], range(len(df))))
print (d)
{'a': 0, 'b': 1, 'c': 2, 'd': 3}

You can do it like this:
#creating example dataframe with 75 000 rows
import uuid
df = pd.DataFrame({"col": [str(uuid.uuid4()) for _ in range(75000) ] } )
#your bit
{ i:v for i,v in df.reset_index().values }
It runs in seconds.

You could convert series to list and use enumerate:
lst = { x: i for i, x in enumerate(df['A'].tolist()) }

Pandas integrate over columns per each row

In a simplified dataframe:
import pandas as pd
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5])
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
I would like to solve an equation which is attached as an image below:
I is equal to the values in the data frame. x is the int_range values so in this case from 350 to 355 with a dx=1.
a and b are optional constants
I need to get a dataframe as an output per each row
For now I do something like this, but I'm not sure it's correct:
dict_INT = {}
for index, row in df1.iterrows():
func = df1.loc[index]*df1.loc[index].index.astype('float')
x = df1.loc[index].index.astype('float')
dict_INT[index] = integrate.trapz(func, x)
df_out = pd.DataFrame(dict_INT, index=['INT']).T
df_fin = df_out/(a*b)
This is the final sum I get per row:
1 3.505796e+06
2 3.068796e+06
3 2.700446e+06
4 2.199336e+06
5 1.840992e+06

I solved this by first converting the dataframe to dict and then performing your equation by each item in row, then writing these value to dict using collections defaultdict. I will break it down:
import pandas as pd
from collections import defaultdict
df1 = pd.DataFrame({'350': [7.898167, 6.912074, 6.049002, 5.000357, 4.072320],
'351': [8.094912, 7.090584, 6.221289, 5.154516, 4.211746],
'352': [8.291657, 7.269095, 6.393576, 5.308674, 4.351173],
'353': [8.421007, 7.374317, 6.496641, 5.403691, 4.439815],
'354': [8.535562, 7.463452, 6.584512, 5.485725, 4.517310],
'355': [8.650118, 7.552586, 6.672383, 4.517310, 4.594806]},
index=[1, 2, 3, 4, 5]
)
int_range = df1.columns.astype(float)
a = 0.005
b = 0.837
dx = 1
df_dict = df1.to_dict() # convert df to dict for easier operations
integrated_dict = {} # initialize empty dict
d = defaultdict(list) # initialize empty dict of lists for tuples later
integrated_list = []
for k,v in df_dict.items(): # unpack df dict of dicts
for x,y in v.items(): # unpack dicts by column and index (x is index, y is column)
integrated_list.append((k, (((float(k)*float(y)*float(dx))/(a*b))))) #store a list of tuples.
for x,y in integrated_list: # create dict with column header as key and new integrated calc as value (currently a tuple)
d[x].append(y)
d = {k:tuple(v) for k, v in d.items()} # unpack to multiple values
integrated_df = pd.DataFrame.from_dict(d) # to df
integrated_df['Sum'] = integrated_df.iloc[:, :].sum(axis=1)
output (updated to include sum):
350 351 352 353 354 \
0 660539.653524 678928.103226 697410.576822 710302.382557 722004.527599
1 578070.704898 594694.141935 611402.972521 622015.269056 631317.086738
2 505890.250896 521785.529032 537763.142652 547984.294624 556969.473835
3 418189.952210 432314.245161 446512.126165 455795.202628 464025.483871
4 340576.344086 353243.212903 365976.797133 374493.356033 382109.376344
355 Sum
0 733761.502987 4.202947e+06
1 640661.416965 3.678162e+06
2 565996.646356 3.236389e+06
3 383188.781362 2.600026e+06
4 389762.516129 2.206162e+06

Get values from one dataframe where they are between an interval in another dataframe

Given a dataframe containing a numeric (float) series and a categorical ID (df). How can I create a dictionary in the form 'key': [] where the key is an ID from the dataframe and the list contains the difference between the numbers in the separate dataframes?
I have managed this using loops though I am looking for a more pandas way of doing this.
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'a': [0.75435, 0.74897, 0.60949,
0.87438, 0.90885, 0.28547,
0.27327, 0.31078, 0.15576,
0.58139],
'id': list('aaaxxbbyyy')})
rl = pd.DataFrame({'b': [0.51, 0.30], 'id': ['aaa', 'bbb']})
interval = 0.1
d = defaultdict(list)
for index, row in rl.iterrows():
before = df[df['a'].between(row['b'] - interval, row['b'], inclusive=False)]
after = df[df['a'].between(row['b'], row['b'] + interval, inclusive=True)]
for x, b_row in before.iterrows():
d[b_row['id']].append((b_row['a'] - row['b']))
for x, a_row in after.iterrows():
d[a_row['id']].append((a_row['a'] - row['b']))
for k, v in d.items():
print('{k}\t{v}'.format(k=k, v=len(v)))
a 1
y 2
b 2
d
defaultdict(list,
{'a': [0.09948],
'b': [-0.01452, -0.02672],
'y': [0.07138, 0.01078]})

searching in a pandas df that contains ranges

I have a pandas df that contains 2 columns 'start' and 'end' (both are integers). I would like an efficient method to search for rows such that the range that is represented by the row [start,end] contains a specific value.
Two additional notes:
It is possible to assume that ranges don't overlap
The solution should support a batch mode - that given a list of inputs, the output will be a mapping (dictionary or whatever) to the row indices that contain the matching range.
For example:
start end
0 7216 7342
1 7343 7343
2 7344 7471
3 7472 8239
4 8240 8495
and the query of
[7215,7217,7344]
will result in
{7217: 0, 7344: 2}
Thanks!

Brute force solution, could use lots of improvements though.
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]})
search = [7215, 7217, 7344]
res = {}
for i in search:
mask = (df.start <= i) & (df.end >= i)
idx = df[mask].index.values
if len(idx):
res[i] = idx[0]
print res
Yields
{7344: 2, 7217: 0}

Selected solution
This new solution could have better performances. But there is a limitation, it will only works if there is no gap between ranges like in the example provided.
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
query = [7215,7217,7344]
# Reshaping the original DataFrame
df = df.reset_index()
df = pd.concat([df['start'], df['end']]).reset_index()
df = df.set_index(0).sort_index()
# Creating a DataFrame with a continuous index
max_range = max(df.index) + 1
min_range = min(df.index)
s = pd.DataFrame(index=range(min_range,max_range))
# Joining them
s = s.join(df)
# Filling the gaps
s = s.fillna(method='backfill')
# Then a simple selection gives the result
s.loc[query,:].dropna().to_dict()['index']
# Result
{7217: 0.0, 7344: 2.0}
Previous proposal
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
# Constructing a DataFrame containing the query numbers
query = [7215,7217,7344]
result = pd.DataFrame(np.tile(query, (len(df), 1)), columns=query)
# Merging the data and the query
df = pd.concat([df, result], axis=1)
# Making the test
df = df.apply(lambda x: (x >= x['start']) & (x <= x['end']), axis=1).loc[:,query]
# Keeping only values found
df = df[df==True]
df = df.dropna(how='all', axis=(0,1))
# Extracting to the output format
result = df.to_dict('split')
result = dict(zip(result['columns'], result['index']))
# The result
{7217: 0, 7344: 2}

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimising quartiling of columns of panda dataframe? - python

Related

comparing cells iteration using pandas

Creating a dictionary out of pandas dataframe, where the value is the index

Pandas integrate over columns per each row

Get values from one dataframe where they are between an interval in another dataframe

searching in a pandas df that contains ranges

Categories

Resources