comparing cells iteration using pandas - python

I'm trying to compare cells within a data frame using pandas.
the data looks like that:
seqnames, start, end, width, strand, s1, s2, s3, sn
1, Ha412HOChr01, 1, 220000, 220000, CN2, CN10, CN2, CN2
2, Ha412HOChr01, 1, 220000, 220000, CN2, CN2, CN2, CN2
3, Ha412HOChr01, 1, 220000, 220000, CN2, CN4, CN2, CN2
n, Ha412HOChr01, 1, 220000, 220000, CN2, CN2, CN2, CN6
I was able to make individual comparisons with the following code
import pandas as pd
df = pd.read_csv("test.csv")
if df.iloc[0,5] != df.iloc[0,6]:
print("yay!")
else:
print("not intersting...")
I would like to iterate a comparison between s1 and all the other s columns, line by line in a loop or in any other more efficient methods.
when i've tried the following code:
df = pd.read_csv("test.csv")
df.columns
#make sure to change in future analysis
ref = df[' Sunflower_14_S8']
all_the_rest = df.drop(['seqnames', ' start', ' end', ' width', ' strand'], axis=1)
#all_the_rest.columns
OP = ref.eq(all_the_rest)
OP.to_csv("OP.csv")
i've got a wired output
0,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False444,False,False,False,False,False,False,False,False,False,False,False,False,False
it seems like it compare all the characters instead of the strings
I'm new to programming and I'm stuck, appreciate your help!

Does this help?
import pandas as pd
# define a list of columns you want to compare
cols = ['s1', 's2', 's3']
# some sample data
df = pd.DataFrame(columns=cols)
df['s1'] = ['CN2', 'CN10', 'CN2', 'CN2']
df['s2'] = ['CN2', 'CN2', 'CN2', 'CN2']
df['s3'] = ['CN2', 'CN2', 'CN2', 'CN6']
# remove 's1' from the list of columns
cols_except_s1 = [x for x in cols if x!='s1']
# create a blank dataframe to hold our comparisons
df_comparison = pd.DataFrame(columns=cols_except_s1)
# iterate through each other column, comparing it against 's1'
for x in cols_except_s1:
comparison_series = df['s1'] == df[x]
df_comparison[x] = comparison_series
# the result is a dataframe that has columns of Boolean values
print(df_comparison)
outputs
s2 s3
0 True True
1 False False
2 True True
3 True False

well 9 hour later i have found a way without using panadas...
df = pd.read_csv("test.csv")
#df.columns
#convertthe data frame to a list
list = df.values.tolist()
for line in list:
lineAVG = sum(line[5:]) / len(line[5:])
ref = (line[5])
if lineAVG - ref > 0.15:
output = line
print(output)

Related

python pandas dataframe add colour to adjusted and inserted row

I have the following data-frame
import pandas as pd
df = pd.DataFrame()
df['number'] = (651,651,651,4267,4267,4267,4267,4267,4267,4267,8806,8806,8806,6841,6841,6841,6841)
df['name']=('Alex','Alex','Alex','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Ankit','Abhishek','Abhishek','Abhishek','Blake','Blake','Blake','Blake')
df['hours']=(8.25,7.5,7.5,7.5,14,12,15,11,6.5,14,15,15,13.5,8,8,8,8)
df['loc']=('Nar','SCC','RSL','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNIT-C','UNI','UNI','UNI','UNKING','UNKING','UNKING','UNKING')
print(df)
If the running balance of an individuals hours reach 38 an adjustment to the cell that reached the 38th hour is made, a duplicate row is inserted and the balance of hours is added to the following row. The following code performs this and the difference in output of original data to adjusted data can be seen.
s = df.groupby('number')['hours'].cumsum()
m = s.gt(38)
idx = m.groupby(df['number']).idxmax()
delta = s.groupby(df['number']).shift().rsub(38).fillna(s)
out = df.loc[df.index.repeat((df.index.isin(idx)&m)+1)]
out.loc[out.index.duplicated(keep='last'), 'hours'] = delta
out.loc[out.index.duplicated(), 'hours'] -= delta
print(out)
I then output to csv with the following.
out.to_csv('Output.csv', index = False)
I need to have the row that got adjusted and the row that got inserted highlighted in a color (any color) when it is exported to csv.
UPDATE: as csv does not accept colours to output, any way to tag the adjusted and insert rows is acceptable
You can't add any kind of formatting, including colors, to a CSV. You can however color records in a dataframe.
# single-index:
# Load a dataset
import seaborn as sns
df = sns.load_dataset('planets')# Now let's group the data
groups = df.groupby('method').mean()
groups
# Highlight the Maximum values
groups.style.highlight_max(color = 'lightgreen')
# multi-index:
import pandas as pd
df = pd.DataFrame([['one', 'A', 100,3], ['two', 'A', 101, 4],
['three', 'A', 102, 6], ['one', 'B', 103, 6],
['two', 'B', 104, 0], ['three', 'B', 105, 3]],
columns=['c1', 'c2', 'c3', 'c4']).set_index(['c1', 'c2']).sort_index()
print(df)
def highlight_min(data):
color= 'red'
attr = 'background-color: {}'.format(color)
if data.ndim == 1: # Series from .apply(axis=0) or axis=1
is_min = data == data.min()
return [attr if v else '' for v in is_min]
else:
is_min = data.groupby(level=0).transform('min') == data
return pd.DataFrame(np.where(is_min, attr, ''),
index=data.index, columns=data.columns)
df = df.apply(highlight_min, axis=0)
df

How to transpose values from top few rows in python dataframe into new columns

I am trying to select the values from the top 3 records of each group in a python sorted dataframe and put them into new columns. I have a function that is processing each group but I am having difficulties finding the right method to extract, rename the series, then combine the result as a single series to return.
Below is a simplified example of an input dataframe (df_in) and the expected output (df_out):
import pandas as pd
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Price', 'Qty'])
I am reproducing below 2 examples of the functions I've tested and trying to get a more efficient option that works, especially if I have to process many more columns and records.
Function best3_prices_v1 works but have to explicitly specify each column or variable, and is especially an issue as I have to add more columns.
def best3_prices_v1(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
# 1st best
d['Price_1'] = best_price_lv1['Price']
d['Qty_1'] = best_price_lv1['Qty']
# 2nd best
d['Price_2'] = best_price_lv2['Price']
d['Qty_2'] = best_price_lv2['Qty']
# 3rd best
d['Price_3'] = best_price_lv3['Price']
d['Qty_3'] = best_price_lv3['Qty']
# return combined results as a series
return pd.Series(d, index=['Price_1', 'Qty_1', 'Price_2', 'Qty_2', 'Price_3', 'Qty_3'])
Codes to call function:
# sort dataframe by Product and Price
df_in.sort_values(by=['Product', 'Price'], ascending=True, inplace=True)
# get best 3 prices and qty as new columns
df_out = df_in.groupby(['Product']).apply(best3_prices_v1).reset_index()
Second attempt to improve/reduce codes and explicit names for each variable ... not complete and not working.
def best3_prices_v2(x):
d = {}
# get best 3 records if records available, else set volumes as zeroes
best_price_lv1 = x.iloc[0].copy()
rec_with_zeroes = best_price_lv1.copy()
rec_with_zeroes['Price'] = 0
rec_with_zeroes['Qty'] = 0
recs = len(x) # number of records
if (recs == 1):
# 2nd and 3rd records not available
best_price_lv2 = rec_with_zeroes.copy()
best_price_lv3 = rec_with_zeroes.copy()
elif (recs == 2):
best_price_lv2 = x.iloc[1]
# 3rd record not available
best_price_lv3 = rec_with_zeroes.copy()
else:
best_price_lv2 = x.iloc[1]
best_price_lv3 = x.iloc[2]
stats_columns = ['Price', 'Qty']
# get records values for best 3 prices
d_lv1 = best_price_lv1[stats_columns]
d_lv2 = best_price_lv2[stats_columns]
d_lv3 = best_price_lv3[stats_columns]
# How to rename (keys?) or combine values to return?
lv1_stats_columns = [c + '_1' for c in stats_columns]
lv2_stats_columns = [c + '_2' for c in stats_columns]
lv3_stats_columns = [c + '_3' for c in stats_columns]
# return combined results as a series
return pd.Series(d, index=lv1_stats_columns + lv2_stats_columns + lv3_stats_columns)
Let's unstack():
df_in=(df_in.set_index([df_in.groupby('Product').cumcount().add(1),'Product'])
.unstack(0,fill_value=0))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
OR via pivot()
df_in=(df_in.assign(key=df_in.groupby('Product').cumcount().add(1))
.pivot('Product','key',['Price','Qty'])
.fillna(0,downcast='infer'))
df_in.columns=[f"{x}_{y}" for x,y in df_in]
df_in=df_in.reset_index()
Based on #AnuragDabas's pivot solution and #ceruler's feedback above, I can now expand it to a more general problem.
New dataframe with more groups and columns:
data_in = { 'Product': ['A', 'A', 'A', 'A', 'B', 'C', 'C'],
'Model': ['A1', 'A1', 'A1', 'A2', 'B1', 'C1', 'C1'],
'Price': [25.0, 30.5, 50.0, 61.5, 120.0, 650.0, 680.0],
'Qty': [15 , 13, 14, 10, 5, 2, 1],
'Ratings': [9, 7, 8, 10, 6, 7, 8 ]}
df_in = pd.DataFrame (data_in, columns = ['Product', 'Model' ,'Price', 'Qty', 'Ratings'])
group_list = ['Product', 'Model']
stats_list = ['Price','Qty', 'Ratings']
df_out = df_in.groupby(group_list).head(3)
df_out=(df_out.assign(key=df_out.groupby(group_list).cumcount().add(1))
.pivot(group_list,'key', stats_list)
.fillna(0,downcast='infer'))
df_out.columns=[f"{x}_{y}" for x,y in df_out]
df_out = df_out.reset_index()

Optimising quartiling of columns of panda dataframe?

I have multiple columns in a data frame that have numerical data. I want to quartile each column, changing each value to either q1, q2, q3 or q4.
I currently loop through each column and change them using the pandas qcut function:
for column_name in df.columns:
df[column_name] = pd.qcut(df[column_name].astype('float'), 4, ['q1','q2','q3','q4'])
This is very slow! Is there a faster way to do this?
Played around with the the following example a little. Looks like converting to float from a string is increasing the time. Though a working example was not provided, so the original type can't be known. df[column].astype(copy=) appears to be performant if copying or not. Not much else to go after.
import pandas as pd
import numpy as np
import random
import time
random.seed(2)
indexes = [i for i in range(1,10000) for _ in range(10)]
df = pd.DataFrame({'A': indexes, 'B': [str(random.randint(1,99)) for e in indexes], 'C':[str(random.randint(1,99)) for e in indexes], 'D':[str(random.randint(1,99)) for e in indexes]})
#df = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
df_result = pd.DataFrame({'A': indexes, 'B': [random.randint(1,99) for e in indexes], 'C':[random.randint(1,99) for e in indexes], 'D':[random.randint(1,99) for e in indexes]})
def qcut(copy, x):
for i, column_name in enumerate(df.columns):
s = pd.qcut(df[column_name].astype('float', copy=copy), 4, ['q1','q2','q3','q4'])
df_result["col %d %d"%(x, i)] = s.values
times = []
for x in range(0,10):
a = time.clock()
qcut(True, x)
b = time.clock()
times.append(b-a)
print np.mean(times)
for x in range(10, 20):
a = time.clock()
qcut(False, x)
b = time.clock()
times.append(b-a)
print np.mean(times)

Ordering data by index after importing dictionary of dictionaries into DataFrame

I am trying to reorder a DataFrame by index. This DataFrame was created from a dictionary of dictionaries. I am trying to use DataFrame.sort_values. Although, the sorting is making absolutely no difference when I try to print the DataFrame.
The following code exemplifies what I am trying to achieve:
import pandas as pd
# Metrics dictionary keys
GOLD_CNT_KEY = 'Gold_Cnt'
PRED_CNT_KEY = 'Pred_Cnt'
NER_INTERSEC_CNT_KEY = 'NER_Intersec_Cnt'
NER_PREC_KEY = 'NER_Precision'
NER_REC_KEY = 'NER_Recall'
NER_F1_KEY = 'NER_F1'
NERC_INTERSEC_CNT_KEY = 'NERC_Intersec_Cnt'
NERC_PREC_KEY = 'NERC_Precision'
NERC_REC_KEY = 'NERC_Recall'
NERC_F1_KEY = 'NERC_F1'
tag_classes = ['X', 'Y', 'Z', 'V']
def get_empty_stats_dict():
"""Dictionary for the counts and metrics"""
return {GOLD_CNT_KEY: 0,
PRED_CNT_KEY: 0,
NER_INTERSEC_CNT_KEY: 0,
NER_PREC_KEY: 0,
NER_REC_KEY: 0,
NER_F1_KEY: 0,
NERC_INTERSEC_CNT_KEY: 0,
NERC_PREC_KEY: 0,
NERC_REC_KEY: 0,
NERC_F1_KEY: 0}
stats = {}
for tag_class in tag_classes:
stats.update({tag_class: get_empty_stats_dict()})
# I want to order by these indexes
index_order = [GOLD_CNT_KEY, PRED_CNT_KEY, NER_INTERSEC_CNT_KEY, NERC_INTERSEC_CNT_KEY,
NER_PREC_KEY, NERC_PREC_KEY, NER_REC_KEY, NERC_REC_KEY, NER_F1_KEY, NERC_F1_KEY]
_stats = pd.DataFrame(stats)
# These two prints yield exactly the same output. Why doesn't sort_values make any difference?
print(_stats)
print(_stats.sort_values(index_order, axis=1))
Try: _stats = _stats.reindex(index_order)

searching in a pandas df that contains ranges

I have a pandas df that contains 2 columns 'start' and 'end' (both are integers). I would like an efficient method to search for rows such that the range that is represented by the row [start,end] contains a specific value.
Two additional notes:
It is possible to assume that ranges don't overlap
The solution should support a batch mode - that given a list of inputs, the output will be a mapping (dictionary or whatever) to the row indices that contain the matching range.
For example:
start end
0 7216 7342
1 7343 7343
2 7344 7471
3 7472 8239
4 8240 8495
and the query of
[7215,7217,7344]
will result in
{7217: 0, 7344: 2}
Thanks!
Brute force solution, could use lots of improvements though.
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]})
search = [7215, 7217, 7344]
res = {}
for i in search:
mask = (df.start <= i) & (df.end >= i)
idx = df[mask].index.values
if len(idx):
res[i] = idx[0]
print res
Yields
{7344: 2, 7217: 0}
Selected solution
This new solution could have better performances. But there is a limitation, it will only works if there is no gap between ranges like in the example provided.
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
query = [7215,7217,7344]
# Reshaping the original DataFrame
df = df.reset_index()
df = pd.concat([df['start'], df['end']]).reset_index()
df = df.set_index(0).sort_index()
# Creating a DataFrame with a continuous index
max_range = max(df.index) + 1
min_range = min(df.index)
s = pd.DataFrame(index=range(min_range,max_range))
# Joining them
s = s.join(df)
# Filling the gaps
s = s.fillna(method='backfill')
# Then a simple selection gives the result
s.loc[query,:].dropna().to_dict()['index']
# Result
{7217: 0.0, 7344: 2.0}
Previous proposal
# Test data
df = pd.DataFrame({'start': [7216, 7343, 7344, 7472, 8240],
'end': [7342, 7343, 7471, 8239, 8495]}, columns=['start','end'])
# Constructing a DataFrame containing the query numbers
query = [7215,7217,7344]
result = pd.DataFrame(np.tile(query, (len(df), 1)), columns=query)
# Merging the data and the query
df = pd.concat([df, result], axis=1)
# Making the test
df = df.apply(lambda x: (x >= x['start']) & (x <= x['end']), axis=1).loc[:,query]
# Keeping only values found
df = df[df==True]
df = df.dropna(how='all', axis=(0,1))
# Extracting to the output format
result = df.to_dict('split')
result = dict(zip(result['columns'], result['index']))
# The result
{7217: 0, 7344: 2}

Categories

Resources