Pandas conditionally creating a new dataframe using another - python

I have a list;
orig= [2, 3, 4, -5, -6, -7]
I want to create another where entries corresponding to positive values above are sum of positives, and those corresponding to negative values above are sum negatives. So the desired output is:
final = [9, 9, 9, 18, 18, 18]
I am doing this:
raw = pd.DataFrame(orig, columns =['raw'])
raw
raw
0 2
1 3
2 4
3 -5
4 -6
5 -7
sum_pos = raw[raw> 0].sum()
sum_neg = -1*raw[raw < 0].sum()
final = pd.DataFrame(index = raw.index, columns = ['final'])
final
final
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
final.loc[raw >0, 'final'] = sum_pos
KeyError: "[('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w')\n ('r', 'a', 'w') ('r', 'a', 'w')] not in index"
So basically i was trying to create an empty dataframe like raw, and then conditionally fill it. However, the above method is failing.
Even when i try to create a new column instead of a new df, it fails:
raw.loc[raw>0, 'final']= sum_pos
KeyError: "[('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w') ('r', 'a', 'w')\n ('r', 'a', 'w') ('r', 'a', 'w')] not in index"
The best solution I've found so far is this:
pd.DataFrame(np.where(raw>0, sum_pos, sum_neg), index= raw.index, columns=['final'])
final
0 9.0
1 9.0
2 9.0
3 18.0
4 18.0
5 18.0
However, I dont understand what is wrong with the other approaches. Is there something I am missing here?

You can try grouping on np.sign, then sum and abs:
s = pd.Series(orig)
s.groupby(np.sign(s)).transform('sum').abs().tolist()
Output:
[9, 9, 9, 18, 18, 18]
You're not aligning indexes. 'sum_pos' is a series with a single element that has an index of 'raw'. And, you are trying to assign that series to a part of dataframe that doesn't have 'raw' as an index.
Pandas does almost everything using index alignment. To properly do this you need to extract the values from the sum_pos series:
final.loc[raw['raw'] > 0, 'final'] = sum_pos.values
print(final)
Output:
final
0 9.0
1 9.0
2 9.0
3 NaN
4 NaN
5 NaN

Related

Creating a union of columns based on metrics

I have a dataframe-
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1,1]})
a b c
0 1 0 1
1 2 3 1
2 4 5 1
and a list [('a', 0.91), ('b', 5), ('c', 2)].
Now I want to create another dataframe which iterates through each row and multiplies df element and list element together and and then selects the top 2 scores and makes a new list which has the said column names.
for example in the first row we have-
1*0.9=0.9 , 0*5=0 , 1*2=2
therefore the top 2 columns are a and c so we append them to a new list.
second row-
2*0.9=1.8, 3*5=15,1*2=2
therefore list=[a,c,b]
and so on...
third row-
4*0.9=3.6,5*5=25,1*2=2
so list remains unchanged [a,c,b]
so final output is [a,c,b]
If i understand you correctly I think the previous answers are incomplete so here is a solution. It involves using numpy which i hope you accept.
Create the weights:
n = [('a', 0.91), ('b', 5), ('c', 2)]
d = { a:b for a,b in n}
weights = [d[i] for i in df.columns]
Then we create a table with weights multiplied in:
df = pd.DataFrame({'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]})
df = df*weights
This yields:
a b c
0 0.9 0.0 2.0
1 1.8 15.0 2.0
2 3.6 25.0 2.0
Then we can get top two indices for this in numpy:
b = np.argsort(df.values,axis=1)
b = b[:,-2:]
This yields:
array([[0, 2],
[2, 1],
[0, 1]], dtype=int64)
Finally we can calculate the order of appearance and give back column names:
c =b.reshape(-1)
_, idx = np.unique(c, return_index=True)
d = c[np.sort(idx)]
print(list(df.columns[d].values))
This yields:
['a', 'c', 'b']
Try this :
dict1 = {'a':[1,2,4], 'b': [0,3,5],'c':[1,1,1]} # arrays must all be same length
df = pd.DataFrame(dict1)
list1 = [('a', 0.91), ('b', 5), ('c', 2)]
df2 = pd.DataFrame({k : [j*v[1] for j in dict1[k]] for k in dict1 for v in list1 if k == v[0]})
"""
df2 should be like this :
a b c
0 0.91 0 2
1 1.82 15 2
2 3.64 25 2
"""
IIUC, you need:
a = [('a', 0.91), ('b', 5), ('c', 2)]
m= df.mul(pd.DataFrame(a).set_index(0)[1])
a b c
0 0.91 0.0 2.0
1 1.82 15.0 2.0
2 3.64 25.0 2.0
Applying rank on each row and taking the sum , then sorting and finding the index gives your desired output.
m.rank(axis=1,method='dense').sum().sort_values().index.tolist()
#['a', 'c', 'b']

np.arange creates a null value matrix on resizing

The following is the code that I am using:
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
animals = DataFrame(np.arange(16).resize(4, 4), columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
The output I get for this is:
W X Y Z
Dog NaN NaN NaN NaN
Cat NaN NaN NaN NaN
Bird NaN NaN NaN NaN
Mouse NaN NaN NaN NaN
The output that I expect is:
W X Y Z
Dog 0 1 2 3
Cat 4 5 6 7
Bird 8 9 10 11
Mouse 12 13 14 15
However, if I run just:
print(np.arange(16))
the output I get is:
[0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
use reshape
import pandas as pd
animals = pd.DataFrame(np.arange(16).reshape(4, 4), columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
or use numpy.resize()
np.resize(np.arange(16),(4, 4))
using resize you need to pass the array as an argument
import pandas as pd
animals = pd.DataFrame(np.resize(np.arange(16),(4, 4)), columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
ndarray.resize() will do inplace operation. So precompute the size and then create a dataframe
a=np.arange(16)
a.resize(4,4)
import pandas as pd
animals = pd.DataFrame(a, columns=['W', 'X', 'Y', 'Z'], index=['Dog', 'Cat', 'Bird', 'Mouse'])
print(animals)
From the docs for resize: "Change shape and size of array in-place."
Thus, your call to resize returns None.
You want reshape. As in np.arange(16).reshape(4, 4)
Just to add to the answer above, docs for resize:
ndarray.resize(new_shape, refcheck=True)
Change shape and size of array in-place.
Therefore, unlike reshape, resize doesn't create a new array. In fact np.arange(16).resize(4, 4) yields None, which is why you get the Nan values.
Using reshape returns a new array:
ndarray.reshape(shape, order='C')
Returns an array containing the same data with a new shape
.

Add labels to Categorical Data in Dataframe

I am trying to convert survey data on the marital status which look as follows:
df['d11104'].value_counts()
[1] Married 1 250507
[2] Single 2 99131
[4] Divorced 4 32817
[3] Widowed 3 24839
[5] Separated 5 8098
[-1] keine Angabe 2571
Name: d11104, dtype: int64
So far, I did df['marstat'] = df['d11104'].cat.codes.astype('category'), yielding
df['marstat'].value_counts()
1 250507
2 99131
4 32817
3 24839
5 8098
0 2571
Name: marstat, dtype: int64
Now, I'd like to add labels to the columnmarstat, such that the numerical values are maintained, i.e. I like to identify people by the condition df['marstat'] == 1, while at the same time being having labels ['Married','Single','Divorced','Widowed'] attached to this variable. How can this be done?
EDIT: Thanks to jpp's Answer, i simply created a new variable and defined the labels by hand:
df['marstat_lb'] = df['marstat'].map({1: 'Married', 2: 'Single', 3: 'Widowed', 4: 'Divorced', 5: 'Separated'})
You can convert your result to a dataframe and include both the category code and name in the output.
A dictionary of category mapping can be extracted via enumerating the categories. Minimal example below.
import pandas as pd
df = pd.DataFrame({'A': ['M', 'M', 'S', 'D', 'W', 'M', 'M', 'S',
'S', 'S', 'M', 'W']}, dtype='category')
print(df.A.cat.categories)
# Index(['D', 'M', 'S', 'W'], dtype='object')
res = df.A.cat.codes.value_counts().to_frame('count')
cat_map = dict(enumerate(df.A.cat.categories))
res['A'] = res.index.map(cat_map.get)
print(res)
# count A
# 1 5 M
# 2 4 S
# 3 2 W
# 0 1 D
For example, you can access "M" by either df['A'] == 'M' or df.index == 1.
A more straightforward solution is just to use apply value_counts and then add an extra column for codes:
res = df.A.value_counts().to_frame('count').reset_index()
res['code'] = res['index'].cat.codes
index count code
0 M 5 1
1 S 4 2
2 W 2 3
3 D 1 0

How to select values from pandas dataframe by column value

I am doing an analysis of a dataset with 6 classes, zero based. The dataset is many thousands of items long.
I need two dataframes with classes 0 & 1 for the first data set and 3 & 5 for the second.
I can get 0 & 1 together easily enough:
mnist_01 = mnist.loc[mnist['class']<= 1]
However, I am not sure how to get classes 3 & 5... so what I would like to be able to do is:
mnist_35 = mnist.loc[mnist['class'] == (3 or 5)]
...rather than doing:
mnist_3 = mnist.loc[mnist['class'] == 3]
mnist_5 = mnist.loc[mnist['class'] == 5]
mnist_35 = pd.concat([mnist_3,mnist_5],axis=0)
You can use isin, probably using set membership to make each check an O(1) time complexity operation:
mnist = pd.DataFrame({'class': [0, 1, 2, 3, 4, 5],
'val': ['a', 'b', 'c', 'd', 'e', 'f']})
>>> mnist.loc[mnist['class'].isin({3, 5})]
class val
3 3 d
5 5 f
>>> mnist.loc[mnist['class'].isin({0, 1})]
class val
0 0 a
1 1 b

Pandas: for loop through columns

My data looks like:
SNP Name ss715583617 ss715592335 ss715591044 ss715598181
4 PI081762 T A A T
5 PI101404A T A A T
6 PI101404B T A A T
7 PI135624 T A A T
8 PI326581 T A A T
9 PI326582A T A A T
10 PI326582B T A A T
11 PI339732 T A A T
12 PI339735A T A A T
13 PI339735B T A A T
14 PI342618A T A A T
In reality I have a dataset of 50,000 columns of 479 rows. My objective is to go through each column with characters and convert the data to integers depending on which is the most abundant character.
Right now I have the data input, and I have more or less written the function I would like to use to analyze each column separately. However, I can't quite understand how to use a forloop or use the apply function through all of the columns in the dataset. I would prefer not to hardcode the columns because I will have 40,000~50,000 columns to analyze.
My code so far is:
import pandas as pd
df = pd.read_csv("/home/dfreese/Desktop/testSNPtext", delimiter='\t')
df.head() # check that the file format fits
# ncol df
df2 = df.iloc[4:-1] # Select the rows you want to analyze in a subset df
print(df2)
My function:
def countAlleles(N):
# N is just suppose to be the column, ideally once I've optimized the function
# I need to analyze every column
# Will hold the counts of each letter in the column
letterCount = []
# This is a parallel array to know the order
letterOrder = {'T','A','G','C','H','U'}
# Boolean to use which one is the maximum
TFlag = None
AFlag = None
GFlag = None
CFlag = None
HFlag = None
UFlag = None
# Loop through the column to determine which one is the maximum
for i in range(len(N)): # How do I get index information of the column?
if(N[i] == 'T'): # If the element in the column is T
letterCount[0] = letterCount[0] + 1
elif(N[i] == 'A'):
letterCount[1] = letterCount [1] + 1
elif (N[i] == 'G'):
letterCount[2] = letterCount [2] + 1
elif (N[i] == 'C'):
lettercount[3] = letterCount[3] + 1
elif(N[i] == 'H'):
letterCount[4] = letterCount[4] + 1
else:
letterCount[5] = letterCount[5] + 1
max = letterCount[0] # This will hold the value of maximum
mIndex = 0 # This holds the index position with the max value
# Determine which one is max
for i in range(len(letterCount)):
if (letterCount[i] > max):
max = letterCount[i]
mIndex = i
So I designed the function to input the column, in hopes to be able to iterate through all the columns of the dataframe. My main question is:
1) How would I pass each in each column as a parameter to the for loop through the elements of each column?
My major source of confusion is how indexes are being used in pandas. I'm familiar with 2-dimensional array in C++ and Java and that is most of where my knowledge stems from.
I'm attempting to use the apply function:
df2 = df2.apply(countAlleles('ss715583617'), axis=2)
but it doesn't seem that my application is correct.
Updated answer: Now the dataframe is analyzed and replaced with the int values according to the occurences of a allele per column. The problem with what happens if one allele has the same number of occurences than the other is still the same - the assignment will be not unique.
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = np.array(['T', 'A', 'G', 'C', 'H', 'U'])
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
# dictionary with full sorted keys
repl = letterOrder[np.argsort(alleles)][::-1]
# directly replace chars by value
for num, char in enumerate(repl):
df[col].replace(char, num+1, inplace=True)
print(df)
This will change the initial dataframe
ss1 ss2 ss3
0 T G C
1 T G H
2 T T C
3 G A H
to the new dataframe with ints sorted according to the number of occurences:
ss1 ss2 ss3
0 1 1 2
1 1 1 1
2 1 3 2
3 2 2 1
For reference the old answer which gives the maximum column indices:
import pandas as pd
import numpy as np
from collections import OrderedDict
df = pd.DataFrame.from_dict({"ss1": ["T", "T", "T", "G"],
"ss2": ["G", "G", "T", "A"],
"ss3": ["C", "H", "C", "H"]})
letterOrder = ['T', 'A', 'G', 'C', 'H', 'U']
full_results = OrderedDict()
for col in df:
alleles = list()
for num, allele in enumerate(letterOrder):
alleles.append(df[col].str.count(allele).sum())
full_results[col] = [letterOrder[np.argmax(alleles)], np.max(alleles)]
print(full_results)
This will give:
OrderedDict([('ss1', ['T', 3]), ('ss2', ['G', 2]), ('ss3', ['C', 2])])
The key in the dict is the name of your column, and the value is a list with [allele, number_of_occurences].
I used OrderedDict to keep the order of your columns and the name, but if you don't need the order, you can use a dict, or if you don't need the column name (and the implicit ID is enough), use a list.
But be careful: If in one column two (or more) characters have the same number of counts, this will only return one of them. You would need to add an additional test for this.
To iterate over columns in e.g. a for loop, use list(df). Anyhow, you can easily do what you are attempting using collections.Counter
assume a dataframe df
df
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 T A A T
#1 PI101404A T A A T
#2 PI101404B T A A T
#3 PI135624 T A A T
#4 PI326581 T A F D
#5 PI326582A G A F T
#6 PI326582B G A A T
#7 PI339732 D H A T
#8 PI339735A D A A T
#9 PI339735B A A A T
#10 PI342618A D A A T
What I gather from the comments sections and your original post, you want to replace each character in each column according to it's frequency of occurrence. This is one approach:
Make the Counters
from collections import Counter
cols = [ col for col in list(df) if col not in ['Name'] ] # all the column you want to operate on
col_counters = { col: Counter( df[col] ) for col in cols }
#{'ss715583617': Counter({'T': 5, 'D': 3, 'G': 2, 'A': 1}),
# 'ss715591044': Counter({'A': 9, 'F': 2}),
# 'ss715592335': Counter({'A': 10, 'H': 1}),
# 'ss715598181': Counter({'T': 10, 'D': 1})}
Sort the items in each Counter
sort_func = lambda items: sorted(items, key=lambda x:x[1], reverse=True ) # sort a nested list according to second element in each sublist
sort_result = { col: sort_func(counter.items()) for col,counter in col_counters.iteritems() }
#{'ss715583617': [('T', 5), ('D', 3), ('G', 2), ('A', 1)],
# 'ss715591044': [('A', 9), ('F', 2)],
# 'ss715592335': [('A', 10), ('H', 1)],
# 'ss715598181': [('T', 10), ('D', 1)]}
Replace letters in dataframe according to sort result
Here we will use enumerate to get the position of each sort result
mapper = { col: {letter:i+1 for i,letter in enumerate(sort_result[col]) } for col in sort_result }
#{'ss715583617': {'A': 4, 'D': 2, 'G': 3, 'T': 1},
# 'ss715591044': {'A': 1, 'F': 2},
# 'ss715592335': {'A': 1, 'H': 2},
# 'ss715598181': {'D': 2, 'T': 1}}
df.replace( to_replace=mapper, inplace=True)
# Name ss715583617 ss715592335 ss715591044 ss715598181
#0 PI081762 1 1 1 1
#1 PI101404A 1 1 1 1
#2 PI101404B 1 1 1 1
#3 PI135624 1 1 1 1
#4 PI326581 1 1 2 2
#5 PI326582A 3 1 2 1
#6 PI326582B 3 1 1 1
#7 PI339732 2 2 1 1
#8 PI339735A 2 1 1 1
#9 PI339735B 4 1 1 1
#10 PI342618A 2 1 1 1
This should be enough to get you on your way. I am not sure how you want to handle duplicate elements, for instance if a column has the same number if T and G.

Categories

Resources