I'm new into pandas and try to learn how to process my multi-dimensional data.
My data
Let's assume, my data is a big CSV of the columns ['A', 'B', 'C', 'D', 'E', 'F', 'G']. This data describes some simulation results, where ['A', 'B', ..., 'F'] are simulation parameters and 'G' is one of the ouputs (only existing output in this example!).
EDIT / UPDATE:
As Boud suggested in the comments, let's generate some data which is compatible to mine:
import pandas as pd
import itertools
import numpy as np
npData = np.zeros(5000, dtype=[('A','i4'),('B','f4'),('C','i4'), ('D', 'i4'), ('E', 'f4'), ('F', 'i4'), ('G', 'f4')])
A = [0,1,2,3,6] # param A: int
B = [1000.0, 10.000] # param B: float
C = [100,150,200,250,300] # param C: int
D = [10,15,20,25,30] # param D: int
E = [0.1, 0.3] # param E: float
F = [0,1,2,3,4,5,6,7,8,9] # param F = random-seed = int -> 10 runs per scenario
# some beta-distribution parameters for randomizing the results in column "G"
aDistParams = [ (6,1),
(5,2),
(4,3),
(3,4),
(2,5),
(1,6),
(1,7) ]
counter = 0
for i in itertools.product(A,B,C,D,E,F):
npData[counter]['A'] = i[0]
npData[counter]['B'] = i[1]
npData[counter]['C'] = i[2]
npData[counter]['D'] = i[3]
npData[counter]['E'] = i[4]
npData[counter]['F'] = i[5]
np.random.seed = i[5]
npData[counter]['G'] = np.random.beta(a=aDistParams[i[0]][0], b=aDistParams[i[0]][1])
counter += 1
data = pd.DataFrame(npData)
data = data.reindex(np.random.permutation(data.index)) # shuffle rows because my original data doesn't give any guarantees
Because the parameters ['A', 'B', ..., 'F'] are generated as a cartesian-product (meaning: nested for-loops; a priori), i want to use groupby for obtaining each possible 'simulation scenario' before analysing the output.
The parameter 'F' describe multiple runs for each scenario (each scenario defined by 'A', 'B', ..., 'E' ; let's assume, that 'F' is the random-seed), so my code becomes:
grouped = data.groupby(['A','B','C','D','E'])
# -> every group defines one simulation scenario
grouped_agg = grouped.agg(({'G' : np.mean}))
# -> the mean of the simulation output in 'G' over 'F' is calculated for each group/scenario
What do i want to do now?
I: display all the (unique) values of each scenario-parameter within these groups -> grouped_agg gives me an iterable of tuples, where for example all the entries at each position 0 give me all the values for 'A' (so with a few lines of python i would get my unique values, but maybe there is a function for that)
Update: my approach
list(set(grouped_agg.index.get_level_values('A'))) (when interested in 'A'; using set for obtaining unique values; probably not the stuff you want to do, if you need high performance)
=> [0, 1, 2, 3, 6]
II: generate some plots (of lower dimension) -> i need to make some variables constant and filter/select my data before plotting (therefore step I needed) =>
'B' const
'C', const
'E' const
'D' = x-axis
'G' = y-axis / output from my aggregation
'A' = one more dimension = multiple colors within 2d-plot -> one G/y-axis for each value of 'A'
How would i generate a plot like that?
I think, that reshaping my data is the key step and pandas plotting capabilities will handle it then. Maybe achieving a shape, where there are 5 columns (one for each value of parameter A) and the corresponding G-values for each index-selection + param-A-selection is enough, but i wasn't able to achieve that form yet.
Thanks for your input!
(i'm using pandas 0.12 within enthought canopy)
Sascha
I: If I understand your example and desired output, I don't see why grouping is necessary.
data.A.unique()
II: Updated....
I will implement the example you sketch above. Assume that we have averaged 'G' over the random seed ('F') like so:
data = data.groupby(['A','B','C','D','E']).agg(({'G' : np.mean})).reset_index()
Start by selecting the rows where B, C, and E have some constant values that you specify.
df1 = data[(data['B'] == const1) & (data['C'] == const2) & (data['E'] == const3)]
Now we want to plot 'G' as a function of 'D', with a different color for every value of 'A'.
df1.set_index('D').groupby('A')['G'].plot(legend=True)
I tested the above on some dummy data, and it works as you describe. The range of 'G' corresponding to each 'A' are plotting in the distinct color on the same axes.
III: I don't know how to answer that broad question.
IV: No, I don't think that's an issue for you here.
I suggest playing with simpler, small data sets and getting more familiar with pandas.
Related
I want to search a value in a 2d array and get the value of the correspondent "pair"
in this example i want to search for 'd' and get '14'.
I did try with np location with no success and i finished with this crap code, someone else has a smarter solution?
`
import numpy as np
ar=[[11,'a'],[12,'b'],[13,'c'],[14,'d']]
arr = np.array(ar)
x = np.where(arr == 'd')
print(x)
print("x[0]:"+str(x[0]))
print("x[1]:"+str(x[1]))
a = str(x[0]).replace("[", "")
a = a.replace("]", "")
a = int (a)
print(a)
b = str(x[1]).replace("[", "")
b = b.replace("]", "")
b = int (b) -1
print(b)
print(ar[a][b])
#got 14
`
So you want to lookup a key and get a value?
It feels like you need to use dict!
>>> ar=[[11,'a'],[12,'b'],[13,'c'],[14,'d']]
>>> d = dict([(k,v) for v,k in ar])
>>> d
{'a': 11, 'b': 12, 'c': 13, 'd': 14}
>>> d['d']
14
Use a dict, simple and straight forward:
dct = {k:v for v,k in ar}
dct['d']
If you are hell bent on using np.where, then you can use this:
import numpy as np
ar = np.array([[11,'a'],[12,'b'],[13,'c'],[14,'d']])
i = np.where(ar[:,1] == 'd')[0][0]
result = ar[i, 0]
I didn't know about np.where! It's docstring mentions using nonzero directly, so here's a code snippet that uses that to print the rows that match your requirement: note I add another row with 'd' to show it works for the general case where you want multiple rows matching the condition:
ar=[[11,'a'],[12,'b'],[13,'c'],[14,'d'],[15,'e'],[16,'d']]
arr = np.array(ar)
rows = arr[(arr=='d').nonzero()[0], :]
# array([['14', 'd'],
# ['16', 'd']], dtype='<U21')
This works because nonzero (or where) returns a tuple of row/column indexes of the match. So we just use the first entry in the tuple (an array of row indexes) to index the array row-wise and ask Numpy for all columns (:). This makes the code a bit fragile if you move to 3D or higher dimensions, so beware.
This is assuming you really do intend to use Numpy! Dict is better for many reasons.
In the following code, what is the right way to print the y ticks as E 40% instead of ('E', 40, '%') i.e. no brackets, quotation marks and no space between 40 and %?
from matplotlib import pyplot as plt
a = [20, 40, 60, 120, 160] # number of samples of each classes
b = ['A', 'B', 'C', 'D', 'E'] # different types of classes
c = [5, 10, 15, 30, 40] # percentage of samples in each classes
d = ['%' for i in range(0, len(a))]
e = list(zip(b, c, d))
plt.barh(b, a)
plt.xlabel('Number of Samples')
plt.ylabel('Different Classes')
plt.yticks(ticks = b, labels = e)
plt.show()
enter image description here
You're creating a tuple of the elements in b, c, and d in the list(zip(..)) command. What you should be doing is just combining them in some other way, which can easily be done in a list comprehension:
e = ['{:s} {:d}%'.format(b[i],c[i]) for i in range(len(a))]
To explain this, this code creates a string of the format "string value%" by using the values in your lists b and c. This also negates the use of the "%" list which is completely redundant since you use a percent sign in every list (if you use multiple symbols than this might be necessary, however, and you can simply put another {:s} field with the formatted d[i] value in the string above).
Subbing in the new list in your current code results in:
I have a dictionary with tuples consisting of pairs of words, and probabilities as values, for example
d = {('a','b'): 0.5, ('b', 'c'): 0.5, ('a', 'd'): 0.25 ...} and so forth where every word in a tuple has a pair with another one. So for example if there are 4 words total, the length of the dictionary would be 16.
I am trying to put the values in a numpy array, in the format
/// a b c d
a
b
c
d
However, I am having a hard time doing this. Any help would be appreciated. Thanks in advance!
The easiest way to think about this is that your letters/words are indices. You want to convert the letter a to the index 0 and the letter b to the index 1.
With that in mind, a simple way to do this is to use the index method of a list. For example, if we have a list with unique values like x = ['cat', 'dog', 'panda'] we can do x.index('dog') to get the index 1 where 'dog' occurs in the list x.
With that in mind, let's generate some data similar to what you described:
import numpy as np
# I'm going to cheat a bit and use some numpy tricks to generate the data
x = np.random.random((5, 5))
values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
your_data = {}
for (i, j), prob in np.ndenumerate(x):
your_data[(values[i], values[j])] = prob
print(your_data)
This gives something like:
{('alpha', 'beta'): 0.8066925762434737, ('alpha', 'gamma'): 0.7778280007471104, ...}
So far, we've just generated some example data. Now let's do the inverse to solve your problem:
values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
output = np.zeros((len(values), len(values)), dtype=float)
for key in your_data:
i = values.index(key[0])
j = values.index(key[1])
output[i, j] = your_data[key]
print(output)
That will give us a numpy array with values similar to what you described.
I am trying to extract values in one column based on another column in pandas,
For example suppose I have 2 columns in dataframe as below
>>> check
child parent
0 b a
1 c a
2 d b
3 e d
Now I want to extract all values in column "child" for value in column "parent"
My initial value can differ for now suppose it is "a" in column "parent"
also length of dataframe might differ.
I tried below but it is not working if there are few more matching values and length of dataframe is more
check = pd.read_csv("Book2.csv",encoding='cp1252')
new = (check.loc[check['parent'] == 'a', 'child']).tolist()
len(new)
a=[]
a.append(new)
for i in range(len(new)):
new[i]
new1 = (check.loc[check['parent'] == new[i], 'child']).tolist()
len(new1)
if(len(new1)>0):
a.append(new1)
for i in range(len(new1)):
new2 = (check.loc[check['parent'] == new1[i], 'child']).tolist()
if(len(new1)>0):
a.append(new2)
flat_list = [item for sublist in a for item in sublist]
>>> flat_list
['b', 'c', 'd', 'e']
Is there any efficient way to get desired results, it will be a great help. Please advice
Recursion is a way to do it. Suppose that check is your dataframe, define a recursive function:
final = [] #empty list which is used to store all results
def getchilds(df, res, value):
where = df['parent'].isin([value]) #check rows where parent is equal to value
newvals = list(df['child'].loc[where]) #get the corresponding child values
if len(newvals) > 0:
res.extend(newvals)
for i in newvals: #recursive calls using child values
getchilds(df, res, i)
getchilds(check, final, 'a')
print(final)
print(final) prints ['b', 'c', 'd', 'e'] if check is your example.
This works if you do not have cyclic calls, like 'b' is child of 'a' and 'a' is child of 'b'. If this is the case, you need to add further checks to prevent infinite recursion.
out_dict = {}
for v in pd.unique(check['parent']):
out_dict[v] = list(pd.unique(check['child'][check['parent']==v]))
Then calling out_dict prints:
{'a': ['b', 'c'], 'b': ['d'], 'd': ['e']}
Let me just make a guess and say you want to get all the values of a column child where parent value is x
import pandas as pd
def get_x_values_of_y(comparison_val, df, val_type="get_parent"):
val_to_be_found = ["child","parent"][val_type=="get_parent"]
val_existing = ["child","parent"][val_type != "get_parent"]
mask_value = df[val_existing] == "a"
to_be_found_column = df[mask_value][val_to_be_found]
unique_results = to_be_found_column.unique().tolist()
return unique_results
check = pd.read_csv("Book2.csv",encoding='cp1252')
# to get results of all parents of child "a"
print get_x_values_of_y("a", check)
# to get results of all children of parent "b"
print get_x_values_of_y("b", check, val_type="get_child")
# to get results of all parents of every child
list_of_all_children = check["child"].unique().tolist()
for each_child in list_of_all_children:
print get_x_values_of_y(each_child, check)
# to get results of all children of every parent
list_of_all_parents = check["parent"].unique().tolist()
for each_parent in list_of_all_parents:
print get_x_values_of_y(each_parent, check, val_type= "get_child")
Hope this solves your problem.
I have a list like this with about 141 entries:
training = [40.0,49.0,77.0,...... 3122.0]
and my goal is to select the first 20% of the list. I did it like this:
testfile_first20 = training[0:int(len(set(training))*0.2)]
testfile_second20 = training[int(len(set(training))*0.2):int(len(set(training))*0.4)]
testfile_third20 = training[int(len(set(training))*0.4):int(len(set(training))*0.6)]
testfile_fourth20 = training[int(len(set(training))*0.6):int(len(set(training))*0.8)]
testfile_fifth20 = training[int(len(set(training))*0.8):]
Is there any way to do this automatically in a loop? This is my way of selecting the Kfold.
Thank you.
You can use list comprehensions:
div_length = int(0.2*len(set(training)))
testfile_divisions = [training[i*div_length:(i+1)*div_length] for i in range(5)]
This will give you your results stacked in a list:
>>> [testfile_first20, testfile_second20, testfile_third20, testfile_fourth20, testfile_fifth20]
If len(training) does not divide equally into five parts, then you can either have five full divisions with a sixth taking the remainder as follows:
import math
div_length = math.floor(0.2*len(set(training)))
testfile_divisions = [training[i*div_length:min(len(training), (i+1)*div_length)] for i in range(6)]
or you can have four full divisions with the fifth taking the remainder as follows:
import math
div_length = math.ceil(0.2*len(set(training)))
testfile_divisions = [training[i*div_length:min(len(training), (i+1)*div_length)] for i in range(5)]
Here's a simple take with list comprehension
lst = list('abcdefghijkl')
l = len(lst)
[lst[i:i+l//5] for i in range(0, l, l//5)]
# [['a', 'b'],
# ['c', 'd'],
# ['e', 'f'],
# ['g', 'h'],
# ['i', 'j'],
# ['k', 'l']]
Edit: Actually now that I look at my answer, it's not a true 20% representation as it returns 6 sublists instead of 5. What is expected to happen when the list cannot be equally divided into 5 parts? I'll leave this up for now until further clarifications are given.
You can loop this by just storing the "size" of 20% and the current starting point in two variables. Then add one to the other:
start = 0
twenty_pct = len(training) // 5
parts = []
for k in range(5):
parts.append(training[start:start+twenty_pct])
start += twenty_pct
However, I suspect there are numpy/pandas/scipy operations that might be a better match for what you want. For example, sklearn includes a function called KFold: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html
Something like this, but maybe you may lose an element due to rounding.
tlen = float(len(training))
testfiles = [ training[ int(i*0.2*tlen): int((i+1)*0.2*tlen) ] for i in range(5) ]