I have a bunch of data that was generated by a nested loop. The innermost loop just keeps increasing each pass, but the outer-loop only increments periodically. I want to plot the data vs. both the counters. The innermost counter works fine as the primary X-axis, but I cannot figure out how to get the second counter on my plot, making sure it shows up aligned correctly with the primary X-axis. I have tried using the secondary_xaxis function, defining transforms that just do an inverse lookup on my dataframe. Unfortunately I get an error that the transform functions are not the same length (ValueError: 'Lenghts must match to compare')
ctr A
ctr B
val1
val2
1
1
-3.22E-03
-0.001010008
1
2
-3.21E-03
-0.002629743
1
3
-3.21E-03
-0.002210752
2
4
-3.21E-03
-0.002210752
2
5
-5.86E-03
-0.004594075
3
6
-0.003212838
-0.002210758
3
7
-0.003645778
-0.002577823
3
8
0.000129821
0.000223856
3
9
-6.06E-06
2.77E-05
4
10
6.05E-07
2.23E-05
def getCtrSub( x ):
return df.loc[ 'ctrA' == x, 'ctrB' ].iloc[0]
def getCtrTop( x ):
return df.loc[ 'ctrB' == x, 'ctrA' ].iloc[0]
figInd = plt.figure(); axInd=figInd.gca();
axInd.plot( ctrB, valA, label=ind )
axInd2 = axInd.secondary_xaxis('bottom', functions=(getCtrTop, getCtrSub) )
plt.show()
Unfortunately I can't make a picture exactly of what I want, but text-wise, the x-Axis would look something like this:
1 - 2 - 3 - 4 - 5 - 6 - 7 - 8 - 9 - 10
1 --------- 2 ------3 ---------------4
Any help would be greatly appreciated.
Related
I want to plot bar graph from the dataframe below.
df2 = pd.DataFrame({'URL': ['A','A','B','B','C','C'],
'X': [5,0,7,1,0,0],
'Y': [4,0,4,7,9,0],
'Z':[11,0,8,4,0,0]})
URL X Y Z
0 A 5 4 11
1 A 0 0 0
2 B 7 4 8
3 B 1 7 4
4 C 0 9 0
5 C 0 0 0
I want to plot bar graph in which I have URL counts on y-axis and X , Y, Z on x-axis with two bars for each.
One bar will show/count the number of non-zero values in each column and I did it (code at the end).
Second bar will show the count of duplicates in the URL column with the condition that While counting the duplicate values in URL column, At least one value in the corresponding column X should be non-zero like A comes two times in the URL, and in column X, we have one non-zero so it will count as one. In the case of B, both values in non-zero so we will count B as one as well but in the case of C, both values are zero in X so we will not count it as one. The same for Y and Z, and plot the result on the x-axis. I have managed to draw a bar graph for the first case (code below) but for the second bar. I'm unable to draw if anyone can help me in this case. Thank you. here's the code
df2.melt("URL").\
groupby("variable").\
agg(Keywords_count=("value", lambda x: sum(x != 0)),
dup = ("URL", "nunique")).\
plot(kind="bar")
I have a dataframe df that looks like this:
ID Sequence
0 A->A
1 C->C->A
2 C->B->A
3 B->A
4 A->C->A
5 A->C->C
6 A->C
7 A->C->C
8 B->B
9 C->C and so on ....
I want to create a column called 'Outcome', which is binomial in nature.
Its value essentially depends on three lists that I am generating from below
Whenever 'A' occurs in a sequence, probability of "Outcome" being 1 is 2%
Whenever 'B' occurs in a sequence, probability of "Outcome" being 1 is 6%
Whenever 'C' occurs in a sequence, probability of "Outcome" being 1 is 1%
so here is the code which is generating these 3 (bi_A, bi_B, bi_C) lists -
A=0.02
B=0.06
C=0.01
count_A=0
count_B=0
count_C=0
for i in range(0,len(df)):
if('A' in df.sequence[i]):
count_A+=1
if('B' in df.sequence[i]):
count_B+=1
if('C' in df.sequence[i]):
count_C+=1
bi_A = np.random.binomial(1, A, count_A)
bi_B = np.random.binomial(1, B, count_B)
bi_C = np.random.binomial(1, C, count_C)
What I am trying to do is to combine these 3 lists as an "output" column so that probability of Outcome being 1 when "A" is in sequence is 2% and so on. How to I solve for it as I understand there would be data overlap, where bi_A says one sequence is 0 and bi_B says it's 1, so how would we solve for this ?
End data should look like -
ID Sequence Output
0 A->A 0
1 C->C->A 1
2 C->B->A 0
3 B->A 0
4 A->C->A 0
5 A->C->C 1
6 A->C 0
7 A->C->C 0
8 B->B 0
9 C->C 0
and so on ....
Such that when I find probability of Outcome = 1 when A is in string, it should be 2%
EDIT -
you can generate the sequence data using this code-
import pandas as pd
import itertools
import numpy as np
import random
alphabets=['A','B','C']
combinations=[]
for i in range(1,len(alphabets)+1):
combinations.append(['->'.join(i) for i in itertools.product(alphabets, repeat = i)])
combinations=(sum(combinations, []))
weights=np.random.normal(100,30,len(combinations))
weights/=sum(weights)
weights=weights.tolist()
#weights=np.random.dirichlet(np.ones(len(combinations))*1000.,size=1)
'''n = len(combinations)
weights = [random.random() for _ in range(n)]
sum_weights = sum(weights)
weights = [w/sum_weights for w in weights]'''
df=pd.DataFrame(random.choices(
population=combinations,weights=weights,
k=10000),columns=['sequence'])
I'm trying to calculate the centre of mass of 20 objects, where each object has it's own different mass.
These objects are represented in a dataframe cm_x, and their associated masses in a list. Below I show an example of just 3 of those 20 objects, for the sake of saving space. Each object has an x, y, z coordinate, but I'll just show the x and then I can apply the same technique to the rest. Below is the head of the dataframe.
bar_head_x bar_hip_centre_x bar_left_ankle_x
0 -203.3502 -195.4573 -293.262
1 -203.4280 -195.4720 -293.251
2 -203.4954 -195.4675 -293.248
3 -203.5022 -195.9193 -293.219
4 -203.5014 -195.9092 -293.328
m_head = 0.081
m_hipc = 0.139
m_lank = 0.0465
m = [m_head,m_hipc,m_lank]
I saw in another similar question, someone has suggested this method, however this doesn't incorporate the masses, and that is where I'm having an issue:
def series_sum(pd_series):
return np.sum(np.dot(pd_series.values, np.asarray(range(1, len(pd_series)+1)))/np.sum(pd_series))
cm_x.apply(series_sum, axis=1)
Basically I want for each row, to have an associated centre of mass, using the formula for centre of mass which is sum(x_i * m_i) / sum(m_i).
The desired result would be a new column in the dataframe like so:
cm_x
0 -214.92
1 ...
2 ...
3 ...
4 ...
Any help?
If I understand correctly, you can compute the desired column like this:
>>> df.mul(m).sum(axis=1)/sum(m)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
Use DataFrame.dot and divide by sum of list m:
s = df.dot(m).div(sum(m))
print (s)
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
dtype: float64
If need DataFrame add Series.to_frame:
df1 = df.dot(m).div(sum(m)).to_frame('cm_x')
print (df1)
cm_x
0 -214.921628
1 -214.951023
2 -214.968638
3 -215.201292
4 -215.214800
7441 -245.078910
7442 -244.943961
7443 -244.806606
7444 -244.665285
7445 -244.533503
I have some data in the form:
ID A B VALUE EXPECTED RESULT
1 1 2 5 GROUP1
2 2 3 5 GROUP1
3 3 4 6 GROUP2
4 3 5 5 GROUP1
5 6 4 5 GROUP3
What i want to do is iterate through the data (thousand of rows) and create a common field so i will be able to join the data easily ( *A-> start Node, B->End Node Value-> Order...the data form something like a chain where only neighbors share a common A or B)
Rules for joining:
equal value for all elements of a group
A of element one equal to B of element two (or the oposite but NOT A=A' or B=B')
The most difficult one: assign to same group all sequential data that form a series of intersecting nodes.
That is the first element [1 1 2 5] has to be joined with [2 2 3 5] and then with [4 3 5 5]
Any idea how to accomplish this robustly when iterating through a large number of data? I have problem with rule number 3, the others are easily applied. For limited data i have some success, but this depends on the order i start examining the data. And this doesn't work for the large dataset.
I can use arcpy (preferably) or even Python or R or Matlab to solve this. Have tried arcpy with no success so i am checking on alternatives.
In ArcPy this code works ok but to limited extend (i.e. in large features with many segments i get 3-4 groups instead of 1):
TheShapefile="c:/Temp/temp.shp"
desc = arcpy.Describe(TheShapefile)
flds = desc.fields
fldin = 'no'
for fld in flds: #Check if new field exists
if fld.name == 'new':
fldin = 'yes'
if fldin!='yes': #If not create
arcpy.AddField_management(TheShapefile, "new", "SHORT")
arcpy.CalculateField_management(TheShapefile,"new",'!FID!', "PYTHON_9.3") # Copy FID to new
with arcpy.da.SearchCursor(TheShapefile, ["FID","NODE_A","NODE_B","ORDER_","new"]) as TheSearch:
for SearchRow in TheSearch:
if SearchRow[1]==SearchRow[4]:
Outer_FID=SearchRow[0]
else:
Outer_FID=SearchRow[4]
Outer_NODEA=SearchRow[1]
Outer_NODEB=SearchRow[2]
Outer_ORDER=SearchRow[3]
Outer_NEW=SearchRow[4]
with arcpy.da.UpdateCursor(TheShapefile, ["FID","NODE_A","NODE_B","ORDER_","new"]) as TheUpdate:
for UpdateRow in TheUpdate:
Inner_FID=UpdateRow[0]
Inner_NODEA=UpdateRow[1]
Inner_NODEB=UpdateRow[2]
Inner_ORDER=UpdateRow[3]
if Inner_ORDER==Outer_ORDER and (Inner_NODEA==Outer_NODEB or Inner_NODEB==Outer_NODEA):
UpdateRow[4]=Outer_FID
TheUpdate.updateRow(UpdateRow)
And some data in shapefile form and dbf form
Using matlab:
A = [1 1 2 5
2 2 3 5
3 3 4 6
4 3 5 5
5 6 4 5]
%% Initialization
% index of matrix line sharing the same group
ind = 1
% length of the index
len = length(ind)
% the group array
g = []
% group counter
c = 1
% Start the small algorithm
while 1
% Check if another line with the same "Value" share some common node
ind = find(any(ismember(A(:,2:3),A(ind,2:3)) & A(:,4) == A(ind(end),4),2));
% If there is no new line, we create a group with the discovered line
if length(ind) == len
%group assignment
g(A(ind,1)) = c
c = c+1
% delete the already discovered line (or node...)
A(ind,:) = []
% break if no more node
if isempty(A)
break
end
% reset the index for the next group
ind = 1;
end
len = length(ind);
end
And here is the output:
g =
1 1 2 1 3
As expected
I imported a CSV file that has two columns ID and Bee_type. The bee_type has two types in it - bumblebee and honey bee. I'm trying to convert them to numbers instead of names; i.e. instead of bumblebee it says 1.
However, my code is setting everything to 1. How can I keep the ID column its original value and only change the bee_type column?
# load the labels using pandas
labels = pd.read_csv("bees/train_labels.csv")
#Set bumble_bee to one
for index in range(len(labels)):
labels[labels['bee_type'] == 'bumble_bee'] = 1
I believe you need map by dictionary if only 2 possible values exist:
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
Another solution is to use numpy.where - set values by condition:
labels['bee_type'] = np.where(labels['bee_type'] == 'bumble_bee', 1, 2)
Your code works, but for improved performance, modify it a bit - remove loops and add loc:
labels.loc[labels['bee_type'] == 'bumble_bee'] = 1
print (labels)
ID bee_type
0 1 1
1 1 honey_bee
2 1 1
3 3 honey_bee
4 1 1
Sample:
labels = pd.DataFrame({
'bee_type': ['bumble_bee','honey_bee','bumble_bee','honey_bee','bumble_bee'],
'ID': list(range(5))
})
print (labels)
ID bee_type
0 0 bumble_bee
1 1 honey_bee
2 2 bumble_bee
3 3 honey_bee
4 4 bumble_bee
labels['bee_type'] = labels['bee_type'].map({'bumble_bee': 1, 'honey_bee': 2})
print (labels)
ID bee_type
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
As far as I can understand, you want to convert names to numbers. If that's the scenario please try LabelEncoder. Detailed documentation can be found sklearn LabelEncoder