I want numpy to go through column 1 and find all numbers that are greater than 0.... Then I want numpy to print out column1 all postive numbers that it found and column2 what ever that number is associated with column 1.
import numpy
N = 3
a1 = [-0.0119,0.0754,0.0272,0.0107,-0.0053,-0.0114,0.0148,0.0062,0.0043,0,0.022,-0.0153,0.0207,-0.0065,0.0069,-0.0018,0.0149,-0.0084,-0.0021,0.0072,0.0095,0.0004,0.0068,0.0016,-0.0048,0.0051,0.0025,0.0081,-0.0203,-0.0008,-0.0008,-0.0047,-0.0007,-0.0291,0.0071,0.0033,0.0179,-0.0016,0.0397,0.0075,0.0061,-0.0075,0.0026,-0.0055,-0.006,0.0026,-0.0046,0.0046,0.0201,0.023,0.0014,-0.0029,0.0115,0.0066,0.0071,0.0061,-0.0081,-0.0071,0.0005,-0.0076,0.0102,-0.0051,0.018,0.0017,0.0123,0.0021,-0.0032,0.0049,0.0004,0.0053,-0.0004,0.0138,-0.0215,0.0019,0.0023,-0.0059,-0.013,-0.0478,-0.0009,0.0089,0.0006,0.014,-0.0077,0.0006,0.0024,0.0113,0.0062,-0.0162,0.0198,0.0096,0.0167,-0.0018,0.0038,0.0088,0.0023,-0.0063,-0.0109,0.0127,-0.027,0,0.0089,-0.0003,0.023,-0.0009,0.02,-0.0059,0.0029,0.0219,-0.0003,0.0029,0.0072,-0.009,0.0025,0.0123,0.0106,-0.0024,-0.0267,0.0124,0.0012,0.0046,-0.0131,0.0133,-0.0075,0.009,0.0209,0.0106,0.0031,0.0019,-0.0122,0.002,-0.0261,-0.004,0.4034]
a1= a1[::-1]
a1 = numpy.array(a1)
numbers_mean = numpy.convolve(a1, numpy.ones((N,))/N)[(N-1):]
numbers_mean = numbers_mean[::-1]
numbers_mean = numbers_mean.reshape(-1,1)
a1 = a1.reshape(-1,1)
x = numpy.column_stack((a1,numbers_mean))
l = x[0<a1]
When I print l all I get is the results from column1 I want also what number is showing in column2(column2 is not filtered)
this is the solution how to sort by column1 and bring all columns with.
xx = x[x[:,0]>0,:]
Related
DF is a dataframe with columns [A1,B1,C1,A2,B2,C2,A3,B3,C3]
I want to split that 'DF' data frame into small dataframes DF1,DF2,DF3
DF1 to have [A1,B1,C1] as columns
DF2 to have [A2,B2,C2] as columns
DF3 to have [A3,B3,C3] as columns
The number in the name of the dataframe DF'3' should match with its columns [A'3',B'3',C'3']
i tried
for i in range(1,4):
'DF{}'.format(i)=DF[DF['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]]
Getting the error
SyntaxError: cannot assign to function call
Is it possible to do this in a single loop?
You can't dynamically change an object's name.
You can use a list comprehension with explicit definition of the dfs:
df1,df2,df3=[df[['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]] for i in range(1,4)]
Update based on ViettelSolutions' comment
Here is a more concise way of doing that: df1,df2,df3=[df[[f'A{i}',f'B{i}','C{i}']] for i in range(1,4)]
You can also use a list instead of explicitly name the dfs, and unpack them when needed.
n=4 # Define the number of dfs
dfs=[df[['A{}'.format(i),'B{}'.format(i),'C{}'.format(i)]] for i in range(1,n)]
The error message stems from trying to assign a dataframe to the string format function, instead of a variable.
Dynamically creating variables from DF1 to DFN for N numbers can be a bit tricky. It is easy to create key-item pairs in dicts though. Try the following:
dfs = {}
for i in range(1,4):
dfs["DF{}".format(i)] = DF[["A{}".format(i), "B{}".format(i), "C{}".format(i)]]
Instead of getting DF1, DF2 and DF3 variables, you get dfs["DF1"], dfs["DF2"], and dfs["DF3"]
You could make it completely configurable:
def split_dataframe(df, letters, numbers):
return [df[[f'{letter}{number}' for letter in letters]] for number in numbers]
letters = ("A","B","C")
numbers = range(1,4)
df1, df2, df3 = split_dataframe(df, letters, numbers)
You can make it the function even more general as follows:
import re
letters_pattern = re.compile("^\D+")
numbers_pattern = re.compile("\d+$")
def split_dataframe(df):
letters = sorted(set(letters_pattern.findall(x)[0] for x in df.columns))
numbers = sorted(set(numbers_pattern.findall(x)[0] for x in df.columns))
return [df[[x for x in [f'{letter}{number}' for letter in letters] if x in df.columns]] for number in numbers]
This method has 2 advantages:
you don't need to provide the letters and the numbers in advance, the method will discover what is available in the header and proceed
it will manage "irregular" situations - when, for example, d1 exists but d2 doesn't
To give a concrete example:
df = pd.DataFrame({"A1":[1,2], "B1":[2,3], "C1":[3,4], "D1":[4,5], "A2":[2,3], "B2":[10,11], "C2":[12,13]})
for sub_df in split_dataframe(df):
print(sub_df)
OUTPUT
A1 B1 C1 D1
0 1 2 3 4
1 2 3 4 5
A2 B2 C2
0 2 10 12
1 3 11 13
The columns names discovery process could be set as optional if you pass letters and numbers you only want to consider, as follows:
def split_dataframe(df, letters=None, numbers=None):
letters = sorted(set(letters_pattern.findall(x)[0] for x in df.columns)) if letters is None else letters
numbers = sorted(set(numbers_pattern.findall(x)[0] for x in df.columns)) if numbers is None else numbers
return [df[[x for x in [f'{letter}{number}' for letter in letters] if x in df.columns]] for number in numbers]
for sub_df in split_dataframe(df, letters=("B","C"), numbers=[1,2]):
print(sub_df)
OUTPUT
B1 C1
0 2 3
1 3 4
B2 C2
0 10 12
1 11 13
I have a pandas dataframe which looks like this:
A B
x 5.9027.5276
y 656.344872.0
z 78.954.23
What I want to reach is to replace the string entries in column B by floats of the first four numbers of the entries of column B as decimal numbers at the second position.
Therefore, I wrote the following code:
for entry in df['B']:
entry = re.search(r'((\d\.?){1,4})', entry).group().replace(".","")
df['B'] = entry[:1] + '.' + entry[1:]
df['B'] = df['B'].astype(float)
It almost does what I want but it replaces all the entries in B with the float value of the first row. Instead, I would like to replace the entries with the according float value of each row.
How could I do this?
Thanks a lot!
You can use the relevant pandas string functions:
df['B'] = df['B'].str.extract('((\d\.?){1,4})')[0].str.replace(r'\.', '')
df['B'] = df['B'].str[:1] + '.' + df['B'].str[1:]
df['B'] = df['B'].astype(float)
print(df)
A B
0 x 5.902
1 y 6.563
2 z 7.895
You might encase your operation in function and then use .apply i.e.:
import re
import pandas as pd
df = pd.DataFrame({'A':['x','y','z'],'B':['5.9027.5276','656.344872.0','78.954.23']})
def func(entry):
entry = re.search(r'((\d\.?){1,4})', entry).group().replace(".","")
return entry[:1] + '.' + entry[1:]
df['B'] = df['B'].apply(func)
df['B'] = df['B'].astype(float)
print(df)
output:
A B
0 x 5.902
1 y 6.563
2 z 7.895
I have a Dataframe like this:
Interesting genre_1 probabilities
1 no Empty 0.251306
2 yes Empty 0.042043
3 no Alternative 5.871099
4 yes Alternative 5.723896
5 no Blues 0.027028
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316
10 yes Classical 1.044135
I would like to perform GINI index on the same category based on the interesting column. After that, I would like to add such a value in a new pandas column.
This is the function to get the Gini index:
#Gini Function
#a and b are the quantities of each class
def gini(a,b):
a1 = (a/(a+b))**2
b1 = (b/(a+b))**2
return 1 - (a1 + b1)
EDIT* SORRY I had an error in my final desired Dataframe. Being interesting or not matters when it comes to choose prob(A) and prob(B) but the Gini score will be the same, because it will measure how much impurity are we getting to classify a song as interesting or not. So if the probabilities are around 50/50% then it will mean that the Gini score will reach it maximum (0.5) and this is because is equally possible to just be mistaken to choose interesting or not.
So for the first two rows, the Gini index will be:
a=no; b=Empty -> gini(0.251306, 0.042043)= 0.245559831601612
a=yes; b=Empty -> gini(0.042043, 0.251306)= 0.245559831601612
Then I would like to get something like:
Interesting genre_1 percentages. GINI INDEX
1 no Empty 0.251306 0.245559831601612
2 yes Empty 0.042043 0.245559831601612
3 no Alternative 5.871099 0.4999194135183881
4 yes Alternative 5.723896. 0.4999194135183881
5 no Blues 0.027028 ..
6 yes Blues 0.120248
7 no Children's 0.207213
8 yes Children's 0.426679
9 no Classical 0.306316 ..
10 yes Classical 1.044135 ..
Ok, I think I know what you mean. The code below does not care, if the Interesting value is 'yes' or 'no'. But what you want, is to calculate the GINI coefficient in two different ways for each row based on the value in the Interesting value of that row. So if interesting == no, then the result is 0.5, because a == b. But if interesting is 'yes', then you need to use a = probability[i] and b = probability[i+1]. So skip this section for the updated code below.
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
probs = df['probabilities']
def ROLLING_GINI(probabilities):
a1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
b1 = (probabilities[0]/(probabilities[0]+probabilities[0]))**2
res = 1 - (a1 + b1)
yield res
for i in range(len(probabilities)-1):
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
df['GINI'] = [val for val in ROLLING_GINI(probs)]
print(df)
This is where the real trouble starts, because if I understand your idea correctly, then you cannot calculate the last GINI value, because your dataframe won't allow it. The important bit here is that the last Interesting value in your dataframe is 'yes'. This means I have to use a = probability[i] and b = probability[i+1]. But your dataframe doesn't have a row number 11. You have 10 rows and on row i == 10, you'd need a probability in row 11 to calculate a GINI coefficient. So in order for your idea to work, the last Interesting value MUST be 'no', otherwise you will always get an index error.
Here's the code anyways:
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)-1):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i+1]))**2
b1 = (probabilities[i+1]/(probabilities[i]+probabilities[i+1]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
EDIT NUMBER THREE (Sorry for the late realization):
So it does work if I apply the indexing correctly. The problem was that I wanted to use the Next probability, not the previous one. So it's a = probabilities[i-1] and b = probabilities[i]
import pandas as pd
df = pd.read_csv('df.txt',delim_whitespace=True)
def ROLLING_GINI(dataframe):
probabilities = dataframe['probabilities']
how_to_calculate = dataframe['Interesting']
for i in range(len(dataframe)):
if how_to_calculate[i] == 'yes':
a1 = (probabilities[i-1]/(probabilities[i-1]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i-1]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
elif how_to_calculate[i] == 'no':
a1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
b1 = (probabilities[i]/(probabilities[i]+probabilities[i]))**2
res = 1 - (a1 + b1)
yield res
GINI = [val for val in ROLLING_GINI(df)]
print('All GINI coefficients: %s'%GINI)
print('Length of all calculatable GINI coefficients: %s'%len(GINI))
print('Number of rows in the dataframe: %s'%len(df))
print('The last Interesting value is: %s'%df.iloc[-1,0])
I am not sure how the Interesting column plays into all of this, but I highly recommend that you make the new column by using numpy.where(). The syntax would be something like:
import numpy as np
df['GINI INDEX'] = np.where(__condition__,__what to do if true__,__what to do if false__)
I'm attempting to loop through groups of phrases to match and score them among all the members in each group. Even if some of the phrases are the same, they may have different Codes which is what I'm trimming from the loop inputs - but need to retain in the final df2. I have to make the comparison in the loop without the code but the issue is tying it back to the original df that contains the code so I can identify which rows need to be flagged.
The code below works but I need to add the original DESCR to df2. Appending a and b only contains the trim.
I've tried df.at[] but have mixed, incorrect results. Thank you.
import pandas as pd
from fuzzywuzzy import fuzz as fz
import itertools
data = [[1,'Oneab'],[1,'Onebc'],[1,'Twode'],[2,'Threegh'],[2,'Threehi'],[2,'Fourjk'],[3,'Fivekl'],[3,'Fivelm'],[3,'Fiveyz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])
n_list = []
a_list = []
b_list = []
pr_list = []
tsr_list = []
groups = df.groupby('Ids')
for n,g in groups:
for a, b in itertools.product(g['DESCR'].str[:-2],g['DESCR'].str[:-2]):
if str(a) < str(b):
try:
n_list.append(n)
a_list.append(a)
b_list.append(b)
pr_list.append(fz.partial_ratio(a,b))
tsr_list.append(fz.token_set_ratio(a,b))
except:
pass
df2 = pd.DataFrame({'Group': n_list, 'First Comparator': a_list, 'Second Comparator': b_list, 'Partial Ratio': pr_list, 'Token Set Ratio': tsr_list})
Instead of:
ab bc 50 50
ab de 0 0
bc de 0 0
gh hi 50 50
gh jk 0 0
hi jk 50 50
...
I'd like to see:
Oneab Onebc 50 50
Oneab Twode 0 0
Onebc Twode 0 0
Threegh Threehi 50 50
Threegh Fourjk 0 0
Threehi Fourjk 50 50
...
In case anyone else runs into a similar issue - figured it out - instead of filtering the inputs at the beginning of the second level loop, I'm bringing the full value into the second loop and stripping it there:
a2 = a[6:]
b2 = b[6:]
So:
import pandas as pd
from fuzzywuzzy import fuzz as fz
import itertools
data = [[1,'Oneab'],[1,'Onebc'],[1,'Twode'],[2,'Threegh'],[2,'Threehi'],[2,'Fourjk'],[3,'Fivekl'],[3,'Fivelm'],[3,'Fiveyz']]
df = pd.DataFrame(data,columns=['Ids','DESCR'])
n_list = []
a_list = []
b_list = []
pr_list = []
tsr_list = []
groups = df.groupby('Ids')
for n,g in groups:
for a, b in itertools.product(g['DESCR'],g['DESCR']):
if str(a) < str(b):
try:
a2 = a[:-2]
b2 = b[:-2]
n_list.append(n)
a_list.append(a)
b_list.append(b)
pr_list.append(fz.partial_ratio(a2,b2))
tsr_list.append(fz.token_set_ratio(a2,b2))
except:
pass
df2 = pd.DataFrame({'Group': n_list, 'First Comparator': a_list, 'Second Comparator': b_list, 'Partial Ratio': pr_list, 'Token Set Ratio': tsr_list})
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03