I have a dataframe that I have created by hand. I am working on a code that copies the dataframe and concatenates the new dataframe to the end of the first one. For now, I need the code to look through each value of a column of the 'Name' dataframe that contains strings and if there is a number in the string, increase this number by 1. I need the number to be turned into an int so that I can create a function that will look through the dataframe and automatically add 1 to the largest number in the dataframe. An example:
import pandas as pd
data = {'ID': [1,2,3,4],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp']}
df = pd.DataFrame(data)
df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
Afterwards the new df looks like
data2 = {'ID': [1,2,3,4,5,6,7,8],
'Name': ['BN #1', 'HHC', 'A comp', 'B Comp','BN #2', 'HHC', 'A comp', 'B Comp']}
When I run this, I receive a 'NoneType' object is not subscriptable error. This makes sense because only the BN # row has a number and re.search returns None when the string parameters are not met, but I cannot figure out how to tell python to ignore the other rows.
EDIT
Only the first row each dataframe will increase by 1, so if there is an easier way where I do not use re.search, that is fine. I know there are a couple ways of doing this but I want to be able to always look through the string value of BN and increase it by 1 every time I run the code.
REGEX EDIT
df2['BaseName'] = [re.sub('\d', '', x) for x in df2['Name'].values]
df['BaseName'] = [re.sub('\d', '', x) for x in df['Name'].values]
df2['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df2['Name'].values]
# df2['SysNum'] = df2['Name'].get(r'(?<=#)\d').astype(int)
# df['SysNum'] = [int(re.search('(?<=#)\d', x)[0]) for x in df['Name'].values]
df['SysNum'] = df['Name'].str.contains('(?<=#)\d').astype(int)
m = re.search(r'(?<=#)\d', df2['Name'].iloc[0])
if m:
df2['SysNum'] = int(m.group(0)) + 1
n = re.search(r'(?<=#)\d', df['Name'].iloc[0])
if n:
df['SysNum'] = int(n.group(1)) + 1
new_names = df2['BaseName'].unique()
maxes2 = np.zeros((len(new_names), ))
for j in range(len(new_names)):
un2 = new_names[j]
maxes2[j] = df['SysNum'].loc[df['BaseName'] == un2].max()
df2['SysNum'].loc[df2['BaseName'] == un2] = np.linspace(1, len(df2['SysNum'].loc[df2['BaseName'] == un2]), len(df2['SysNum'].loc[df2['BaseName'] == un2]))
df2['SysNum'].loc[df2['BaseName'] == un2] += maxes2[j]
newnames2 = [s + '%d' % num for s,num in zip(df2['BaseName'].loc[df2['BaseName'] == un2].values, df2['SysNum'].loc[df2['BaseName'] == un2].values)]
df2['Name'].loc[df2['BaseName'] == un2] = newnames2
I have this code working for two dataframes and the numbering works out how I would like it to. The first two have a "Name-###" naming convention for all the rows in the dataframe. This allows the commented out re.search line at the top to run just fine. The next two dataframes I am working on are like the examples I put up earlier with the BN #1 and the rest of the names do not have a number. When I run the commented out re.search lines, the code tries to convert the NoneTypes to int and it cannot do that. When I run the code as is now, a new number is put on each and every row immediately following the name, but I need it to add a new number to the row with the #. So what I need and I am struggling with is a piece of code that looks through the dataframe, looks for a # sign, turns the number after the # sign into an int, a loop that looks for the max int and then adds 1 to that number, adds that new number onto the new dataframe, adds new dataframe onto the old one for a larger master list.
You can access the value on the first row of the Name column using df['Name'].iloc[0].
Thus, you can search for a sequence of digits after a # sign in that value using
m = re.search(r'#(\d+)', df['Name'].iloc[0])
if m:
df['SysNum'] = int(m.group(1)) + 1
Output:
>>> df
ID Name SysNum
0 1 BN #1 2
1 2 HHC 2
2 3 A comp 2
3 4 B Comp 2
My Problem
I have a loop that creates a column using either a formula based on values from other columns or the previous value in the column depending on a condition ("days from new low == 0"). It is really slow over a huge dataset so I wanted to get rid of the loop and find a formula that is faster.
Current Working Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('stock_price.csv', delimiter = ',')
df = pd.DataFrame(csv1)
for x in range(1,len(df.index)):
if df["days from new low"].iloc[x] == 0:
df["mB"].iloc[x] = (df["RSI on new low"].iloc[x-1] - df["RSI on new low"].iloc[x]) / -df["days from new low"].iloc[x-1]
else:
df["mB"].iloc[x] = df["mB"].iloc[x-1]
df
Input Data and Expected Output
RSI on new low,days from new low,mB
0,22,0
29.6,0,1.3
29.6,1,1.3
29.6,2,1.3
29.6,3,1.3
29.6,4,1.3
21.7,0,-2.0
21.7,1,-2.0
21.7,2,-2.0
21.7,3,-2.0
21.7,4,-2.0
21.7,5,-2.0
21.7,6,-2.0
21.7,7,-2.0
21.7,8,-2.0
21.7,9,-2.0
25.9,0,0.5
25.9,1,0.5
25.9,2,0.5
23.9,0,-1.0
23.9,1,-1.0
Attempt at Solution
def mB_calc (var1,var2,var3):
df[var3]= np.where(df[var1] == 0, df[var2].shift(1) - df[var2] / -df[var1].shift(1) , "")
return df
df = mB_calc('days from new low','RSI on new low','mB')
First, it gives me this "TypeError: can't multiply sequence by non-int of type 'float'" and second I dont know how to incorporate the "ffill" into the formula.
Any idea how I might be able to do it?
Cheers!
Try this one:
df["mB_temp"] = (df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift()
df["mB"] = df["mB"].shift()
df["mB"].loc[df["days from new low"] == 0]=df["mB_temp"].loc[df["days from new low"] == 0]
df.drop(["mB_temp"], axis=1)
And with np.where:
df["mB"] = np.where(df["days from new low"]==0, df["RSI on new low"].shift() - df["RSI on new low"]) / -df["days from new low"].shift(), df["mB"].shift())
Using Panda, I am dealing with the following CSV data type:
f,f,f,f,f,t,f,f,f,t,f,t,g,f,n,f,f,t,f,f,f,f,f,f,f,f,f,f,f,f,f,f,f,t,t,t,nowin
t,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,nowin
t,f,f,f,t,f,f,f,t,f,t,f,g,f,b,f,f,t,f,f,f,t,f,t,f,t,f,f,f,f,f,f,f,t,f,n,won
f,f,f,f,f,f,f,f,f,f,t,f,g,f,b,f,f,t,f,f,f,f,f,t,f,t,f,f,f,f,f,f,f,t,f,n,win
For this part of the raw data, I was trying to return something like:
Column1_name -- t -- counts of nowin = 0
Column1_name -- t -- count of wins = 3
Column1_name -- f -- count of nowin = 2
Column1_name -- f -- count of win = 1
Based on this idea get dataframe row count based on conditions I was thinking in doing something like this:
print(df[df.target == 'won'].count())
However, this would return always the same number of "wons" based on the last column without taking into consideration if this column it's a "f" or a "t". In other others, I was hoping to use something from Panda dataframe work that would produce the idea of a "group by" from SQL, grouping based on, for example, the 1st and last column.
Should I keep pursing this idea of should I simply start using for loops?
If you need, the rest of my code:
import pandas as pd
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/chess/king-rook-vs-king-pawn/kr-vs-kp.data"
df = pd.read_csv(url,names=[
'bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target'
])
features = ['bkblk','bknwy','bkon8','bkona','bkspr','bkxbq','bkxcr','bkxwp','blxwp','bxqsq','cntxt','dsopp','dwipd',
'hdchk','katri','mulch','qxmsq','r2ar8','reskd','reskr','rimmx','rkxwp','rxmsq','simpl','skach','skewr',
'skrxp','spcop','stlmt','thrsk','wkcti','wkna8','wknck','wkovl','wkpos','wtoeg','target']
# number of lines
#tot_of_records = np.size(my_data,0)
#tot_of_records = np.unique(my_data[:,1])
#for item in my_data:
# item[:,0]
num_of_won=0
num_of_nowin=0
for item in df.target:
if item == 'won':
num_of_won = num_of_won + 1
else:
num_of_nowin = num_of_nowin + 1
print(num_of_won)
print(num_of_nowin)
print(df[df.target == 'won'].count())
#print(df[:1])
#print(df.bkblk.to_string(index=False))
#print(df.target.unique())
#ini_entropy = (() + ())
This could work -
outdf = df.apply(lambda x: pd.crosstab(index=df.target,columns=x).to_dict())
Basically we are going in on each feature column and making a crosstab with target column
Hope this helps! :)