adding new column with values calculated from other columns - python

I was wandering how you could add a new column with values calculated from other columns.
Let's say 'small' is one that has a value less than 5. I want to record this information in a new column called 'small' where this column has value 1 if it is short and 0 otherwise.
I know you can do that with numpy like this :
import pandas as pd
import numpy as np
rawData = pd.read_csv('data.csv', encoding = 'ISO-8859-1')
rawData['small'] = np.where(rawData['value'] < 5, '1', '0')
How would you do the same operation without using numpy?
I've read about numpy.where() but I still don't get it.
Help would be really appreciated thanks!

You can convert boolean mask to integer and then to string:
rawData['small'] = (rawData['value'] < 5).astype(int).astype(str)

Related

Is there a way to add two arrays in two columns in to a third array using pands

I am working on a project, which uses pandas data frame. So in there, I received some values in to the columns as below.
In there, I need to add this Pos_vec column and word_vec column and need to create a new column called the sum_of_arrays. And the size of the third column's array size should 2.
Eg: pos_vec Word_vec sum_of_arrays
[-0.22683072, 0.32770252] [0.3655883, 0.2535131] [0.13875758,0.58121562]
Is there anyone who can help me? I'm stuck in here. :(
If you convert them to np.array you can simply sum them.
import pandas as pd
import numpy as np
df = pd.DataFrame({'pos_vec':[[-0.22683072,0.32770252],[0.14382899,0.049593687],[-0.24300802,-0.0908088],[-0.2507714,-0.18816864],[0.32294357,0.4486494]],
'word_vec':[[0.3655883,0.2535131],[0.33788466,0.038143277], [-0.047320127,0.28842866],[0.14382899,0.049593687],[-0.24300802,-0.0908088]]})
If you want to use numpy
df['col_sum'] = df[['pos_vec','word_vec']].applymap(lambda x: np.array(x)).sum(1)
If you don't want to use numpy
df['col_sum'] = df.apply(lambda x: [sum(x) for x in zip(x.pos_vec,x.word_vec)], axis=1)
There are maybe cleaner approaches possible using pandas to iterate over the columns, however this is the solution I came up with by extracting the data from the DataFrame as lists:
# Extract data as lists
pos_vec = df["pos_vec"].tolist()
word_vec = df["word_vec"].tolist()
# Create new list with desired calculation
sum_of_arrays = [[x+y for x,y in zip(l1, l2)] for l1,l2 in zip(pos,word)]
# Add new list to DataFrame
df["sum_of_arrays"] = sum_of_arrays

PYTHON check if a value in a column Dataset is within a range of values reported in another dataset

Have read through similar post but can't find an exact solution.
I have a dataset in a column named "A" and want to check if each value in this column is contained within any of the intervals in another dataset with two column intervals "Start" and "End". Return True or False in column "B" Please see attached image (data always in ascending order). Thank You
This is not the most efficient solution but it should do what you are asking:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"A":list(range(20))})
df2 = pd.DataFrame({"START":[1,3,5,7],
"END":[2,4,6,8]})
def compare_with_df(x,df):
for row in range(df.shape[0]):
if x >= df.loc[row,'START'] and x <= df.loc[row,'END']:
return True
return False
df1['B'] = df1['A'].apply(lambda x:compare_with_df(x,df2))
As you can see the compare_with_df() function loops through df2 and compares a given x to all possible ranges (this can and probably should be optimized for larger datasets). The apply() method is equivalent to looping trough the values of the give column (series).

Rename column values using pandas DataFrame

in one of the columns in my dataframe I have five values:
1,G,2,3,4
How to make it change the name of all "G" to 1
I tried:
df = df['col_name'].replace({'G': 1})
I also tried:
df = df['col_name'].replace('G',1)
"G" is in fact 1 (I do not know why there is a mixed naming)
Edit:
works correctly with:
df['col_name'] = df['col_name'].replace({'G': 1})
If I am understanding your question correctly, you are trying to change the values in a column and not the column name itself.
Given you have mixed data type there, I assume that column is of type object and thus the number is read as string.
df['col_name'] = df['col_name'].str.replace('G', '1')
You could try the following line
df.replace('G', 1, inplace=True)
use numpy
import numpy as np
df['a'] = np.where((df.a =='G'), 1, df.a)
You can try this, lets say your data is like :
ab=pd.DataFrame({'a':[1,2,3,'G',5]})
And you will replace it as :
ab1=ab.replace('G',4)

Reading values from Pandas dataframe rows into equations and entering result back into dataframe

I have a dataframe. For each row of the dataframe: I need to read values from two column indexes, pass these values to a set of equations, enter the result of each equation into its own column index in the same row, go to the next row and repeat.
After reading the responses to similar questions I tried:
import pandas as pd
DF = pd.read_csv("...")
Equation_1 = f(x, y)
Equation_2 = g(x, y)
for index, row in DF.iterrows():
a = DF[m]
b = DF[n]
DF[p] = Equation_1(a, b)
DF[q] = Equation_2(a, b)
Rather than iterating over DF, reading and entering new values for each row, this codes iterates over DF and enters the same values for each row. I am not sure what I am doing wrong here.
Also, from what I have read it is actually faster to treat the DF as a NumPy array and perform the calculation over the entire array at once rather than iterating. Not sure how I would go about this.
Thanks.
Turns out that this is extremely easy. All that must be done is to define two variables and assign the desired columns to them. Then set "the row to be replaced" equivalent to the equation containing the variables.
Pandas already knows that it must apply the equation to every row and return each value to its proper index. I didn't realize it would be this easy and was looking for more explicit code.
e.g.,
import pandas as pd
df = pd.read_csv("...") # df is a large 2D array
A = df[0]
B = df[1]
f(A,B) = ....
df[3] = f(A,B)
# If your equations are simple enough, do operations column-wise in Pandas:
import pandas as pd
test = pd.DataFrame([[1,2],[3,4],[5,6]])
test # Default column names are 0, 1
test[0] # This is column 0
test.icol(0) # This is COLUMN 0-indexed, returned as a Series
test.columns=(['S','Q']) # Column names are easier to use
test #Column names! Use them column-wise:
test['result'] = test.S**2 + test.Q
test # results stored in DataFrame
# For more complicated stuff, try apply, as in Python pandas apply on more columns :
def toyfun(df):
return df[0]-df[1]**2
test['out2']=test[['S','Q']].apply(toyfun, axis=1)
# You can also define the column names when you generate the DataFrame:
test2 = pd.DataFrame([[1,2],[3,4],[5,6]],columns = (list('AB')))

Referencing columns by assigned name in numpy array

I am trying to create column names for easy reference, That way I can just call the name from the rest of the program instead of having to know which column is where in terms of placement. The from_ column array is coming up empty. I am new to numpy so I am just wondering how this is done. Changing of data type for columns 5 and 6 was successful though.
def array_setter():
import os
import glob
import numpy as np
os.chdir\
('C:\Users\U2970\Documents\Arcgis\Text_files\Data_exports\North_data_folder')
for file in glob.glob('*.TXT'):
reader = open(file)
headerLine = reader.readlines()
for col in headerLine:
valueList = col.split(",")
data = np.array([valueList])
from_ = np.array(data[1:,[5]],dtype=np.float32)
# trying to assign a name to columns for easy reference
to = np.array(data[1:,[6]],dtype=np.float32)
if data[:,[1]] == 'C005706N':
if data[:,[from_] < 1.0]:
print data[:,[from_]]
array_setter()
If you want to index array columns by name name, I would recommend turning the array into a pandas dataframe. For example,
import pandas as pd
import numpy as np
arr = np.array([[1, 2], [3, 4]])
df = pd.DataFrame(arr, columns=['f', 's'])
print df['f']
The nice part of this approach is that the arrays still maintain all their structure but you also get all the optimized indexing/slicing/etc. capabilities of pandas. For example, if you wanted to find elements of 'f' that corresponded to elements of 's' being equal to some value a, then you could use loc
a = 2
print df.loc[df['s']==2, 'f']
Check out the pandas docs for different ways to use the DataFrame object. Or you could read the book by Wes McKinney (pandas creator), Python for Data Analysis. Even though it was written for an older version of pandas, it's a great starting point and will set you in the right direction.

Categories

Resources