I am trying to multiply a single-column data-frame into every column of another data-frame. both have 40 rows.
The idea is that the single column has calculated weights which I want to apply to the returns given in the other data frame
Here are all the ones I tried but I keep getting NaN values.
#lineaInter5m= li_wg * stocks5m
#lineaInter5m = stocks5m.mul(li_wg);
#print(li_wg.multiply(stocks5m))
#li_wg["li_wg"] * stocks5m
You need to set the axis parameter to "index" or 0
lineaInter5m = stocks5m.mul(li_wg, axis=0)
I have a DataFrame with some NA and I want to drop the rows where a particular column has NA values.
My first trial has been:
- Identifying the rows where the specific column values were NA
- Pass them to pandas.drop()
In my specific case, I have a DataFrame of 39164 rows by 40 columns.
If I look to NA in the specific columns I found 17715 concerned labels that I saved to a dedicated variable. I then sent them to pandas.drop() expecting about 22000 rows remaining, but I only got 2001.
If I use pandas.dropna() I get 21449 rows remaining, which is what I was expecting.
Here follows my code.
The first code portion downloads data from gouv.fr (sorry for not using fake data...but it will only take less than 10 seconds to execute).
WARNING: only the last 5 years are stored on the online database. So my example should be adapted later...
import pandas as pd
villes = {'Versailles' : '78646',
'Aix-en-Provence' : '13001'}
years = range(2014,2019)
root = "https://cadastre.data.gouv.fr/data/etalab-dvf/latest/csv/"
data = pd.DataFrame({})
for ville in villes.keys() :
for year in years :
file_addr = '/'.join([root,str(year),'communes',villes[ville][:2],villes[ville]+'.csv'])
print(file_addr)
tmp = pd.read_csv(file_addr)
data =pd.concat([data,tmp])
This is the second portion, where I try to drop some rows. As said, the results are very different depending on the chosen strategy (data_1 vs data_2). data_2, obtained by dropna() is the expected results.
print(data.shape)
undefined_surface = data.index[data.surface_reelle_bati.isna()]
print(undefined_surface)
data_1 = data.drop(undefined_surface)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_1.shape)
print(data_2.shape)
Using dropna() is totally fine for me, but I would like to understand what I was doing wrong with drop() since I got a very silly result compared to my expectation and I would like to be aware of that in the future...
Thanks in advance for help.
This is because your index is not unique, look for example for index 0, you have forty rows with this index
data_idx0 = data.iloc[0]
data_idx0.shape
# (40,)
If at least one of the rows with index 0 has surface_reelle_bati missing, all the forty rows will disappear from data_1. This is why you drop more rows when creating data_1 than when creating data_2.
To solve this, use reset_index()to get index go from 0 to the number of rows of data
data = data.reset_index()
undefined_surface = data.index[data.surface_reelle_bati.isna()].tolist()
data_1 = data.drop(undefined_surface)
print(data_1.shape)
# (21449, 41)
data_2 = data.dropna(subset=['surface_reelle_bati'])
print(data_2.shape)
# (21449, 41)
I'm looking to trim my dataframe by removing the top and bottom 5% or so of data from specific columns. There are erroneous outliers that are preventing me from using the data effectively.
The dataframe has a "name" column and a few other non-numeric columns, so I want to be able to select specific columns to trim the df from.
I think converting the cell to NaN if its value is the in the largest or smallest x% would be an effective way to do it, but I'm open to other ways if they work, too.
Here is an example of what I'm trying to do:
for column in df.columns:
top = column.quantile(0.95)
bottom = column.quantile(0.05)
for cell in column:
if (cell >= top)|(cell <= bottom):
cell = np.NaN
I think you want between. Also, you can pass an array to quantile():
for column in [your_list_of_columns]:
bottom, top = df[column].quantile([0.05,0.95])
df[column] = df[column].where(df[column].between(bottom, top))
you can use np.argpartation method like below to select top and bottom 5% data from each column. This will be more efficient as it uses vectorization and also don't need to sort all rows
bottom_ind = np.argpartition(df.values, trim_len, axis=0)[:trim_len]
top_ind = np.argpartition(df.values, -trim_len, axis=0)[-trim_len:]
trim_ind = np.r_[bottom_ind, top_ind]
## you can use loop here if you have more columns
df.iloc[trim_ind[0],0] = np.nan
df.iloc[trim_ind[1],1] = np.nan
df
This is what I am trying to do - I was able to do steps 1 to 4. Need help with steps 5 onward
Basically for each data point I would like to find euclidean distance from all mean vectors based upon column y
take data
separate out non numerical columns
find mean vectors by y column
save means
subtract each mean vector from each row based upon y value
square each column
add all columns
join back to numerical dataset and then join non numerical columns
import pandas as pd
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean()
For each row of means, subtract that row from each row of df_numeric. then take square of each column in the output and then for each row add all columns. Then join this data back to df_numeric and df_non_numeric
--------------update1
added code as below. My questions have changed and updated questions are at the end.
def calculate_distance(row):
return (np.sum(np.square(row-means.head(1)),1))
def calculate_distance2(row):
return (np.sum(np.square(row-means.tail(1)),1))
df_numeric2=df_numeric.drop("class",1)
#np.sum(np.square(df_numeric2.head(1)-means.head(1)),1)
df_numeric2['distance0']= df_numeric.apply(calculate_distance, axis=1)
df_numeric2['distance1']= df_numeric.apply(calculate_distance2, axis=1)
print(df_numeric2)
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
could anyone confirm that these is a correct way to achieve the results? i am mainly concerned about the last two statements. Would the second last statement do a correct join? would the final statement assign the original class? i would like to confirm that python wont do the concat and class assignment in a random order and that python would maintain the order in which rows appear
final = pd.concat([df_non_numeric, df_numeric2], axis=1)
final["class"]=df["class"]
I think this is what you want
import pandas as pd
import numpy as np
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
# Make df_non_numeric a copy and not a view
df_non_numeric=df.select_dtypes(exclude='number').copy()
# Subtract mean (calculated using the transform function which preserves the
# number of rows) for each class to create distance to mean
df_dist_to_mean = df_numeric[['Age', 'weight']] - df_numeric[['Age', 'weight', 'class']].groupby('class').transform('mean')
# Finally calculate the euclidean distance (hypotenuse)
df_non_numeric['euc_dist'] = np.hypot(df_dist_to_mean['Age'], df_dist_to_mean['weight'])
df_non_numeric['class'] = df_numeric['class']
# If you want a separate dataframe named 'final' with the end result
df_final = df_non_numeric.copy()
print(df_final)
It is probably possible to write this even denser but this way you'll see whats going on.
I'm sure there is a better way to do this but I iterated through depending on the class and follow the exact steps.
Assigned the 'class' as the index.
Rotated so that the 'class' was in the columns.
Performed that operation of means that corresponded with df_numeric
Squared the values.
Summed the rows.
Concatenated the dataframes back together.
data = [['Alex',10,5,0],['Bob',12,4,1],['Clarke',13,6,0],['brke',15,1,0]]
df = pd.DataFrame(data,columns=['Name','Age','weight','class'],dtype=float)
#print (df)
df_numeric=df.select_dtypes(include='number')#, exclude=None)[source]
df_non_numeric=df.select_dtypes(exclude='number')
means=df_numeric.groupby('class').mean().T
import numpy as np
# Changed index
df_numeric.index = df_numeric['class']
df_numeric.drop('class' , axis = 1 , inplace = True)
# Rotated the Numeric data sideways so the class was in the columns
df_numeric = df_numeric.T
#Iterated through the values in means and seen which df_Numeric values matched
store = [] # Assigned an empty array
for j in means:
sto = df_numeric[j]
if type(sto) == type(pd.Series()): # If there is a single value it comes out as a pd.Series type
sto = sto.to_frame() # Need to convert ot dataframe type
store.append(sto-j) # append the various values to the array
values = np.array(store)**2 # Squaring the values
# Summing the rows
summed = []
for i in values:
summed.append((i.sum(axis = 1)))
df_new = pd.concat(summed , axis = 1)
df_new.T
Background
I deal with a csv datasheet that prints out columns of numbers. I am working on a program that will take the first column, ask a user for a time in float (ie. 45 and a half hours = 45.5) and then subtract that number from the first column. I have been successful in that regard. Now, I need to find the row index of the "zero" time point. I use min to find that index and then call that off of the following column A1. I need to find the reading at Time 0 to then normalize A1 to so that on a graph, at the 0 time point the reading is 1 in column A1 (and eventually all subsequent columns but baby steps for me)
time_zero = float(input("Which time would you like to be set to 0?"))
df['A1']= df['A1']-time_zero
This works fine so far to set the zero time.
zero_location_series = df[df['A1'] == df['A1'].min()]
r1 = zero_location_series[' A1.1']
df[' A1.1'] = df[' A1.1']/r1
Here's where I run into trouble. The first line will correctly identify a series that I can pull off of for all my other columns. Next r1 correctly identifies the proper A1.1 value and this value is a float when I use type(r1).
However when I divide df[' A1.1']/r1 it yields only one correct value and that value is where r1/r1 = 1. All other values come out NaN.
My Questions:
How to divide a column by a float I guess? Why am I getting NaN?
Is there a faster way to do this as I need to do this for 16 columns.(ie 'A2/r2' 'a3/r3' etc.)
Do I need to do inplace = True anywhere to make the operations stick prior to resaving the data? or is that only for adding/deleting rows?
Example
Dataframe that looks like this
!http://i.imgur.com/ObUzY7p.png
zero time sets properly (image not shown)
after dividing the column
!http://i.imgur.com/TpLUiyE.png
This should work:
df['A1.1']=df['A1.1']/df['A1.1'].min()
I think the reason df[' A1.1'] = df[' A1.1']/r1 did not work was because r1 is a series. Try r1? instead of type(r1) and pandas will tell you that r1 is a series, not an individual float number.
To do it in one attempt, you have to iterate over each column, like this:
for c in df:
df[c] = df[c]/df[c].min()
If you want to divide every value in the column by r1 it's best to apply, for example:
import pandas as pd
df = pd.DataFrame([1,2,3,4,5])
# apply an anonymous function to the first column ([0]), divide every value
# in the column by 3
df = df[0].apply(lambda x: x/3.0, 0)
print(df)
So you'd probably want something like this:
df = df["A1.1"].apply(lambda x: x/r1, 0)
This really only answers part 2 of you question. Apply is probably your best bet for running a function on multiple rows and columns quickly. As for why you're getting nans when dividing by a float, is it possible the values in your columns are anything other than floats or integers?