How to select the first column of datasets? - python

I am trying to get the first column of dataset to calculate the summary of the data such as mean, median, variance, stdev etc...
This is how I read my csv file
wine_data = pd.read_csv('winequality-white.csv')
I tried to select the first columns two ways
first_col = wine_data[wine_data.columns[0]]
wine_data.iloc[:,0]
But I get this whole result
0 7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6
1 6.3;0.3;0.34;1.6;0.049;14;132;0.994;3.3;0.49;9...
2 8.1;0.28;0.4;6.9;0.05;30;97;0.9951;3.26;0.44;1...
4896 5.5;0.29;0.3;1.1;0.022;20;110;0.98869;3.34;0.3...
4897 6;0.21;0.38;0.8;0.02;22;98;0.98941;3.26;0.32;1...
Name: fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality", Length: 4898, dtype: object
How can I just select the first columns such as 7,6.3,8.1,5.5,6.0.

You might use the following:
#to see all columns
df.columns
#Selecting one column
df['column_name']
#Selecting multiple columns
df[['column_one', 'column_two','column_four', 'column_seven']]
Something like this example:
Or if you prefer, you might use the df.iloc

You can try this:
first_col = wind_data.ix[:,0]

Related

How to select a range of columns when using the replace function on a large dataframe?

I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.

Replace NaN values from one column with different length into other column with additional condition

I am working with Titanic data set. This set have 891 rows. At moment I am focus on column 'Age'.
import pandas as pd
import numpy as np
import os
titanic_df = pd.read_csv('titanic_data.csv')
titanic_df ['Age']
Column 'Age' have 177 Nan values, so I want to replace this values from values from my sample. I already made sample for this column and you can see code below.
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(177)
So next steep should be replacing Nan value from age_sample into titanic_df ['Age']. In order to do this I try with this lines of code.
titanic_df ['Age']=age_sample
titanic_df ['Age'].isna()=age_sample
But obliviously here I made some mistakes. So can anybody help me how to replace value from sample (177 rows) into original data set (891 rows) and replace only Nan values.
A two line solution:
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(177))
If number of NaN values is not known:
nan_len = len(df['Age'][df['Age'].isna()])
age_sample = df['Age'][df['Age'].notnull()]
df['Age'] = list(age_sample) + list(age_sample.sample(nan_len ))
You need to select the subframe you want to update using loc:
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
I will divide my answer to two parts. Solution you are looking for and solution that makes it more robust.
Solution you are looking for
We have to find the number of missing values first, then generate number of sample matching our missing value and then assign. This will insure that you have the same size of needed missing values.
...
age_na_size = titanic_df ['Age'].isna().sum()
# generate sample of that sum
age_sample= titanic_df ['Age'][titanic_df ['Age'].notnull()].sample(age_na_size)
# feed that to missing values
titanic_df.loc[titanic_df['Age'].isna(), 'Age'] = age_sample
Solutions to make it robust
find the group mean age and replace missing values according. Example group by gender, carbin etc features that makes sense and use median age as a replacer.
Use k-Nearest Neighbour as age replacer. See scikit-learn
knnimputer
Use bins of age instead of actual ages. In this way you can first create a classifier to predict the age bin then use that as your code imputer.

min of all columns of the dataframe in a range

I want to find the min value of every row of a dataframe restricting to only few columns.
For example: consider a dataframe of size 10*100. I want the min of middle 5 rows and this becomes of size 10*5.
I know to find the min using df.min(axis=0) but i dont know how to restrict the number of columns. Thanks for the help.
I use pandas lib.
You can start by selecting the slice of columns you are interested in and applying DataFrame.min() to only that selection:
df.iloc[:, start:end].min(axis=0)
If you want these to be the middle 5, simply find the integer indices which correspond to the start and end of that range:
start = int(n_columns/2 - 2.5)
end = start + 5
Following the 'pciunkiewicz's logic:
First you should select the columns that you desire. You can use the functions: .loc[..] or .iloc[..].
The first one you can use the names of the columns. When it takes 2 arguments, the first one is the row's index. The second is the columns.
df.loc[[rows], [columns]] # The filter data should be inside the brakets.
df.loc[:, [columns]] # This will consider all rows.
You can also use .iloc. In this case, you have to use integers to locate the data. So you don't have to know the name of the columns, but their position.

Set column value as the mean of a group in pandas

I have a data frame with columns X Y temperature Label
label is an integer between 1 and 9
I want to add an additional column my_label_mean_temperature which will contain for each row the mean of the temperatures of the rows that has the same label.
I'm pretty sure i need to start with my_df.groupby('label') but not sure how to calculate the mean on temperature and propagate the values on all the rows of my original data frame
Your problem could be solved with the transform method of pandas.
You could try something like this :
df['my_label_mean_temperature'] = df.groupby(['label']).transform('mean')
Something like this?
df = pd.DataFrame(data={'x':np.random.rand(19),
'y':np.arange(19),
'temp':[22,33,22,55,3,7,55,1,33,4,5,6,7,8,9,4,3,6,2],
'label': [1,2,3,4,2,3,9,3,2,9,2,3,9,4,1,2,9,7, 1]})
df['my_label_mean_temperature'] = df.groupby(['label'], sort=False)['temp'].transform('mean')
df['my_label_mean_temperature']= df.groupby('label', as_index=False)['temperature'].mean()

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources