How to randomly delete 10% attributes values from df in pandas - python

I have a example dataset. It has 2000 rows and 15 columns. Last columns will be need as decision class in classification.
I need to delete randomly 10% of attributes values. So 10% values from columns 0-13 should be NA.
I wrote a for loop. It randomizes a colNumber (0-13) and rowNumber (0-2000) and it replaces a value to NA. But I think (and I see this) it's not a faster solution. I tried to find something else in pandas, not core python, but couldn't find anything.
Maybe someone have better idea? More pandas solution? Or maybe something completely different?

You can make use of pandas' sample method.
Imports and set up data
import numpy as np
import pandas as pd
n = 100
data = {
'a': np.random.random(size=n),
'b': np.random.choice(list(string.ascii_lowercase), size=n),
'c': np.random.random(size=n),
}
df = pd.DataFrame(data)
Solution
for col in df.columns:
df.loc[df.sample(frac=0.1).index, col] = np.nan
Solution without for loop:
def delete_10(col):
col.loc[col.sample(frac=0.1).index] = np.nan
return col
df.apply(delete_10, axis=0)
Check
Check to see proportion of NaN values:
df.isnull().sum() / len(df)
Output:
a 0.1
b 0.1
c 0.1
dtype: float64

Maybe this work, create a random array and see if is less than 0.1:
mask = np.random.random(df.iloc[:, :13].shape)<0.1
mask[13:] = False
df[mask] = np.nan

Related

Normalizing a huge python dataframe

I have a huge csv file (~2GB) that I have imported using Dask. Now I want to normalize this dataframe. The dataframe contains about 70k columns. I have written this python function to calculate this:
def normalize(df):
result = df.copy()
for col in tqdm(df.columns):
if col!=str('name') #basically not to normalize columns with name of "name"
max_value = df[col].max()
min_value = df[col].min()
result[col] = (df[col] - min_value) / (max_value - min_value)
return result
It works okay but takes a lot of time. I put this on execution and its showing it will take appoximately 88 hours to complete. I tried switching to sklearn's minmaxscaler() but it doesn't show any progress of normalization and I am afraid that it will also take quite a lot of time. Is there any other way to normalize all the columns (and ignore a few like I did in that if condition).
You don't need to loop through this. When the other columns than name are numerical values then you can just do something along the following:
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
Here is a minimal code sample:
import pandas as pd
df = pd.DataFrame({"name": ["a"]*4, "a": [2,3,4,6], "b": [9,5,2,34]})
num_cols = [col for col in df.columns if col != "name"]
df.loc[:, num_cols] = (df[num_cols] - df[num_cols].min()) / (df[num_cols].max() - df[num_cols].min())
print(df)
I am afraid that it will also take quite a lot of time
Then considering that you just need numerical operations I suggest using numpy for actual number crunching and pandas only for extraction of columns to process, simple example:
import numpy as np
import pandas as pd
df = pd.DataFrame({'name':['A','B','C'],'x1':[1,2,3],'x2':[4,8,6],'x3':[10,15,30]})
num_arr = df[['x1','x2','x3']].to_numpy()
mins = np.min(num_arr,0)
maxs = np.max(num_arr,0)
result_arr = (num_arr - mins) / (maxs - mins)
result_df = pd.concat([df[['name']],pd.DataFrame(result_arr,columns=['x1','x2','x3'])],axis=1)
print(result_df)
output
name x1 x2 x3
0 A 0.0 0.0 0.00
1 B 0.5 1.0 0.25
2 C 1.0 0.5 1.00
Disclaimer: this solutions assumes that df has indices like 0,1,2,...
If you would need further speed increase consider using parallelization, which might be used in this case as values in each columns are computed independently from other columns.

Calculate the mean in pandas while a column has a string

I am currently learning pandas and I am using an imdb movies database, which one of the columns is the duration of the movies. However, one of the values is "None", so I can´t calculate the mean because there is this string in the middle. I thought of changing the "None" to = 0, however that would skew the results. Like can be seen with the code below.
dur_temp = duration.replace("None", 0)
dur_temp = dur_temp.astype(float)
descricao_duration = dur_temp.mean()
Any ideas on what I should do in order to not skew the data? I also graphed it and it becomes more clear how it skews it.
You can replace "None" with numpy.nan, instead that using 0.
Something like this should do the trick:
import numpy as np
dur_temp = duration.replace("None", np.nan)
descricao_duration = dur_temp.mean()
if you want it working for any string in your pandas serie, you could use pd.to_numeric:
pd.to_numeric(dur_temp, errors='coerce').mean()
in this way all the values ​​that cannot be converted to float will be replaced by NaN regardless of which is
Just filter by condition like this
df[df['a']!='None'] #assuming your mean values are in column a
Make them np.NAN values
I am writing it as answer because i can't comment df = df.replace('None ', np.NaN) or df.replace('None', np.NaN, inplace=True)
You can use fillna(value=np.nan) as shown below:
descricao_duration = dur_temp.fillna(value=np.nan).mean()
Demo:
import pandas as pd
import numpy as np
dur_temp = pd.DataFrame({'duration': [10, 20, None, 15, None]})
descricao_duration = dur_temp.fillna(value=np.nan).mean()
print(descricao_duration)
Output:
duration 15.0
dtype: float64

Python Panda dataframe, giving the amount where 2 columns return true

So I have a really big dataframe with the following information:
There are 2 columns "eethuis" and "caternaar" which return True or False whether they have it or not. Now I need to find the number of places where they have both eethuis and caternaar. So this means that I need to count the rows where eethuis and caternaar return True. But I can't really find a way? Even after searching for sometime.
This is what I have. I merged the 2 rows that I need togheter but now I still need to select and count the columns that are both True:
In the picture You will not see a column where both are true, but there are some. Its a really long table with over 800 columns.
Would be nice if someone could help me!
If I understand your question correctly, you can use '&', here is an example on random data:
import pandas as pd
import random
# create random data
df = pd.DataFrame()
df['col1'] = [random.randint(0,1) for x in range(10000)]
df['col2'] = [random.randint(0,1) for x in range(10000)]
df = df.astype(bool)
# filter it:
df1 = df[(df['col1']==True) & (df['col2']==True)]
# check sise:
df1.shape
Thanks to Ezer K I found the solution! Here is the code:
totaal = df_resto[(df_resto['eethuis']==True) & (df_resto['cateraar']==True)]
This is the output:
`
So u see it works!
And the count is 41!

Is there a faster way to search every column of a dataframe for a String than with .apply and str.contains?

So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])

Mean of every 15 rows of a dataframe in python

I have a dataframe of (1500x11). I have to select each of the 15 rows and take mean of every 11 columns separately. So my final dataframe should be of dimension 100x11. How to do this in Python.
The following should work:
dfnew=df[:0]
for i in range(100):
df2=df.iloc[i*15:i*15+15, :]
x=pd.Series(dict(df2.mean()))
dfnew=dfnew.append(x, ignore_index=True)
print(dfnew)
Don't know much about pandas, hence I've coded my next solution in pure numpy. Without any python loops hence very efficient. And converted result back to pandas DataFrame:
Try next code online!
import pandas as pd, numpy as np
df = pd.DataFrame([[i + j for j in range(11)] for i in range(1500)])
a = df.values
a = a.reshape((a.shape[0] // 15, 15, a.shape[1]))
a = np.mean(a, axis = 1)
df = pd.DataFrame(a)
print(df)
You can use pandas.DataFrame.
Use a for loop to compute the means and create a counter which should be reseted at every 15 entries.
columns = [col1, col2, ..., col12]
for columns, values in df.items():
# compute mean
# at every 15 entries save it
Also, using pd.DataFrame() you can create the new dataframe.
I'd recommend you to read the documentation.
https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

Categories

Resources