replace values by NAN - python

I've got a dataframe that looks like this;
[index, Data]
[1, [5,3,6,8,4,5,7etc]]
The data in my "data"column stays in an array. I need to have at least 75 values in each array. The dataframe is 438 rows long.
I need to make a filter where all the arrays that contains less than 75 values, will be replaced by NaN.
I thought of something like this:
for i in range(len(df_window)):
if len(df_window['Data'][i][0])<75:
I don't know if this is right and how to continue. The dataframe called df_window
can someone help me quick please?

You can use lengths = df_window['Data'].apply(len) to get the serie of array lengths. Then by using df_window.loc[(lengths < 75), 'Data'] = np.nan you should get what you want.
EDIT: Corrected first line.

Related

How to select a range of columns when using the replace function on a large dataframe?

I have a large dataframe that consists of around 19,000 rows and 150 columns. Many of these columns contain values with -1s and -2s. When I try to replace the -1s and -2s with 0 using the following code, Jupyter times out on me and says no memory left. So, I am curious if you can select a range of columns and apply the replace function. This way I can replace in batches since I cant seem to replace in one pass based on my available memory.
Here is the code a tried to use that timed out on me when first replacing the -2s:
df.replace(to_replace=-2, value="0").
Thank you for any guidance!
Sean
Let's say you want to divide your columns in chunks of 10, then you should try something like this:
columns = your_df.columns
division_num = 10
chunks_num = int(len(columns)/division_num)
index = 0
for i in range(chunks_num):
your_df[columns[index: index+10]].replace(to_replace=-2, value="0")
index += division_num
If your memory keeps overflowing then maybe you can try with loc function to divide the data by rows instead of columns.

Remove "?" from pandas column

I've a pandas dataset which has columns and it's Dtype is object. The columns however has numerical float values inside it along with '?' and I'm trying to convert it to float. I want to remove these '?' from the entire column and making those values Nan but not 0 and then convert the column to float64.
The output of value_count() of Voltage column look like this :
? 3771
240.67 363
240.48 356
240.74 356
240.62 356
...
227.61 1
227.01 1
226.36 1
227.28 1
227.02 1
Name: Voltage, Length: 2276, dtype: int64
What is the best way to do that in case I've entire dataset which has "?" inside them along with numbers and i want to convert them all at once.
I tried something like this but it's not working. I want to do this operation for all the columns. Thanks
df['Voltage'] = df['Voltage'].apply(lambda x: float(x.split()[0].replace('?', '')))
1 More question. How can I get "?" from all the columns. I tried something like. Thanks
list = []
for i in df.columns:
if '?' in df[i]
continue
series = df[i].value_counts()['?']
list.append(series)
So, from your value_count, it is clear, that you just have some values that are floats, in a string, and some values that contain ? (apparently that ARE ?).
So, the one thing NOT to do, is use apply or applymap.
Those are just one step below for loops and iterrows in the hierarchy of what not to do.
The only cases where you should use apply is when, otherwise, you would have to iterate rows with for. And those cases almost never happen (in my real life, I've used apply only once. And that was when I was a beginner, and I am pretty sure that if I were to review that code now, I would find another way).
In your case
df.Voltage = df.Voltage.where(~df.Voltage.str.contains('\?')).astype(float)
should do what you want
df.Voltage.str.contains('\?') is a True/False series saying if a row contains a '?'. So ~df.Voltage.str.contains('\?') is the opposite (True if the row does not contain a '\?'. So df.Voltage.where(~df.Voltage.str.contains('\?')) is a serie where values that match ~df.Voltage.str.contains('\?') are left as is, and the other are replaced by the 2nd argument, or, if there is no 2nd argument (which is our case) by NaN. So exactly what you want. Adding .astype(float) convert everyhting to float, since it should now be possible (all rows contains either strings representing a float such as 230.18, or a NaN. So, all convertible to float).
An alternative, closer to what you where trying, that is replacing first, in place, the ?, would be
df.loc[df.Voltage=='?', 'Voltage']=None
# And then, df.Voltage.astype(float) converts to float, with NaN where you put None

Efficient way of converting a numpy array of 2 dimensions into a list with no duplicates

I want to extract the values from two different columns of a pandas dataframe, put them in a list with no duplicate values.
I have tried the following:
arr = df[['column1', 'column2']].values
thelist= []
for ix, iy in np.ndindex(arr.shape):
if arr[ix, iy] not in thelist:
thelist.append(edges[ix, iy])
This works but it is taking too long. The dataframe contains around 30 million rows.
Example:
column1 column2
1 adr1 adr2
2 adr1 adr2
3 adr3 adr4
4 adr4 adr5
Should generate the list with the values:
[adr1, adr2, adr3, adr4, adr5]
Can you please help me find a more efficient way of doing this, considering that the dataframe contains 30 million rows.
#ALollz gave a right answer. I'll extend from there. To convert into list as expected just use list(np.unique(df.values))
You can use just np.unique(df) (maybe this is the shortest version).
Formally, the first parameter of np.unique should be an array_like object,
but as I checked, you can also pass just a DataFrame.
Of course, if you want just plain list not a ndarray, write
np.unique(df).tolist().
Edit following your comment
If you want the list unique but in the order of appearance, write:
pd.DataFrame(df.values.reshape(-1,1))[0].drop_duplicates().tolist()
Operation order:
reshape changes the source array into a single column.
Then a DataFrame is created, with default column name = 0.
Then [0] takes just this (the only) column.
drop_duplicates acts exactly what the name says.
And the last step: tolist converts to a plain list.

Values being altered in numpy array

So I have a 2D numpy array (256,256), containing values between 0 and 10, which is essentially an image. I need to remove the 0 values and set them to NaN so that I can plot the array using a specific library (APLpy). However whenever I try and change all of the 0 values, some of the other values get altered, in some cases to 100 times their original value (no idea why).
The code I'm using is:
for index, value in np.ndenumerate(tex_data):
if value == 0:
tex_data[index] = 'NaN'
where tex_data is the data array from which I need to remove the zeros. Unfortunately I can't just use a mask for the values I don't need, as APLpy wont except masked arrays as far as I can tell.
Is there anyway I can set the 0 values to NaN without changing the other values in the array?
Use fancy-indexing. Like this:
tex_data[tex_data==0] = np.nan
I don't know why your original code was failing. It looks correct to me, although terribly inefficient.
Using float rules,
tex_data/tex_data*tex_data
make the job here also.

In Python, how do I select the columns of a dataframe satisfying a condition on the number of NaN?

I hope someone could help me. I'm new to Python, and I have a dataframe with 111 columns and over 40 000 rows. All the columns contain NaN values (some columns contain more NaN's than others), so I want to drop those columns having at least 80% of NaN values. How can I do this?
To solve my problem, I tried the following code
df1=df.apply(lambda x : x.isnull().sum()/len(x) < 0.8, axis=0)
The function x.isnull().sum()/len(x) is to divide the number of NaN in the column x by the length of x, and the part < 0.8 is to choose those columns containing less than 80% of NaN.
The problem is that when I run this code I only get the names of the columns together with the boolean "True" but I want the entire columns, not just the names. What should I do?
You could do this:
filt = df.isnull().sum()/len(df) < 0.8
df1 = df.loc[:, filt]
You want to achieve two things. First, you have to find the indices of all columns which contain at most 80% NaNs. Second, you want to discard them from your DataFrame.
To get a pandas Series indicating whether a row should be discarded by doing, you can do:
df1 = df.isnull().sum(axis=0) < 0.8*df.shape[1]
(Btw. you have a typo in your question. You should drop the ==True as it always tests whether 0.5==True)
This will give True for all column indices to keep, as .isnull() gives True (or 1) if it is NaN and False (or 0) for a valid number for every element. Then the .sum(axis=0) sums along the columns giving the number of NaNs in each column. The comparison is then, if that number is bigger than 80% of the number of columns.
For the second task, you can use this to index your columns by using:
df = df[df.columns[df1]]
or as suggested in the comments by doing:
df.drop(df.columns[df1==False], axis=1, inplace=True)

Categories

Resources