I have a data set something like this:
import pandas as pd
# initialize data of lists.
data = {'name':['x', 'y', 'z'],
'value':['fb', 'nan', 'ti']}
# Create DataFrame
df = pd.DataFrame(data)
I now want to check the column of value and count the number of rows if value does not have 'fb' and 'nan' (null values).
How can I do this?
df[~df.value.isin(['fb','nan'])].shape[0]
In this case, we are checking when the value is not in this list and selecting those rows only. From there we can get the shape using shape of that dataframe.
Output
1
This would be the result dataframe
name value
2 z ti
If in future you want to also ignore the rows where the value column is NA (NA values, such as None or numpy.NaN),then you can use this
df[(~df.value.isin(['fb','nan'])) & (~df.value.isnull())].shape[0]
To count values that are not fb and nan:
(~df.value.str.contains('fb|nan')).sum()
Omit the tilde if you want to count the fb and nan values.
Just make a condition checking for "fb" or NaN, and use sum to get the count of True's:
>>> (df['value'].eq('fb') | df['value'].isna()).sum()
3
Related
I was given a dataset by my professor and one of my questions is, "Find the number of missing values, 99999, in each column and list them." How would I do this in python? I have multiple columns all with numerical data.
The missing values in the dataset are denoted by '99999' instead of NA like usual.
I don't have much experience in python and have tried many things to no avail
Use a lambda function to find all occurrences of 99999; then use sum() to get the total number of occurrences per column
# import pandas package
import pandas as pd
# load dataset with pandas, for example if you have a csv:
df = pd.read_csv("YOUR_FILEPATH_HERE")
# print out the number of occurrences of 99999 in each column
print(df.apply(lambda x: (x == 99999).sum()))
A non pandas answer:
NA = 99999
data = [
[ 1, NA, 3 ],
[ NA, NA, 6 ],
]
NAs = [0] * len(data[0]) # create an array of counters; 1 for each column
for row in data:
for x,value in enumerate(row):
if value == NA:
NAs[x] += 1
print( NAs )
# Replace the missing value code '99999' with the default missing value code NaN
df = df.replace(99999, np.nan)
# Identify the missing values in each column of the DataFrame (where NaN is the default missing value code)
missing_values = df.isnull()
Remember to import numpy as np.
I am trying to replace Nan with a list of numbers generated by a random seed. This means each Nan value needs to be replaced by a unique integer. Items in the columns are unique, but the rows just seem to be replicating themselves? Any suggestions would be welcome
np.random.seed(56)
rs=np.random.randint(1,100, size=total)
df=pd.DataFrame(index=np.arange(rows), columns=np.arange(columns))
for i in rs:
df=df.fillna(value=i, limit=1)
the df.fillna() will replace all the values that contains NA. So your code is actually changing the NA values just at the first iteration of the forloop because than, no other values to fill remains.
You can use the applymap function to iterates through all the rows and fill the NANs with a randomly generated values in this way:
df.applymap(lambda l: l if not np.isnan(l) else np.random.randint(1,100, size=total)))
You can try stack
s = df.stack(dropna=False)
s = s[s.isna()]
s[:] = rs
df.update(s.unstack())
Or we can create from original
df = pd.DataFrame(data = rs.reshape(row, col),
index=np.arange(row),
columns=np.arange(col))
I want to shift values of row 10 , Fintech into next column and fill the city column in same row with Bahamas. Is there any way to do that?
I found the dataframe.shift() function of pandas but it is limited to columns and it shifts all the values.
Use DataFrame.shift with filtered rows and axis=1:
#test None values like Nonetype
m = df['Select Investors'].isna()
#test None values like strings
#m = df['Select Investors'].eq('None')
df.loc[m, 'Country':] = df.loc[m, 'Country':].shift(axis=1)
The code below will generate only one value of a normal distribution, and fill in all the missing values with this same value:
helper_df = df.dropna()
df = df.fillna(numpy.random.normal(loc=helper_df.mean(), scale=numpy.std(helper_df)))
What can we do to generate a value for each missing value?
You can create a series with normal values. You should extract the index of the Nan values in the column you are working on.
df: your dataframe
col: the col containing Nan values
index = df[df.col.isna()].index
value = np.random.normal(loc=data.col.mean(), scale=data.col.std(), size=data.Age.isna().sum())
data.Age.fillna(pd.Series(value, index=index), inplace=True)
You can create a series of random variables with the same length as your dataframe, then apply fillna:
df.fillna(pd.Series([np.random.normal() for x in range(len(df))]))
If a value in a row is not missing, fillna just ignores it.
I have a data frame called active and it has 10 unique POS column values.
Then I group POS values and mean normalize the OPW columns and then store the normalized values as a seperate column ['resid'].
If I groupby on POS values shouldnt the new active data frame's POS columns contain only unique POS values??
For example:
df2 = pd.DataFrame({'X' : ['B', 'B', 'A', 'A'], 'Y' : [1, 2, 3, 4]})
print df2
df2.groupby(['X']).sum()
I get an output like this:
Y
X
A 7
B 3
In my example shouldn't I get an column with only unique Pos values as mentioned below??
POS Other Columns
Rf values
2B values
LF values
2B values
OF values
I can't be 100% sure without the actual data, but i'm pretty sure that the problem here is that you are not aggregating the data.
Let's go through the groupby step by step.
When you do active.groupby('POS'), what's actually happening is that you are slicing the dataframe per each unique POS, and passing each of these sclices, sequentially, to the applied function.
You can get a better vision of what's happening by using get_group (ex : active.groupby('POS').get_group('RF') )
So you're applying your meanNormalizeOPW function to each of those slices. That function creates a mean normalized value of the column 'resid' for each line of the passed dataframe. And you return that dataframe, ending with a shape that is similar than what was passed.
So if you just add an aggregation function to the returned df, it should work fine. I guess here you want a mean, so just change return df into return df.mean()