Python Pandas, deleting NaN

Python Pandas, deleting NaN - python

So basically I am stuck on a very simple thing. For some reason when I execute this code:
import pandas as pd
x = pd.read_csv('titanic.csv')
v = x.dropna(axis=0,how="any")
z = v[["Survived"]]
y = z.where(z == 1)
print (y)
It still prints values with NaN, even though I have already done dropna on the whole file and it works. I just want to print rows with value 1. I have tried many variations and I cant seem to fix it. Any ideas?
Output
Part of the file I am interested in

try:
y = z.where(z == 1).dropna(subset=['Survived'])

SAMPLE DATA:
PassengerId Survived pClass
1 1 3
2 1 4
3 0 2
4 1 9
5 0 6
6 0 0
import pandas as pd
import numpy as np
columns = ['PassengerId','Survived', 'pClass']
PassengerIdList = [1,2,3,4,5,6]
SurvivedList = [1,1,0,1,0,0]
pClassList = [3,4,2,9,6,0]
newList = list(zip(PassengerIdList,SurvivedList,pClassList))
data = np.array(newList)
# print(data)
df = pd.DataFrame(data, columns=columns)
filtered_df = df.loc[df['Survived'] == 1]
print(filtered_df)
OUTPUT:
PassengerId Survived pClass
1 1 3
2 1 4
4 1 9
pyFiddle

Am guessing you have empty rows in your datasets, try using the:
x.fillna(-99999, inplace=True)
that should solve the problem or better still, post what your output looks like and we can know what to do.

You can do this also
y = z.loc[z['Survived'] == 1]

You could use loc, and just locate every row which fulfills your criteria.
survivors = df.loc[df['Survived'] == 1]

Related

How not to use loop in a df when access previous lines

I use pandas to process transport data. I study attendance of bus lines. I have 2 columns to count people getting on and off the bus at each stop of the bus. I want to create one which count the people currently on board. At the moment, i use a loop through the df and for the line n, it does : current[n]=on[n]-off[n]+current[n-1] as showns in the following example:
for index,row in df.iterrows():
if index == 0:
df.loc[index,'current']=df.loc[index,'on']
else :
df.loc[index,'current']=df.loc[index,'on']-df.loc[index,'off']+df.loc[index-1,'current']
Is there a way to avoid using a loop ?
Thanks for your time !

You can use Series.cumsum(), which accumulates the the numbers in a given Series.
a = pd.DataFrame([[3,4],[6,4],[1,2],[4,5]], columns=["off", "on"])
a["current"] = a["on"].cumsum() - a["off"].cumsum()
off on current
0 3 4 1
1 6 4 -1
2 1 2 0
3 4 5 1

If I've understood the problem properly, you could calculate the difference between people getting on and off, then have a running total using Series.cumsum():
import pandas as pd
# Create dataframe for demo
d = {'Stop':['A','B','C','D'],'On':[3,2,3,2],'Off':[2,1,0,1]}
df = pd.DataFrame(data=d)
# Get difference between 'On' and 'Off' columns.
df['current'] = df['On']-df['Off']
# Get cumulative sum of column
df['Total'] = df['current'].cumsum()
# Same thing in one line
df['Total'] = (df['On']-df['Off']).cumsum()
Stop On Off Total
A 3 2 1
B 2 1 2
C 3 0 5
D 2 1 6

how to find increasing-decreasing trends in Python

I am trying to compare a dataframe's different columns with each other row by row like
for (i= startday to endday)
if(df[i]<df[i+1])
counter=counter+1
else
i=endday+1
the goal is find increasing (or decreasing) trends(need to be consecutive)
And my data looks like this
df= 1 2 3 0 1 1 1
1 1 1 1 0 1 2
1 2 1 0 1 1 2
0 0 0 0 1 0 1
(In this example startday to endday is 7 but actually these two are unstable)
As a result i expect to find this {2,0,1,0} and i need it to work fast because my data is quite big(1.2 million). Because of the time limit I tried not to use loops (for, if etc.)
I tried the code below but couldn't find how to stop counter if condition is false
import math
import numpy as np
import pandas as pd
df1=df.copy()
df2=df.copy()
bool1 = (np.less_equal.outer(startday.startday, range(1,13))
& np.greater_equal.outer(endday.endday, range(1,13))
)
bool1= np.c_[np.zeros(len(startday)),bool1].astype('bool')
bool2 = (np.less_equal.outer(startday2.startday2, range(1,13))
& np.greater_equal.outer(endday2.endday2, range(1,13))
)
bool2= np.c_[bool2, np.zeros(len(startday))].astype('bool')
df1.insert(0, 'c_False',math.pi)
df2.insert(12, 'c_False',math.pi)
#df2.head()
arr_bool = (bool1&bool2&(df1.values<df2.values))
df_new = pd.DataFrame(np.sum(arr_bool , axis=1),
index=data_idx, columns=['coll'])
df_new.coll= np.select( condlist = [startday.startday > endday.endday],
choicelist = [-999],
default = df_new.coll)

Add zeros at the end, then use np.diff, then get the first "non positive" using argmin:
(np.diff(np.hstack((df.values, np.zeros((df.values.shape[0], 1)))), axis=1) > 0).argmin(axis=1)
>> array([2, 0, 1, 0], dtype=int64)

How to add a new column to a table formed from conditional statements?

I have a very simple query.
I have a csv that looks like this:
ID X Y
1 10 3
2 20 23
3 21 34
And I want to add a new column called Z which is equal to 1 if X is equal to or bigger than Y, or 0 otherwise.
My code so far is:
import pandas as pd
data = pd.read_csv("XYZ.csv")
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0

You can do this without using a loop by using ge which means greater than or equal to and cast the boolean array to int using astype:
In [119]:
df['Z'] = (df['X'].ge(df['Y'])).astype(int)
df
Out[119]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
Regarding your attempt:
for x in data["X"]:
if x >= data["Y"]:
Data["Z"] = 1
else:
Data["Z"] = 0
it wouldn't work, firstly you're using Data not data, even with that fixed you'd be comparing a scalar against an array so this would raise a warning as it's ambiguous to compare a scalar with an array, thirdly you're assigning the entire column so overwriting the column.
You need to access the index label which your loop didn't you can use iteritems to do this:
In [125]:
for idx, x in df["X"].iteritems():
if x >= df['Y'].loc[idx]:
df.loc[idx, 'Z'] = 1
else:
df.loc[idx, 'Z'] = 0
df
Out[125]:
ID X Y Z
0 1 10 3 1
1 2 20 23 0
2 3 21 34 0
But really this is unnecessary as there is a vectorised method here

Firstly, your code is just fine. You simply capitalized your dataframe name as 'Data' instead of making it 'data'.
However, for efficient code, EdChum has a great answer above. Or another method similar to the for loop in efficiency but easier code to remember:
import numpy as np
data['Z'] = np.where(data.X >= data.Y, 1, 0)

How to plot data after groupby

I have a data frame similar to this
import pandas as pd
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
print(df) #printing dataframe.
I performed the groupby function on it to get the required output.
df['COUNTER'] =1 #initially, set that counter to 1.
group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
print(group_data)
now i want to plot the out using matplot lib. Please help me with it.. I am not able to figure how to start and what to do.
I want to plot using the counter value and something similar to bar graph

Try:
group_data = group_data.reset_index()
in order to get rid of the multiple index that the groupby() has created for you.
Your print(group_data) will give you this:
In [24]: group_data = df.groupby(['age','data'])['COUNTER'].sum() #sum function
In [25]: print(group_data)
age data
1 ONE 3
THREE 1
TWO 1
2 ONE 2
TWO 1
3 ONE 1
TWO 1
Name: COUNTER, dtype: int64
Whereas, reseting will 'simplify' the new index:
In [26]: group_data = group_data.reset_index()
In [27]: group_data
Out[27]:
age data COUNTER
0 1 ONE 3
1 1 THREE 1
2 1 TWO 1
3 2 ONE 2
4 2 TWO 1
5 3 ONE 1
6 3 TWO 1
Then depending on what it is exactly that you want to plot, you might want to take a look at the Matplotlib docs
EDIT
I now read more carefully that you want to create a 'bar' chart.
If that is the case then I would take a step back and not use reset_index() on the groupby result. Instead, try this:
In [46]: fig = group_data.plot.bar()
In [47]: fig.figure.show()
I hope this helps

Try with this:
# This is a great tool to add plots to jupyter notebook
% matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
# Params get plot bigger
plt.rcParams["axes.labelsize"] = 16
plt.rcParams["xtick.labelsize"] = 14
plt.rcParams["ytick.labelsize"] = 14
plt.rcParams["legend.fontsize"] = 12
plt.rcParams["figure.figsize"] = [15, 7]
df = pd.DataFrame([['1','3','1','2','3','1','2','2','1','1'], ['ONE','TWO','ONE','ONE','ONE','TWO','ONE','TWO','ONE','THREE']]).T
df.columns = [['age','data']]
df['COUNTER'] = 1
group_data = df.groupby(['age','data']).sum()[['COUNTER']].plot.bar(rot = 90) # If you want to rotate labels from x axis
_ = group_data.set(xlabel = 'xlabel', ylabel = 'ylabel'), group_data.legend(['Legend']) # you can add labels and legend

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!

Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Pandas, deleting NaN - python

try: y = z.where(z == 1).dropna(subset=['Survived'])

Am guessing you have empty rows in your datasets, try using the: x.fillna(-99999, inplace=True) that should solve the problem or better still, post what your output looks like and we can know what to do.

You can do this also y = z.loc[z['Survived'] == 1]

You could use loc, and just locate every row which fulfills your criteria. survivors = df.loc[df['Survived'] == 1]

Related

How not to use loop in a df when access previous lines

how to find increasing-decreasing trends in Python

How to add a new column to a table formed from conditional statements?

How to plot data after groupby

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

Categories

Resources