what is the difference between notna() and dropna()? - python

I'm working on titanic data right now, using pandas. Funny thing is, when dealing with missing values, drorpna() does not work but notna() does.
temp.Embarked.dropna(inplace = True)
temp.isnull().sum()
Embarked 2
temp = temp[temp['Embarked'].notna()]
temp.isnull().sum()
Embarked 0

I think both done same process, but when we are using dropna() we have to mention how the way we have to drop Nan means, by axis....
you have to mention row wise or column wise
Eg:temp.Embarked.dropna(inplace = True,axis=1)
it will drop the nan values with entire row
for further clarification please refer this link below:
Here's What does axis in pandas mean?

Related

Complete NaN values in a dataframe based on the values completed in another dataframe

So, what i am trying to do, is complete the NaN values of a Dataframe with the correct values that are to be found in a second dataframe. It would be something like this
df={"Name":["Lennon","Mercury","Jagger"],"Band":["The Beatles", "Queen", NaN]}
df2={"Name":["Jagger"],"Band":["The Rolling Stones"]}
So, I have this command to know which rows have at least one NaN:
inds = list(pd.isnull(dfinal).any(1).nonzero()[0].astype(int))
I thought this would be useful to use a for like function (didn't succeed there)
And then I tried this:
result=df.join(dfinal, on=["Name"])
But it gives me the following error
ValueError: You are trying to merge on object and int64 columns. If
you wish to proceed you should use pd.concat
I checked, and both Series "Name" are string values. So i am unable to solve this.
Keep in mind there are more columns, and the likely result it would be that if a row has one NaN, it will have like 7 NaN.
It is there a way to solve this?
Thanks in advance!
Map and Fillna()
we can target missing values in your target df with missing values from the second dataframe based on the Name column.
df["Band"] = df["Band"].fillna(df["Name"].map(df2.set_index("Name")["Band"]))
print(df)
Name Band
0 Lennon The Beatles
1 Mercury Queen
2 Jagger The Rolling Stones

Subtract each column by the preceding column on Dataframe in Python

Using the following code I can build a simple table with the current COVID-19 cases worldwide, per country:
url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
raw_data = pd.read_csv(url, sep=",")
raw_data.drop(['Province/State','Lat','Long'], axis = 1, inplace = True)
plot_data = raw_data.groupby('Country/Region').sum()
The plot_data is a simple DataFrame:
What I would like to do now is to subtract the values on each column by the values on the column on a prior day - i.e., I wan to get the new cases per day.
If I do something like plot_data['3/30/20'].add(-plot_data['3/29/20']), it works well. But if I do something like plot_data.iloc[:,68:69].add(-plot_data.iloc[:,67:68]), I got two columns with NaN values. I.e. Python tries to "preserve" de columns header and doesn't perform the operation the way I would like it to.
My goal was to perform this operation in an "elegant way". I was thinking something in the lines of plot_data.iloc[:,1:69].add(-plot_data.iloc[:,0:68]). But of course, if it doesn't work as the single-column example, it doesn't work with multiple columns either (as Python will match the column headers and return a bunch of zeros/NaN values).
Maybe there is a way to tell Python to ignore the headers during an operation with a DataFrame? I know that I can transform my DataFrame into a NumPy array and do a bunch of operations. However, since this is a simple/small table, I thought I would try to keep using a DataFrame data type.
The good old shift can be used on the horizontal axis:
plot_data - plot_data.shift(-1, axis=1)
should be what you want.
Thank you very much #Serge Ballesta! Your answer is exactly the type of "elegant solution" I was looking for. The only comment is the shift sign should be "positive".
plot_data - plot_data.shift(1, axis=1)
This way we bring the historical figures forward one day and now I can subtract it from the actual numbers on each day.

How to index a 2d array properly pandas dataframe?

I am reading a .xslx excel file into a pandas dataframe.
Here is what it looks like:
Image Link
Or in text form:
1 2 3 4
3.5 15.48403728 23.22605592 30.96807456 38.7100932
4 17.41954194 26.12931291 34.83908388 43.54885485
4.5 19.3550466 29.0325699 38.7100932 48.3876165
5 21.29055126 31.93582689 42.58110252 53.22637815
As you can see there is a space in the top left hand cell that is empty.
The rows are amounts and the columns are material, the values are the prices.
I really don't know how to give names properly for indexing.
If I was to try
df.columns = ['Material 1',...'Material 4']
It errors because obviously it is wanting 5 column headers as there are five columns.
Really what I want is to label the top left column as amount/material or something like that, but I don't have a clue on how to do it.
I think the best way would be for me to try and transform this dataframe into something like this:
Amount Material Price
3.5 1 15.48...
3.5 2 23.22...
...
5 4 53.22...
as this will hopefully make it easier to deal with.
Any idea how to do this?
I believe this is called unpivot columns in excel or something like that????
I am not sure how you have read the excel file but if all you wanted is to rename your columns then you can set column names while reading the excel itself.
Supposing my file name is MyExcelFile.xlsx and the columns names that are there 'Amount','Material_1','Material_2','Material_3' and 'Material_4' then I will read it as follows. If these column names do not exist (in the excel) then you have to pass header=None explicitly.
MyDF = pd.read_excel('/FullPathToYourExcelFile/MyExcelFile.xlsx', names=['Amount','Material_1','Material_2','Material_3','Material_4'], header=None)
The output is as below.
See the documentation here (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html). If you have already done it, as I have suggested above, then I am sorry I have underestimated your problem requirements. All the best

How to combine multiple rows of data into a single sting per group

To preface: I'm new to using Python.
I'm working on cleaning up a file where data was spread across multiple rows. I'm struggling to find a solution that will concatenate multiple text strings to a single cell. The .csv data looks similar to this:
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
with one or two blank rows between each entry, too.
The amount of rows used for 'description' isn't consistent. Sometimes it's just one cell, sometimes up to about four. The ideal output turns these multiple rows into a single row of useful data, without all the wasted space. I thought maybe I could create a series of masks by copying the data across a few columns, shifted up, and then iterating in some way. I haven't found a solution that matches what I'm trying to do, though. This is where I'm at so far:
#Add column f description stuff and shift up a row for concatenation
DogData['Z'] = DogData['Y'].shift(-1)
DogData['AA'] = DogData['Z'].shift(-1)
DogData['AB'] = DogData['AA'].shift(-1)
#create series checks to determine how to concat values properly
YNAs = DogData['Y'].isnull()
ZNAs = DogData['Z'].isnull()
AANAs = DogData['AA'].isnull()
The idea here was basically that I'd iterate over column 'Y', check if the same row in column 'Z' was NA or had a value, and concat if it did. If not, just use the value in 'Y'. Carry that logic across but stopping if it encountered an NA in any subsequent columns. I can't figure out how to do that, or if there's a more efficient way to do this.
What do I have to do to get to my end result? I can't figure out the right way to iterate or concatenate in the way I was hoping to.
'''
name,date,description
bundy,12-12-2017,good dog
,,smells kind of weird
,,needs to be washed
'''
df = pd.read_clipboard(sep=',')
df.fillna(method = 'ffill').groupby([
'name',
'date'
]).description.apply(lambda x : ', '.join(x)).to_frame(name = 'description')
I'm not sure I follow exactly what you mean. I took that text, saved it as a csv file, and successfully read it into a pandas dataframe.
import pandas as pd
df = pd.read_csv('test.csv')
df
Output:
name date description
0 bundy 12-12-2017 good dog
1 NaN NaN smells kind of weird
2 NaN NaN needs to be washed
Isn't this the output you require?

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Categories

Resources