I have a dataset:
367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN
When I plot it, using plt.plot(df, '-o') I get this:
But what I would like is for the datapoints from each column to be connected in a line, like so:
I understand that matplotlib does not connect datapoints that are separate by NaN values. I looked at all the options here for dealing with missing data, but all of them would essentially misrepresent the data in the dataframe. This is because each value within the dataframe represents an incident; if I try to replace the NaNs with scalar values or use the interpolate option, I get a bunch of points that are not actually in my dataset. Here's what interpolate looks like:
df_wanted2 = df.apply(pd.Series.interpolate)
If I try to use dropna I'll lose entire rows\columns from the dataframe, and these rows hold valuable data.
Does anyone know a way to connect up my dots? I suspect I need to extract individual arrays from the datasframe and plot them, as is the advice given here, but this seems like a lot of work (and my actual dataframe is much bigger.) Does anyone have a solution?
use interpolate method with parameter 'index'
df.interpolate('index').plot(marker='o')
alternative answer
plot after iteritems
for _, c in df.iteritems():
c.dropna().plot(marker='o')
extra credit
only interpolate from first valid index to last valid index for each column
for _, c in df.iteritems():
fi, li = c.first_valid_index(), c.last_valid_index()
c.loc[fi:li].interpolate('index').plot(marker='o')
Try iterating through with apply and then inside the apply function drop the missing values
def make_plot(s):
s.dropna().plot()
df.apply(make_plot)
An alternative would be to outsource the NaN handling to the graph libary Plotly using its connectgaps function.
import plotly
import pandas as pd
txt = """367235 419895 992194
1999-01-11 8 5 1
1999-03-23 NaN 4 NaN
1999-04-30 NaN NaN 1
1999-06-02 NaN 9 NaN
1999-08-08 2 NaN NaN
1999-08-12 NaN 3 NaN
1999-08-17 NaN NaN 10
1999-10-22 NaN 3 NaN
1999-12-04 NaN NaN 4
2000-03-04 2 NaN NaN
2000-09-29 9 NaN NaN
2000-09-30 9 NaN NaN"""
data_points = [line.split(' ') for line in txt.splitlines()[1:]]
df = pd.DataFrame(data_points)
data = list()
for i in range(1, len(df.columns)):
data.append(plotly.graph_objs.Scatter(
x = df.iloc[:,0].tolist(),
y = df.iloc[:,i].tolist(),
mode = 'line',
connectgaps = True
))
fig = dict(data=data)
plotly.plotly.sign_in('user', 'token')
plot_url = plotly.plotly.plot(fig)
Related
I have need to apply a function to each column in a Pandas dataframe that includes a count of NaN in each column. Say that I have this dataframe:
import pandas as pd
df = pd.DataFrame({'Baseball': [3, 1, 2], 'Soccer': [1, 6, 7], 'Rugby': [8, 7, None]})
Baseball Soccer Rugby
0 3 1 8.0
1 1 6 7.0
2 2 7 NaN
I can get the count of NaN in each column with:
df.isnull().sum()
Baseball 0
Soccer 0
Rugby 1
But I can't figure out how to use that result in a function to apply to each column. Say just as an example, I want to add the number of NaN in a column to each element in that column to get:
Baseball Soccer Rugby
0 3 1 9.0
1 1 6 8.0
2 2 7 NaN
(My actual function is more complex.) I tried:
def f(x, y):
return x + y
df2 = df.apply(lambda x: f(x, df.isnull().sum()))
and I get the thoroughly mangled:
Baseball Soccer Rugby
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
Baseball NaN NaN NaN
Rugby NaN NaN NaN
Soccer NaN NaN NaN
Any idea how to use the count of NaN in each column in a function applied to each column?
Thanks in advance!
Thanks to Datanovice and vb_rises, the answer is:
df.apply(lambda x : x + df.isnull().sum(), axis=1)
If anyone had a similar question, I wanted the answer to be clear and without the need to read through the comments. I had thought that axis=1 (column-wise) is a default in Pandas, but it seems that's not necessarily the case for all methods.
I prefer #ALollz' answer; df.add(df.isnull().sum()).
The lambda function #Dribbler is defining already exists in the form of .add().
Currently I'm extracting data from pdf's and putting it in a csv file. I'll explain how this works.
First I create an empty dataframe:
ndataFrame = pandas.DataFrame()
Then I read the data. Assume for simplicity reasons the data is the same for each pdf:
data = {'shoe': ['a', 'b'], 'fury': ['c','d','e','f'], 'chaos': ['g','h']}
dataFrame = pandas.DataFrame({k:pandas.Series(v) for k, v in data.items()})
Then I append this data to the empty dataframe:
ndataFrame = ndataFrame.append(dataFrame)
The is the output:
shoe fury chaos
0 a c g
1 b d h
2 NaN e NaN
3 NaN f NaN
However, now comes the issue. I need some columns (let's say 4) to be empty between the columns fury and chaos. This is my desired output:
shoe fury chaos
0 a c g
1 b d h
2 NaN e NaN
3 NaN f NaN
I tried some stuff with reindexing but I couldn't figure it out. Any help is welcome.
By the way, my desired output might be confusing. To be clear, I need some columns to be completely empty between fury and chaos(this is because some other data goes in there manually).
Thanks for reading
This answer assumes you have no way to change the way the data is being read in upstream. As always, it is better to handle these types of formatting changes at the source. If that is not possible, here is a way to do it after parsing.
You can use reindex here, using numpy.insert to add your four columns:
dataFrame.reindex(columns=np.insert(dataFrame.columns, 2, [1,2,3,4]))
shoe fury 1 2 3 4 chaos
0 a c NaN NaN NaN NaN g
1 b d NaN NaN NaN NaN h
2 NaN e NaN NaN NaN NaN NaN
3 NaN f NaN NaN NaN NaN NaN
I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156
I have a dataframe (it is the product of using the pivot function, which is why it has the c and a):
c 367235 419895 992194
a
1999-02-06 Nan 9 Nan
2000-04-03 2 Nan Nan
1999-04-12 Nan Nan 4
1999-08-08 2 Nan Nan
1999-11-01 8 5 1
1999-12-08 Nan 3 Nan
1999-08-17 Nan Nan 10
1999-10-22 Nan 3 Nan
1999-03-23 Nan 4 Nan
2000-09-29 9 Nan Nan
1999-04-30 Nan Nan 1
2000-09-30 9 Nan Nan
I would like to add a new row at the bottom of this dataframe. Each cell in the new row will evaluate the column above it; if the column contains the numbers 9, 8 or 3, the cell will evaluate to "TRUE". If the column does not contain those numbers, the cell will evaluate to "FALSE". Ultimately, my goal is to delete the columns with a "FALSE" cell using the drop function, creating a dataset like so:
c 367235 419895
a
1999-02-06 Nan 9
2000-04-03 2 Nan
1999-04-12 Nan Nan
1999-08-08 2 Nan
1999-11-01 8 5
1999-12-08 Nan 3
1999-08-17 Nan Nan
1999-10-22 Nan 3
1999-03-23 Nan 4
2000-09-29 9 Nan
1999-04-30 Nan Nan
2000-09-30 9 Nan
TRUE TRUE
My problem:
I can write a function that evaluates if one of several numbers are in a list, but I cannot write this function into .apply.
That is, I found that this works for determining if a group of numbers is in a list:
How to check if one of the following items is in a list?
I tried to modify it as follows for the apply function:
def BIS(i):
L1 = [9,8,3]
if i in L1:
return "TRUE"
else:
return "FALSE"
df_wanted.apply(BIS, axis = 0)
this results in an error:
('the truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item, a.any().' u'occured at index 367235')
This makes me think that although .apply takes an entire column as input, it cannot aggregate the truth value of all the individual cells and come up with a total truth value about the column. I looked up a.any and a.bool, and they look very useful, but I don't know where to stick them in? For example, this didn't work:
df_wanted.apply.any(BIS, axis = 0)
nor did this
df_wanted.apply(BIS.any, axis = 0).
Can anyone point me in the right direction? Many thanks in advance
You can use the .isin() method:
df.loc[:, df.isin(['9','8','3']).any()]
And if you need to append the condition to the data frame:
cond = df.isin(['9','8','3']).any().rename("cond")
df.append(cond).loc[:, cond]
I want to code a script that takes series values from a column, splits them into strings and makes a new column for each of the resulting strings (filled with NaN right now). As the df is groupedby Column1, I want to do this for every group
My input data frame looks like this:
df1:
Column1 Column2
0 L17 a,b,c,d,e
1 L7 a,b,c
2 L6 a,b,f
3 L6 h,d,e
What I finally want to have is:
Column1 Column2 a b c d e f h
0 L17 a,b,c,d,e nan nan nan nan nan nan nan
1 L7 a,b,c nan nan nan nan nan nan nan
2 L6 a,b,f nan nan nan nan nan nan nan
My code currently looks like this:
def NewCols(x):
for item, frame in group['Column2'].iteritems():
Genes = frame.split(',')
for value in Genes:
string = value
x[string] = np.nan
return x
df1.groupby('Column1').apply(NewCols)
My thought behind this was that the code loops through Column2 of every grouped object, splitting the values contained in frame at comma and creating a list for that group. So far the code works fine. Then I added
for value in Genes:
string = value
x[string] = np.nan
return x
with the intention of adding a new column for every value contained in the list Genes. However, my output looks like this:
Column1 Column2 d
0 L17 a,b,c,d,e nan
1 L7 a,b,c nan
2 L6 a,b,f nan
3 L6 h,d,e nan
and I am pretty much struck dumb. Can someone explain why only one column gets appended (which is not even named after the first value in the first list of the first group) and suggest how I could improve my code?
I think you just return too early in your function, before the end of the two loops. If you indent it back two times like this :
def NewCols(x):
for item, frame in group['Column2'].iteritems():
Genes = frame.split(',')
for value in Genes:
string = value
x[string] = np.nan
return x
UngroupedResGenesLineage.groupby('Column1').apply(NewCols)
It should work fine !
cols = sorted(list(set(df1['Column2'].apply(lambda x: x.split(',')).sum())))
df = df1.groupby('Column1').agg(lambda x: ','.join(x)).reset_index()
pd.concat([df,pd.DataFrame({c:np.nan for c in cols}, index=df.index)], axis=1)
Column1 Column2 a b c d e f h
0 L17 a,b,c,d,e NaN NaN NaN NaN NaN NaN NaN
1 L6 a,b,f,h,d,e NaN NaN NaN NaN NaN NaN NaN
2 L7 a,b,c NaN NaN NaN NaN NaN NaN NaN