For a dataframe df, I'm trying to fill column b by value 2017-01-01 if the values in column a are either empty NaNs or Others:
df = pd.DataFrame({'a':['Coffee','Muffin','Donut','Others',pd.np.nan, pd.np.nan]})
a
0 Coffee
1 Muffin
2 Donut
3 Others
4 NaN
5 NaN
The expected result is like this:
a b
0 Coffee 2017-01-01
1 Muffin 2017-01-01
2 Donut 2017-01-01
3 Others NaN
4 NaN NaN
5 NaN NaN
What I have tried which didn't exclude NaNs:
df.loc[~df['a'].isin(['nan', 'Others']), 'b'] = '2017-01-01'
a b
0 Coffee 2017-01-01
1 Muffin 2017-01-01
2 Donut 2017-01-01
3 Others NaN
4 NaN 2017-01-01
5 NaN 2017-01-01
Thanks!
Use np.nan instead nan:
df.loc[~df['a'].isin([np.nan, 'Others']), 'b'] = '2017-01-01'
Or before comparing replace missing values by Others:
df.loc[~df['a'].fillna('Others').eq('Others'), 'b'] = '2017-01-01'
print (df)
a b
0 Coffee 2017-01-01
1 Muffin 2017-01-01
2 Donut 2017-01-01
3 Others NaN
4 NaN NaN
5 NaN NaN
check this out:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': ['Coffee', 'Muffin', 'Donut', 'Others', pd.np.nan, pd.np.nan]})
conditions = [
(df['a'] == 'Others'),
(df['a'].isnull())
]
choices = [np.nan, np.nan]
df['color'] = np.select(conditions, choices, default='2017-01-01')
print(df)
import pandas as pd
import numpy as np
df = pd.DataFrame({'a':['Coffee','Muffin','Donut','Others',pd.np.nan, pd.np.nan]})
df.loc[df['a'].replace('Others',np.nan).notnull(),'b'] = '2017-01-01'
print(df)
Related
This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!
You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0
I try to replace/update price column's values based on condition of: if date is equal to 2019-09-01, then replace or update them with with np.nan, I use two methods but not worked out so far:
price pct date
0 10379.00000 0.0242 2019/6/1
1 10608.25214 NaN 2019/9/1
2 10400.00000 0.0658 2019/6/1
3 10258.48471 NaN 2019/9/1
4 12294.00000 0.1633 2019/6/1
5 11635.07402 NaN 2019/9/1
6 12564.00000 -0.0066 2019/6/1
7 13615.10992 NaN 2019/9/1
Solution 1: df.price.where(df.date == '2019-09-01', np.nan, inplace=True), but it replaced all price values with NaN
price pct date
0 NaN 0.0242 2019-06-01
1 NaN NaN 2019-09-01
2 NaN 0.0658 2019-06-01
3 NaN NaN 2019-09-01
4 NaN 0.1633 2019-06-01
5 NaN NaN 2019-09-01
6 NaN -0.0066 2019-06-01
7 NaN NaN 2019-09-01
Solution 2: df.loc[df.date == '2019-09-01', 'price'] = np.nan, this didn't replace values.
price pct date
0 10379.00000 0.0242 2019-06-01
1 10608.25214 NaN 2019-09-01
2 10400.00000 0.0658 2019-06-01
3 10258.48471 NaN 2019-09-01
4 12294.00000 0.1633 2019-06-01
5 11635.07402 NaN 2019-09-01
6 12564.00000 -0.0066 2019-06-01
7 13615.10992 NaN 2019-09-01
Please note date in excel file before read_excel is 2019/9/1 format, I have converted it with df['date'] = pd.to_datetime(df['date']).dt.date.
Someone why this doesn't work? Thanks.
'2019-06-01' is a string, df.date is a datetime
you should convert df.date to str to match
df.loc[df.date.astype(str) == '2019-06-01', 'price'] = np.nan
Actually the first solution works (kind of) for me, try this:
import pandas as pd
import numpy as np
df = pd.DataFrame(
np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9], [3, 2, 1], [5, 6, 7]]),
columns=['a', 'b', 'c']
)
The df should be:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 3 2 1
4 5 6 7
Then using the similiar code:
df.a.where(df.c != 7, np.nan, inplace=True)
I got the df as:
a b c
0 1.0 2 3
1 4.0 5 6
2 7.0 8 9
3 3.0 2 1
4 NaN 6 7
I have an empty pandas dataframe (df), a list of (index, column) pairs (pair_list), and a list of corresponding values (value_list). I want to assign the value in value_list to the corresponding position in df according to pair_list. The following code is what I am using currently, but it is slow. Is there any faster way to do it?
import pandas as pd
import numpy as np
df = pd.DataFrame(index=[0,1,2,3], columns=['a', 'b','c','d'])
pair_list = [(0,'a'),(1,'c'),(0,'d')]
value_list = np.array([3,2,4])
for pos, item in enumerate(pair_list):
df.at[item] = value_list[pos]
The output of the code should be:
a b c d
0 3 NaN NaN 4
1 NaN NaN 2 NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
One idea is create a MultiIndex by MultiIndex.from_tuples, then create a Series, reshape by Series.unstack and add missing columns, index values by DataFrame.reindex:
pair_list = [(0,'a'),(1,'c'),(0,'d')]
value_list = np.array([3,2,4])
mux = pd.MultiIndex.from_tuples(pair_list)
cols = ['a', 'b','c','d']
idx = [0,1,2,3]
df = pd.Series(value_list, index=mux).unstack().reindex(index=idx, columns=cols)
print (df)
a b c d
0 3.0 NaN NaN 4.0
1 NaN NaN 2.0 NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
The following is example of data I have in excel sheet.
A B C
1 2 3
4 5 6
I am trying to get the columns name using the following code:
p1 = list(df1t.columns.values)
the output is like this
[A, B, C, 'Unnamed: 3', 'unnamed 4', 'unnamed 5', .....]
I checked the excel sheet, there is only three columns named A, B, and C. Other columns are blank. Any suggestion?
Just in case anybody stumbles over this problem: The issue can also arise if the excel sheet contains empty cells that are formatted with a background color:
import pandas as pd
df1t = pd.read_excel('test.xlsx')
print(df1t)
A B C Unnamed: 3
0 1 2 3 NaN
1 4 5 6 NaN
One option is to drop the 'Unnamed' columns as described here:
https://stackoverflow.com/a/44272830/11826257
df1t = df1t[df1t.columns.drop(list(df1t.filter(regex='Unnamed:')))]
print(df1t)
A B C
0 1 2 3
1 4 5 6
There is problem some cells are not empty but contains some whitespaces.
If need columns names with filtering Unnamed:
cols = [col for col in df if not col.startswith('Unnamed:')]
print (cols)
['A', 'B', 'C']
Sample with file:
df = pd.read_excel('https://dl.dropboxusercontent.com/u/84444599/file_unnamed_cols.xlsx')
print (df)
A B C Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7
0 4.0 6.0 8.0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN
cols = [col for col in df if not col.startswith('Unnamed:')]
print (cols)
['A', 'B', 'C']
Another solution:
cols = df.columns[~df.columns.str.startswith('Unnamed:')]
print (cols)
Index(['A', 'B', 'C'], dtype='object')
And for return all columns by cols use:
print (df[cols])
A B C
0 4.0 6.0 8.0
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
And if necessary remove all NaNs rows:
print (df[cols].dropna(how='all'))
A B C
0 4.0 6.0 8.0
I have a pandas dataframe with two id variables:
df = pd.DataFrame({'id': [1,1,1,2,2,3],
'num': [10,10,12,13,14,15],
'q': ['a', 'b', 'd', 'a', 'b', 'z'],
'v': [2,4,6,8,10,12]})
id num q v
0 1 10 a 2
1 1 10 b 4
2 1 12 d 6
3 2 13 a 8
4 2 14 b 10
5 3 15 z 12
I can pivot the table with:
df.pivot('id','q','v')
And end up with something close:
q a b d z
id
1 2 4 6 NaN
2 8 10 NaN NaN
3 NaN NaN NaN 12
However, what I really want is (the original unmelted form):
id num a b d z
1 10 2 4 NaN NaN
1 12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
2 14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
In other words:
'id' and 'num' my indices (normally, I've only seen either 'id' or 'num' being the index but I need both since I'm trying to retrieve the original unmelted form)
'q' are my columns
'v' are my values in the table
Update
I found a close solution from Wes McKinney's blog:
df.pivot_table(index=['id','num'], columns='q')
v
q a b d z
id num
1 10 2 4 NaN NaN
12 NaN NaN 6 NaN
2 13 8 NaN NaN NaN
14 NaN 10 NaN NaN
3 15 NaN NaN NaN 12
However, the format is not quite the same as what I want above.
You could use set_index and unstack
In [18]: df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
Out[18]:
q id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
You're really close slaw. Just rename your column index to None and you've got what you want.
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel().rename(None)
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Note that the the 'v' column is expected to be numeric by default so that it can be aggregated. Otherwise, Pandas will error out with:
DataError: No numeric types to aggregate
To resolve this, you can specify your own aggregation function by using a custom lambda function:
df2 = df.pivot_table(index=['id','num'], columns='q', aggfunc= lambda x: x)
you can remove name q.
df1.columns=df1.columns.tolist()
Zero's answer + remove q =
df1 = df.set_index(['id', 'num', 'q'])['v'].unstack().reset_index()
df1.columns=df1.columns.tolist()
id num a b d z
0 1 10 2.0 4.0 NaN NaN
1 1 12 NaN NaN 6.0 NaN
2 2 13 8.0 NaN NaN NaN
3 2 14 NaN 10.0 NaN NaN
4 3 15 NaN NaN NaN 12.0
This might work just fine:
Pivot
df2 = (df.pivot_table(index=['id', 'num'], columns='q', values='v')).reset_index())
Concatinate the 1st level column names with the 2nd
df2.columns =[s1 + str(s2) for (s1,s2) in df2.columns.tolist()]
Came up with a close solution
df2 = df.pivot_table(index=['id','num'], columns='q')
df2.columns = df2.columns.droplevel()
df2.reset_index().fillna("null").to_csv("test.csv", sep="\t", index=None)
Still can't figure out how to drop 'q' from the dataframe
It can be done in three steps:
#1: Prepare auxilary column 'id_num':
df['id_num'] = df[['id', 'num']].apply(tuple, axis=1)
df = df.drop(columns=['id', 'num'])
#2: 'pivot' is almost an inverse of melt:
df, df.columns.name = df.pivot(index='id_num', columns='q', values='v').reset_index(), ''
#3: Bring back 'id' and 'num' columns:
df['id'], df['num'] = zip(*df['id_num'])
df = df.drop(columns=['id_num'])
This is a result, but with different order of columns:
a b d z id num
0 2.0 4.0 NaN NaN 1 10
1 NaN NaN 6.0 NaN 1 12
2 8.0 NaN NaN NaN 2 13
3 NaN 10.0 NaN NaN 2 14
4 NaN NaN NaN 12.0 3 15
Alternatively with proper order:
def multiindex_pivot(df, columns=None, values=None):
#inspired by: https://github.com/pandas-dev/pandas/issues/23955
names = list(df.index.names)
df = df.reset_index()
list_index = df[names].values
tuples_index = [tuple(i) for i in list_index] # hashable
df = df.assign(tuples_index=tuples_index)
df = df.pivot(index="tuples_index", columns=columns, values=values)
tuples_index = df.index # reduced
index = pd.MultiIndex.from_tuples(tuples_index, names=names)
df.index = index
df = df.reset_index() #me
df.columns.name = '' #me
return df
df = df.set_index(['id', 'num'])
df = multiindex_pivot(df, columns='q', values='v')