Python - format dataframe index - python

My original df.index is in yyyy-mm-dd format (not a datetime dtype, it is a str). How to I format it as ddmmmyyyy?
df1 = pd.DataFrame(index=['2017-01-01', '2017-02-01', '2017-03-01'],
columns=["A", "B", "C"],
data=[[5,np.nan, "ok"], [7,8,"fine"], ["3rd",100,np.nan]])
df1
The result that I need:

Are you trying to programmatically change it? Otherwise you can just change the string literal lie 3.6 biturbo suggested:
df1 = pd.DataFrame(index=['01JAN2017', '01FEB2017', '01MAR2017'],
columns=["A", "B", "C"],
data=[[5,np.nan, "ok"], [7,8,"fine"], ["3rd",100,np.nan]])
df1
Otherwise you could try:
df['date'] = df['datetime'].apply(lambda x: x.strftime('%d%m%Y'))
df['time'] = df['datetime'].apply(lambda x: x.strftime('%H%M%S'))

use DatetimeIndex.strftime() method
In [193]: df1.index = pd.to_datetime(df1.index).strftime('%d%b%Y')
In [194]: df1
Out[194]:
A B C
01Jan2017 5 NaN ok
01Feb2017 7 8.0 fine
01Mar2017 3rd 100.0 NaN

Try this solution
df1 = pd.DataFrame(index=['01JAN2017', '01FEB2017', '01MAR2017'],
columns=["A", "B", "C"],
data=[[5,np.nan, "ok"], [7,8,"fine"], ["3rd",100,np.nan]])
df1

Related

How to check in which column is certain value in pandas.DataFrame?

I have DataFrame in Python like below:
df = pd.DataFrame({"col1" : ["a", "b", "c"], "col2" : ["a", "d", "e"], "col3" : ["r", "t" , "g"]})
And I would like to check in which columns is value "a" (of course in "col1" and "col2"). How can I check it ?
(df=='a').any()
col1 True
col2 True
col3 False
If need columns names in list compare all values by DataFrame.eq with DataFrame.any for check if at least one True (match) per columns, last filter columns names:
c = df.columns[df.eq('a').any()].tolist()
print (c)
['col1', 'col2']
If need filter columns to new DataFrame use DataFrame.loc:
df1 = df.loc[:, df.eq('a').any()]
print (df1)
col1 col2
0 a a
1 b d
2 c e

Pandas if statement in vectorized operation

df = pd.DataFrame([["a", "d"], ["", ""], ["", "3"]],
columns=["a", "b"])
df
a b
0 a d
1
2 3
I'm looking to do a vectorized string concatenation with an if statement like this:
df["c"] = df["a"] + "()" + df["b"] if df["a"].item != "" else ""
But it doesn't work because .item returns a series. Is it possible to do it like this without an apply or lambda method that goes through each row? In a vectorized operation pandas will try and concatenate multiple cells at a time and make it faster...
Desired output:
df
a b c
0 a d a ()b
1
2 3
Try this: using np.where()
df = pd.DataFrame([["a", "d"], ["", ""], ["", "3"]],
columns=["a", "b"])
df['c']=np.where(df['a']!='',df['a'] + '()' + df['b'],'')
print(df)
output:
a b c
0 a d a()d
1
2 3
IIUC you could use mask to concatenate both columns, separated by some string using str.cat, whenever a condition holds:
df['c'] = df.a.mask(df.a.ne(''), df.a.str.cat(df.b, sep='()'))
print(df)
a b c
0 a d a()d
1
2 3
Since nobody already mentioned it, you can also use the apply method:
df['c'] = df.apply(lambda r: r['a']+'()'+r['b'] if r['a']!='' else '', axis=1)
If anyone checks performances please comment below :)

Join two same columns from two dataframes, pandas

I am looking for fastest way to join columns with same names using separator.
my dataframes:
df1:
A,B,C,D
my,he,she,it
df2:
A,B,C,D
dog,cat,elephant,fish
expected output:
df:
A,B,C,D
my:dog,he:cat,she:elephant,it:fish
As you can see, I want to merge columns with same names, two cells in one.
I can use this code for A column:
df=df1.merge(df2)
df['A'] = df[['A_x','A_y']].apply(lambda x: ':'.join(x), axis = 1)
In my real dataset i have above 30 columns, and i dont want to write same lines for each of them, is there any faster way to receive my expected output?
How about concat and groupby ?
df3 = pd.concat([df1,df2],axis=0)
df3 = df3.groupby(df3.index).transform(lambda x : ':'.join(x)).drop_duplicates()
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
How about this?
df3 = df1 + ':' + df2
print(df3)
A B C D
0 my:dog he:cat she:elephant it:fish
This is good because if there's columns that doesn't match, you get NaN, so you can filter then later if you want:
df1 = pd.DataFrame({'A': ['my'], 'B': ['he'], 'C': ['she'], 'D': ['it'], 'E': ['another'], 'F': ['and another']})
df2 = pd.DataFrame({'A': ['dog'], 'B': ['cat'], 'C': ['elephant'], 'D': ['fish']})
df1 + ':' + df2
A B C D E F
0 my:dog he:cat she:elephant it:fish NaN NaN
you can do this by simply adding the two dataframe with a separator.
import pandas as pd
df1 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df2 = pd.DataFrame(columns=["A", "B", "C", "D"], index=[0])
df1["A"] = "my"
df1["B"] = "he"
df1["C"] = "she"
df1["D"] = "it"
df2["A"] = "dog"
df2["B"] = "cat"
df2["C"] = "elephant"
df2["D"] = "fish"
print(df1)
print(df2)
df3 = df1 + ':' + df2
print(df3)
This will give you a result like:
A B C D
0 my he she it
A B C D
0 dog cat elephant fish
A B C D
0 my:dog he:cat she:elephant it:fish
Is this what you try to achieve? Although, this only works if you have same columns in both the dataframes. The extra columns will have nans. What do you want to do with the columns those are not same in df1 and df2? Please comment below to help me understand your problem better.
You can simply do:
df = df1 + ':' + df2
print(df)
Which is simple and effective
This should be your answer

Pandas: Calculate mean leaving out own row's value

I want to calculate means by group, leaving out the value of the row itself.
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
I know how to return means by group:
df.groupby('col1').agg({'col2': 'mean'})
Which returns:
Out[247]:
col1 col2
1 a 4
3 a -5
5 a 4
But what I want is mean by group, leaving out the row's value. E.g. for the first row:
df.query('col1 == "a"')[1:4].mean()
which returns:
Out[251]:
col2 1.0
dtype: float64
Edit:
Expected output is a dataframe of the same format as df above, with a column mean_excl_own which is the mean across all other members in the group, excluding the row's own value.
You could GroupBy col1and transform with the mean. Then subtract the value from a given row from the mean:
df['col2'] = df.groupby('col1').col2.transform('mean').sub(df.col2)
Thanks for all your input. I ended up using the approach linked to by #VnC.
Here's how I solved it:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
df = pd.merge(df, group_summary, on = 'col1')
df['other_sum'] = df['col2'] * df['mean'] - df['col2']
df['result'] = df['other_sum'] / (df['count'] - 1)
Check out the final result:
df['result']
Which prints:
Out:
0 1.000000
1 -0.333333
2 2.666667
3 -0.333333
4 3.000000
5 3.000000
Name: result, dtype: float64
Edit: I previously had some trouble with column names, but I fixed it using this answer.

pandas groupby - custom function

I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.
Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN
Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN
make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.

Categories

Resources