I am comparing dataframes with pandas. I want to distinguish the compared dataframe columns by naming them, therefore I'm using the parameter result_names from pandas documentation but it returns: 'TypeError: DataFrame.compare() got an unexpected keyword argument 'result_names''.
Here is the code, that is simply the suggested one in the documentation: (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.compare.html)
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
df.compare(df2, result_names=("left", "right"))
Any ideas why?
You need pandas ≥1.5.
For earlier versions, you can instead rename the level:
df.compare(df2).rename({'self': 'left', 'other': 'right'}, axis=1, level=1)
output:
col1 col3
left right left right
0 a c NaN NaN
2 NaN NaN 3.0 4.0
Related
What is the best way to convert the datatype of columns of a Pandas dataframe using a dict with data types?
e.g. I have a dataframe df:
d = {'col1': ["1", "abc"], 'col2': ["abc", "02-02-2021"]}
df = pd.DataFrame(data=d)
and I have a dict:
dtype_dict = { "col1": int
"col2": datetime}
When a value in a column can not be converted tot the correct datatype I need it to set to NaN (similar behaviour as in the errors = 'coerce' parameter in pd.to_numeric)
Expected output:
d_out = {'col1': [1, NaN], 'col2': [NaN, 02-02-2021]}
df_out = pd.DataFrame(data=d)
My true dataset contains of multiple large pandas dataframes and corresponding dicts. So I am looking for an automated way to convert complete dataframes.
Thanks!
If you slightly modify your dict, you can use:
from functools import partial
dtype_dict = { "col1": partial(pd.to_numeric, errors='coerce'),
"col2": partial(pd.to_datetime, errors='coerce')}
out = df.agg(dtype_dict)
>>> out
col1 col2
0 1.0 NaT
1 NaN 2021-02-02
is there a good clean way to hard code a pandas dataframe into python code (e.g. a .py file)?
I don't want to store in a separate CSV (I want the script file to be able to run on it's own), and the dataframe is not very big. I also want it clear in the code what it is and easily modifiable.
For example:
cols = ['val1', 'val2', 'val3']
rows = ['red', 'blue', 'green','orange','pink']
data = [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0], [10.0,11.0,12.0],[13.0, 14.0,15.0]]
pd.DataFrame(data, index=rows, columns=cols)
This works ok, but if you want to modify, say, green val2, it's not easy to immediately find the right value. Slightly better (in some ways):
cols = ['val1', 'val2', 'val3']
rows = ['red', 'blue', 'green','orange','pink']
data = [
# val1, val2, val3
[1.0, 2.0, 3.0], # red
[4.0, 5.0, 6.0], # blue
[7.0, 8.0, 9.0], # green
[10.0,11.0,12.0], # orange
[13.0, 14.0,15.0]] # pink
pd.DataFrame(data, index=rows, columns=cols)
but this requires a lot of manual formatting, or writing a separate dataframe printer, and is ugly and hackish.
Use pd.read_csv based on a string literal:
try: from io import StringIO # Python 3
except: from StringIO import StringIO # Python 2
import pandas as pd
TESTDATA = u"""\
val1, val2, val3, color
1.0, 2.0, 3.0, red
4.0, 5.0, 6.0, blue
7.0, 8.0, 9.0, green
10.0, 11.0, 12.0, orange
13.0, 14.0, 15.0, pink
"""
df = pd.read_csv(StringIO(TESTDATA), index_col=-1, sep=r",\s*", engine='python')
print(df)
# prints:
# val1 val2 val3
# color
# red 1.0 2.0 3.0
# blue 4.0 5.0 6.0
# green 7.0 8.0 9.0
# orange 10.0 11.0 12.0
# pink 13.0 14.0 15.0
The inclusion of \s* in sep means that you then have the option to pretty-format your data with whitespace. Since you say the dataframe is not very big, why not do that, for the sake of readability? But if you're averse to manually aligning things even for a small dataframe, you could just remove the spaces and paste the raw CSV content in TESTDATA. Then you can drop the \s* out of sep and remove engine='python' (the latter is only there to suppress a warning associated with the use of regular expressions in sep).
An even better version, which allows you to use the print(df) output itself as the input, without manual editing, would be:
try: from io import StringIO # Python 3
except: from StringIO import StringIO # Python 2
import pandas as pd
TESTDATA = u"""\
val1 val2 val3
color
red 1.0 2.0 3.0
blue 4.0 5.0 6.0
green 7.0 8.0 9.0
orange 10.0 11.0 12.0
pink 13.0 14.0 15.0
"""
df = pd.read_csv(StringIO(TESTDATA), index_col=0, sep=r"\s+", engine='python')
print(df)
To provide a complete answer based on our comments:
from io import StringIO
data = """
col1,col2,col3
a,b,c
d,e,f
"""
s = StringIO(data)
df = pd.read_csv(s)
result:
col1 col2 col3
0 a b c
1 d e f
I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.
Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN
Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN
make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.
My original df.index is in yyyy-mm-dd format (not a datetime dtype, it is a str). How to I format it as ddmmmyyyy?
df1 = pd.DataFrame(index=['2017-01-01', '2017-02-01', '2017-03-01'],
columns=["A", "B", "C"],
data=[[5,np.nan, "ok"], [7,8,"fine"], ["3rd",100,np.nan]])
df1
The result that I need:
Are you trying to programmatically change it? Otherwise you can just change the string literal lie 3.6 biturbo suggested:
df1 = pd.DataFrame(index=['01JAN2017', '01FEB2017', '01MAR2017'],
columns=["A", "B", "C"],
data=[[5,np.nan, "ok"], [7,8,"fine"], ["3rd",100,np.nan]])
df1
Otherwise you could try:
df['date'] = df['datetime'].apply(lambda x: x.strftime('%d%m%Y'))
df['time'] = df['datetime'].apply(lambda x: x.strftime('%H%M%S'))
use DatetimeIndex.strftime() method
In [193]: df1.index = pd.to_datetime(df1.index).strftime('%d%b%Y')
In [194]: df1
Out[194]:
A B C
01Jan2017 5 NaN ok
01Feb2017 7 8.0 fine
01Mar2017 3rd 100.0 NaN
Try this solution
df1 = pd.DataFrame(index=['01JAN2017', '01FEB2017', '01MAR2017'],
columns=["A", "B", "C"],
data=[[5,np.nan, "ok"], [7,8,"fine"], ["3rd",100,np.nan]])
df1
I am trying to do a calculation in Pandas that looks obvious, but after several tries I did not find how to do it correctly.
I have a dataframe that looks like this:
df = pd.DataFrame([["A", "a", 10.0],
["A", "b", 12.0],
["A", "c", 13.0],
["B", "a", 5.0 ],
["B", "b", 6.0 ],
["B", "c", 7.0 ]])
The first column is a test name, the second column is a class, and third column gives a time. Each test is normally present in the table with the 3 classes.
This is the correct format to plot it like this:
sns.factorplot(x="2", y="0", hue="1", data=df,
kind="bar")
So that for each test, I get a group of 3 bars, one for each class.
However I would like to change the dataframe so that each value in column 2 is not an absolute value, but a ratio compared to class "a".
So I would like to transform it to this:
df = pd.DataFrame([["A", "a", 1.0],
["A", "b", 1.2],
["A", "c", 1.3],
["B", "a", 1.0],
["B", "b", 1.2],
["B", "c", 1.4]])
I am able to extract the series, change the index so that they match, do the computation, for example:
df_a = df[df[1] == "a"].set_index(0)
df_b = df[df[1] == "b"].set_index(0)
df_b["ratio_a"] = df_b[2] / df_a[2]
But this is certainly very inefficient, and I need to group it back to the format.
What is the correct way to do it?
You could use groupby/transform('first') to find the first value in each group:
import pandas as pd
df = pd.DataFrame([["A", "a", 10.0],
["A", "b", 12.0],
["A", "c", 13.0],
["B", "b", 6.0 ],
["B", "a", 5.0 ],
["B", "c", 7.0 ]])
df = df.sort_values(by=[0,1])
df[2] /= df.groupby(0)[2].transform('first')
yields
0 1 2
0 A a 1.0
1 A b 1.2
2 A c 1.3
3 B a 1.0
4 B b 1.2
5 B c 1.4
You can also do this with some index alignment.
df1 = df.set_index(['test', 'class'])
df1 / df1.xs('a', level='class')
But transform is better