Pandas reversing Data Frames giving strange results - python

Let's say I've a df
value
2023-02-01 a
2023-02-02 b
2023-02-03 c
2023-02-04 d
when I try to reverse df and after doing some operations If i reverse back It not showing complete df Like
import pandas as pd
df = pd.DataFrame(
{"value": {0: "a", 1: "b", 2: "c", 3: "d"}}
)
df.index = pd.date_range(start='2023-02-01', periods=4,freq='D')
new = df.loc[::-1] #Reversing 1 time
new2 = new.loc[::-1] # Reversing back again..
print(new2)
Giving me # even though I haven't done any operations on df
2023-02-04 d
Expected
value
2023-02-01 a
2023-02-02 b
2023-02-03 c
2023-02-04 d
Version Detials
Pandas 1.3.5
Python 3.9

Related

Dropping redundant levels from a pandas multiindex

I have a Pandas data frame with a multiindex that is filtered (interactively). The resulting filtered frame have redundant levels in the index where all entries are the same for all entries.
Is there a way to drop these levels from the index?
Having a data frame like:
>>> df = pd.DataFrame([[1,2],[3,4]], columns=["a", "b"], index=pd.MultiIndex.from_tuples([("a", "b", "c"), ("d", "b", "e")], names=["one", "two", "three"]))
>>> df
a b
one two three
a b c 1 2
d b e 3 4
I would like to drop level "two" but without specifying the level since I wouldn't know beforehand which level is redundant.
Something like (made up function...)
>>> df.index = df.index.drop_redundant()
>>> df
a b
one three
a c 1 2
d e 3 4
You can convert the index to a dataframe, then count the unique number of values per level. Levels with nunique == 1 will then be dropped:
nunique = df.index.to_frame().nunique()
to_drop = nunique.index[nunique == 1]
df = df.droplevel(to_drop)
If you do this a lot, you can monkey-patch it to the DataFrame class:
def drop_redundant(df: pd.DataFrame, inplace=False):
if not isinstance(df.index, pd.MultiIndex):
return df
nunique = df.index.to_frame().nunique()
to_drop = nunique.index[nunique == 1]
return df.set_index(df.index.droplevel(to_drop), inplace=inplace)
# The monkey patching
pd.DataFrame.drop_redundant = drop_redundant
# Usage
df = df.drop_redundant() # chaining
df.drop_redundant(inplace=True) # in-place
Another possible solution, which is based on janitor.drop_constant_columns:
# pip install pyjanitor
import janitor
df.index = pd.MultiIndex.from_frame(
janitor.drop_constant_columns(df.index.to_frame()))
Output:
a b
one three
a c 1 2
d e 3 4

How two combine two columns of different dataframes such that they have unique values?

I have two different dataframes and I want to get the sorted
values of two columns.
Setup
import numpy as np
import pandas as pd
df1 = pd.DataFrame({
'id': range(7),
'c': list('EDBBCCC')
})
df2 = pd.DataFrame({
'id': range(8),
'c': list('EBBCCCAA')
})
Desired Output
# notice that ABCDE appear in alphabetical order
c_first c_second
NAN A
B B
C C
D NAN
E E
What I've tried
pd.concat([df1.c.sort_values().drop_duplicates().rename('c_first'),
df2.c.sort_values().drop_duplicates().rename('c_second')
],axis=1)
How to get the output as given in required format?
Here one possible way to achive it:
t1 = df1.c.drop_duplicates()
t2 = df2.c.drop_duplicates()
tmp1 = pd.DataFrame({'id':t1, 'c_first':t1})
tmp2 = pd.DataFrame({'id':t2, 'c_second':t2})
result = pd.merge(tmp1,tmp2, how='outer').sort_values('id').drop('id', axis=1)
result
c_first c_second
4 NaN A
0 B B
1 C C
2 D NaN
3 E E
https://pandas.pydata.org/pandas-docs/version/0.25.0/reference/api/pandas.concat.html
There is an argument in concat function.
Try to add sort=True.

How to return columns from a Dataframe in a function that were not calculated by the function?

I have the following Dataframes
import pandas as pd
df_county = pd.DataFrame({
"A": [50],
"B": [60],
"C": [70]})
df_voronoi = pd.DataFrame({
"area": [1000, 2000, 3000, 4000],
"county": ["A", "B", "C", "A"],
"bus":["bus1", "bus4", "bus20", "bus2"]})
With the following function I am calculating my values:
def calc(df1, df2):
return [1/(df1[county] / area) for county,area in zip(df2.county, df2.area)]
df=calc(df_county,df_voronoi)
df=pd.DataFrame(df)
print(df)
Result:
Here county is the index. I want to have county as a own column and I want to have the bus-column from the Voronoi-Dataframe as a column with the right relation to the county and area.
Thas means i would like to have an output from the function that looks like this:
How to realize that?
And an extra question:
Does it matter at what position I define the function? I have an example where the function is created at the top and the type of the return is a pandas Dataframe. In this example it's a list and I have to make a Dataframe from the list. If yes, can you explane me why?
I think you need a small modification to your existing structure.Try this
import pandas as pd
df_county = pd.DataFrame({
"A": [50],
"B": [60],
"C": [70]})
df_voronoi = pd.DataFrame({
"area": [1000, 2000, 3000, 4000],
"country": ["A", "B", "C", "A"],
"bus":["bus1", "bus4", "bus20", "bus2"]})
def calc(df1, df2):
return [(1/(df1[country] / area),area) for country,area in zip(df2.country, df2.area)]
df=calc(df_county,df_voronoi)
mdf= pd.DataFrame([f[0] for f in df]).reset_index()
mdf["area"]= [f[1] for f in df]
mdf.columns = ["country","factor","area"]
print(mdf)
country factor area
0 A 20.000000 1000
1 B 33.333333 2000
2 C 42.857143 3000
3 A 80.000000 4000
added area column,otherwise we can't identify which bus we need(since two A in df2)
merged = pd.merge(mdf,df_voronoi,on=["country","area"],how="left")
merged = merged.drop(columns=["area"])
print(merged)
country factor bus
0 A 20.000000 bus1
1 B 33.333333 bus4
2 C 42.857143 bus20
3 A 80.000000 bus2

Pandas: Calculate mean leaving out own row's value

I want to calculate means by group, leaving out the value of the row itself.
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
I know how to return means by group:
df.groupby('col1').agg({'col2': 'mean'})
Which returns:
Out[247]:
col1 col2
1 a 4
3 a -5
5 a 4
But what I want is mean by group, leaving out the row's value. E.g. for the first row:
df.query('col1 == "a"')[1:4].mean()
which returns:
Out[251]:
col2 1.0
dtype: float64
Edit:
Expected output is a dataframe of the same format as df above, with a column mean_excl_own which is the mean across all other members in the group, excluding the row's own value.
You could GroupBy col1and transform with the mean. Then subtract the value from a given row from the mean:
df['col2'] = df.groupby('col1').col2.transform('mean').sub(df.col2)
Thanks for all your input. I ended up using the approach linked to by #VnC.
Here's how I solved it:
import pandas as pd
d = {'col1': ["a", "a", "b", "a", "b", "a"], 'col2': [0, 4, 3, -5, 3, 4]}
df = pd.DataFrame(data=d)
group_summary = df.groupby('col1', as_index=False)['col2'].agg(['mean', 'count'])
df = pd.merge(df, group_summary, on = 'col1')
df['other_sum'] = df['col2'] * df['mean'] - df['col2']
df['result'] = df['other_sum'] / (df['count'] - 1)
Check out the final result:
df['result']
Which prints:
Out:
0 1.000000
1 -0.333333
2 2.666667
3 -0.333333
4 3.000000
5 3.000000
Name: result, dtype: float64
Edit: I previously had some trouble with column names, but I fixed it using this answer.

pandas groupby - custom function

I have the following dataframe to which I use groupby and sum():
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1").sum()
This results in the following:
col1 col2
A 6.0
B 15.0
C 0.0
I want C to show NaN instead of 0 since all of the values for C are NaN. How can I accomplish this? Apply() with a lambda function? Any help would be appreciated.
Use this:
df.groupby('col1').apply(pd.DataFrame.sum,skipna=False).reset_index(drop=True)
#Or --> df.groupby('col1',as_index=False).apply(pd.DataFrame.sum,skipna=False)
Without the apply() thanks to #piRSquared:
df.set_index('col1').sum(level=0, min_count=1).reset_index()
thanks #Alollz :
If you want to return sum of groups containing NaN and not just NaNs
df.set_index('col1').sum(level=0,min_count=1).reset_index()
Output
col1 col2
0 AAA 6.0
1 BBB 15.0
2 CCC NaN
Thanks to #piRSquared, #Alollz, and #anky_91:
You can use without setting index and reset index:
d = {'col1': ["A", "A", "A", "B", "B", "B", "C", "C","C"], 'col2': [1,2,3,4,5,6, np.nan, np.nan, np.nan]}
df = pd.DataFrame(data=d)
df.groupby("col1", as_index=False).sum(min_count=1)
Output:
col1 col2
0 A 6.0
1 B 15.0
2 C NaN
make the call to sum have the parameter skipna = False.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sum.html
that link should provide the documentation you need and I expect that will fix your problem.

Categories

Resources