Pandas groupby does not set index correctly - python

i'm a bit confused by the behaviour of the pandas groupby function:
df = pd.DataFrame({"row_id":[1,2,3], "group": [1,2,2], "col1":[1,100,2], "col2":[2,200,2]})
for i, e in df.groupby("group", as_index=True):
print(e.index)
Here I would expect that I will get the "group" column as a new index. However the print returns:
Int64Index([0], dtype='int64')
Int64Index([1, 2], dtype='int64')
Thus having kept the "old" index. Also the column "group" is still in place as seperate column.
Shouldnt be the result like:
Int64Index([1], dtype='int64')
Int64Index([2, 2], dtype='int64')
I dont understand the logic, esp. as as_index=False doesn't change anything.
P.s. I am using pandas 1.3.5

as_index changes the output format only:
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
>>> df.groupby('group', as_index=False).first()
group row_id col1 col2
0 1 1 1 2
1 2 2 100 200
>>> df.groupby('group', as_index=True).first()
row_id col1 col2
group
1 1 1 2
2 2 100 200

Related

pandas: how to merge columns irrespective of index

I have two dataframes with meaningless index's, but carefully curated order and I want to merge them while preserving that order. So, for example:
>>> df1
First
a 1
b 3
and
>>> df2
c 2
d 4
After merging, what I want to obtain is this:
>>> Desired_output
First Second
AnythingAtAll 1 2 # <--- Row Names are meaningless.
SeriouslyIDontCare 3 4 # <--- But the ORDER of the rows is critical and must be preserved.
The fact that I've got row-indices "a/b", and "c/d" is irrelevent, but what is crucial is the order in which the rows appear. Every version of "join" I've seen requires me to manually reset indices, which seems really awkward, and I don't trust that it won't screw up the ordering. I thought concat would work, but I get this:
>>> pd.concat( [df1, df2] , axis = 1, ignore_index= True )
0 1
a 1.0 NaN
b 3.0 NaN
c NaN 2.0
d NaN 4.0
# ^ obviously not what I want.
Even when I explicitly declare ignore_index.
How do I "overrule" the indexing and force the columns to be merged with the rows kept in the exact order that I supply them?
Edit:
Note that if I assign another column, the results are all "NaN".
>>> df1["second"]=df2["Second"]
>>> df1
First second
a 1 NaN
b 3 NaN
This was screwing me up but thanks to the suggestion from jsmart and topsail, you can dereference the indices by directly accessing the values in the column:
df1["second"]=df2["Second"].values
>>> df1
First second
a 1 2
b 3 4
^ Solution
This should also work I think:
df1["second"] = df2["second"].values
It would keep the index from the first dataframe, but since you have values in there such as "AnyThingAtAll" and "SeriouslyIdontCare" I guess any index values whatsoever are acceptable.
Basically, we are just adding a the values from your series as a new column to the first dataframe.
Here's a test example similar to your described problem:
# -----------
# sample data
# -----------
df1 = pd.DataFrame(
{
'x': ['a','b'],
'First': [1,3],
})
df1.set_index("x", drop=True, inplace=True)
df2 = pd.DataFrame(
{
'x': ['c','d'],
'Second': [2, 4],
})
df2.set_index("x", drop=True, inplace=True)
# ---------------------------------------------
# Add series as a new column to first dataframe
# ---------------------------------------------
df1["Second"] = df2["Second"].values
Result is:
First
Second
a
1
2
b
3
4
The goal is to combine data based on position (not by Index). Here is one way to do it:
import pandas as pd
# create data frames df1 and df2
df1 = pd.DataFrame(data = {'First': [1, 3]}, index=['a', 'b'])
df2 = pd.DataFrame(data = {'Second': [2, 4]}, index = ['c', 'd'])
# add a column to df1 -- add by position, not by Index
df1['Second'] = df2['Second'].values
print(df1)
First Second
a 1 2
b 3 4
And you could create a completely new data frame like this:
data = {'1st': df1['First'].values, '2nd': df1['Second'].values}
print(pd.DataFrame(data))
1st 2nd
0 1 2
1 3 4
ignore_index means whether to keep the output dataframe index from original along axis. If it is True, it means don't use original index but start from 0 to n just like what the column header 0, 1 shown in your result.
You can try
out = pd.concat( [df1.reset_index(drop=True), df2.reset_index(drop=True)] , axis = 1)
print(out)
First Second
0 1 2
1 3 4

Pandas: Sort by sum of 2 columns

I have a DataFrame:
COL1 COL2
1 1
3 1
1 3
I need to sort by COL1 + COL2.
key=lambda col: f(col) argument-function of sort_values(...) lets you sort by a changed column but in the described case I need to sort on the basis of 2 columns. So, it would be nice if there were an opportunity to provide a key argument-function for 2 or more columns but I don't know whether such a one exists.
So, how can I sort its rows by sum COL1 + COL2?
Thank you for your time!
Assuming a unique index, you can also conveniently use the key parameter of sort_values to pass a callable to apply to the by column. Here we can add the other column:
df.sort_values(by='COL1', key=df['COL2'].add)
We can even generalize to any number of columns using sort_index:
df.sort_index(key=df.sum(1).get)
Output:
COL1 COL2
0 1 1
2 1 3
1 3 2
Used input:
data = {"COL1": [1, 3, 1], "COL2": [1, 2, 3]}
df = pd.DataFrame(data)
This does the trick:
data = {"Column 1": [1, 3, 1], "Column 2": [1, 2, 3]}
df = pd.DataFrame(data)
sorted_indices = (df["Column 1"] + df["Column 2"]).sort_values().index
df.loc[sorted_indices, :]
I just created a series that has the sum of both the columns, sorted it, got the sorted indices, and printed those indices out for the dataframe.
(I changed the data a little so you can see the sorting in action. Using the data you provided, you wouldn't have been able to see the sorted data as it would have been the same as the original one.)

Python - count number of elements that are equal between two columns of two dataframes

I have two dataframes: df1, df2
that contain two columns, col1 and col2. I would like to calculate the number of elements in column col1 of df1 that are equal to col2 of df2. How can I do that?
You can use Series.isin df1.col1.isin(df2.col2).sum():
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5, 6]})
df2 = pd.DataFrame({'col2': [1, 3, 5, 7]})
nb_comon_elements = df1.col1.isin(df2.col2).sum()
assert nb_comon_elements == 3
Be cautious depending on your use case because:
df1 = pd.DataFrame({'col1': [1, 1, 1, 2, 7]})
df1.col1.isin(df2.col2).sum()
Would return 4 and not 2, because all 1 from df1.col1 are present in df2.col2. If that's not the expected behaviour you could drop duplicates from df1.col1 before testing the intersection size:
df1.col1.drop_duplicates().isin(df2.col2).sum()
Which in this example would return 2.
To better understand why this is happening you can have look at what .isin is returning:
df1['isin df2.col2'] = df1.col1.isin(df2.col2)
Which gives:
col1 isin df2.col2
0 1 True
1 1 True
2 1 True
3 2 False
4 7 True
Now .sum() adds up the booleans from column isin df2.col2 (a total of 4 True).
I assume you're using pandas.
One way is to simply use pd.merge and merge on the second column, and return the length of that column.
pd.merge(df1, df2, on="column_to_merge")
Pandas does an inner merge by default.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html

Pandas DataFrame filter

My question is about the pandas.DataFrame.filter command. It seems that pandas creates a copy of the data frame to write any changes. How am I able to write on the data frame itself?
In other words:
d = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=d)
df.filter(regex='col1').iloc[0]=10
Output:
col1 col2
0 1 3
1 2 4
Desired Output:
col1 col2
0 10 3
1 2 4
I think you need extract columns names and then use loc or iloc functions:
cols = df.filter(regex='col1').columns
df.loc[0, cols]=10
Or:
df.iloc[0, df.columns.get_indexer(cols)] = 10
print (df)
col1 col2
0 10 3
1 2 4
You cannnot use filter function, because subset returns a Series/DataFrame which may have its data as a view. That's why SettingWithCopyWarning is possible there (or raise if you set the option).

append dictionary to data frame

I have a function, which returns a dictionary like this:
{'truth': 185.179993, 'day1': 197.22307753038834, 'day2': 197.26118010160317, 'day3': 197.19846975345905, 'day4': 197.1490578795196, 'day5': 197.37179265011116}
I am trying to append this dictionary to a dataframe like so:
output = pd.DataFrame()
output.append(dictionary, ignore_index=True)
print(output.head())
Unfortunately, the printing of the dataframe results in an empty dataframe. Any ideas?
You don't assign the value to the result.
output = pd.DataFrame()
output = output.append(dictionary, ignore_index=True)
print(output.head())
The previous answer (user alex, answered Aug 9 2018 at 20:09) now triggers a warning saying that appending to a dataframe will be deprecated in a future version.
A way to do it is to transform the dictionary to a dataframe and the concatenate the dataframes:
output = pd.DataFrame()
df_dictionary = pd.DataFrame([dictionary])
output = pd.concat([output, df_dictionary], ignore_index=True)
print(output.head())
I always do it this way because this syntax is less confusing for me.
I believe concat method is recommended though.
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>>df
col1 col2
0 1 3
1 2 4
d={'col1': 5, 'col2': 6}
df.loc[len(df)]=d
>>>df
col1 col2
0 1 3
1 2 4
2 5 6
Note that iloc method won't work this way.

Categories

Resources