Reading data from Dataframe using other Dataframe data as iloc inputs - python

I'm trying to grab value from an existing df using iloc coordinates stored in another df, then stored that value in the second df.
df_source (source):
Category1 Category2 Category3
Bucket1 100 200 300
Bucket2 400 500 600
Bucket3 700 800 900
df_coord (coordinates):
Index_X Index_Y
0 0
1 1
2 2
Want:
df_coord
Index_X Index_Y Added
0 0 100
1 1 500
2 2 900
I'm more familiar with analytical language like SAS, where data is processed one line at a time, so the natural approach for me was this:
df_coord['Added'] = df_source.iloc[df_coord[Index_X][df_coord[Index_Y]]
When I tried this I got an error, which I understand as df_coord[Index_X] does not refer to the data on the same row. I have seen a few posts where using a "axis=1" option worked for their respective cases, but I can't figure out how to apply it to this case. Thank you.

You could index the underlying ndarray, i.e calling the values method, using the columns in df_coord as first and second axis:
df_coord['Added'] = df_source.values[df_coord.Index_X, df_coord.Index_Y]
Index_X Index_Y Added
0 0 0 100
1 1 1 500
2 2 2 900

Related

Pandas grouping with filtering on other columns

I have the following dataframe in Pandas:
name
value
in
out
A
50
1
0
A
-20
0
1
B
150
1
0
C
10
1
0
D
500
1
0
D
-250
0
1
E
800
1
0
There are maximally only 2 observations for each name: one for in and one for out.
If there is only in for a name there is only one observation for it.
You can create this dataset with this code:
data = {
'name': ['A','A','B','C','D','D','E'],
'values': [50,-20,150,10,500,-250,800],
'in': [1,0,1,1,1,0,1],
'out': [0,1,0,0,0,1,0]
}
df = pd.DataFrame.from_dict(data)
I want to sum the value column for each name but only if name has both in and out record. In other words, only when one unique name has exactly 2 rows.
The result should look like this:
name
value
A
30
D
250
If I run the following code I got all the results without filtering based on in and out.
df.groupby('name').sum()
name
value
A
30
B
150
C
10
D
250
E
800
How to add the beforementioned filtering based on columns?
Maybe you can try something with groupby, agg, and query (like below):
df.groupby('name').agg({'name':'count', 'values': 'sum'}).query('name>1')[['values']]
Output:
values
name
A 30
D 250
You could also make .query('name==2') in above if you like but assuming it can occur max at 2 .query('name>1') would also return same.
IIUC, you could filter before aggregation:
# check that we have exactly 1 in and 1 out per group
mask = df.groupby('name')[['in', 'out']].transform('sum').eq([1,1]).all(1)
# slice the correct groups and aggregate
out = df[mask].groupby('name', as_index=False)['values'].sum()
Or, you could filter afterwards (maybe less efficient if you have a lot of groups that would be filtered out):
(df.groupby('name', as_index=False).sum()
.loc[lambda d: d['in'].eq(1) & d['out'].eq(1), ['name', 'values']]
)
output:
name values
0 A 30
1 D 250

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

How to divide a column by the number of rows with equal id in a dataframe?

I have a DataFrame that looks like this:
Id
Price
1
300
1
300
1
300
2
400
2
400
3
100
My goal is to divide the price for each observation by the number of rows with the same Id number. The expected output would be:
Id
Price
1
100
1
100
1
100
2
200
2
200
3
100
However I am having some issues finding the most optimized way to conduct this operation. I did manage to do this using the code below, but it takes more than 5 minutes to run (as I have roughly 200k observations):
# For each row in the dataset, get the number of rows with the same Id and store them in a list
sum_of_each_id=[]
for i in df['Id'].to_numpy():
sum_of_each_id.append(len(df[df['Id']==i]))
# Creating an auxiliar column in the dataframe, with the number of rows associated to each Id
df['auxiliar']=sum_of_each_id
# Dividing the price by the number of rows with the same Id
df['Price']=df['Price']/df['auxiliar']
Could you please let me know what would be the best way to do this?
Try groupby with transform.
Make groups on the basis of id using groupby('Id')
Get count of values in a group for each row using `transform('count')
Divide df["Price] by that series which contains count.
df = pd.DataFrame({"Id":[1,1,1,2,2,3],"Price":[300,300,300,400,400,100]})
df["new_Price"] = (df["Price"]/df.groupby("Id")["Price"].transform("count")).astype('int')
print(df)
Id Price new_Price
0 1 300 100
1 1 300 100
2 1 300 100
3 2 400 200
4 2 400 200
5 3 100 100
import pandas as pd
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "price": [300, 300, 300, 400, 400, 100]})
df.set_index("id") / df.groupby("id").count()
Explanation:
df.groupby("id").count() calculates the number of rows with the same Id number. the resulting DataFrame will have an Id as index.
df.set_index("id") will set the Id column as index
Then we simply divide the frames and pandas will match the numbers by the index.

Getting a merged dataframe from two dataframe

I have two dataframe:
Source dataframe
index A x y
1 1 100 100
2 1 100 400
3 1 100 700
4 1 300 200
5 2 50 200
6 2 100 200
7 2 800 400
8 2 1200 800
Destination dataframe
index A x y
1 1 105 100
2 1 110 410
3 1 110 780
4 2 1000 90
For each row in source dataframe I need to find values nearest to it based on values in the destination dataframe grouped by 'A' column. The resultant dataframe should be as below(Just a sample taking only one row from source(index 1) and corresponding nearest ones from destination in that group(A == 1))
A x_1 y_1 x_2 y_2 nearness(approx.)
1 100 100 105 100 95
1 100 100 110 410 50
1 100 100 110 780 20
NOTE: The nearness column is just a mere representation and will be a calculation function in the future based on x and y. What I need is row wise merging between the two dataframe.
This might be arbitrary, but can someone explain how merge works?
pd.merge(source_df, dest_df, on='A')
Basically, it will go through every item of the left dataframe, look for its key in the right dataframe, and create an entry in the merged datagrame (it creates an entry for each time the key is found in the right dataframe, but you can change this behaviour with the validate keyword)
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html for more infos!!!
source_df.merge(dest_df, on='A')
What it does is it first looks at source_df's column and 'A' and matches it with dest_df's column 'A' (if 'on' is specified) - much like SQL join -, else it tries to do this using index, if fails then it tries to achieve joining using common column names. You can also join on different column names using 'left' and 'right' arguments.

Aggregating data using python

I have some data that I want to both sum and count based upon a certain field. My data looks like this
Value ID Object
100 ABD Type1
200 ABD Type1
400 ABD Type2
200 BCE Type1
100 BCE Type1
800 JHO Type3
600 TVM Type4
And I am trying to get to this where I have counted the number of unique Objects related to an ID and also summed the total value also related to that ID
ValueSum ID CountObject
700 ABD 2
300 BCE 1
800 JHO 1
600 TVM 1
What I have been taking a look at is using the .groupby.() function along with .count() and .sum() but I can't seem to get things in the right format.
Any help is much appreciated.
Thanks!
You can pass a dict of the funcs to perform on multiple columns on your df using groupby and agg:
In [289]:
gp = df.groupby('ID', as_index=False).agg({'Value':sum, 'Object':'nunique'})
gp = gp.rename(columns={'Value':'ValueSum', 'Object':'ObjectCount'})
gp
Out[289]:
ID ValueSum ObjectCount
0 ABD 700 2
1 BCE 300 1
2 JHO 800 1
3 TVM 600 1
Here we pass a dict with the corresponding column names and the func to perform, for the counting we use nunique which returns the number of unique values

Categories

Resources