Python pandas find difference and update - python

I need to compare new datasets to existing sql datasets and update them if new information is presenting itself.
data from db:
dfa
id foo bar
1 2 "home"
2 5 "work"
3 6 "car"
4 99 "people"
new data:
dfb
id foo bar
1 22 "home"
2 5 "work"
8 8 "pet"
4 99 "humans"
What I need is a way to recognize that for id 1, there is a different value in column foo and that for id 4 there is a new value for column bar. And then update the dataframe from the db before sending it back to the db. I'd like to do this in a runtime efficient maner.
dfout
id foo bar
1 22 "home"
2 5 "work"
3 6 "car"
4 99 "humans"
I have searched the web for a solution. But I can't find my specific case and I have trouble fitting what I do find into my case. Can someone explain how I would do this?
These seem related but deal with non overlapping data and entire new rows.
Pandas sort columns and find difference
Python Pandas - Find difference between two data frames

Use DataFrame.update by Id, so first are converted columns to index in both DataFrames:
df1 = dfa.set_index('id')
df2 = dfb.set_index('id')
df1.update(df2)
dfa = df1.reset_index().astype(dfa.dtypes)
print (dfa)
id foo bar
0 1 22 home
1 2 5 work
2 3 6 car
3 4 99 humans

Related

Pandas groupby over consecutive duplicates

Given a table,
Id
Value
1
1
2
2
2
3
3
4
4
5
4
6
2
8
2
3
1
1
Instead of a simple groupby('Id').agg({'Value':'sum'}) which would perform aggregation over all the instances and yield a table with only four rows, I wish the result to aggregate only over the nearby instances and hence maintaining the order the table was created.
The expected output is following,
Id
Value
1
1
2
5
3
4
4
11
2
11
1
1
If not possible with pandas groupby, any other kind of trick would also be greatly appreciated.
Note: If the above example is not helpful, basically what I want is to somehow compress the table with aggregating over 'Value'. The aggregation should be done only over the duplicate 'Id's which occur one exactly after the other.
Unfortunately, the answers from eshirvana and wwnde doesn't seem to work for a long dataset. Inspired from answer of wwnde, I found a workaround,
# create a series referring to group of identicals
new=[]
i=-1
for item in df.Id:
if item !=seen:
i+=1
seen=items
new.append(i)
df['temp']=new
Now, we groupby over 'temp' column.
df.groupby('temp').agg({'Id':max, 'Value':sum}).reset_index(drop=True)

Pandas DataFrame MultiIndex Pivot - Remove Empty Headers and Axis Rows

this is closely related to the question I asked earlier here Python Pandas Dataframe Pivot Table Column and Values Order. Thanks again for the help. Very much appreciated.
I'm trying to automate a report that will be distributed via email to a large audience so it needs to look "pretty" :)
I'm having trouble resetting/removing the Indexes and/or Axis post-Pivots to enable me to use the .style CSS functions (i.e. creating a Styler Object out of the df) to make the table look nice.
I have a DataFrame where two of the principal fields (in my example here they are "Name" and "Bucket") will be variable. The desired display order will also change (so it can't be hard-coded) but it can be derived earlier in the application (e.g. "Name_Rank" and "Bucket_Rank") into Integer "Sorting Values" which can be easily sorted (and theoretically dropped later).
I can drop the column Sorting Value but not the Row/Header/Axis(?). Additionally, no matter what I try I just can't seem to get rid of the blank row between the headers and the DataTable.
I (think) I need to set the Index = Bucket and Headers = "Name" and "TDY/Change" to use the .style style object functionality properly.
import pandas as pd
import numpy as np
data = [
['AAA',2,'X',3,5,1],
['AAA',2,'Y',1,10,2],
['AAA',2,'Z',2,15,3],
['BBB',3,'X',3,15,3],
['BBB',3,'Y',1,10,2],
['BBB',3,'Z',2,5,1],
['CCC',1,'X',3,10,2],
['CCC',1,'Y',1,15,3],
['CCC',1,'Z',2,5,1],
]
df = pd.DataFrame(data, columns =
['Name','Name_Rank','Bucket','Bucket_Rank','Price','Change'])
display(df)
Name
Name_Rank
Bucket
Bucket_Rank
Price
Change
0
AAA
2
X
3
5
1
1
AAA
2
Y
1
10
2
2
AAA
2
Z
2
15
3
3
BBB
3
X
3
15
3
4
BBB
3
Y
1
10
2
5
BBB
3
Z
2
5
1
6
CCC
1
X
3
10
2
7
CCC
1
Y
1
15
3
8
CCC
1
Z
2
5
1
Based on the prior question/answer I can pretty much get the table into the right format:
df2 = (pd.pivot_table(df, values=['Price','Change'],index=['Bucket_Rank','Bucket'],
columns=['Name_Rank','Name'], aggfunc=np.mean)
.swaplevel(1,0,axis=1)
.sort_index(level=0,axis=1)
.reindex(['Price','Change'],level=1,axis=1)
.swaplevel(2,1,axis=1)
.rename_axis(columns=[None,None,None])
).reset_index().drop('Bucket_Rank',axis=1).set_index('Bucket').rename_axis(columns=
[None,None,None])
which looks like this:
1
2
3
CCC
AAA
BBB
Price
Change
Price
Change
Price
Change
Bucket
Y
15
3
10
2
10
2
Z
5
1
15
3
5
1
X
10
2
5
1
15
3
Ok, so...
A) How do I get rid of the Row/Header/Axis(?) that used to be "Name_Rank" (e.g. the integer "Sorting Values" 1,2,3). I figured a hack where the df is exported to XLS/re-imported with Header=(1,2) but that can't be the best way to accomplish the objective.
B) How do I get rid of the blank row above the data in the table? From what I've read online it seems like you should "rename_axis=[None]" but this doesn't seem to work no matter which order I try.
C) Is there a way to set the Header(s) such that the both what used to be "Name" and "Price/Change" rows are Headers so that the .style functionality can be employed to format them separate from the data in the table below?
Thanks a lot for whatever suggestions anyone might have. I'm totally stuck!
Cheers,
Devon
In pandas 1.4.0 the options for A and B are directly available using the Styler.hide method:

Python: Map a string value from one dataframe to another dataframe using a id, while creating a new column in the older dataframe [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 1 year ago.
i turned a json file into a dataframe, but I am unsure of how to map a certain value from the json dataframe onto the existing data frame i have.
df1 = # (2nd column does'nt matter just there)
category_id
tags
1
a
1
a
10
b
10
c
40
d
df2(json) =
id
title
1
film
2
music
3
travel
4
cooking
5
dance
I would like to make a new column in df1, that maps the titles from the df2 onto df1 corresponding to the category_id. I am sorry I am new to python programming. I know I can hard code the dictionary and key values and go from there. However I was wondering if there is a way with python/pandas to do this in an easier way.
You can use pandas.Series.map() which maps values of Series according to input correspondence.
df1['tilte'] = df1['category_id'].map(df2.set_index('id')['title'])
# print(df1)
category_id tags tilte
0 1 a film
1 1 a film
2 10 b NaN
3 10 c NaN
4 40 d NaN

Python: Can you check how many times a unique combination of two column values appears in another dataframe?

I am trying to see how many times a unique combination of two column values appears in another dataframe and add it as a new column with one line. I have a reference table looking at unique combinations of the ID and Desc fields. I also have a table that has all active occurrences of those combinations
ref_table active_data
ID Desc ID Desc
0 1 Windows 0 1 Windows
1 1 Linux 1 1 Windows
2 2 Linux 2 1 Linux
3 3 Network 3 2 Linux
4 4 Automation 4 3 Network
5 3 Network
6 3 Network
7 4 Automation
I'd like to add to the ref_table the count of the unique combinations of the ID and Desc fields that appears in active_data like so:
ref_table
ID Desc Count
0 1 Windows 2
1 1 Linux 1
2 2 Linux 1
3 3 Network 3
4 4 Automation 1
I recognize this can be accomplished by performing pd.merge or join. However, if possible, I would like to do it with one line, and if I was just concerned with the count of one column like ID, I know it can be done with:
ref_table['Count'] = ref_table['ID'].map(active_data['ID'].value_counts()).
Trying to extend this to look at both the ID AND Desc columns using:
ref_table['Count'] = ref_table[['ID', 'Desc']].apply(active_data[['ID', 'Desc']].value_counts()) produces an error, KeyError: "None of [Index([3, 'Network'], dtype='object')] are in the [index]". Ideally I would like to use the value_counts solution, but cannot figure it out with two columns.
You can do a merge on groupby:
ref_table.merge(active_data.groupby(['ID','Desc'], as_index=False)['ID'].count(),
on=['ID','Desc'], how='left')
Or you can merge, then groupby:
(ref_table.merge(active_data, on=['ID','Desc'], how='left')
.groupby(['ID','Desc'])['ID'].count()
.reset_index('Count')
)

Get values of two different columns based on a condition in a third column

I have a certain condition (Incident = yes) and I want to know the values in two columns fulfilling this condition. I have a very big data frame (many rows and many columns) and I am looking for a "screening" function.
To illustrate the following example with the df (which has many more columns than shown):
Repetition Step Incident Test1 Test2
1 1 no 10 20
1 1 no 9 11
1 2 yes 9 19
1 2 yes 11 20
1 2 yes 12 22
1 3 yes 9 18
1 3 yes 8 18
What I would like to get as an answer is
Repetition Step
1 2
1 3
If I only wanted to know the Step, I would use the following command:
df[df.Incident == 'yes'].Step.unique()
Is there a similar command to get the values of two columns for a specific condition?
Thanks for the help! :-)
You could use the query option for the condition, select the interested columns, and finally remove duplicate values
df.query('Incident=="yes"').filter(['Repetition','Step']).drop_duplicates()
OR
you could use the Pandas' loc method, set the rows as the condition, set the columns part with the columns you are interested in, then drop the duplicates.
df.loc[df.Incident=="yes",['Repetition','Step']].drop_duplicates()

Categories

Resources