Subset of columns from another data frame - python

I have a dataframe (G) whose columns are “Client” and “TIV”.
I have another dataframe whose (B) columns are “Client”, “TIV”, “A”, “B”, “C”.
I want to select all rows from B whose clients are not in G. In other words, if there is a row in B whose Client also extsist in G then I want to delete it.
I did this:
x= B[B[‘Client’]!= G[‘Client’]
But it returned saying that “can only compare identically labeled Series Object”
I appriciate your help.

You can use df.isin combined with ~ operator:
B[~B.Client.isin(G.Client)]

Maybe the following code snippet helps:
df1 = pd.DataFrame(data={'Client': [1,2,3,4,5]})
df2 = pd.DataFrame(data={'Client': [1,2,3,6,7]})
# Identify what Clients are in df1 and not in df2
clients_diff = set(df1.Client).difference(df2.Client)
df1.loc[df1.Client.isin(clients_diff)]
The idea is to filter df1 on all clients which are not in df2

Related

Optimal way to create a column by matching two other columns

The first df I have is one that has station codes and names, along with lat/long (not as relevant), like so:
code name latitude longitude
I have another df with start/end dates for travel times. This df has only the station code, not the station name, like so:
start_date start_station_code end_date end_station_code duration_sec
I am looking to add columns that have the name of the start/end stations to the second df by matching the first df "code" and second df "start_station_code" / "end_station_code".
I am relatively new to pandas, and was looking for a way to optimize doing this as my current method takes quite a while. I use the following code:
for j in range(0, len(df_stations)):
for i in range(0, len(df)):
if(df_stations['code'][j] == df['start_station_code'][i]):
df['start_station'][i] = df_stations['name'][j]
if(df_stations['code'][j] == df['end_station_code'][i]):
df['end_station'][i] = df_stations['name'][j]
I am looking for a faster method, any help is appreciated. Thank you in advance.
Use merge. If you are familiar with SQL, merge is equivalent to LEFT JOIN:
cols = ["code", "name"]
result = (
second_df
.merge(first_df[cols], left_on="start_station_code", right_on="code")
.merge(first_df[cols], left_on="end_station_code", right_on="code")
.rename(columns={"code_x": "start_station_code", "code_y": "end_station_code"})
)
The answer by #Code-Different is very nearly correct. However the columns to be renamed are the name columns not the code columns. For neatness you will likely want to drop the additional code columns that get created by the merges. Using your names for the dataframes df and df_station the code needed to produce df_required is:
cols = ["code", "name"]
required_df = (
df
.merge(df_stations[cols], left_on="start_station_code", right_on="code")
.merge(df_stations[cols], left_on="end_station_code", right_on="code")
.rename(columns={"name_x": "start_station", "name_y": "end_station"})
.drop(columns = ['code_x', 'code_y'])
)
As you may notice the merge means that the dataframe acquires duplicate 'code' columns which get suffixed automatically, this is a built in default of the merge command. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html for more detail.

how to join 2 rows based on column value

i have the dataframe like picture below:
enter image description here
and based on col_3 value i want to extract this dataframe.
enter image description here
i tried :
df1 = df[df['col_8'] == 2]
df2 = df[df['col_8'] == 3]
df3 = pd.merge(df1, df2, on=['col_3'], how = 'inner')
but because i have just one col_3=252 after the merge this row is deleted.
how can i fix the problem and with which function i can extract above dataframe?
What are you trying to do?
In your picture, col_3 only has values of 2 and 3. And then, you split the dataframe on the condition of col_3 = 2 or 3. And then you want to merge it.
So, you are trying to slice a dataframe and the rejoin it as it was? Why?
I think this is happening due to your df2 being empty, since there is no df[df['col_8'] == 3]. Inner join is the intersection of the sets. So Df2 is empty so then you try and then you try and merge this it will return nothing.
I think you are trying to do this:
df2 = df[df['col_8_3'] == 3]
Then when you take the inner join it should work produce one row

How to assign values to the rows of a data frame which satisfy certain conditions?

I have two data frames:
df1 = pd.read_excel("test1.xlsx")
df2 = pd.read_excel("test2.xlsx")
I am trying to assign values of df1 to df2 where a certain condition is met (Column1 is equal to Column1 then assign values of ColY to ColX).
df1.loc[df1['Col1'] == df2['Col1'],'ColX'] = df2['ColY']
This results in an error as df2['ColY] is the whole column. How do i assign for only the rows that match?
You can use numpy.where:
import numpy as np
df1['ColX'] = np.where(df1['Col1'].eq(df2['Col1']), df2['ColY'], df1['ColX'])
Since you wanted to assign from df1 to df2 your code should have been
df2.loc[df1['Col1']==df2['Col2'],'ColX']=df1.['ColY']
The code you wrote won't assign the values from df1 to df2, but from df2 to df1.
And also if you could clarify to which dataframe ColX and ColY belong to I could help more(Or does both dataframe have them??).
Your code is pretty much right!!! Only change the df1 and df2 as above.

comparing a dataframe column with another data frame

I have two datasets
df1 = pd.DataFrame ({"skuid" :("45","22","32","33"), "country": ("A","B","C","A")})
df2 = pd.DataFrame ({"skuid" :("45","32","40","21"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
I want to find how many rows does df2 have in common with df1 when country is A (only sum).
I want the output as 1, because skuid 45 is in both datasets and country is A.
I did by subsetting by country and using isin() like
df3 = df1.loc[df1['country']=='A']
df3['skuid'].isin(df2['skuid']).value_counts()
but I want to know whether I can do in single line.
Here what I tried to do in one line code
df1.loc['skuid'].isin(df2.skuid[df1.country.eq('A')].unique().sum()):,])
I know my mistake I'm comparing with df1 with df2 of a country that doesn't exist.
So, is there any way where I can do it in one or two lines, without subsetting each country
Thanks in advance
Let's try:
df1.loc[df1['country']=='A', 'skuid'].isin(df2['skuid']).sum()
# out: 1
Or
(df1['skuid'].isin(df2['skuid']) & df1['country'].eq('A')).sum()
You can also do for all countries with groupby():
df1['skuid'].isin(df2['skuid']).groupby(df1['country']).sum()
Output:
country
A 1
B 0
C 1
Name: skuid, dtype: int64
If I correctly understood you need this:
df3=df1[lambda x: (df1.skuid.isin(df2['skuid'])) & (x['country'] =='A') ].count()

pandas appending df1 to df2 get 0s/NaNs in result

I have 2 dataframes. df1 comprises a Series of values.
df1 = pd.DataFrame({'winnings': cumsums_winnings_s, 'returns':cumsums_returns_s, 'spent': cumsums_spent_s, 'runs': cumsums_runs_s, 'wins': cumsums_wins_s, 'expected': cumsums_expected_s}, columns=["winnings", "returns", "runs", "wins", "expected"])
df2 runs each row through a function which takes 3 columns and produces a result for each row - specialSauce
df2= pd.DataFrame(list(map(lambda w,r,e: doStuff(w,r,e), df1['wins'], df1['runs'], df1['expected'])), columns=["specialSauce"])
print(df2.append(df1))
produces all the df1 columns but NaN for the df1 (and vice versa if df1/df2 switched in append)
So the problem I has is how to append these 2 dataframes correctly.
As I understand things, your issue seems to be related to the fact that you get NaN's in the result DataFrame.
The reason for this is that you are trying to .append() one dataframe to the other while they don't have the same columns.
df2 has one extra column, the one created with apply() and doStuff, while df1 does not have that column. When trying to append one pd.DataFrame to the other the result will have all columns both pd.DataFrame objects. Naturally, you will have some NaN's for ['specialSauce'] since this column does not exist in df1.
This would be the same if you were to use pd.concat(), both methods do the same thing in this case. The one thing that you could do to bring the result closer to your desired result is use the ignore_index flag like this:
>> df2.append(df1, ignore_index=True)
This would at least give you a 'fresh' index for the result pd.DataFrame.
EDIT
If what you're looking for is to "append" the result of doStuff to the end of your existing df, in the form of a new column (['specialSauce']), then what you'll have to do is use pd.concat() like this:
>> pd.concat([df1, df2], axis=1)
This will return the result pd.DataFrame as you want it.
If you had a pd.Series to add to the columns of df1 then you'd need to add it like this:
>> df1['specialSauce'] = <'specialSauce values'>
I hope that helps, if not please rephrase the description of what you're after.
Ok, there are a couple of things going on here. You've left code out and I had to fill in the gaps. For example you did not define doStuff, so I had to.
doStuff = lambda w, r, e: w + r + e
With that defined, your code does not run. I had to guess what you were trying to do. I'm guessing that you want to have an additional column called 'specialSauce' adjacent to your other columns.
So, this is how I set it up and solved the problem.
Setup and Solution
import pandas as pd
import numpy as np
np.random.seed(314)
df = pd.DataFrame(np.random.randn(100, 6),
columns=["winnings", "returns",
"spent", "runs",
"wins", "expected"]).cumsum()
doStuff = lambda w, r, e: w + r + e
df['specialSauce'] = df[['wins', 'runs', 'expected']].apply(lambda x: doStuff(*x), axis=1)
print df.head()
winnings returns spent runs wins expected specialSauce
0 0.166085 0.781964 0.852285 -0.707071 -0.931657 0.886661 -0.752067
1 -0.055704 1.163688 0.079710 0.155916 -1.212917 -0.045265 -1.102266
2 -0.554241 1.928014 0.271214 -0.462848 0.452802 1.692924 1.682878
3 0.627985 3.047389 -1.594841 -1.099262 -0.308115 4.356977 2.949601
4 0.796156 3.228755 -0.273482 -0.661442 -0.111355 2.827409 2.054611
Also
You tried to use pd.DataFrame.append(). Per the linked documentation, it attaches the DataFrame specified as the argument to the end of the DataFrame that is being appended to. You would have wanted to use pd.DataFrame.concat().

Categories

Resources