comparing a dataframe column with another data frame - python

I have two datasets
df1 = pd.DataFrame ({"skuid" :("45","22","32","33"), "country": ("A","B","C","A")})
df2 = pd.DataFrame ({"skuid" :("45","32","40","21"),"salesprice" :(10,0,0,30),"regularprice" : (9,10,0,2)})
I want to find how many rows does df2 have in common with df1 when country is A (only sum).
I want the output as 1, because skuid 45 is in both datasets and country is A.
I did by subsetting by country and using isin() like
df3 = df1.loc[df1['country']=='A']
df3['skuid'].isin(df2['skuid']).value_counts()
but I want to know whether I can do in single line.
Here what I tried to do in one line code
df1.loc['skuid'].isin(df2.skuid[df1.country.eq('A')].unique().sum()):,])
I know my mistake I'm comparing with df1 with df2 of a country that doesn't exist.
So, is there any way where I can do it in one or two lines, without subsetting each country
Thanks in advance

Let's try:
df1.loc[df1['country']=='A', 'skuid'].isin(df2['skuid']).sum()
# out: 1
Or
(df1['skuid'].isin(df2['skuid']) & df1['country'].eq('A')).sum()
You can also do for all countries with groupby():
df1['skuid'].isin(df2['skuid']).groupby(df1['country']).sum()
Output:
country
A 1
B 0
C 1
Name: skuid, dtype: int64

If I correctly understood you need this:
df3=df1[lambda x: (df1.skuid.isin(df2['skuid'])) & (x['country'] =='A') ].count()

Related

Subset of columns from another data frame

I have a dataframe (G) whose columns are “Client” and “TIV”.
I have another dataframe whose (B) columns are “Client”, “TIV”, “A”, “B”, “C”.
I want to select all rows from B whose clients are not in G. In other words, if there is a row in B whose Client also extsist in G then I want to delete it.
I did this:
x= B[B[‘Client’]!= G[‘Client’]
But it returned saying that “can only compare identically labeled Series Object”
I appriciate your help.
You can use df.isin combined with ~ operator:
B[~B.Client.isin(G.Client)]
Maybe the following code snippet helps:
df1 = pd.DataFrame(data={'Client': [1,2,3,4,5]})
df2 = pd.DataFrame(data={'Client': [1,2,3,6,7]})
# Identify what Clients are in df1 and not in df2
clients_diff = set(df1.Client).difference(df2.Client)
df1.loc[df1.Client.isin(clients_diff)]
The idea is to filter df1 on all clients which are not in df2

how to join 2 rows based on column value

i have the dataframe like picture below:
enter image description here
and based on col_3 value i want to extract this dataframe.
enter image description here
i tried :
df1 = df[df['col_8'] == 2]
df2 = df[df['col_8'] == 3]
df3 = pd.merge(df1, df2, on=['col_3'], how = 'inner')
but because i have just one col_3=252 after the merge this row is deleted.
how can i fix the problem and with which function i can extract above dataframe?
What are you trying to do?
In your picture, col_3 only has values of 2 and 3. And then, you split the dataframe on the condition of col_3 = 2 or 3. And then you want to merge it.
So, you are trying to slice a dataframe and the rejoin it as it was? Why?
I think this is happening due to your df2 being empty, since there is no df[df['col_8'] == 3]. Inner join is the intersection of the sets. So Df2 is empty so then you try and then you try and merge this it will return nothing.
I think you are trying to do this:
df2 = df[df['col_8_3'] == 3]
Then when you take the inner join it should work produce one row

i want to extract dataframe that meet certain conditions using python, pandas

I call Excel data with the tuples Time, Name, Good, Bad using python and pandas.
I want to reprocess dataframe to another dataframe that meet certain conditions.
In detail, i would like to print out a dataframe that stores the sum of Good and Bad data for each Name during the entire time.
please help me anybody who knows well python, pandas.
enter image description here
First aggregate sum by DataFrame.groupby, change columns names by DataFrame.add_prefix, add new column by DataFrame.assign and last convert index to column by DataFrame.reset_index:
df = pd.DataFrame({
'Name':list('aaabbb'),
'Bad':[1,3,5,7,1,0],
'Good':[5,3,6,9,2,4]
})
df1 = (df.groupby('Name')['Good','Bad']
.sum()
.add_prefix('Total_')
.assign(Total_Count = lambda x: x.sum(axis=1))
.reset_index())
print (df1)
Name Total_Good Total_Bad Total_Count
0 a 14 9 23
1 b 15 8 23
Use pandas NamedAgg with eval,
df.groupby('Name')[['Good', 'Bad']]\
.agg(Total_Good=('Good','sum'),
Total_Bad=('Bad', 'sum'))\
.eval('Total_Count = Total_Good + Total_Bad')

right way to add values in empty column in loop in dataframe , python

df=pd.DataFrame[columns='one','two','three']
for home in city:
adres= home
for a in abc: #abc is pd.Series od names
#here i want to add values - adress and a , but my dataframe have 3 columns, i will use only 2 here
df.loc[len(df)]= [adres, a, np.nan]
print(df)
one, two, three
alabama, flat, NaN
How propery should i add values adres and a to one and two column in df and leave column three untouchable?
thank you
I would first create a pandas series that contain the values and columns I want. for example:
new_values= pd.Series(["a", "b"], index=["one", "two"])
Then I would append this series to the original dataframe:
df= df.append(new_values, ignore_index=True)
I think you are looking for something like:
for i in range(1):
df.loc[i,['one','two']]=['adress','a']
print(df)
one two three
0 adress a NaN

Add new column to Pandas DataFrame and fill with first word from another column from same df

I have a dataset of crimes reported by Gloucestershire Constabulary from 2011-16. It's a .csv file that I have imported to a Pandas dataframe. The data include a column stating the Lower Super Output Area (LSOA) in which the crime occurred, so for crimes in Tewkesbury, for instance, each record has the corresponding LSOA name, e.g. 'Tewkesbury 009D'; 'Tewkesbury 009E'.
I want to group these data by the town/city they relate to, e.g. 'Gloucester', 'Tewkesbury', ignoring the specific LSOAs within each conurbation. Ideally, I would append a new column to the dataframe, with just the place name copied across, and group on that. I am comfortable with how to do the grouping, just not the new column in the first place. Any advice on how to do this is gratefully received.
I am no Pandas expert but I think you can do string slicing to strip out the last five digits (it supports regex too if I recall correctly, so you can do a proper 'search' if required).
#x is the original dataframe
new_col = x.lsoa.str[:-5] #lsoa is the column containing city names
pd.concat([x, new_col], axis=1)
The str method can be used to extract a string out of the lsoa column of the dataframe.
Something along these lines should work:
df['town'] = [x.split()[0] for x in df['LSOA']]
You can use regex to extract the city name from the DataFrame and then join the result to the original DataFrame. If your inital DataFrame is df
df = pd.DataFrame([ 'Tewkesbury 009D', 'Tewkesbury 009E'], columns=['LSOA'])
In [2]: df
Out[2]:
LSOA
0 Tewkesbury 009D
1 Tewkesbury 009E
Then you can extract the city name and optionally the LSOA code in to a new DataFrame df_new
df_new = df['LSOA'].str.extract('(\w*)\s(\d+\w*)', expand=True)
In [10]: df_new
Out[10]:
0 1
0 Tewkesbury 009D
1 Tewkesbury 009E
If you want to discard the code and just keep the city name remove the second bracket from the regex as '(\w*)\s\d+\w*' . Now you can append this result to the original DataFrame
In [11]: df.join(df_new)
Out[11]:
LSOA 0 1
0 Tewkesbury 009D Tewkesbury 009D
1 Tewkesbury 009E Tewkesbury 009E

Categories

Resources