Remove df rows based on column of another df

Remove df rows based on column of another df - python

Have two df with values
df 1
number 1 2 3
12354 mark 24 london
12356 jacob 25 denver
12357 luther 26 berlin
12358 john 27 tokyo
12359 marshall 28 cairo
12350 ted 29 delhi
another df 2
number
12354
12357
12359
remove all the rows in df1 having values of same column values of df2
Expected Output
0 1 2 3
12356 jacob 25 denver
12358 john 27 tokyo
12350 ted 29 delhi

Here an example :
import pandas as pd
from io import StringIO
df1 = """
number,1,2,3
12354,mark,24,london
12356,jacob,25,denver
12357,luther,26,berlin
12358,john,27,tokyo
12359,marshall,28,cairo
12350,ted,29,delhi
"""
df2 = """
number
12354
12357
12359
"""
df_df2 = pd.read_csv(StringIO(df2), sep=',')
df_df1 = pd.read_csv(StringIO(df1), sep=',')
df=pd.merge(df_df1,df_df2, indicator=True, how='outer').query('_merge=="left_only"')
df.drop(['_merge'], axis=1, inplace=True)
df.rename(columns={'number': '0'}, inplace=True)
print(df)

Related

How do i increase an element value from column in Pandas?

Hello I have this Pandas code (look below) but turn out it give me this error: TypeError: can only concatenate str (not "int") to str
import pandas as pd
import numpy as np
import os
_data0 = pd.read_excel("C:\\Users\\HP\\Documents\\DataScience task\\Gender_Age.xlsx")
_data0['Age' + 1]
I wanted to change the element values from column 'Age', imagine if I wanted to increase the column elements from 'Age' by 1, how do i do that? (With Number of Children as well)
The output I wanted:
First Name Last Name Age Number of Children
0 Kimberly Watson 36 2
1 Victor Wilson 35 6
2 Adrian Elliott 35 2
3 Richard Bailey 36 5
4 Blake Roberts 35 6
Original output:
First Name Last Name Age Number of Children
0 Kimberly Watson 24 1
1 Victor Wilson 23 5
2 Adrian Elliott 23 1
3 Richard Bailey 24 4
4 Blake Roberts 23 5

Try:
df['Age'] = df['Age'] - 12
df['Number of Children'] = df['Number of Children'] - 1

A query using pandas

Hello
i need to create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014 , using pandas This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index)
you'll find a description of my data in the picture :

Consider the following demo:
In [19]: df
Out[19]:
REGION STNAME CTYNAME POPESTIMATE2014 POPESTIMATE2015
0 0 Washington Washington 10 12
1 1 Washington Washington County 11 13
2 2 Alabama Alabama County 13 15
3 4 Alaska Alaska 14 12
4 3 Montana Montana 10 11
5 2 Washington Washington 15 19
In [20]: qry = "REGION in [1,2] and POPESTIMATE2015 > POPESTIMATE2014 and CTYNAME.str.contains('^Washington')"
In [21]: df.query(qry, engine='python')[['STNAME', 'CTYNAME']]
Out[21]:
STNAME CTYNAME
1 Washington Washington County
5 Washington Washington

Use boolean indexing with mask created by isin and startswith:
mask = df['REGION'].isin([1,2]) &
df['COUNTY'].str.startswith('Washington') &
(df['POPESTIMATE2015'] > df['POPESTIMATE2014'])
df = df.loc[mask, ['STNAME', 'CTYNAME']]

add a row at top in pandas dataframe [duplicate]

This question already has answers here:
Insert a row to pandas dataframe
(18 answers)
Closed 4 years ago.
Below is my dataframe
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
I want to insert a new row at the first position
name: dean, age: 45, sex: male
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male
What is the best way to do this in pandas?

Probably this is not the most efficient way but:
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df.sort_index(inplace=True)
Output:
age name sex
0 45 Dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male

If it's going to be a frequent operation, then it makes sense (in terms of performance) to gather the data into a list first and then use pd.concat([], ignore_index=True) (similar to #Serenity's solution):
Demo:
data = []
# always inserting new rows at the first position - last row will be always on top
data.insert(0, {'name': 'dean', 'age': 45, 'sex': 'male'})
data.insert(0, {'name': 'joe', 'age': 33, 'sex': 'male'})
#...
pd.concat([pd.DataFrame(data), df], ignore_index=True)
In [56]: pd.concat([pd.DataFrame(data), df], ignore_index=True)
Out[56]:
age name sex
0 33 joe male
1 45 dean male
2 30 jon male
3 25 sam male
4 18 jane female
5 26 bob male
PS I wouldn't call .append(), pd.concat(), .sort_index() too frequently (for each single row) as it's pretty expensive. So the idea is to do it in chunks...

#edyvedy13's solution worked great for me. However it needs to be updated for the deprecation of pandas' sort method - now replaced with sort_index.
df.loc[-1] = ['45', 'Dean', 'male'] # adding a row
df.index = df.index + 1 # shifting index
df = df.sort_index() # sorting by index

Use pandas.concat and reindex new dataframe:
import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex':['male','male','female','male']})
# new line
line = pd.DataFrame({'name': 'dean', 'age': 45, 'sex': 'male'}, index=[0])
# concatenate two dataframe
df2 = pd.concat([line,df.ix[:]]).reset_index(drop=True)
print (df2)
Output:
age name sex
0 45 dean male
1 30 jon male
2 25 sam male
3 18 jane female
4 26 bob male

import pandas as pd
df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
'age': [30,25,18,26],
'sex': ['male','male','female','male']})
df1 = pd.DataFrame({'name': ['dean'], 'age': [45], 'sex':['male']})
df1 = df1.append(df)
df1 = df1.reset_index(drop=True)
That works

This will work for me.
>>> import pandas as pd
>>> df = pd.DataFrame({'name': ['jon','sam','jane','bob'],
... 'age': [30,25,18,26],
... 'sex':['male','male','female','male']}) >>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
>>> df.loc['a']=[45,'dean','male']
>>> df
age name sex
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male
a 45 dean male
>>> newIndex=['a']+[ind for ind in df.index if ind!='a']
>>> df=df.reindex(index=newIndex)
>>> df
age name sex
a 45 dean male
0 30 jon male
1 25 sam male
2 18 jane female
3 26 bob male

Pandas Creating Dataframes from Loops

I am trying to make a dataframe so that I can send it to a CSV easily, otherwise I have to do this process manually..
I'd like this to be my final output. Each person has a month and year combo that starts at 1/1/2014 and goes to 12/1/2016:
Name date
0 ben 1/1/2014
1 ben 2/1/2014
2 ben 3/1/2014
3 ben 4/1/2014
....
12 dan 1/1/2014
13 dan 2/1/2014
14 dan 3/1/2014
code so far:
import pandas as pd
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df = pd.DataFrame({"Name": listof_people})
for month in months:
df.append({'date': month}, ignore_index=True)
print(df)
When I try looping to create the dataframe it either does not work, I get index errors (because of the non-matching lists) and I'm at a loss.
I've done a good bit of searching and have found some following links that are similar, but I can't reverse engineer the work to fit my case.
Filling empty python dataframe using loops
How to build and fill pandas dataframe from for loop?
I don't want anyone to feel like they are "doing my homework", so if i'm derping on something simple please let me know.

I think you can use product for all combination with to_datetime for column date:
from itertools import product
days = [1]
months = list(range(1, 13))
years = ['2014', '2015', '2016']
listof_people = ['ben','dan','nathan', 'gary', 'Mark', 'Sean', 'Tim', 'Chris']
df1 = pd.DataFrame(list(product(listof_people, months, days, years)))
df1.columns = ['Name', 'month','day','year']
print (df1)
Name month day year
0 ben 1 1 2014
1 ben 1 1 2015
2 ben 1 1 2016
3 ben 2 1 2014
4 ben 2 1 2015
5 ben 2 1 2016
6 ben 3 1 2014
7 ben 3 1 2015
8 ben 3 1 2016
9 ben 4 1 2014
10 ben 4 1 2015
...
...
df1['date'] = pd.to_datetime(df1[['month','day','year']])
df1 = df1[['Name','date']]
print (df1)
Name date
0 ben 2014-01-01
1 ben 2015-01-01
2 ben 2016-01-01
3 ben 2014-02-01
4 ben 2015-02-01
5 ben 2016-02-01
6 ben 2014-03-01
7 ben 2015-03-01
...
...

mux = pd.MultiIndex.from_product(
[listof_people, years, months],
names=['Name', 'Year', 'Month'])
pd.Series(
1, mux, name='Day'
).reset_index().assign(
date=pd.to_datetime(df[['Year', 'Month', 'Day']])
)[['Name', 'date']]

Iterate over rows and expand pandas dataframe

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})
location name
0 Amsterdam Tom
1 [Berlin, Paris] Jim
2 [Antwerp, Barcelona, Pisa] Claus
I want to turn into:
dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Antwerp Claus
4 Barcelona Claus
5 Pisa Claus
I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...
def duplicator(series):
if type(series['location']) == list:
for location in series['location']:
subSeries = series
subSeries['location'] = location
dfOut.append(subSeries)
else:
dfOut.append(series)
for index, row in dfIn.iterrows():
duplicator(row)

Not as much interesting/fancy pandas usage, but this works:
import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})
It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all location entries are already iterables, you can remove the atleast_1d call, which gives about another 20% speedup.

If you return a series whose index is a list of locations, then dfIn.apply will collate those series into a table:
import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'],
['Antwerp','Barcelona','Pisa'] ]})
def expand(row):
locations = row['location'] if isinstance(row['location'], list) else [row['location']]
s = pd.Series(row['name'], index=list(set(locations)))
return s
In [156]: dfIn.apply(expand, axis=1)
Out[156]:
Amsterdam Antwerp Barcelona Berlin Paris Pisa
0 Tom NaN NaN NaN NaN NaN
1 NaN NaN NaN Jim Jim NaN
2 NaN Claus Claus NaN NaN Claus
You can then stack this DataFrame to obtain:
In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]:
0 Amsterdam Tom
1 Berlin Jim
Paris Jim
2 Antwerp Claus
Barcelona Claus
Pisa Claus
dtype: object
This is a Series, while you want a DataFrame. A little massaging with reset_index gives you the desired result:
dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)
yields
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Amsterdam Claus
4 Antwerp Claus
5 Barcelona Claus

import pandas as pd
dfIn = pd.DataFrame({
u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ],
})
print(dfIn.explode('location'))
>>>
name location
0 Tom Amsterdam
1 Jim Berlin
1 Jim Paris
2 Claus Antwerp
2 Claus Barcelona
2 Claus Pisa

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Remove df rows based on column of another df - python

Related

How do i increase an element value from column in Pandas?

A query using pandas

add a row at top in pandas dataframe [duplicate]

Pandas Creating Dataframes from Loops

Iterate over rows and expand pandas dataframe

Categories

Resources