Filling Null Values of a Pandas Column with a List - python

I have data frame df that can be re-created with the code below:
df1 = pd.DataFrame({'name': ['jim', 'john', 'joe', 'jack', 'jake']})
df2 = pd.DataFrame({'name': ['jim', 'john', 'jack'],
'listings': [['orlando', 'los angeles', 'houston'],
['buffalo', 'boston', 'dallas', 'none'],
['phoenix', 'montreal', 'seattle', 'none']]})
df = pd.merge(df1, df2, on = 'name', how = 'left')
print(df)
name listings
0 jim [orlando, los angeles, houston, detroit]
1 john [buffalo, boston, dallas, none]
2 joe NaN
3 jack [phoenix, montreal, seattle, none]
4 jake NaN
I want to fill the NaN values in the listings column with a list of none repeated the length of the the lists in the listings column, ['none']*4, so that the resulting dataframe looks like below:
print(df)
name listings
0 jim [orlando, los angeles, houston, detroit]
1 john [buffalo, boston, dallas, none]
2 joe [none, none, none, none]
3 jack [phoenix, montreal, seattle, none]
4 jake [none, none, none, none]
I've tried both approaches below, and neither are working:
# Failed Approach 1
df['listings'] = np.where(df['listings'].isnull(), ['none']*4, df['listings'])
# Failed Approach 2
df['listings'].fillna(['none']*4)

you can do:
df.loc[df['listings'].isna(),'listings'] = [['none']*4]
name listings
0 jim [orlando, los angeles, houston]
1 john [buffalo, boston, dallas, none]
2 joe [none, none, none, none]
3 jack [phoenix, montreal, seattle, none]
4 jake [none, none, none, none]

Related

Filter of one dataframe using multiple columns of other dataframe in python

I have one dataframe(df1) which is my raw data from which i want to filter or extract a part of the data. I have another dataframe(df2) which have my filter conditions. The catch here is my filter condition column if blank should skip tht column condition and move to the other column conditions
Example below:
DF1:
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
CZCH
SEATLLE
DC
CZCH
Europe
NY
MARYLAND
DC
US
S America
NY
WASHIN
NY
US
America
NY
SEAGA
NJ
UK
Europe
DF2:(sample filter condition table - this table can have multiple conditions)
City
District
Town
Country
Continent
NY
DC
NJ
Notice that i have left the district, country and continent column blank. As I may or may not use it later. I cannot delete these columns.
OUTPUT DF: should look like this
City
District
Town
Country
Continent
NY
WASHIN
DC
US
America
NY
MARYLAND
DC
US
S America
NY
SEAGA
NJ
UK
Europe
So basically i need a filter condition table which will extract information from the raw data for fields i input in the filter tables. I cannot change/delete columns in DF2. I can only leave the column blank if i dont require the filter condition.
Thanks in advance,
Nitz
If DF2 has always one row:
df = df1.merge(df2.dropna(axis=1))
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
1 NY NJ DC US S America
If multiple rows with missing values:
Sample data:
nan = np.nan
df1 = pd.DataFrame({'City': ['NY', 'CZCH', 'NY', 'NY', 'NY'], 'District': ['WASHIN', 'SEATLLE', 'MARYLAND', 'WASHIN', 'SEAGA'], 'Town': ['DC', 'DC', 'DC', 'NY', 'NJ'], 'Country': ['US', 'CZCH', 'US', 'US', 'UK'], 'Continent': ['America', 'Europe', 'S America', 'America', 'Europe']})
df2 = pd.DataFrame({'City': ['NY', nan], 'District': [nan, nan], 'Town': ['DC', 'NJ'], 'Country': [nan, nan], 'Continent': [nan, nan]})
First remove missing values with reshape by DataFrame.stack:
print (df2.stack())
0 City NY
Town DC
1 Town NJ
dtype: object
Then for each group compare df1 columns if exist in columns names and value from df2:
m = [df1[list(v.droplevel(0).index)].eq(v.droplevel(0)).all(axis=1)
for k, v in df2.stack().groupby(level=0)]
print (m)
[0 True
1 False
2 True
3 False
4 False
dtype: bool, 0 False
1 False
2 False
3 False
4 True
dtype: bool]
Use logical_or.reduce and filter in boolean indexing:
print (np.logical_or.reduce(m))
[ True False True False True]
df = df1[np.logical_or.reduce(m)]
print (df)
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe
Another possible solution, using numpy broadcasting (it works even when df2 has more than one row):
df1.loc[np.sum(np.sum(
df1.values == df2.values[:, None], axis=2) ==
np.sum(df2.notna().values, axis=1)[:,None], axis=0) == 1]
Output:
City District Town Country Continent
0 NY WASHIN DC US America
2 NY MARYLAND DC US S America
4 NY SEAGA NJ UK Europe

Pandas create new rows based on column values

df_current = pd.DataFrame({'Date':['2022-09-16', '2022-09-17', '2022-09-18'],'Name': ['Bob Jones', 'Mike Smith', 'Adam Smith'],
'Items Sold':[1, 3, 2], 'Ticket Type':['1 x GA', '2 x VIP, 1 x GA', '1 x GA, 1 x VIP']})
Date Name Items Sold Ticket Type
0 2022-09-16 Bob Jones 1 1 x GA
1 2022-09-17 Mike Smith 3 2 x VIP, 1 x GA
2 2022-09-18 Adam Smith 2 1 x GA, 1 x VIP
Hi there. I have the above dataframe, and what I'm after is new rows, with the ticket type and number of tickets sold split out such as below:
df_desired = pd.DataFrame({'Date':['2022-09-16', '2022-09-17', '2022-09-17', '2022-09-18', '2022-09-18'],
'Name': ['Bob Jones', 'Mike Smith', 'Mike Smith', 'Adam Smith', 'Adam Smith'],
'Items Sold':[1, 2, 1, 1, 1], 'Ticket Type':['GA', 'VIP', 'GA', 'GA', 'VIP']})
Any help would be greatly appreciated!
#create df2, by splitting df['ticket type'] on "," and then explode to create rows
df2=df.assign(tt=df['Ticket Type'].str.split(',')).explode('tt')
# split again at 'x'
df2[['Items Sold','Ticket Type']]=df2['tt'].str.split('x', expand=True)
#drop the temp column
df2.drop(columns="tt", inplace=True)
df2
Date Name Items Sold Ticket Type
0 2022-09-16 Bob Jones 1 GA
1 2022-09-17 Mike Smith 2 VIP
1 2022-09-17 Mike Smith 1 GA
2 2022-09-18 Adam Smith 1 GA
2 2022-09-18 Adam Smith 1 VIP

Python DataFrame : find previous row's value before a specific value with same value in other columns

I have a datafame as follows
import pandas as pd
d = {
'Name' : ['James', 'John', 'Peter', 'Thomas', 'Jacob', 'Andrew','John', 'Peter', 'Thomas', 'Jacob', 'Peter', 'Thomas'],
'Order' : [1,1,1,1,1,1,2,2,2,2,3,3],
'Place' : ['Paris', 'London', 'Rome','Paris', 'Venice', 'Rome', 'Paris', 'Paris', 'London', 'Paris', 'Milan', 'Milan']
}
df = pd.DataFrame(d)
Name Order Place
0 James 1 Paris
1 John 1 London
2 Peter 1 Rome
3 Thomas 1 Paris
4 Jacob 1 Venice
5 Andrew 1 Rome
6 John 2 Paris
7 Peter 2 Paris
8 Thomas 2 London
9 Jacob 2 Paris
10 Peter 3 Milan
11 Thomas 3 Milan
[Finished in 0.7s]
The dataframe represents people visiting various cities, Order column defines the order of visit.
I would like find which city people visited before Paris.
Expected dataframe is as follows
Name Order Place
1 John 1 London
2 Peter 1 Rome
4 Jacob 1 Venice
Which is the pythonic way to find it ?
Using merge
s = df.loc[df.Place.eq('Paris'), ['Name', 'Order']]
m = s.assign(Order=s.Order.sub(1))
m.merge(df, on=['Name', 'Order'])
Name Order Place
0 John 1 London
1 Peter 1 Rome
2 Jacob 1 Venice

pretty nested dictionary as a table

Is there any way to pretty print in a table format a nested dictionary? My data structure looks like this;
data = {'01/09/16': {'In': ['Jack'], 'Out': ['Lisa', 'Tom', 'Roger', 'Max', 'Harry', 'Same', 'Joseph', 'Luke', 'Mohammad', 'Sammy']},
'02/09/16': {'In': ['Jack', 'Lisa', 'Rache', 'Allan'], 'Out': ['Lisa', 'Tom']},
'03/09/16': {'In': ['James', 'Jack', 'Nowel', 'Harry', 'Timmy'], 'Out': ['Lisa', 'Tom
And I'm trying to print it out something like this (the names are kept in one line). Note that the names are listed below one another:
+----------------------------------+-------------+-------------+-------------+
| Status | 01/09/16 | 02/09/16 | 03/09/16 |
+----------------------------------+-------------+-------------+-------------+
| In | Jack Tom Tom
| Lisa | Jack |
+----------------------------------+-------------+-------------+-------------+
| Out | Lisa
Tom | Jack | Lisa |
+----------------------------------+-------------+-------------+-------------+
I've tried using pandas with this code;
pd.set_option('display.max_colwidth', -1)
df = pd.DataFrame(role_assignment)
df.fillna('None', inplace=True)
print df
But the problem above is that pandas prints it like this (The names are printed in a single line and it doesn't look good, especially if there's a lot of names);
01/09/16 \
In [Jack]
Out [Lisa, Tom, Roger, Max, Harry, Same, Joseph, Luke, Mohammad, Sammy]
02/09/16 03/09/16
In [Jack, Lisa, Rache, Allan] [James, Jack, Nowel, Harry, Timmy]
Out [Lisa, Tom] [Lisa, Tom]
I prefer this but names listed below one another;
01/09/16 02/09/16 03/09/16
In [Jack] [Jack] [James]
Out [Lisa] [Lisa] [Lisa]
Is there a way to print it neater using pandas or another tool?
This is nonsense hackery and only for display purposes only.
data = {
'01/09/16': {
'In': ['Jack'],
'Out': ['Lisa', 'Tom', 'Roger',
'Max', 'Harry', 'Same',
'Joseph', 'Luke', 'Mohammad', 'Sammy']
},
'02/09/16': {
'In': ['Jack', 'Lisa', 'Rache', 'Allan'],
'Out': ['Lisa', 'Tom']
},
'03/09/16': {
'In': ['James', 'Jack', 'Nowel', 'Harry', 'Timmy'],
'Out': ['Lisa', 'Tom']
}
}
df = pd.DataFrame(data)
d1 = df.stack().apply(pd.Series).stack().unstack(1).fillna('')
d1.index.set_levels([''] * len(d1.index.levels[1]), level=1, inplace=True)
print(d1)
01/09/16 02/09/16 03/09/16
In Jack Jack James
Lisa Jack
Rache Nowel
Allan Harry
Timmy
Out Lisa Lisa Lisa
Tom Tom Tom
Roger
Max
Harry
Same
Joseph
Luke
Mohammad
Sammy

Iterate over rows and expand pandas dataframe

I have pandas dataframe with a column containing values or lists of values (of unequal length). I want to 'expand' the rows, so each value in the list becomes single value in column. An example says it all:
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ]})
location name
0 Amsterdam Tom
1 [Berlin, Paris] Jim
2 [Antwerp, Barcelona, Pisa] Claus
I want to turn into:
dfOut = pd.DataFrame({u'name': ['Tom', 'Jim', 'Jim', 'Claus','Claus','Claus'],
u'location': ['Amsterdam', 'Berlin','Paris', 'Antwerp','Barcelona','Pisa']})
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Antwerp Claus
4 Barcelona Claus
5 Pisa Claus
I first tried using apply but it's not possible to return multiple Series as far as I know. iterrows seems to be the trick. But the code below gives me an empty dataframe...
def duplicator(series):
if type(series['location']) == list:
for location in series['location']:
subSeries = series
subSeries['location'] = location
dfOut.append(subSeries)
else:
dfOut.append(series)
for index, row in dfIn.iterrows():
duplicator(row)
Not as much interesting/fancy pandas usage, but this works:
import numpy as np
dfIn.loc[:, 'location'] = dfIn.location.apply(np.atleast_1d)
all_locations = np.hstack(dfIn.location)
all_names = np.hstack([[n]*len(l) for n, l in dfIn[['name', 'location']].values])
dfOut = pd.DataFrame({'location':all_locations, 'name':all_names})
It's about 40x faster than the apply/stack/reindex approach. As far as I can tell, that ratio holds at pretty much all dataframe sizes (didn't test how it scales with the size of the lists in each row). If you can guarantee that all location entries are already iterables, you can remove the atleast_1d call, which gives about another 20% speedup.
If you return a series whose index is a list of locations, then dfIn.apply will collate those series into a table:
import pandas as pd
dfIn = pd.DataFrame({u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'],
['Antwerp','Barcelona','Pisa'] ]})
def expand(row):
locations = row['location'] if isinstance(row['location'], list) else [row['location']]
s = pd.Series(row['name'], index=list(set(locations)))
return s
In [156]: dfIn.apply(expand, axis=1)
Out[156]:
Amsterdam Antwerp Barcelona Berlin Paris Pisa
0 Tom NaN NaN NaN NaN NaN
1 NaN NaN NaN Jim Jim NaN
2 NaN Claus Claus NaN NaN Claus
You can then stack this DataFrame to obtain:
In [157]: dfIn.apply(expand, axis=1).stack()
Out[157]:
0 Amsterdam Tom
1 Berlin Jim
Paris Jim
2 Antwerp Claus
Barcelona Claus
Pisa Claus
dtype: object
This is a Series, while you want a DataFrame. A little massaging with reset_index gives you the desired result:
dfOut = dfIn.apply(expand, axis=1).stack()
dfOut = dfOut.to_frame().reset_index(level=1, drop=False)
dfOut.columns = ['location', 'name']
dfOut.reset_index(drop=True, inplace=True)
print(dfOut)
yields
location name
0 Amsterdam Tom
1 Berlin Jim
2 Paris Jim
3 Amsterdam Claus
4 Antwerp Claus
5 Barcelona Claus
import pandas as pd
dfIn = pd.DataFrame({
u'name': ['Tom', 'Jim', 'Claus'],
u'location': ['Amsterdam', ['Berlin','Paris'], ['Antwerp','Barcelona','Pisa'] ],
})
print(dfIn.explode('location'))
>>>
name location
0 Tom Amsterdam
1 Jim Berlin
1 Jim Paris
2 Claus Antwerp
2 Claus Barcelona
2 Claus Pisa

Categories

Resources