Creating a pandas dataframe out of nested dictionaries - python

I have a dictionary data which has a structure like so:
{
1: {
'title': 'Test x Miss LaFamilia - All Mine [Music Video] | Link Up TV',
'time': '2020-06-28T18:30:06Z',
'channel': 'Link Up TV',
'description': 'SUB & ENABLE NOTIFICATIONS for more: Visit our clothing store: Visit our website for the latest videos: ...',
'url': 'youtube',
'region_searched': 'US',
'time_searched': datetime.datetime(2020, 8, 6, 13, 6, 5, 188727, tzinfo = < UTC > )
},
2: {
'title': 'Day 1 Highlights | England Frustrated by Rain as Babar Impresses | England v Pakistan 1st Test 2020',
'time': '2020-08-05T18:29:43Z',
'channel': 'England & Wales Cricket Board',
'description': 'Watch match highlights of Day 1 from the 1st Test between England and Pakistan at Old Trafford. Find out more at ecb.co.uk This is the official channel of the ...',
'url': 'youtube',
'region_searched': 'US',
'time_searched': datetime.datetime(2020, 8, 6, 13, 6, 5, 188750, tzinfo = < UTC > )
}
I am trying to make a pandas DataFrame which would look like this:
rank title time channel description url region_searched time_searched
1 Test x Miss LaFamilia... 2020-06-28T18:30:06Z Link Up TV SUB & ENABLE NOTIFICATIONS for more... youtube.com US 2020-8-6 13:06:05
2 Day 1 Highlights | E... 2020-08-05T18:29:43 England & .. Watch match highlights of D youtube.com US 2020-8-6 13:06:05
In my data dictionary, each key should be rank entry in my DataFrame, and each key inside the parent key is an entry which column name is the key and their value is the value that key holds.
When I simply run:
df = pd.DataFrame(data)
The df looks like this:
1 2
title Test x Miss LaFamilia - All Mine [Music Video]... Day 1 Highlights | England Frustrated by Rain ...
time 2020-06-28T18:30:06Z 2020-08-05T18:29:43Z
channel Link Up TV England & Wales Cricket Board
description SUB & ENABLE NOTIFICATIONS for more: http://go... Watch match highlights of Day 1 from the 1st T...
url youtube.com/watch?v=YB3xASruJHE youtube.com/watch?v=xABoyLxWc7c
region_searched US US
time_searched 2020-08-06 2020-08-06
Which I feel like is few smart pivot lines away from what I need but I can't figure out how can I achieve the structure I need in a smart way.

It can be done in a much simpler way as #dm2 mentioned in the comments. Here d is the dictionary which has the data
df=pd.DataFrame(d)
dfz=df.T
To create the rank column
dfz['rank']=dfz.index

try this,
import pandas as pd
pd.DataFrame(data.values()).assign(rank = data.keys())
title ... rank
0 Test x Miss LaFamilia - All Mine [Music Video]... ... 1
1 Day 1 Highlights | England Frustrated by Rain ... ... 2

If you want index and rank to be two different columns
Create a dataframe from the data
df = pd.DataFrame(data.values())
Just add a rank column in the dataframe
df['rank'] = data.keys()
OR
To do this in one line use assign method
df = pd.DataFrame(data.values()).assign(rank = data.keys())
If you want index and rank to be same column
Create the dataframe but in transpose order
df = pd.DataFrame(data).T
Rename the index
df.index.names = ['rank']
It should work.

Try looping trough the dict keys and appending to a new df for each value. (replace the object "dict" to your variable)
df_full = pd.DataFrame()
for key in dict.keys():
df_temp = pd.DataFrame(dict[key])
df_full = pd.concat([df_full, df_temp], axis=0)

Related

Long to wide format using a dictionary

I would like to make a long to wide transformation of my dataframe, starting from
match_id player goals home
1 John 1 home
1 Jim 3 home
...
2 John 0 away
2 Jim 2 away
...
ending up with:
match_id player_1 player_2 player_1_goals player_2_goals player_1_home player_2_home ...
1 John Jim 1 3 home home
2 John Jim 0 2 away away
...
Since I'm going to have columns with new names, I though that maybe I should try to build a dictionary for that, where the outer key is match id, for everylike so:
dict = {1: {
'player_1': 'John',
'player_1_goals':1,
'player_1_home': 'home'
'player_2': 'Jim',
'player_2_goals':3,
'player_2_home': 'home'
},
2: {
'player_1': 'John',
'player_1_goals':0,
'player_1_home': 'away',
'player_2': 'Jim',
'player_2_goals':2
'player_2_home': 'away'
},
}
and then:
pd.DataFrame.from_dict(dict).T
In the real case scenario, however, the number of players will vary and I can't hardcode it.
Is this the best way of doing this using diciotnaries? If so, how could I build this dict and populate it from my original pandas dataframe?
It looks like you want to pivot the dataframe. The problem is there is no column in your dataframe that "enumerates" the players for you. If you assign such a column via assign() method, then pivot() becomes easy.
So far, it actually looks incredibly similar this case here. The only difference is you seem to need to format the column names in a specific way where the string "player" needs to prepended to each column name. The set_axis() call below does that.
(df
.assign(
ind=df.groupby('match_id').cumcount().add(1).astype(str)
)
.pivot('match_id', 'ind', ['player', 'goals', 'home'])
.pipe(lambda x: x.set_axis([
'_'.join([c, i]) if c == 'player' else '_'.join(['player', i, c])
for (c, i) in x
], axis=1))
.reset_index()
)

Big pandas dataframe to dict of some columns

I have a big dataframe like:
product price serial category department origin
0 cookies 4 2345 breakfast food V
1 paper 0.5 4556 stationery work V
2 spoon 2 9843 kitchen household M
I want to convert to dict, but I just want an output like:
{serial: 2345}{serial: 4556}{serial: 9843} and {origin: V}{origin: V}{origin: M}
where key is column name and value is value
Now, i've tried with df.to_dict('values') and I selected dic['origin'] and returns me
{0: V}{1:V}{2:M}
I've tried too with df.to_dict('records') but it give me:
{product: cookies, price: 4, serial:2345, category: breakfast, department:food, origin:V}
and I don't know how to select only 'origin' or 'serial'
You can do something like:
serial_dict = df[['serial']].to_dict('r')
origin_dict = df[['origin']].to_dict('r')

Creating a conditional loop in Numpy/Pandas

Absolute newbie here....
I have a dataset with a list of expenses data 1
I would like to create a loop to identify the dates in which the person spends more than the previous day and also spends more than the next day. In doing so, I would like it to either print the date and amount(expenses) or create a new column reading true/false.
Should I use Numpy or Pandas?
I was thinking of something in the likes of: today = i yesterday = i-1 and tomorrow = i+1
...and then proceeding to create a loop
Are you looking for something like this:
# sample data
np.random.seed(4)
df = pd.DataFrame({'Date': pd.date_range('2020-01-01', '2020-01-10'),
'Name': ['Some Name', 'Another Name']*5,
'Price': np.random.randint(100,1000, 10)})
# groupby name
g = df.groupby('Name')['Price']
# create a mask to filter your dataframe where the current price is grater than the price above and below
mask = (g.shift(0) > g.shift(1)) & (g.shift(0) > g.shift(-1))
df[mask]
Date Name Price
3 2020-01-04 Another Name 809
4 2020-01-05 Some Name 997
7 2020-01-08 Another Name 556

Conditional Filling in Missing Values in a Pandas Data frame using non-conventional means

TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.

How to fix Pandas dataframe which shows NaN for string, and remove list brackets when write dataframe to csv

I am converting python lists into Pandas dataframe, then write the dataframe into csv. The lists are as following:
name = ['james beard', 'james beard']
ids = [304589, 304589]
year = [1999, 1999]
co_authors = [['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani'], ['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani']]
title = ['world wide databaseintegrating the web corba and databases', 'world wide databaseintegrating the web corba and databases']
venue = ['international conference on management of data', 'international conference on management of data']
data = {
'Name': name,
'ID': ids,
'Year': year,
'Co-author': co_authors,
'Title:': title,
'Venue:': venue,
}
df = pd.DataFrame(data, columns=['Name','ID','Year','Co-author','Title', 'Venue'])
df
df.to_csv('test.csv')
My questions are
(a) "Title" and "Venue" columns are shown as 'NaN' instead of their values (see below). How can I fix this?
Name ID Year Co-author Title Venue
0 james beard 304589 1999 [athman bouguettaya, boualem benatallah, lily ... NaN NaN
1 james beard 304589 1999 [athman bouguettaya, boualem benatallah, lily ... NaN NaN
(b) In CSV (see below), how to add "Index" to the header and remove brackets in "Co-author"?
,Name,ID,Year,Co-author,Title,Venue
0,james beard,304589,1999,"['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani']",,
1,james beard,304589,1999,"['athman bouguettaya', 'boualem benatallah', 'lily hendra', 'kevin smith', 'mourad quzzani']",,
As for first problem: in data you have char : in names 'Title:', 'Venue:'
so DataFrame can't find 'Title', 'Venue' in data.
You have to remove :
Or you can skip columns=[...] and it will use names with : -'Title:', 'Venue:'
df = pd.DataFrame(data)
As for second: I was searching solution with pandas after (or during) creating DataFrame.
And I didn't find it.
But if you assume you can modify data before you create DataFrame then you can write you version shorter
co_authors = [','.join(row) for row in co_authors]
Ah Well, I solve (b) using the below before loading into data..
tmp = []
for c in xrange(len(co_authors)):
tmp.append(','.join(map(str,co_authors[c])))
co_authors = tmp

Categories

Resources