I have a big dataframe like:
product price serial category department origin
0 cookies 4 2345 breakfast food V
1 paper 0.5 4556 stationery work V
2 spoon 2 9843 kitchen household M
I want to convert to dict, but I just want an output like:
{serial: 2345}{serial: 4556}{serial: 9843} and {origin: V}{origin: V}{origin: M}
where key is column name and value is value
Now, i've tried with df.to_dict('values') and I selected dic['origin'] and returns me
{0: V}{1:V}{2:M}
I've tried too with df.to_dict('records') but it give me:
{product: cookies, price: 4, serial:2345, category: breakfast, department:food, origin:V}
and I don't know how to select only 'origin' or 'serial'
You can do something like:
serial_dict = df[['serial']].to_dict('r')
origin_dict = df[['origin']].to_dict('r')
Related
user_id
partner_name
order_sequence
2
Star Bucks
1
2
KFC
2
2
MCD
3
6
Coffee Store
1
6
MCD
2
9
KFC
1
I am trying to figure out what two restaurant combinations occur the most often. For instance, user with user_id 2 went to star bucks, KFC, and MCD, so I want a two-dimensional array that has [[star bucks, KFC],[KFC, MCD].
However, each time the user_id changes, for instance, in lines 3 and 4, the code should skip adding this combination.
Also, if a user has only one entry in the table, for instance, user with user_id 9, then this user should not be added to the list because he did not visit two or more restaurants.
The final result I am expecting for this table are:
[[Star Bucks, KFC], [KFC,MCD], [Coffee Store, MCD]]
I have written the following lines of code but so far, I am unable to crack it.
Requesting help!
arr1 = []
arr2 = []
for idx,x in enumerate(df['order_sequence']):
if x!=1:
arr1.append(df['partner_name'][idx])
arr1.append(df['partner_name'][idx+1])
arr2.append(arr1)
You could try to use .groupby() and zip():
res = [
pair
for _, sdf in df.groupby("user_id")
for pair in zip(sdf["partner_name"], sdf["partner_name"].iloc[1:])
]
Result for the sample dataframe:
[('Star Bucks', 'KFC'), ('KFC', 'MCD'), ('Coffee Store', 'MCD')]
Or try
res = (
df.groupby("user_id")["partner_name"].agg(list)
.map(lambda l: list(zip(l, l[1:])))
.sum()
)
with the same result.
Might be, that you have to sort the dataframe before:
df = df.sort_values(["user_id", "order_sequence"])
I work with a lot of CSV data for my job. I am trying to use Pandas to convert the member 'Email' to populate into the row of their spouses 'PrimaryMemberEmail' column. Here is a sample of what I mean:
import pandas as pd
user_data = {'FirstName':['John','Jane','Bob'],
'Lastname':['Snack','Snack','Tack'],
'EmployeeID':['12345','12345S','54321'],
'Email':['John#issues.com','NaN','Bob#issues.com'],
'DOB':['09/07/1988','12/25/1990','07/13/1964'],
'Role':['Employee On Plan','Spouse On Plan','Employee Off Plan'],
'PrimaryMemberEmail':['NaN','NaN','NaN'],
'PrimaryMemberEmployeeId':['NaN','12345','NaN']
}
df = pd.DataFrame(user_data)
I have thousands of rows like this. I need to only populate the 'PrimaryMemberEmail' when the user is a spouse with the 'Email' of their associated primary holders email. So in this case I would want to autopopulate the 'PrimaryMemberEmail' for Jane Snack to be that of her spouse, John Snack, which is 'John#issues.com' I cannot find a good way to do this. currently I am using:
for i in (df['EmployeeId']):
p = (p + len(df['EmployeeId']) - (len(df['EmployeeId'])-1))
EEID = df['EmployeeId'].iloc[p]
if 'S' in EEID:
df['PrimaryMemberEmail'].iloc[p] = df['Email'].iloc[p-1]
What bothers me is that this only works if my file comes in correctly, like how I showed in the example DataFrame. Also my NaN values do not work with dropna() or other methods, but that is a question for another time.
I am new to python and programming. I am trying to add value to myself in my current health career and I find this all very fascinating. Any help is appreciated.
IIUC, map the values and fillna:
df['PrimaryMemberEmail'] = (df['PrimaryMemberEmployeeId']
.map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
.fillna(df['PrimaryMemberEmail'])
)
Alternatively, if you have real NaNs, (not strings), use boolean indexing:
df.loc[df['PrimaryMemberEmployeeId'].notna(),
'PrimaryMemberEmail'] = df['PrimaryMemberEmployeeId'].map(df.set_index('EmployeeID')['PrimaryMemberEmail'])
output:
FirstName Lastname EmployeeID DOB Role PrimaryMemberEmail PrimaryMemberEmployeeId
0 John Mack 12345 09/07/1988 Employee On Plan John#issues.com NaN
1 Jane Snack 12345S 12/25/1990 Spouse On Plan John#issues.com 12345
2 Bob Tack 54321 07/13/1964 Employee Off Plan Bob#issues.com NaN
I have a dictionary data which has a structure like so:
{
1: {
'title': 'Test x Miss LaFamilia - All Mine [Music Video] | Link Up TV',
'time': '2020-06-28T18:30:06Z',
'channel': 'Link Up TV',
'description': 'SUB & ENABLE NOTIFICATIONS for more: Visit our clothing store: Visit our website for the latest videos: ...',
'url': 'youtube',
'region_searched': 'US',
'time_searched': datetime.datetime(2020, 8, 6, 13, 6, 5, 188727, tzinfo = < UTC > )
},
2: {
'title': 'Day 1 Highlights | England Frustrated by Rain as Babar Impresses | England v Pakistan 1st Test 2020',
'time': '2020-08-05T18:29:43Z',
'channel': 'England & Wales Cricket Board',
'description': 'Watch match highlights of Day 1 from the 1st Test between England and Pakistan at Old Trafford. Find out more at ecb.co.uk This is the official channel of the ...',
'url': 'youtube',
'region_searched': 'US',
'time_searched': datetime.datetime(2020, 8, 6, 13, 6, 5, 188750, tzinfo = < UTC > )
}
I am trying to make a pandas DataFrame which would look like this:
rank title time channel description url region_searched time_searched
1 Test x Miss LaFamilia... 2020-06-28T18:30:06Z Link Up TV SUB & ENABLE NOTIFICATIONS for more... youtube.com US 2020-8-6 13:06:05
2 Day 1 Highlights | E... 2020-08-05T18:29:43 England & .. Watch match highlights of D youtube.com US 2020-8-6 13:06:05
In my data dictionary, each key should be rank entry in my DataFrame, and each key inside the parent key is an entry which column name is the key and their value is the value that key holds.
When I simply run:
df = pd.DataFrame(data)
The df looks like this:
1 2
title Test x Miss LaFamilia - All Mine [Music Video]... Day 1 Highlights | England Frustrated by Rain ...
time 2020-06-28T18:30:06Z 2020-08-05T18:29:43Z
channel Link Up TV England & Wales Cricket Board
description SUB & ENABLE NOTIFICATIONS for more: http://go... Watch match highlights of Day 1 from the 1st T...
url youtube.com/watch?v=YB3xASruJHE youtube.com/watch?v=xABoyLxWc7c
region_searched US US
time_searched 2020-08-06 2020-08-06
Which I feel like is few smart pivot lines away from what I need but I can't figure out how can I achieve the structure I need in a smart way.
It can be done in a much simpler way as #dm2 mentioned in the comments. Here d is the dictionary which has the data
df=pd.DataFrame(d)
dfz=df.T
To create the rank column
dfz['rank']=dfz.index
try this,
import pandas as pd
pd.DataFrame(data.values()).assign(rank = data.keys())
title ... rank
0 Test x Miss LaFamilia - All Mine [Music Video]... ... 1
1 Day 1 Highlights | England Frustrated by Rain ... ... 2
If you want index and rank to be two different columns
Create a dataframe from the data
df = pd.DataFrame(data.values())
Just add a rank column in the dataframe
df['rank'] = data.keys()
OR
To do this in one line use assign method
df = pd.DataFrame(data.values()).assign(rank = data.keys())
If you want index and rank to be same column
Create the dataframe but in transpose order
df = pd.DataFrame(data).T
Rename the index
df.index.names = ['rank']
It should work.
Try looping trough the dict keys and appending to a new df for each value. (replace the object "dict" to your variable)
df_full = pd.DataFrame()
for key in dict.keys():
df_temp = pd.DataFrame(dict[key])
df_full = pd.concat([df_full, df_temp], axis=0)
The dataframe which is in below format has to be converted like "op_df",
ip_df=pd.DataFrame({'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]})
ip_df:
class details
0 I [{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}]
1 II [{'sec':'B','assigned_to':'joe'}]
2 III []
The required output dataframe is suppose to be,
op_df:
class sec assigned_to
0 I A tom
1 I B sam
2 II B joe
3 III NaN NaN
How to change each dictionaries of "details" column as a new row with keys of the dictionary as column name and value of the dictionary as its respective column value?
I have tried with,
ip_df.join(ip_df['details'].apply(pd.Series))
whereas, I am unable to frame like "op_df".
I am sure there are better ways to do it, but I had to deconstruct your details list and create your dataframe as follows:
dict_values = {'class':['I','II','III'],'details':[[{'sec':'A','assigned_to':'tom'},{'sec':'B','assigned_to':'sam'}],[{'sec':'B','assigned_to':'joe'}],[]]}
all_values = []
for cl, detail in zip(dict_values['class'], dict_values['details']):
if len(detail) > 0:
for innerdict in detail:
row = {'class': cl}
for innerkey in innerdict.keys():
row[innerkey] = innerdict[innerkey]
all_values.append(row)
else:
row = {'class': cl}
all_values.append(row)
op_df = pd.DataFrame(all_values)
TLDR; How can I improve my code and make it more pythonic?
Hi,
One of the interesting challenge(s) we were given in a tutorial was the following:
"There are X missing entries in the data frame with an associated code but a 'blank' entry next to the code. This is a random occurance across the data frame. Using your knowledge of pandas, map each missing 'blank' entry to the associated code."
So this looks like the following:
|code| |name|
001 Australia
002 London
...
001 <blank>
My approach I have used is as follows:
Loop through entire dataframe and identify areas with blanks "". Replace all blanks via copying the associated correct code (ordered) to the dataframe.
code_names = [ "",
'Economic management',
'Public sector governance',
'Rule of law',
'Financial and private sector development',
'Trade and integration',
'Social protection and risk management',
'Social dev/gender/inclusion',
'Human development',
'Urban development',
'Rural development',
'Environment and natural resources management'
]
df_copy = df_.copy()
# Looks through each code name, and if it is empty, stores the proper name in its place
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
if(df_copy.mjtheme_namecode[x][y]['name'] == ""):
df_copy.mjtheme_namecode[x][y]['name'] = code_names[int(df_copy.mjtheme_namecode[x][y]['code'])]
limit = 25
counter = 0
for x in range(len(df_copy.mjtheme_namecode)):
for y in range(len(df_copy.mjtheme_namecode[x])):
print(df_copy.mjtheme_namecode[x][y])
counter += 1
if(counter >= limit):
break
While the above approach works - is there a better, more pythonic way of achieving what I'm after? I feel the approach I have used is very clunky due to my skills not being very well developed.
Thank you!
Method 1:
One way to do this would be to replace all your "" blanks with NaN, sort the dataframe by code and name, and use fillna(method='ffill'):
Starting with this:
>>> df
code name
0 1 Australia
1 2 London
2 1
You can apply the following:
new_df = (df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.fillna(method='ffill')
.sort_index())
>>> new_df
code name
0 1 Australia
1 2 London
2 1 Australia
Method 2:
This is more convoluted, but will work as well:
Using groupby, first, and sqeeze, you can create a pd.Series to map the codes to non-blank names, and use .map to map that series to your code column:
df['name'] = (df['code']
.map(
df.replace({'name':{'':np.nan}})
.sort_values(['code', 'name'])
.groupby('code')
.first()
.squeeze()
))
>>> df
code name
0 1 Australia
1 2 London
2 1 Australia
Explanation: The pd.Series map that this creates looks like this:
code
1 Australia
2 London
And it works because it gets the first instance for every code (via the groupby), sorted in such a manner that the NaNs are last. So as long as each code is associated with a name, this method will work.