I have a column named 'hierarchy' in pandas dataframe which is having dictionary values
{'5ff70ec16e8fa91c6462a47f': {'title': 'TP Layer', 'joinBy': '4a850c44-0107-48fb-a5e3-14a8e4cd44ab'}}
{'5fff3c3318d71e001221cc5b': {'title': 'Legal Entities', 'joinBy': '20e49f0a-4dca-43a3-8a5c-2ef1607c5e5f'}}
{'5ff76134930ddee5814becba': {'title': 'Line Item', 'joinBy': '5a8295e8-e006-4a6a-98b9-64587bb679c6'}}
nan
nan
nan
{'5ff74bc8930ddef3be4becb5': {'title': 'Relationship', 'joinBy': 'ea307ebb-1b40-4c6b-b922-b7d6d6920e03'}}
nan
nan
{'600062d318d71e001221cc5d': {'title': 'ProjeX V2 Periods', 'joinBy': '1e09f4d0-2736-4a38-a122-ac8e7ee35367'}}
I want to extract the title and joinBy and create separate columns for that in dataframe, so, the result should appear like
title joinBy
TP Layer 4a850c44-0107-48fb-a5e3-14a8e4cd44ab
Legal Entities 20e49f0a-4dca-43a3-8a5c-2ef1607c5e5f
nan nan
Does anyone has any idea, how to do this?
With df referring to the input dataframe described in your post, the following will add new columns to your dataframe and fill them with "title" and "joinBy" values extracted from the "hierarchy" column:
import numpy as np
import pandas as pd
for i, entry in enumerate(df["hierarchy"]):
if type(entry) == dict:
k = list(entry.keys())[0]
df.at[i,"title"] = entry[k]["title"]
df.at[i,"joinBy"] = entry[k]["joinBy"]
else:
df.at[i,"title"] = np.nan
df.at[i,"joinBy"] = np.nan
Note that I have worked with np.nan in my snippet. If you want your nans differently in the created dataframe, you would have to amend the code accordingly.
Related
I have a large data file as shown below.
Edited to include an updated example:
I wanted to add two new columns (E and F) next to column D and move the suite # when applicable and City/State data in cell D3 and D4 to E2 and F2, respectively. The challenge is not every entry has the suite number. I would need to insert a row first for those entries that don't have the suite number, only for them, not for those that already have the suite information.
I know how to do loops, but am having trouble to define the conditions. One way is to count the length of the string. How should I get started? Much appreciate your help!
This is how I would do it. I don't recommend looping when using pandas. There are a lot of tools that it is often not needed. Some caution on this. Your spreadsheet has NaN and I think that is actually numpy np.nan equivalent. You also have blanks I am thinking that it is a "" equivalent.
import pandas as pd
import numpy as np
# dictionary of your data
companies = {
'Comp ID': ['C1', '', np.nan, 'C2', '', np.nan, 'C3',np.nan],
'Address': ['10 foo', 'Suite A','foo city', '11 spam','STE 100','spam town', '12 ham', 'Myhammy'],
'phone': ['888-321-4567', '', np.nan, '888-321-4567', '', np.nan, '888-321-4567',np.nan],
'Type': ['W_sale', '', np.nan, 'W_sale', '', np.nan, 'W_sale',np.nan],
}
# make the frames needed.
df = pd.DataFrame( companies)
df1 = pd.DataFrame() # blank frame for suite and town columns
# Edit here to TEST the data types
for r in range(0, 5):
v = df['Comp ID'].values[r]
print(f'this "{v}" is a ', type(v))
# So this will tell us the data types so we can construct our where(). Back to prior answer....
# Need a where clause it is similar to a if() statement in excel
df1['Suite'] = np.where( df['Comp ID']=='', df['Address'], np.nan)
df1['City/State'] = np.where( df['Comp ID'].isna(), df['Address'], np.nan)
# copy values to rows above
df1 = df1[['Suite','City/State']].backfill()
# joint the frames together on index
df = df.join(df1)
df.drop_duplicates(subset=['City/State'], keep='first', inplace=True)
# set the column order to what you want
df = df[['Comp ID', 'Type', 'Address', 'Suite', 'City/State', 'phone' ]]
output
Comp ID
Type
Address
Suite
City/State
phone
C1
W_sale
10 foo
Suite A
foo city
888-321-4567
C2
W_sale
11 spam
STE 100
spam town
888-321-4567
C3
W_sale
12 ham
Myhammy
888-321-4567
Edit: the numpy where statement:
numpy is brought in by the line import numpy as np at the top. We are creating calculated column that is based on the 'Comp ID' column. The numpy does this without loops. Think of the where like an excel IF() function.
df1(return value) = np.where(df[test] > condition, true, false)
The pandas backfill
Some times you have a value that is in a cell below and you want to duplicate it for the blank cell above it. So you backfill. df1 = df1[['Suite','City/State']].backfill().
import pandas as pd
df = pd.DataFrame({'action': ['visited', 'clicked', 'switched'],
'target': ['pricing page', 'homepage', 'succeesed']
'type': [np.nan, np.nan, np.nan],})`
I have an empty "type" column in the dataframe. I want a text to be written if the row certain condition satisfies it. e.g;
action=visited and target=pricing page get type=free
df.loc[df['action'].eq('visited') & df['target'].eq('pricing page'), 'type'] = 'free'
action target type
0 visited pricing page free
1 clicked homepage NaN
2 switched succeesed NaN
I have a list of arbitrary number of dictionaries in each cell of a pandas column.
df['Amenities'][0]
[{'Description': 'Basketball Court(s)'},
{'Description': 'Bike Rack / Bike Storage'},
{'Description': 'Bike Rental'},
{'Description': 'Business Center'},
{'Description': 'Clubhouse'},
{'Description': 'Community Garden'},
{'Description': 'Complex Wifi '},
{'Description': 'Courtesy Patrol/Officer'},
{'Description': 'Dog Park'},
{'Description': 'Health Club / Fitness Center'},
{'Description': 'Jacuzzi'},
{'Description': 'Pet Friendly'},
{'Description': 'Pet Park / Dog Run'},
{'Description': 'Pool'}]
I'd like to do the following.
1) Iterate over list of dicts, Unpack them and create columns with value 1 (Amenities exits).
2) On the subsequent iterations, check if the column label already exists, then add 1 as value to the cell, or create a new column if it doesn't exist.
3) Fill the remaining columns with 0.
Basically, I am trying to create features that hold values 0 and 1 from a list of dictionaries.
The code below creates new columns based on dict values but the part around checking if the column exists , creating a new one if doesn't and assigning 1s and 0s needs a bit of thinking.
for i, row in df.iterrows():
dict_obj = row['Amenities']
for key, val in dict_obj.items():
if val in df.columns:
df.loc[i, val] = 1
else
.......
Expected Outcome something like this:
one way is to explode the column Amenities, then create a dataframe, use str.get_dummies on the column and sum on the level=0 like:
#data example
df = pd.DataFrame({
'Amenities': [
[{'Description': 'Basketball Court(s)'},
{'Description': 'Bike Rental'}],
[{'Description': 'Basketball Court(s)'},
{'Description': 'Clubhouse'},
{'Description': 'Community Garden'}]
]})
# explode
s = df['Amenities'].explode()
# create dataframe, use get_dummies and sum on the level=0 of index
df_ = pd.DataFrame(s.tolist(), s.index)['Description'].str.get_dummies().sum(level=0)
print (df_)
Basketball Court(s) Bike Rental Clubhouse Community Garden
0 1 1 0 0
1 1 0 1 1
Your code was a great start and very close!
As you said, you need to iterate through the dictionaries. The solution is to use .loc to create the new column on your dataframe (for the amenity currently being processed) if it doesn't yet exist or set its value if it does.
import pandas as pd
df = pd.DataFrame(
{
"Amenities": [
[
{"Description": "Basketball Court(s)"},
{"Description": "Bike Rack / Bike Storage"},
{"Description": "Bike Rental"},
],
[
{"Description": "Basketball Court(s)"},
{"Description": "Courtesy Patrol/Officer"},
{"Description": "Dog Park"},
],
]
}
)
for i, row in df.iterrows():
amenities_list = row["Amenities"]
for amenity in amenities_list:
for k, v in amenity.items():
df.loc[i, v] = 1
df = df.drop(columns="Amenities")
df = df.fillna(0).astype({i: "int" for i in df.columns})
Short description:
i is the row index and v is the name of the amenity (string). df.loc[] takes in row index, column index and creates a new column if the column index is not yet present.
After the for loop, we just drop the no longer needed "Amentities" column, replace all NA values with 0 and then convert all columns to integers (NA values exist only for floats and so by default they are floats to begin with).
I've got a very large dataframe where one of the columns is a dictionary itself. (let's say column 12). In that dictionary is a part of a hyperlink, which I want to get.
In Jupyter, I want to display a table where I have column 0 and 2, as well as the completed hyperlink
I think I need to:
Extract that dictionary from the dataframe
Get a particular keyed value from it
Create the full hyperlink from the extracted value
Copy the dataframe and replace the column with the hyperlink created above
Let's just tackle step 1 and I'll make other questions for the next steps.
How do I extract values from a dataframe into a variable I can play with?
import pytd
import pandas
client = pytd.Client(apikey=widget_api_key.value, database=widget_database.value)
results = client.query(query)
dataframe = pandas.DataFrame(**results)
dataframe
# Not sure what to do next
If you only want to extract one key from the dictionary and the dictionary is already stored as a dictionary in the column, you can do it as follows:
import numpy as np
import pandas as pd
# assuming, your dicts are stored in column 'data'
# and you want to store the url in column 'url'
df['url']= df['data'].map(lambda d: d.get('url', np.NaN) if hasattr(d, 'get') else np.NaN)
# from there you can do your transformation on the url column
Testdata and results
df= pd.DataFrame({
'col1': [1, 5, 6],
'data': [{'url': 'http://foo.org', 'comment': 'not interesting'}, {'comment': 'great site about beer receipes, but forgot the url'}, np.NaN],
'json': ['{"url": "http://foo.org", "comment": "not interesting"}', '{"comment": "great site about beer receipes, but forgot the url"}', np.NaN]
}
)
# Result of the logic above:
col1 data url
0 1 {'url': 'http://foo.org', 'comment': 'not inte... http://foo.org
1 5 {'comment': 'great site about beer receipes, b... NaN
2 6 NaN NaN
If you need to test, if your data is already stored in python dicts (rather than strings), you can do it as follows:
print(df['data'].map(type))
If your dicts are stored as strings, you can convert them to dicts first based on the following code:
import json
def get_url_from_json(document):
if pd.isnull(document):
url= np.NaN
else:
try:
_dict= json.loads(document)
url= _dict.get('url', np.NaN)
except:
url= np.NaN
return url
df['url2']= df['json'].map(get_url_from_json)
# output:
print(df[['col1', 'url', 'url2']])
col1 url url2
0 1 http://foo.org http://foo.org
1 5 NaN NaN
2 6 NaN NaN
I have thousands of row in given block structure. In this structure First row - Response Comments, Second row- Customer name and Last row - Recommended are fixed. Rest of the fields/rows are not mandatory.
I am trying to write a code where I am reading Column Name = 'Response Comments' then Key = Column Values of next row (Customer Name).
This should be done from Row - Response Comments to Recommended,
Then breaking a loop and having new key value.
The data is from an Excel file:
from pandas import DataFrame
import pandas as pd
import os
import numpy as np
xl = pd.ExcelFile('Filepath')
df = xl.parse('Reviews_Structured')
print(type (df))
RowNum Column Name Column Values Key
1 Response Comments they have been unresponsive
2 Customer Name Brian
.
.
.
.
13 Recommended no
Any help regarding this loop code will be appreciated.
One way to implement your logic is using collections.defaultdict and a nested dictionary structure. Below is an example:
from collections import defaultdict
import pandas as pd
# input data
df = pd.DataFrame([[1, 'Response Comments', 'they have been unresponsive'],
[2, 'Customer Name', 'Brian'],
.....
[9, 'Recommended', 'yes']],
columns=['RowNum', 'Column Name', 'Column Values'])
# fill Key columns
df['Key'] = df['Column Values'].shift(-1)
df.loc[df['Column Name'] != 'Response Comments', 'Key'] = np.nan
df['Key'] = df['Key'].ffill()
# create defaultdict of dict
d = defaultdict(dict)
# iterate dataframe
for row in df.itertuples():
d[row[4]].update({row[2]: row[3]})
# defaultdict(dict,
# {'April': {'Customer Name': 'April',
# 'Recommended': 'yes',
# 'Response Comments': 'they have been responsive'},
# 'Brian': {'Customer Name': 'Brian',
# 'Recommended': 'no',
# 'Response Comments': 'they have been unresponsive'},
# 'John': {'Customer Name': 'John',
# 'Recommended': 'yes',
# 'Response Comments': 'they have been very responsive'}})
Am I understanding this correctly, that you want a new DataFrame with
columns = ['Response Comments', 'Customer name', ...]
to reshape your data from the parsed excel file?
Create an empty DataFrame from the known, mandatory column names, e.g
df_new = pd.DataFrame(columns=['Response Comments', 'Customer name', ...])
index = 0
iterate over the parsed excel file row by row and assign your values
for k, row in df.iterrows():
index += 1
if row['Column Name'] in df_new:
df_new.at[index, row['Column Name']] = row['Column Values']
if row['Column Name'] == 'Recommended':
continue
Not a beauty, but I'm not quite sure what exactly you're trying to achieve :)