I've got a very large dataframe where one of the columns is a dictionary itself. (let's say column 12). In that dictionary is a part of a hyperlink, which I want to get.
In Jupyter, I want to display a table where I have column 0 and 2, as well as the completed hyperlink
I think I need to:
Extract that dictionary from the dataframe
Get a particular keyed value from it
Create the full hyperlink from the extracted value
Copy the dataframe and replace the column with the hyperlink created above
Let's just tackle step 1 and I'll make other questions for the next steps.
How do I extract values from a dataframe into a variable I can play with?
import pytd
import pandas
client = pytd.Client(apikey=widget_api_key.value, database=widget_database.value)
results = client.query(query)
dataframe = pandas.DataFrame(**results)
dataframe
# Not sure what to do next
If you only want to extract one key from the dictionary and the dictionary is already stored as a dictionary in the column, you can do it as follows:
import numpy as np
import pandas as pd
# assuming, your dicts are stored in column 'data'
# and you want to store the url in column 'url'
df['url']= df['data'].map(lambda d: d.get('url', np.NaN) if hasattr(d, 'get') else np.NaN)
# from there you can do your transformation on the url column
Testdata and results
df= pd.DataFrame({
'col1': [1, 5, 6],
'data': [{'url': 'http://foo.org', 'comment': 'not interesting'}, {'comment': 'great site about beer receipes, but forgot the url'}, np.NaN],
'json': ['{"url": "http://foo.org", "comment": "not interesting"}', '{"comment": "great site about beer receipes, but forgot the url"}', np.NaN]
}
)
# Result of the logic above:
col1 data url
0 1 {'url': 'http://foo.org', 'comment': 'not inte... http://foo.org
1 5 {'comment': 'great site about beer receipes, b... NaN
2 6 NaN NaN
If you need to test, if your data is already stored in python dicts (rather than strings), you can do it as follows:
print(df['data'].map(type))
If your dicts are stored as strings, you can convert them to dicts first based on the following code:
import json
def get_url_from_json(document):
if pd.isnull(document):
url= np.NaN
else:
try:
_dict= json.loads(document)
url= _dict.get('url', np.NaN)
except:
url= np.NaN
return url
df['url2']= df['json'].map(get_url_from_json)
# output:
print(df[['col1', 'url', 'url2']])
col1 url url2
0 1 http://foo.org http://foo.org
1 5 NaN NaN
2 6 NaN NaN
Related
import pandas as pd
df = pd.DataFrame({'action': ['visited', 'clicked', 'switched'],
'target': ['pricing page', 'homepage', 'succeesed']
'type': [np.nan, np.nan, np.nan],})`
I have an empty "type" column in the dataframe. I want a text to be written if the row certain condition satisfies it. e.g;
action=visited and target=pricing page get type=free
df.loc[df['action'].eq('visited') & df['target'].eq('pricing page'), 'type'] = 'free'
action target type
0 visited pricing page free
1 clicked homepage NaN
2 switched succeesed NaN
I want to create a new columns conditional on two other columns in python.
Below is the dataframe:
name
address
apple
hello1234
banana
happy111
apple
str3333
pie
diary5144
I want to create a new column "want", conditional on column "name" and "column" address.
The rules are as follows:
(1)If the value in "name" is apple, the the value in "want" should be the first five letters in column "address".
(2)If the value in "name" is banana, the the value in "want" should be the first four letters in column "address".
(3)If the value in "name" is pie, the the value in "want" should be the first three letters in column "address".
The dataframe I want look like this:
name
address
want
apple
hello1234
hello
banana
happy111
happ
apple
str3333
str33
pie
diary5144
dia
How to address such problem? Thanks!
I hope you are well,
import pandas as pd
# Initialize data of lists.
data = {'Name': ['Apple', 'Banana', 'Apple', 'Pie'],
'Address': ['hello1234', 'happy111', 'str3333', 'diary5144']}
# Create DataFrame
df = pd.DataFrame(data)
# Add an empty column
df['Want'] = ''
for i in range(len(df)):
if df['Name'].iloc[i] == "Apple":
df['Want'].iloc[i] = df['Address'].iloc[i][:5]
if df['Name'].iloc[i] == "Banana":
df['Want'].iloc[i] = df['Address'].iloc[i][:4]
if df['Name'].iloc[i] == "Pie":
df['Want'].iloc[i] = df['Address'].iloc[i][:3]
# Print the Dataframe
print(df)
I hope it helps,
Have a lovely day
I think a broader way of doing this is by creating a conditional map dict and applying it with lambda functions on your dataset.
Creating the dataset:
import pandas as pd
data = {
'name': ['apple', 'banana', 'apple', 'pie'],
'address': ['hello1234', 'happy111', 'str3333', 'diary5144']
}
df = pd.DataFrame(data)
Defining the conditional dict:
conditionalMap = {
'apple': lambda s: s[:5],
'banana': lambda s: s[:4],
'pie': lambda s: s[:3]
}
Applying the map:
df.loc[:, 'want'] = df.apply(lambda row: conditionalMap[row['name']](row['address']), axis=1)
With the resulting df:
name
address
want
0
apple
hello1234
hello
1
banana
happy111
happ
2
apple
str3333
str33
3
pie
diary5144
dia
You could do the following:
for string, length in {"apple": 5, "banana": 4, "pie": 3}.items():
mask = df["name"].eq(string)
df.loc[mask, "want"] = df.loc[mask, "address"].str[:length]
Iterate over the 3 conditions: string is the string on which the length requirement depends, and the length requirement is stored in length.
Build a mask via df["name"].eq(string) which selects the rows with value string in column name.
Then set column want at those rows to the adequately clipped column address values.
Result for the sample dataframe:
name address want
0 apple hello1234 hello
1 banana happy111 happ
2 apple str3333 str33
3 pie diary5144 dia
Here's an example of the data I'm working with:
values variable.variableName timeZone
0 [{'value': [], turbidity PST
'qualifier': [],
'qualityControlLevel': [],
'method': [{
'methodDescription': '[TS087: YSI 6136]',
'methodID': 15009}],
'source': [],
'offset': [],
'sample': [],
'censorCode': []},
{'value': [{
'value': '17.2',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '17.5',
'qualifiers': ['P'],
'dateTime': '2022-01-05T14:00:00.000-08:00'}
}]
1 [{'value': degC PST
[{'value': '9.3',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:30:00.000-08:00'},
{'value': '9.4',
'qualifiers': ['P'],
'dateTime': '2022-01-05T12:45:00.000-08:00'},
}]
I'm trying to break out each of the variables in the data into their own dataframes, what I have so far works, however, if there are multiple sets of the values (like in turbidity); it only pulls in the first set, which is sometimes empty. How do I pull in all the value sets? Here's what I have so far:
import requests
import pandas as pd
url = ('https://waterservices.usgs.gov/nwis/iv?sites=11273400&period=P1D&format=json')
response = requests.get(url)
result = response.json()
json_list = result['value']['timeSeries']
df = pd.json_normalize(json_list)
new_df = df['values'].apply(lambda x: pd.DataFrame(x[0]['value']))
new_df.index = df['variable.variableName']
# print turbidity
print(new_df.loc['Turbidity, water, unfiltered, monochrome near infra-red LED light,
780-900 nm, detection angle 90 ±2.5°, formazin nephelometric units (FNU)'])
This outputs:
turbidity df
Empty DataFrame
Columns: []
Index: []
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Whereas I want my output to be something like:
turbidity df
value qualifiers dateTime
0 17.2 P 2022-01-05T12:30:00.000-08:00
1 17.5 P 2022-01-05T14:00:00.000-08:00
degC df
value qualifiers dateTime
0 9.3 P 2022-01-05T12:30:00.000-08:00
1 9.4 P 2022-01-05T12:45:00.000-08:00
Unfortunately, it only grabs the first value set, which in the case of turbidity is empty. How can I grab them all or check to see if the data frame is empty and grab the next one?
I believe the missing link here is DataFrame.explode() -- it allows you to split a single row that contains a list of values (your "values" column) into multiple rows.
You can then use
new_df = df.explode("values")
which will split the "turbidity" row into two.
You can then filter rows with empty "value" dictionaries and apply .explode() once again.
You can then also use pd.json_normalize again to expand a dictionary of values into multiple columns, or also look into Series.str.get() to extract a single element from a dict or list.
This JSON is nested deep so I think it requires a few steps to transform into what you want.
# First, use json_normalize on top level to extract values and variableName.
df = pd.json_normalize(result, record_path=['values'], meta=[['variable', 'variableName']])
# Then explode the value to flatten the array and filter out any empty array
df = df.explode('value').dropna(subset=['value'])
# Another json_normalize on the exploded value to extract the value and qualifier and dateTime, concat with variableName.
# explode('qualifiers') is to take out wrapping array.
df = pd.concat([df[['variable.variableName']].reset_index(drop=True),
pd.json_normalize(df.value).explode('qualifiers')], axis=1)
Resulted dataframe should look like this.
variable.variableName value qualifiers dateTime
0 Temperature, water, °C 10.7 P 2022-01-06T12:15:00.000-08:00
1 Temperature, water, °C 10.7 P 2022-01-06T12:30:00.000-08:00
2 Temperature, water, °C 10.7 P 2022-01-06T12:45:00.000-08:00
3 Temperature, water, °C 10.8 P 2022-01-06T13:00:00.000-08:00
If you will do further data processing, it is probably better to keep everything in 1 dataframe but if you really need to have separate dataframes, take it out with the filtering.
df_turbidity = df[df['variable.variableName'].str.startswith('Turbidity')]
I have a column named 'hierarchy' in pandas dataframe which is having dictionary values
{'5ff70ec16e8fa91c6462a47f': {'title': 'TP Layer', 'joinBy': '4a850c44-0107-48fb-a5e3-14a8e4cd44ab'}}
{'5fff3c3318d71e001221cc5b': {'title': 'Legal Entities', 'joinBy': '20e49f0a-4dca-43a3-8a5c-2ef1607c5e5f'}}
{'5ff76134930ddee5814becba': {'title': 'Line Item', 'joinBy': '5a8295e8-e006-4a6a-98b9-64587bb679c6'}}
nan
nan
nan
{'5ff74bc8930ddef3be4becb5': {'title': 'Relationship', 'joinBy': 'ea307ebb-1b40-4c6b-b922-b7d6d6920e03'}}
nan
nan
{'600062d318d71e001221cc5d': {'title': 'ProjeX V2 Periods', 'joinBy': '1e09f4d0-2736-4a38-a122-ac8e7ee35367'}}
I want to extract the title and joinBy and create separate columns for that in dataframe, so, the result should appear like
title joinBy
TP Layer 4a850c44-0107-48fb-a5e3-14a8e4cd44ab
Legal Entities 20e49f0a-4dca-43a3-8a5c-2ef1607c5e5f
nan nan
Does anyone has any idea, how to do this?
With df referring to the input dataframe described in your post, the following will add new columns to your dataframe and fill them with "title" and "joinBy" values extracted from the "hierarchy" column:
import numpy as np
import pandas as pd
for i, entry in enumerate(df["hierarchy"]):
if type(entry) == dict:
k = list(entry.keys())[0]
df.at[i,"title"] = entry[k]["title"]
df.at[i,"joinBy"] = entry[k]["joinBy"]
else:
df.at[i,"title"] = np.nan
df.at[i,"joinBy"] = np.nan
Note that I have worked with np.nan in my snippet. If you want your nans differently in the created dataframe, you would have to amend the code accordingly.
I have a pandas data frame where one of the columns is an array of keywords, one row in the data frame would look like
id, jobtitle, company, url, keywords
1, Software Engineer, Facebook, http://xx.xx, [javascript, java, python]
However the number of possible keywords can range from 1 to 40
But I would like to do some data analysis like,
what keyword appears most often across the whole dataset
what keywords appear most often for each job title/company
Apart from giving each keyword its own column and dealing with lots of NAN values is there an easy way to answer these questions with python, (permeably pandas as its a dataframe)
You can do something like this :
import pandas as pd
keyword_dict = {}
def count_keywords(keyword):
for item in keyword:
if item in keyword_dict:
keyword_dict[item] += 1
else:
keyword_dict[item] =1
def new_function():
data = {'keywords':
[['hello', 'test'], ['test', 'other'], ['test', 'hello']]
}
df = pd.DataFrame(data)
df.keywords.map(count_keywords)
print(keyword_dict)
if __name__ == '__main__':
new_function()
output
{'hello': 2, 'test': 3, 'other': 1}