I have a DataFrame with a "details" column that I believe is a dictionary. The initial data is a JSON string parsed with json.loads, then converted from a dictionary to DataFrame. I would like to populate a new "affectedDealId" column with the value in data['details']['actions']['affectedDealId'].
I'm hoping I can do this the "DataFrame" way without using a loop with something like:
data['affectedDealId'] = data['details'].get('actions').get('affectedDealId')
To simplify I've tried:
data['actions'] = data['details'].get('actions')
But that ends up as "None".
Also data['details'] seems to be a series when I think it's a dictionary before converting it to a DataFrame.
Alternatively, I do later loop through the DataFrame. How would I access that 'affectedDealId' element?
Below is a screenshot of the DataFrame from the PyCharm debugger.
I'm making some assumptions about details json, but does this help? You'll will have to adjust the json.loads(x) key/index to extract the right location.
df['affectedDealId'] = df['details'].apply(lambda x: json.loads(x)['affectedDealId'])
I think with will be great if you could do something like this.
so this create a data frame off your json column by calling the pd.Series
data_frame_new = df['details'].apply(pd.Series)
and then reassign your data frame by concat your data_frame_new with your existing data frame.
df = pd.concat([df,data_frame_new],axis = 1)
print(df)
This approach worked for me on a recent project.
your affectedId will be come a column of it own with the data populated.
it may be of help to you.
Thanks
Related
I am trying to bring this data from an api into a data frame however some of the data is nested under another one. So for example I would like to have each stat like 'passing_att' as its own column. But it is nested under 'passing' which 'passing' and 'rushing' and 'receiving' is all nested under 'stats'. I can bring just 'passing' into a data frame which would show 'passing_att' and the rest of the 'passing' data but then I don't have names and all the other data I would like.
Code
The correct word is to 'flatten' a JSON object. There are many ways to do this. You could use the flatten_json pip package or write your own function to do this. Though for your specific use case, you could use the json_normalize() function from the pandas library because you are trying to flatten a list of JSON objects that can easily be converted into a pandas dataframe.
Once you have the nested JSON object as show below (performed in your code):
json_list = json.loads(response.text)
Use the following code to flatten it:
df = pandas.json_normalize(json_list)
You will get all the nested columns flattened into a single level dataframe separated by a period. For example, the passing_att attribute cann be accessed via the column name 'stats.passing.passing_att'. You can change the separator by passing the sep parameter to the json_normalize function. For example, if you want to use a hyphen as the separator, you can do so by passing the following parameter to the json_normalize function:
df = pandas.json_normalize(json_list, sep='-')
If you want to select only a few columns, you can pass a list of column names to the dataframe's [] operator. For example, if you want to select only the player_name, position, passing_att and passing_cmp columns, you can do so by using the following code:
df = df[['player_name', 'position', 'stats.passing.passing_att', 'stats.passing.passing_cmp']]
I am pulling some data from a graphql api and the data I am getting is being returned in a dict like this:
and then I also turned the dict into a pandas dataframe which returns this:
so from my beginner understanding, it looks like the 'swaps' row is just a super long string. I was looking at some online tutorials and still cannot figure out how to transform this row into many rows (its 1000 rows). Any help would be greatly appreciated!
You can only convert value of swaps to dataframe
df = pd.DataFrame(x['data']['swaps'])
I have column info in a dataframe with data in dict like format like the one below.
I would like to get another dataframe with this info and a tried:
feature = [d.get('Feature') for d in df['info']]
but it returns none.
How can I do it? I am really having a bad time trying to get this done.
As the dict is nested, you can try pd.json_normalize() which normalizes semi-structured JSON data into a flat table:
df_new = pd.json_normalize(df['info'])
As some inner dict are further under a list, you may need to further handle this to dig out the deeper contents. Anyway, this should serve well as a starting point for your works.
I am streaming data from a websocket into my python application successfully using these lines of code:
wsClient = GDAX.WebsocketClient(url="wss://ws-feed.gdax.com", products="LTC-USD")
wsClient.start()
I am having trouble saving the results of wsClient.start()into a pandas dataframe. Not sure why records are not appending with this lines of code, can anyone please help me understand why not :
df1 = pd.DataFrame()
for i in wsClient.start():
df1.append(wsClient.start())
Thank you in advance.
If you look at the documentation for append you can see that it returns the resulting DataFrame, and doesn't alter the DataFrame on which append is called nor the DataFrame which is the argument.
In the body of your loop, you probably meant something like
df1 = df1.append(wsClient.start())
As DJK correctly notes below, for a more efficient alternative, you can first create a list of all DataFrames, then append:
dfs = []
for i in wsClient.start():
dfs.append(wsClient.start())
df1 = pd.concat(dfs)
I have a tab separated file which I extracted in pandas dataframe as below:
import pandas as pd
data1 = pd.DataFrame.from_csv(r"C:\Users\Ashish\Documents\indeed_ml_dataset\train.tsv", sep="\t")
data1
Here is how the data1 looks like:
Now, I want to view the column name tags. I don't know whether I should call it a column or not, but I have tried accessing it using the norm:
data2=data1[['tags']]
but it errors out. I have tried several other things as well using index and loc, but all of them fails. Any suggestions?
To fix this you'll need to remove description from the index by resetting. Try the below:
data2 = data1.reset_index()
data2['tags']
You'll then be able to select by "tags".
Try reading your data using pd.read_csv instead of pd.DataFrame.from_csv as it takes first column as index by default.
For more info refer to this documentation on pandas website: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_csv.html