Separate pd DataFrame Rows that are dictionaries into columns - python

I am extracting some data from an API and having challenges transforming it into a proper dataframe.
The resulting DataFrame df is arranged as such:
Index Column
0 {'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
1 {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}
I am trying to split the emails into one column and the list into a separate column:
Index Column1 Column2
0 email#email.com [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
Ideally, each 'action'/'date' would have it's own separate row, however I believe I can do the further unpacking myself.
After looking around I tried/failed lots of solutions such as:
df.apply(pd.Series) # does nothing
pd.DataFrame(df['column'].values.tolist()) # makes each dictionary key as a separate colum
where most of the rows are NaN except one which has the pair value
Edit:
As many of the questions asked the initial format of the data in the API, it's a list of dictionaries:
[{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]},{'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
Thanks

One naive way of doing this is as below:
inp = [{'email#email.com': [{'action': 'data', 'date': 'date'}, {'action': 'data', 'date': 'date'}]}
, {'different-email#email.com': [{'action': 'data', 'date': 'date'}]}]
index = 0
df = pd.DataFrame()
for each in inp: # iterate through the list of dicts
for k, v in each.items(): #take each key value pairs
for eachv in v: #the values being a list, iterate through each
print (str(eachv))
df.set_value(index,'Column1',k)
df.set_value(index,'Column2',str(eachv))
index += 1
I am sure there might be a better way of writing this. Hope this helps :)

Assuming you have already read it as dataframe, you can use following -
import ast
df['Column'] = df['Column'].apply(lambda x: ast.literal_eval(x))
df['email'] = df['Column'].apply(lambda x: x.keys()[0])
df['value'] = df['Column'].apply(lambda x: x.values()[0])

Related

Pandas Dataframe from list nested in json

I have a request that gets me some data that looks like this:
[{'__rowType': 'META',
'__type': 'units',
'data': [{'name': 'units.unit', 'type': 'STRING'},
{'name': 'units.classification', 'type': 'STRING'}]},
{'__rowType': 'DATA', '__type': 'units', 'data': ['A', 'Energie']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['bar', ' ']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CCM', 'Volumen']},
{'__rowType': 'DATA', '__type': 'units', 'data': ['CDM', 'Volumen']}]
and would like to construct a (Pandas) DataFrame that looks like this:
Things like pd.DataFrame(pd.json_normalize(test)['data'] are close but still throw the whole list into the column instead of making separate columns. record_path sounded right but I can't get it to work correctly either.
Any help?
It's difficult to know how the example generalizes, but for this particular case you could use:
pd.DataFrame([d['data'] for d in test
if d.get('__rowType', None)=='DATA' and 'data' in d],
columns=['unit', 'classification']
)
NB. assuming test the input list
output:
unit classification
0 A Energie
1 bar
2 CCM Volumen
3 CDM Volumen
Instead of just giving you the code, first I explain how you can do this by details and then I'll show you the exact steps to follow and the final code. This way you understand everything for any further situation.
When you want to create a pandas dataframe with two columns you can do this by creating a dictionary and passing it to DataFrame class:
my_data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=my_data)
This will result in this dataframe:
So if you want to have the dataframe you specified in your question the my_data dictionary should be like this:
my_data = {
'unit': ['A', 'bar', 'CCM', 'CDM'],
'classification': ['Energie', '', 'Volumen', 'Volumen'],
}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df
(You can see the df.index=... part. This is because that the index column of the desired dataframe is started at 1 in your question)
So if you want to do so you just have to extract these data from the data you provided and convert them to the exact dictionary mentioned above (my_data dictionary)
To do so you can do this:
# This will get the data values like 'bar', 'CCM' and etc from your initial data
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
So the whole code would be this:
d = YOUR_DATA
# This will get the data values like 'bar', 'CCM' and etc
values = [x['data'] for x in d if x['__rowType']=='DATA']
# This gets the columns names from meta data
meta = list(filter(lambda x: x['__rowType']=='META', d))[0]
columns = [x['name'].split('.')[-1] for x in meta['data']]
# This line creates the exact dictionary we need to send to DataFrame class.
my_data = {column:[v[i] for v in values] for i, column in enumerate(columns)}
df = pd.DataFrame(data=my_data, )
df.index = np.arange(1, len(df)+1)
df #or print(df)
Note: Of course you can do all of this in one complex line of code but to avoid confusion I decided to do this in couple of lines of code

Traverse through a JSON file

Hi I have a lrage JSON file.I'm reading the data from the JSON file & storing it in a list. I need to extract some element from the JSON file. So I wrote this code
l=len(alldata_json)
for i in range(l):
df_school_us.loc[i,'schoolName']=alldata_json[i].get('schoolName')
data_address=alldata_json[i].get('addressLocations')
df_school_us.loc[i,'Latitude']=data_address[0].get('Location').get('latitude')
df_school_us.loc[i,'Longitude']=data_address[0].get('Location').get('longitude')
print("i= ",i)
len(alldata_json) is returning 87598 & alldata_json contains my json data.But I'm feeling running for loop with this many number of rows is not an optimized approach. Can you suggest me how to do it without for loop?
df = pd.DataFrame(alldata_json)
df2 = pd.concat([df.drop('addressLocations', axis=1),
df['addressLocations'].apply(pd.Series)], axis=1)
Extracting countryCode, latitude, and longitude
import pandas as pd
data = [{'locationType': 'ab',
'address': {'countryCode': 'IN',
'city': 'Mumbai',
'zipCode': '5000',
'schoolNumber': '2252'},
'Location': {'latitude': 19.0760,
'longitude': 72.8777},
'names': [{'languageCode': 'IN', 'name': 'DPS'},
{'languageCode': 'IN', 'name': 'DPS'}]}]
df = pd.DataFrame(data)
df2 = pd.concat([df['address'].apply(pd.Series)['countryCode'],
df['Location'].apply(pd.Series)[['latitude', 'longitude']]

Pandas - Extracting data from Series

I am trying to extract a seat of data from a column that is of type pandas.core.series.Series.
I tried
df['col1'] = df['details'].astype(str).str.findall(r'name\=(.*?),')
but the above returns null
Given below is how the data looks like in column df['details']
[{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}]
Trying to extract value corresponding to name field
Expected output : Name1
try this: simple, change according to your need.
import pandas as pd
df = pd.DataFrame([{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}])
print(df['name'][0])
#or if DataFrame inside a column itself
df['details'][0]['name']
NOTE: as you mentioned details is one of the dataset that you have in the existing dataset
import pandas as pd
df = pd.DataFrame([{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}])
#Name column
print(df.name)
#Find specific values in Series
indeces = df.name.str.find("Name") #Returns indeces of such values
df.iloc[index] # Returns all columns that fields name contain "Name"
df.name.iloc[index] # Returns all values from column name, which contain "Name"
Hope, this example will help you.
EDIT:
Your data frame has column 'details', which contain a dict {'id':101, ...}
>>> df['details']
0 {'id': 101, 'name': 'Name1', 'state': 'active'...
And you want to get value from field 'name', so just try:
>>> df['details'][0]['name']
'Name1'
The structure in your series is a dictionary.
[{'id': 101, 'name': 'Name1', 'state': 'active', 'boardId': 101, 'goal': '', 'startDate': '2019-01-01T12:16:20.296Z', 'endDate': '2019-02-01T11:16:00.000Z'}]
You can just point to the element 'name' from that dict with the following command
df['details'][0]['name']
If the name could be different you can get the list of the keys in the dictionary and apply your regex on that list to get your field's name.
Hope that it can help you.

How to create a dict of dicts from pandas dataframe?

I have a dataframe df
id price date zipcode
u734 8923944 2017-01-05 AERIU87
uh72 9084582 2017-07-28 BJDHEU3
u029 299433 2017-09-31 038ZJKE
I want to create a dictionary with the following structure
{'id': xxx, 'data': {'price': xxx, 'date': xxx, 'zipcode': xxx}}
What I have done so far
ids = df['id']
prices = df['price']
dates = df['date']
zips = df['zipcode']
d = {'id':idx, 'data':{'price':p, 'date':d, 'zipcode':z} for idx,p,d,z in zip(ids,prices,dates,zips)}
>>> SyntaxError: invalid syntax
but I get the error above.
What would be the correct way to do this, using either
list comprehension
OR
pandas .to_dict()
bonus points: what is the complexity of the algorithm, and is there a more efficient way to do this?
I'd suggest the list comprehension.
v = df.pop('id')
data = [
{'id' : i, 'data' : j}
for i, j in zip(v, df.to_dict(orient='records'))
]
Or a compact version,
data = [dict(id=i, data=j) for i, j in zip(df.pop('id'), df.to_dict(orient='r'))]
Note that, if you're popping id inside the expression, it has to be the first argument to zip.
print(data)
[{'data': {'date': '2017-09-31',
'price': 299433,
'zipcode': '038ZJKE'},
'id': 'u029'},
{'data': {'date': '2017-01-05',
'price': 8923944,
'zipcode': 'AERIU87'},
'id': 'u734'},
{'data': {'date': '2017-07-28',
'price': 9084582,
'zipcode': 'BJDHEU3'},
'id': 'uh72'}]

JSON to Pandas Dataframe not knowing if JSON will have all the columns of the dataframe

I am doing a research project and trying to pull thousands of quarterly results for companies from the SEC EDGAR API.
Each result is a list of dictionaries structured as follows:
[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}...]
I want each result to be a row of a pandas dataframe. The issue is that each result may not have the same fields due to the data available. I would like to check if the column(field) of the dataframe is present in one of the results field and if it is add the result value to the row. If not, I would like to add an np.NaN. How would I go about doing this?
A list/dict comprehension ought to work here:
In [11]: s
Out[11]:
[[{'field': 'othercurrentliabilities', 'value': 6886000000.0},
{'field': 'otherliabilities', 'value': 13700000000.0},
{'field': 'propertyplantequipmentnet', 'value': 15789000000.0}],
[{'field': 'othercurrentliabilities', 'value': 6886000000.0}]]
In [12]: pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s])
Out[12]:
othercurrentliabilities otherliabilities propertyplantequipmentnet
0 6.886000e+09 1.370000e+10 1.578900e+10
1 6.886000e+09 NaN NaN
make a list of df.result.rows[x]['values']
like below
s=[]
for x in range(df.result.totalrows[0]):
s=s+[df.result.rows[x]['values']]
print(x)
df1=pd.DataFrame([{d["field"]: d["value"] for d in row} for row in s]
df1
will give you result.

Categories

Resources