Flatten nested JSON and concatenate to dataframe using pandas - python

I have searched for a lot of similar topics online, but I have not found the solution yet.
My pandas dataframe looks like this:
index FOR
0 [{'id': '2766', 'name': '0803 Computer Softwar...
1 [{'id': '2766', 'name': '0803 Computer Softwar...
2 [{'id': '2766', 'name': '0803 Computer Softwar...
3 [{'id': '2766', 'name': '0803 Computer Softwar...
4 [{'id': '2766', 'name': '0803 Computer Softwar...
And I would like to flatten all 4 rows to become like the following dataframe while below is just the result for the first row:
index id name
0 2766 0803 Computer Software
I found a similar solution here. Unfortunately, I got a "TypeError" as the following:
TypeError: the JSON object must be str, bytes or bytearray, not 'list'
My code was:
dfs = []
for i in test['FOR']:
data = json.loads(i)
dfx = pd.json_normalize(data)
dfs.append(dfx)
df = pd.concat(dfs).reset_index(inplace = True)
print(df)
Would anyone can help me here?
Thank you very much!

try using literal_eval from the ast standard lib.
from ast import literal_eval
df_flattened = pd.json_normalize(df['FOR'].map(literal_eval))
then drop duplicates.
print(df_flattened.drop_duplicates())
id name
0 2766 0803 Computer Software

After a few weeks not touching related works,
I encountered another similar case and
I think I have got the solution so far for this case.
Please feel free to correct me or provide any other ideas.
I really appreciated all the helps and all the generous support!
chuck = []
for i in range(len(test)):
chuck.append(json_normalize(test.iloc[i,:]['FOR']))
test_df = pd.concat(chuck)
And then drop duplicated columns for the test_df

Related

Python Pandas .str.extract method fails when indexing

I'd like to set values on a slice of a DataFrame using .loc using pandas str extract method .str.extract() however, it's not working due to indexing errors. This code works perfectly if I swap extract with contains.
Here is a sample frame:
import pandas as pd
df = pd.DataFrame(
{
'name': [
'JUNK-0003426', 'TEST-0003435', 'JUNK-0003432', 'TEST-0003433', 'TEST-0003436',
],
'value': [
'Junk', 'None', 'Junk', 'None', 'None',
]
}
)
Here is my code:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)")
How can I set the None values to the extracted regex string
Hmm the problem seems to be that .str.extract returns a pd.DataFrame, you can .squeeze it to turn it into a series and it seems to work fine:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)").squeeze()
indexing alignment takes care of the rest.
Instead of trying to get the group, you can replace the rest with the empty string:
df.loc[df['value']=='None', 'value'] = df.loc[df['value']=='None', 'name'].str.replace('TEST-\d{3}', '')
Was this answer helpful to your problem?
Here is a way to do it:
df.loc[df["name"].str.startswith("TEST"), "value"] = df["name"].str.extract(r"TEST-\d{3}(\d+)").loc[:,0]
Output:
name value
0 JUNK-0003426 Junk
1 TEST-0003435 3435
2 JUNK-0003432 Junk
3 TEST-0003433 3433
4 TEST-0003436 3436

JSON_Normalize (nested json) to csv

I have been trying via pandas to extract data from a txt file containing json utf-8 encoded data.
Direct link to data file - http://download.companieshouse.gov.uk/psc-snapshot-2022-02-06_8of20.zip
Data's structure looks like the following examples:
{"company_number":"04732933","data":{"address":{"address_line_1":"Windsor Road","locality":"Torquay","postal_code":"TQ1 1ST","premises":"Windsor Villas","region":"Devon"},"country_of_residence":"England","date_of_birth":{"month":1,"year":1964},"etag":"5623f35e4bb5dc9cb37e134cb2ac0ca3151cd01f","kind":"individual-person-with-significant-control","links":{"self":"/company/04732933/persons-with-significant-control/individual/8X3LALP5gAh5dAYEOYimeiRiJMQ"},"name":"Ms Karen Mychals","name_elements":{"forename":"Karen","surname":"Mychals","title":"Ms"},"nationality":"British","natures_of_control":["ownership-of-shares-50-to-75-percent"],"notified_on":"2016-04-06"}}
{"company_number":"10118870","data":{"address":{"address_line_1":"Hilltop Road","address_line_2":"Bearpark","country":"England","locality":"Durham","postal_code":"DH7 7TL","premises":"54"},"ceased_on":"2019-04-15","country_of_residence":"England","date_of_birth":{"month":9,"year":1983},"etag":"5b3c984156794e5519851b7f1b22d1bbd2a5c5df","kind":"individual-person-with-significant-control","links":{"self":"/company/10118870/persons-with-significant-control/individual/hS6dYoZ234aXhmI6Q9y83QbAhSY"},"name":"Mr Patrick John Burns","name_elements":{"forename":"Patrick","middle_name":"John","surname":"Burns","title":"Mr"},"nationality":"British","natures_of_control":["ownership-of-shares-25-to-50-percent","voting-rights-25-to-50-percent"],"notified_on":"2017-04-06"}}
The simplepd.read_json did not work initially (I would get ValueError: Trailing data errors) until lines=true was used (using jupyternotebook for this).
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
this is how the data structure is displayed via df.head() :
company_number data
0 06851805 {'address': {'address_line_1': 'Briar Road', '...
1 04732933 {'address': {'address_line_1': 'Windsor Road',...
2 10118870 {'address': {'address_line_1': 'Hilltop Road',...
3 10118870 {'address': {'address_line_1': 'Hilltop Road',...
4 09565353 {'address': {'address_line_1': 'Old Hertford R...
After looking through stackoverflow and several online tutorials I tried using pd.json_normalize(df) but keep getting a AttributeError: 'str' object has no attribute 'values' error. I would like to ultimately export this json file into a csv file.
thank you in advance for any advice!
You can solve that problem by applying the json_normalize just to the data column.
import pandas as pd
import json
df = pd.read_json(r'E:\JSON_data\psc-snapshot-2022-02-06_8of20.txt', encoding='utf8', lines=True)
#json_normalize
df2 = pd.json_normalize(df['data'])
df = pd.concat([df, df2], axis=1)
#output to csv
df.to_csv("./OUTPUT_FILE_NAME")
print(df)
company_number ... name_elements.middle_name
0 4732933 ... NaN
1 10118870 ... John
[2 rows x 24 columns]

Extract data from specific format in Pandas DF

I have a raw data in csv format which looks like this:
product-name brand-name rating
["Whole Wheat"] ["bb Royal"] ["4.1"]
Expected output:
product-name brand-name rating
Whole Wheat bb Royal 4.1
I want this to affect every entry in my dataset. I have 10,000 rows of data. How can I do this using pandas?
Can we do this using regular expressions? Not sure how to do it.
Thank you.
Edit 1:
My data looks something like this:
df = {
'product-name': [
[""'Whole Wheat'""], [""'Milk'""] ],
'brand-name': [
[""'bb Royal'""], [""'XYZ'""] ],
'rating': [
[""'4.1'""], [""'4.0'""] ]
}
df_p = pd.DataFrame(data=df)
It outputs like this: ["bb Royal"]
PS: Apologies for my programming. I am quite new to programming and also to this community. I really appreciate your help here :)
IIUC select first values of lists:
df = df.apply(lambda x: x.str[0])
Or if values are strings:
df = df.replace('[\[\]]', '', regex=True)
You can use the explode function
df = df.apply(pd.Series.explode)

How to create a Pandas DataFrame from a list of OrderedDicts?

I have the following list:
o_dict_list = [(OrderedDict([('StreetNamePreType', 'ROAD'), ('StreetName', 'Coffee')]), 'Ambiguous'),
(OrderedDict([('StreetNamePreType', 'AVENUE'), ('StreetName', 'Washington')]), 'Ambiguous'),
(OrderedDict([('StreetNamePreType', 'ROAD'), ('StreetName', 'Quartz')]), 'Ambiguous')]
And like the title says, I am trying to take this list and create a pandas dataframe where the columns are: 'StreetNamePreType' and 'StreetName' and the rows contain the corresponding values for each key in the OrderedDict.
I have done some searching on StackOverflow to get some guidance on how to create a dataframe, see here but I am getting an error when I run this code (I am trying to replicate what is going on in that response).
from collections import Counter, OrderedDict
import pandas as pd
col = Counter()
for k in o_dict_list:
col.update(k)
df = pd.DataFrame([k.values() for k in o_dict_list], columns = col.keys())
When I run this code, the error I get is: TypeError: unhashable type: 'OrderedDict'
I looked up this error, here, I get that there is a problem with the datatypes, but I, unfortunately, I don't know enough about the inner workings of Python/Pandas to resolve this problem on my own.
I suspect that my list of OrderedDict is not exactly the same as in here which is why I am not getting my code to work. More specifically, I believe I have a list of sets, and each element contains an OrderedDict. The example, that I have linked to here seems to be a true list of OrderedDicts.
Again, I don't know enough about the inner workings of Python/Pandas to resolve this problem on my own and am looking for help.
I would use list comprehension to do this as follows.
pd.DataFrame([o_dict_list[i][0] for i, j in enumerate(o_dict_list)])
See the output below.
StreetNamePreType StreetName
0 ROAD Coffee
1 AVENUE Washington
2 ROAD Quartz
extracting the OrderedDict objects from your list and then use pd.Dataframe should work
values= []
for i in range(len(o_dict_list)):
values.append(o_dict_list[i][0])
pd.DataFrame(values)
StreetNamePreType StreetName
0 ROAD Coffee
1 AVENUE Washington
2 ROAD Quartz
d = [{'points': 50, 'time': '5:00', 'year': 2010},
{'points': 25, 'time': '6:00', 'month': "february"},
{'points':90, 'time': '9:00', 'month': 'january'},
{'points_h1':20, 'month': 'june'}]
pd.DataFrame(d)

how to convert pandas dataframe and numpy array into dictionary?

I have the following pandas dataframe which looks like,
code comp name
0 A292340 디비자산운용 마이티 200커버드콜ATM레버리지
1 A291630 키움투자자산운용 KOSEF 코스닥150선물레버리지
2 A278240 케이비자산운용 KBSTAR 코스닥150선물레버리지
3 A267770 미래에셋자산운용 TIGER 200선물레버리지
4 A267490 케이비자산운용 KBSTAR 미국장기국채선물레버리지(합성 H)
And I like to make dictionary out of this which will look like,
{'20180408' :{'A292340' : {comp : 디비자산운용}, {name : 마이티 200커버드콜ATM 레버리지}}}
Sorry about the data which is in foreign to you, but let me please ask.
What I tried is like,
values = [comp, name]
names = ['comp', 'name']
tmp = {names:values for names, values in zip(names, values)}
tpm = {code:values for values in zip(*tmp)}
aaaa = {date:c for c in zip(*tpm)}
print(aaaa)
aaaa is what I try to get.. and date is just simple list of date, from prior to todate. but when I run this, I got the error
TypeError: unhashable type: 'numpy.ndarray'
Thank you in advance for your answer.
Is this what you want? First, set "code" column as the index. Then use to_dict with "orient="index".
df.set_index("code").to_dict("index")
{'A267490': {'comp': '케이비자산운용', 'name': 'KBSTAR 미국장기국채선물레버리지(합성 H)'},
'A267770': {'comp': '미래에셋자산운용', 'name': 'TIGER 200선물레버리지'},
'A278240': {'comp': '케이비자산운용', 'name': 'KBSTAR 코스닥150선물레버리지'},
'A291630': {'comp': '키움투자자산운용', 'name': 'KOSEF 코스닥150선물레버리지'},
'A292340': {'comp': '디비자산운용', 'name': '마이티 200커버드콜ATM레버리지'}}
Using the argument "index" will give the layout:
{index -> {columnName -> valueOfTheColumn}}
Here since we set code as the index, we have
code -> {"comp" -> comp's value, "name" -> name's value}
'A267490': {'comp': '케이비자산운용', 'name': 'KBSTAR 미국장기국채선물레버리지(합성 H)'}

Categories

Resources