convert nested dictionary json response to df - python

I am receiving a nested dictionary as a response to an API call. I tried converting it into a dataframe but I am not able to get the output I want.
I wrote some code to handle the file, but I have a massive chunk of nested dictionary data in the "items" columns. How do I parse that and create a dataframe from it?
df1 = pd.json_normalize(response.json())
df.to_csv('file1.csv')
This is the csv file I was able to generate:
https://drive.google.com/file/d/1wg0QqkFmIpv_aUYefbrQxBMz_x4hRWMX/view?usp=share_link (check the items column)
I tried the json_normalize and flatdict route among the other json/dict to df answers on stackoverflow as well but those did not work.
Any help is appreciated.

you can use:
df=df.explode('items')
mask=pd.json_normalize(df.pop('items'))
df=df.join(mask)
There are two columns left to convert.
print(df[['tags','productConfiguration.allowedOrderQuantities']])
'''
tags productConfiguration.allowedOrderQuantities
0 [popular, onsale] []
0 [popular, onsale] []
0 [popular, onsale] []
0 [popular, onsale] []
'''
explode this to the new rows:
df=df.explode('tags').explode('productConfiguration.allowedOrderQuantities').drop_duplicates()
but there is a situation. After this operation we have 2 new rows. This means that each row will be repeated 2 times. If there are 100 rows in the dataset there will now be 200 rows because we have converted the json strings into columns and rows.
For a more general explode method:
explode_cols=[]
for i in df.columns:
if type(df[i][0])==list: #check column value is a list or not ?
exploded_cols.append(i) # if type is list append column name to explode_cols
df=df.explode(explode_cols) #explode df with given column list.

Related

Adding whole lines of a dataframe via a for loop

I had code as follows to collect interesting rows into a new dataframe:
df = df1.iloc[[66,113,231,51,152,122,185,179,114,169,97][:]]
but I want to use a for loop to collect the data. I have read that I need to combine the data as a list and then create the dataframe, but all the examples I have seen are for numbers and I can't create the same for each line of a dataframe. At the moment I have the following:
data = ['A','B','C','D','E']
for n in range(10):
data.append(dict(zip(df1.iloc[n, 4])))
df = pd.Dataframe(data)
(P.S. I have 4 in the code because I want the data to be selected via column E and the dataframe is already sorted so I am just looking for the first 10 rows)
Thanks in advance for your help.

using .loc and .isin to look up values in column - Error as list has int and strings

I have a big data frame - TUCABCP
I have a list called millreflist which contains integers, but now also strings
My objective is to look up the items from millreflist in the dataframe TUCABCP under column 'PO'and then return all the rows into a new dataframe called NEWBCP.
My code was working fine until recently, my millreflist contained a string. Now the issue is when using .loc and .isin when the string is searched up it deletes all the rows and returns the rows with only the string under the 'PO' column.
Below is my code:
millreflist = combined['TUCA PO'].tolist()
millreflist = list(dict.fromkeys(millreflist))
df = TUCABCP.loc[TUCABCP['PO'].isin(millreflist)]
df.to_excel("NEWBCP.xlsx", header=True, index=False)
For example if I had 5 integers before I would get 5 rows in return - perfect.
Now if I have 5 integers and 1 string, I only get 1 row in return - the string row.
My question is why are all the other rows being discarded and I am only getting the row back with the string.

How do i convert list into python dataframe

I have used for loop for text extraction from images. So i getting errors while converting list into python pandas dataframe.
info = []
for item in dirs:
if os.path.isfile(path+item):
for a in x:
img = Image.open(path+item)
crop = img.crop(a)
text = pytesseract.image_to_string(crop)
info.append(text)
df = pd.DataFrame([info], colnames=['col1','col2'])
df
Expected result: data store in dataframe row wise.
Yes list is not a list of two items. i have 14 predefined columns.
Here it is another code
for i in range(info):
df.loc[i] = [ info for n in range(14))
Please check documentation for .DataFrame
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
The line in which you create your dataframe
df = pd.DataFrame([info], colnames=['col1','col2']
Is missing parenthesis at the end, uses colnames instead of columns, has unnecessary square brackets around your list and is creating two columns where you only need one.
Please mention the exact error
There are two problems here I think.
First of all, you are passing to the DataFrame [info] although info is already a list. You can just pass this list as it is.
Now that you pass a list of items as argument, you're trying to convert the list as a DataFrame with two columns: colnames=['col1','col2']. And the keyword is columns not colnames.
I think that's the problem. You list is not a list of two-items list (like [[a, b], [c, d]]). Just use:
df = pd.DataFrame(info, columns=['col1'])
Best

Searching CSV files with Pandas (unique id's) - Python

I am looking into searching a csv file with 242000 rows and want to sum the unique identifiers in one of the columns. The column name is 'logid' and has a number of different values i.e. 1002, 3004, 5003. I want to search the csv file using the panda dataframe and sum the amount of unique identifiers. If possible I would then like to create a new csv file that stores this information. For example if I find there are 50 logid's of 1004 I would then like to create a csv file that has column name 1004 and the count of 50 displayed below. I would do this for all unique identifiers and add them in the same csv file. I am completely new at this and have done some searching but no idea where to start.
Thanks!
As you haven't post your code I can give you an answer only about the general way it would work.
Load the CSV file into a pd.Dataframe using pandas.read_csv
Save all values which a occurence > 1 in a seperate df1 using pandas.DataFrame.drop_duplicates like:
df1=df.drop_duplicates(keep="first)
--> This Will return a DataFrame which only contains the rows with the first occurence of duplicate values. E.g. if the value 1000 is in 5 rows only the first row will be returned while the others are dropped.
--> Applying df1.shape[0] will return you the number of duplicate values in your df.
3.If you want to store all rows of df which contain a "duplicate value" in a seperate CSV file you have to do smth like this:
df=pd.DataFrame({"A":[0,1,2,3,0,1,2,5,5]}) # This should represent your original data set
print(df)
df1=df.drop_duplicates(subset="A",keep="first") #I assume the column with the duplicate values is columns "A" if you want to check the whole row just omit the subset keyword.
print(df1)
list=[]
for m in df1["A"]:
mask=(df==m)
list.append(df[mask].dropna())
for dfx in range(len(list)):
name="file{0}".format(dfx)
list[dfx].to_csv(r"YOUR PATH\{0}".format(name))

Python2.7: How to split a column into multiple column based on special strings like this?

I'm a newbie for programming and python, so I would appreciate your advice!
I have a dataframe like this.
In 'info' column, there are 7 different categories: activities, locations, groups, skills, sights, types and other. and each categories have unique values within [ ].(ie,"activities":["Tour"])
I would like to split 'info' column into 7 different columns based on each category as shown below.
I would like to allocate appropriate column names and also put corresponding unique strings within [ ] to each row.
Is there any easy way to split dataframe like that?
I was thinking to use str.split functions to split into pieces and merge everthing later. But not sure that is the best way to go and I wanted to see if there is more sophisticated way to make a dataframe like this.
Any advice is appreciated!
--UPDATE--
When print(dframe['info']), it shows like this.
It looks like the content of the info column is JSON-formatted, so you can parse that into a dict object easily:
>>> import json
>>> s = '''{"activities": ["Tour"], "locations": ["Tokyo"], "groups": []}'''
>>> j = json.loads(s)
>>> j
{u'activities': [u'Tour'], u'locations': [u'Tokyo'], u'groups': []}
Once you have the data as a dict, you can do whatever you like with it.
Ok, here is how to do it :
import pandas as pd
import ast
#Initial Dataframe is df
mylist = list(df['info'])
mynewlist = []
for l in mylist:
mynewlist.append(ast.literal_eval(l))
df_info = pd.DataFrame(mynewlist)
#Add columns of decoded info to the initial dataset
df_new = pd.concat([df,df_info],axis=1)
#Remove the column info
del df_new['info']
You can use the json library to do that.
1) import the json libray
import json
2) Turn into string all the rows of that column and then Apply the json.loads function to all of them. Insert the result in an object
jsonO = df['info'].map(str).apply(json.loads)
3)The Json object is now a json dataframe in which you can navigate. For each columns of your Json dataframe, create a column in your final dataframe
df['Activities'] = jsonO.apply(lambda x: x['Activities'])
Here for one column of your json dataframe each 'rows' is dump in the new column of your final dataframe df
4) Re-do 3 for all the columns you're interested in

Categories

Resources