This question already has answers here:
Split / Explode a column of dictionaries into separate columns with pandas
(13 answers)
Closed 2 years ago.
I am beginner of programming language, so it would be appreciated you help and support.
Here is DataFrame and one column' data is JSON type? of data.
ID, Name, Information
1234, xxxx, '{'age': 25, 'gender': 'male'}'
2234, yyyy, '{'age': 34, 'gender': 'female'}'
3234, zzzz, '{'age': 55, 'gender': 'male'}'
I would like to covert this DataFrame as below.
ID, Name, age, gender
1234, xxxx, 25, male
2234, yyyy, 34, female
3234, zzzz, 55, male
I found that ast.literal_eval() can convert str to dict type, but I have no idea how to write code of this issue.
Would you please give some example of a code which can solve this issue?
Given test.csv
ID,Name,Information
1234,xxxx,"{'age': 25, 'gender': 'male'}"
2234,yyyy,"{'age': 34, 'gender': 'female'}"
3234,zzzz,"{'age': 55, 'gender': 'male'}"
Read the file in with pd.read_csv and use the converters parameter with ast.literal_eval, which will convert the data in the Information column from a str type to dict type.
Use pd.json_normalize to unpack the dict with keys as column headers and values in the rows
.join the normalized columns with df
.drop the Information column
import pandas as pd
from ast import literal_eval
df = pd.read_csv('test.csv', converters={'Information': literal_eval})
df = df.join(pd.json_normalize(df.Information))
df.drop(columns=['Information'], inplace=True)
# display(df)
ID Name age gender
0 1234 xxxx 25 male
1 2234 yyyy 34 female
2 3234 zzzz 55 male
If the data is not from a csv file
import pandas as pd
from ast import literal_eval
data = {'ID': [1234, 2234, 3234],
'Name': ['xxxx', 'yyyy', 'zzzz'],
'Information': ["{'age': 25, 'gender': 'male'}", "{'age': 34, 'gender': 'female'}", "{'age': 55, 'gender': 'male'}"]}
df = pd.DataFrame(data)
# apply literal_eval to Information
df.Information = df.Information.apply(literal_eval)
# normalize the Information column and join to df
df = df.join(pd.json_normalize(df.Information))
# drop the Information column
df.drop(columns=['Information'], inplace=True)
If third column is a JSON string, ' is not valid, it should be ", so we need to fix this.
If the third column is a string representation of python dict, you can use eval to convert it.
A sample of code to split third column of type dict and merge into the original DataFrame:
data = [
[1234, 'xxxx', "{'age': 25, 'gender': 'male'}"],
[2234, 'yyyy', "{'age': 34, 'gender': 'female'}"],
[3234, 'zzzz', "{'age': 55, 'gender': 'male'}"],
]
df = pd.DataFrame().from_dict(data)
df[2] = df[2].apply(lambda x: json.loads(x.replace("'", '"'))) # fix the data and convert to dict
merged = pd.concat([df[[0, 1]], df[2].apply(pd.Series)], axis=1)
Related
Ive got a little issue while coding a script that takes a CSV string and is supposed to select a column name and value based on the input. The CSV string contains Names of NBA players, their Universities etc. Now when the input is "name" && "Andre Brown", it should search for those values in the given CSV string. I have a rough code laid out - but I am unsure on how to implement the where method. Any ideas?
import csv
import pandas as pd
import io
class MySelectQuery:
def __init__(self, table, columns, where):
self.table = table
self.columns = columns
self.where = where
def __str__(self):
return f"SELECT {self.columns} FROM {self.table} WHERE {self.where}"
csvString = "name,year_start,year_end,position,height,weight,birth_date,college\nAlaa Abdelnaby,1991,1995,F-C,6-10,240,'June 24, 1968',Duke University\nZaid Abdul-Aziz,1969,1978,C-F,6-9,235,'April 7, 1946',Iowa State University\nKareem Abdul-Jabbar,1970,1989,C,7-2,225,'April 16, 1947','University of California, Los Angeles\nMahmoud Abdul-Rauf,1991,2001,G,6-1,162,'March 9, 1969',Louisiana State University\n"
df = pd.read_csv(io.StringIO(csvString), error_bad_lines=False)
where = "name = 'Alaa Abdelnaby' AND year_start = 1991"
df = df.query(where)
print(df)
The CSV string is being transformed into a pandas Dataframe, which should then find the values based on the input - however I get the error "name 'where' not defined". I believe everything until the df = etc. part is correct, now I need help implementing the where method. (Ive seen one other solution on SO but wasnt able to understand or figure that out)
# importing pandas
import pandas as pd
record = {
'Name': ['Ankit', 'Amit', 'Aishwarya', 'Priyanka', 'Priya', 'Shaurya' ],
'Age': [21, 19, 20, 18, 17, 21],
'Stream': ['Math', 'Commerce', 'Science', 'Math', 'Math', 'Science'],
'Percentage': [88, 92, 95, 70, 65, 78]}
# create a dataframe
dataframe = pd.DataFrame(record, columns = ['Name', 'Age', 'Stream', 'Percentage'])
print("Given Dataframe :\n", dataframe)
options = ['Math', 'Science']
# selecting rows based on condition
rslt_df = dataframe[(dataframe['Age'] == 21) &
dataframe['Stream'].isin(options)]
print('\nResult dataframe :\n', rslt_df)
Output:
Source: https://www.geeksforgeeks.org/selecting-rows-in-pandas-dataframe-based-on-conditions/
Sometimes Googling does the trick ;)
You need the double = there. So should be:
where = "name == 'Alaa Abdelnaby' AND year_start == 1991"
I have a question related to Pandas.
In df1 I have a data frame with the id of each seller and their respective names.
In df2 I have the id of the salesmen and their respective sales.
I would like to have in the df2, two new columns with the first name and last names of the salesmen.
PS. in df2 one of the sales is shared between two vendors.
import pandas as pd
vendors = {'first_name': ['Montgomery', 'Dagmar', 'Reeba', 'Shalom', 'Broddy', 'Aurelia'],
'last_name': ['Humes', 'Elstow', 'Wattisham', 'Alen', 'Keningham', 'Brechin'],
'id_vendor': [127, 241, 329, 333, 212, 233]}
sales = {'id_vendor': [['127'], ['241'], ['329, 333'], ['212'], ['233']],
'sales': [1233, 25000, 8555, 4333, 3222]}
df1 = pd.DataFrame(vendors)
df2 = pd.DataFrame(sales)
I attach the code. any suggestions?`
Thank you in advance.
You can merge df1 with df2 with the exploded id_vendors column and use DataFrame.GroupBy.agg when grouping by sales to obtain the columns as you want:
transform_names = lambda x: ', '.join(list(x))
res = (df1.merge(df2.explode('id_vendor')).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
first_name last_name id_vendor
sales
1233 Montgomery Humes [127]
3222 Aurelia Brechin [233]
4333 Broddy Keningham [212]
8555 Reeba, Shalom Wattisham, Alen [329, 333]
25000 Dagmar Elstow [241]
Note:
In your example, id_vendors in df2 is populated by lists of strings, but since id_vendor in df1 is of integer type, I assume that it was a typo. If id_vendors is indeed containing lists of strings, you need to also convert the strings to integers:
transform_names = lambda x: ', '.join(list(x))
# Notice the .astype(int) call.
res = (df1.merge(df2.explode('id_vendor').astype(int)).
groupby('sales').
agg({'first_name': transform_names, 'last_name': transform_names,
'id_vendor': list})
)
print(res)
I have a nested JSON like below. I want to convert it into a pandas dataframe. As part of that, I also need to parse the weight value only. I don't need the unit.
I also want the number values converted from string to numeric.
Any help would be appreciated. I'm relatively new to python. Thank you.
JSON Example:
{'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'},
'gender': 'male'}
Sample output below:
id name weight gender
123 joe 100 male
use " from pandas.io.json import json_normalize ".
id name weight.number weight.unit gender
123 joe 100 lbs male
if you want to discard the weight unit, just flatten the json:
temp = {'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'}, 'gender': 'male'}
temp['weight'] = temp['weight']['number']
then turn it into a dataframe:
pd.DataFrame(temp)
Something like this should do the trick:
json_data = [{'id': '123', 'name': 'joe', 'weight': {'number': '100', 'unit': 'lbs'}, 'gender': 'male'}]
# convert the data to a DataFrame
df = pd.DataFrame.from_records(json_data)
# conver id to an int
df['id'] = df['id'].apply(int)
# get the 'number' field of weight and convert it to an int
df['weight'] = df['weight'].apply(lambda x: int(x['number']))
df
My dataframe is as shown
name key value
john A223 390309
jason B439 230943
peter A5388 572039
john D23902 238939
jason F2390 23930
I want to convert the above generated dataframe into a dictionary in the below shown format.
{'john': {'key':'A223', 'value':'390309', 'key':'A5388', 'value':'572039'},
'jason': {'key':'B439','value':'230943', 'key':'F2390', 'value':'23930'},
'peter': {'key':'A5388' ,'value':'572039'}}
I tried a = dict(zip(dataframe['key'],dataframe['value'])).
But wont give me the dataframe columns headers.
Dictionary keys must be unique
Assuming, as in your desired output, you want to keep only rows with the first instance of each name, you can reverse row order and then use to_dict with orient='index':
res = df.iloc[::-1].set_index('name').to_dict('index')
print(res)
{'jason': {'key': 'B439', 'value': 230943},
'john': {'key': 'A223', 'value': 390309},
'peter': {'key': 'A5388', 'value': 572039}}
I have a column of values such as this:
123, 234, 345, 456, 567
When I do
pd.read_csv(dtype = {'column': str})
or
pd.read_csv(dtype = 'column': object})
they both produce values like
00123, 00234, 00345, 00456, 00567.
I was searching through stackexchange, and people say that you should use dtype: object, but it doesn't work for me..
If you want to read in your data as integers, drop the dtype:
df = pd.read_csv('data.csv')
If you want to convert the fields to strings, you can apply a str transformation with df.astype:
df = pd.read_csv('data.csv').astype(str)
Another option would be to use a converter:
df = pd.read_csv('data.csv', converters={'ColName': str})