Slice pandas dataframe json column into columns - python

I have a data frame which have two columns in JSON format, like this:
author biblio series
Mehrdad Vahabi {'volume': 68, 'month': 'January', {'handle':'RePEc:aka:aoecon', 'name': 'Oeconomica'}
'name': 'János Kornai',
'issue': 's', 'handle':
'n:v:68:y:2018:i',
'year': '2018',
'pages': '27-52', 'doi': ''}
Michael Bailey {'c_date': '2017', 'number': {'handle': '', 'name': ''}
'23608', 'handle': 'RePEc:nbr:
nberwo:23608', 'name': 'Measuring'}
I Want to my data frame looks like this:
author biblio.volume biblio.month biblio.name biblio.issue biblio.handle bibilio.year biblio.pages biblio.doi biblio.c_date bibi¡lio.number series.handle series.name
Mehrdad Vahabi 68 January János Kornai s n:v:68:y:2018:i 2018 27-52 NA NA RePEc:aka:aoecon Oeconomica
Michael Bailey NA Na Meausuring NA nberwo:23608 NA NA NA 2017 23608
I try do it using the answers in this question, but no one works for me.
How can I do it?
[EDIT]
Here is a sample of the data
[EDIT]
Following the #jezrael solution I get this:
df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns
df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns
col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
Traceback (most recent call last):
File "dfs.py", line 8, in <module>
df1.columns = 'bibliographic.'+ df1.columns
File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/range.py", line 583, in _evaluate_numeric_binop
other = self._validate_for_numeric_binop(other, op, opstr)
File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/base.py", line 3961, in
_validate_for_numeric_binop
raise TypeError("can only perform ops with scalar values")
TypeError: can only perform ops with scalar values
And with json_normalize:
Traceback (most recent call last):
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "dfs.py", line 7, in <module>
df = json_normalize(d)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/io/json/normalize.py", line 192, in json_normalize
if any([isinstance(x, dict) for x in compat.itervalues(data[0])]):
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in __getitem__
return self._getitem_column(key)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
return self._get_item_cache(key)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
values = self._data.get(item)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3843, in get
loc = self.items.get_loc(item)
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2527, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0
Following the #Jhon H solution, I get this:
Traceback (most recent call last):
File "dfs.py", line 7, in <module>
jsonSeries = df[['bibliographic']].tolist()
File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'tolist'

Create for each dict column new DataFrame by constructor and last concat all together:
df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns
df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns
col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
author biblio.c_date biblio.doi biblio.handle \
0 Mehrdad Vahabi NaN n:v:68:y:2018:i
1 Michael Bailey 2017 NaN RePEc:nbr:nberwo:23608
biblio.issue biblio.month biblio.name biblio.number biblio.pages \
0 s January Janos Kornai NaN 27-52
1 NaN NaN Measuring 23608 NaN
biblio.volume biblio.year series.handle series.name
0 68.0 2018 RePEc:aka:aoecon Oeconomica
1 NaN NaN
EDIT:
If input is json is possible use json_normalize:
from pandas.io.json import json_normalize
d = [{"author":"Mehrdad Vahabi","biblio":{"volume":68,"month":"January","name":"Janos Kornai","issue":"s","handle":"n:v:68:y:2018:i","year":"2018","pages":"27-52","doi":""},"series":{"handle":"RePEc:aka:aoecon","name":"Oeconomica"}},{"author":"Michael Bailey","biblio":{"c_date":"2017","number":"23608","handle":"RePEc:nbr:nberwo:23608","name":"Measuring"},"series":{"handle":"","name":""}}]
df = json_normalize(d)
print (df)
author biblio.c_date biblio.doi biblio.handle \
0 Mehrdad Vahabi NaN n:v:68:y:2018:i
1 Michael Bailey 2017 NaN RePEc:nbr:nberwo:23608
biblio.issue biblio.month biblio.name biblio.number biblio.pages \
0 s January Janos Kornai NaN 27-52
1 NaN NaN Measuring 23608 NaN
biblio.volume biblio.year series.handle series.name
0 68.0 2018 RePEc:aka:aoecon Oeconomica
1 NaN NaN
EDIT: There is problem your dictionaries are strings, so first is necessary use ast.literal_eval for convert:
import ast
df = pd.read_csv('probe.csv')
#print (df)
df1 = pd.DataFrame(df['bibliographic'].apply(ast.literal_eval).values.tolist())
df1.columns = 'bibliographic.'+ df1.columns
df2 = pd.DataFrame(df['series'].apply(ast.literal_eval).values.tolist())
df2.columns = 'series.'+ df2.columns
col = df.columns.difference(['bibliographic','series'])
df = pd.concat([df[col], df1, df2],axis=1)

You need to process the columns individually and join them all together to get the format that you need. Here is a simple example that you could follow
import pandas as pd
records = [{'col1':'v1','col2':{'a1':1,'b1':1},'col3':{'c1':1,'d1':1}},
{'col1':'v2','col2':{'a1':2,'b1':2},'col3':{'c1':2,'d1':2}}]
sample_df = pd.DataFrame(records)
sample_df
col1 col2 col3
0 v1 {'a1': 1, 'b1': 1} {'c1': 1, 'd1': 1}
1 v2 {'a1': 2, 'b1': 2} {'c1': 2, 'd1': 2}
col2_expanded = sample_df.col2.apply(lambda x:pd.Series(x))
col2_expanded.columns = ['{}.{}'.format('col2',i) for i in col2_expanded]
col2_expanded
col2.a1 col2.b1
0 1 1
1 2 2
col3_expanded = sample_df.col3.apply(lambda x:pd.Series(x))
col3_expanded.columns = ['{}.{}'.format('col3',i) for i in col3_expanded]
col3_expanded
col3.c1 col3.d1
0 1 1
1 2 2
final = pd.concat([sample_df[['col1']],col2_expanded,col3_expanded],axis=1)
final
col1 col2.a1 col2.b1 col3.c1 col3.d1
0 v1 1 1 1 1
1 v2 2 2 2 2

Related

python pandas data frame error while trying to print it within single df[ _ , _ ] form

I see dataframe error while trying to print it within single df[ _ , _ ] form. Below are the code lines
#Data Frames code
import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
df =pd.DataFrame(randArr,np.arange(101,106,1),['PDS','Algo','SE','INS'])
print(df['PDS','SE'])
errors:
Traceback (most recent call last): File "C:\Users\subro\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexes\base.py", line 3621, in get_loc return self._engine.get_loc(casted_key) File "pandas\_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: ('PDS', 'SE')
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "D:\Education\4th year\1st sem\Machine Learning Lab\1st Lab\python\pandas\pdDataFrame.py", line 11, in <module> print(df['PDS','SE']) File "C:\Users\subro\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\frame.py", line 3505, in __getitem__ indexer = self.columns.get_loc(key) File "C:\Users\subro\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\site-packages\pandas\core\indexes\base.py", line 3623, in get_loc raise KeyError(key) from err KeyError: ('PDS', 'SE')
Do you mean to do this? Need to indicate the column names when creating the dataframe, and also need double square brackets df[[ ]] when extracting a slice of the dataframe
import numpy as np
import pandas as pd
randArr = np.random.randint(0,100,20).reshape(5,4)
df = pd.DataFrame(randArr, columns=['PDS', 'SE', 'ABC', 'CDE'])
print(df)
print(df[['PDS','SE']])
Output:
PDS SE ABC CDE
0 56 77 82 42
1 17 12 84 46
2 34 9 19 12
3 19 88 34 19
4 51 54 9 94
PDS SE
0 56 77
1 17 12
2 34 9
3 19 88
4 51 54
use print(df[['PDS','SE']]) format instead of print(df['PDS','SE'])

Pandas apply getting KeyError: [duplicate]

This question already has an answer here:
Why do I get a KeyError when using pandas apply?
(1 answer)
Closed 13 days ago.
I was looking at this answer by Roman Pekar for using apply. I initially copied the code exactly and it worked fine. Then I used it on my df3 that is created from a csv file and I got a KeyError. I checked datatypes the columns I was using are int64, so that is okay. I don't have nulls. If I can get this working then I will make the function more complex. How do I get this working?
def fxy(x, y):
return x * y
df3 = pd.read_csv(path + 'test_data.csv', usecols=[0,1,2])
print(df3.dtypes)
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
Trace back
Traceback (most recent call last):
File "f:\...\my_file.py", line 54, in <module>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\frame.py", line 8833, in apply
return op.apply().__finalize__(self, method="apply")
File "C:\...\apply.py", line 727, in apply
return self.apply_standard()
File "C:\...\apply.py", line 851, in apply_standard
results, res_index = self.apply_series_generator()
File "C:\...\apply.py", line 867, in apply_series_generator
results[i] = self.f(v)
File "f:\...\my_file.py", line 54, in <lambda>
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']))
File "C:\...\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\...\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\...\range.py", line 389, in get_loc
raise KeyError(key)
KeyError: 'Len'
I don't see a way to attach the csv file. Below is Sample df3 if I save the below with excel as "CSV (Comma delimited)(*.csv) I get the same results.
ID
Len
Width
A
170
4
B
362
5
C
12
15
D
42
7
E
15
3
F
46
49
G
71
74
I think you miss the axis=1 on apply:
df3['Area'] = df3.apply(lambda x: fxy(x['Len'], x['Width']), axis=1)
But in your case, you can just do:
df3['Area'] = df3['Len'] * df3['Width']
print(df3)
# Output
ID Len Width Area
0 A 170 4 680
1 B 362 5 1810
2 C 12 15 180
3 D 42 7 294
4 E 15 3 45
5 F 46 49 2254
6 G 71 74 5254

How to draw a boxplot correctly by seaborn in python3.x

I've loaded a csv file, and printed correctly, but I get an error when drawing boxplot with a Series.
Loaded my data and printed correctly
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
data2 = pd.read_csv(...)
print(data2)
ax = sns.boxplot(x=data2['2'])
plt.show()
and the formation of my datas are followed:
0 1 2 3 4 5 6 7 ... 29 30 31 32 33 34 35 36
0 2016-06-06 04:07:42 0 26.0 0 1 101 0 0 ... 0 0 0 0 0 0 0
1 2016-06-08 12:34:10 0 25.0 0 1 101 0 0 ... 0 0 0 0 0 0 0
....
I want to draw a boxplot with the 2 columns (26.0、25.0), but I got this error:
Traceback (most recent call last):
File "D:\Python-Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2657, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '2'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "E:/work/fLUTE/Solve-52/练习/sns练习/boxplot.py", line 16, in
ax = sns.boxplot(x=data2['2'])
File "D:\Python-Anaconda\lib\site-packages\pandas\core\frame.py", line 2927, in getitem
indexer = self.columns.get_loc(key)
File "D:\Python-Anaconda\lib\site-packages\pandas\core\indexes\base.py", line 2659, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 129, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index_class_helper.pxi", line 91, in pandas._libs.index.Int64Engine._check_type
KeyError: '2'
When changing
ax = sns.boxplot(x=data2['2'])
to
ax = sns.boxplot(x=data2[2])
another error occurs:
TypeError: cannot perform reduce with flexible type
First, change ax = sns.boxplot(x=data2['2']) to ax = sns.boxplot(x=data2[2])
Second, add such codes data2[2] = data2[2].astype(float)

get index column error from dataframe from

I simply want to get index column.
import pandas as pd
df1=pd.read_csv(path1, index_col='ID')
df1.head()
VAR1 VAR2 VAR3 OUTCOME
ID
28677 28 1 0.0 0
27170 59 1 0.0 1
39245 65 1 0.0 1
31880 19 1 0.0 0
41441 24 1 0.0 1
I can get many columns like:
df1["VAR1"]
ID
28677 28
27170 59
39245 65
31880 19
41441 24
31070 77
39334 63
....
38348 23
38278 52
28177 58
but, I cannot get index column:
>>> df1["ID"]
Traceback (most recent call last):
File "C:\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2134, in get_loc
return self._engine.get_loc(key)
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'ID'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "C:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "C:\Anaconda3\lib\site-packages\pandas\core\internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "C:\Anaconda3\lib\site-packages\pandas\indexes\base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas\index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas\index.c:4433)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13742)
File "pandas\src\hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:13696)
KeyError: 'ID'
I want to get the list of index column.
how to do it?
why I get error?
If I want to merge two dataframe use the index column, how to do it?
First column is index so for select use:
print (df1.index)
Int64Index([28677, 27170, 39245, 31880, 41441], dtype='int64', name='ID')
But if possible MultiIndex in index use get_level_values:
print (df1.index.get_level_values('ID'))
Int64Index([28677, 27170, 39245, 31880, 41441], dtype='int64', name='ID')
You can use df.index property:
df.index (or df.index.values for numpy array)
pd.Series(df.index)

Pandas KeyError when working on split data frame

I want to perform some operations on a pandas data frame that is split into chunks. After splitting the data frame, I then try to iterate over the chunks, but after the first iterations runs well, I get an error (see below). I have gone through some questions like these: 1 and 2 but they don't quite address my issue. Kindly help me resolve this as I don't fully understand it.
import pandas as pd
tupList = [('Eisenstadt', 'Paris','1', '2'), ('London', 'Berlin','1','3'), ('Berlin', 'stuttgat','1', '4'),
('Liverpool', 'Southampton','1', '5'),('Tirana', 'Blackpool', '1', '6'),('blackpool', 'tirana','1','7'),
('Paris', 'Lyon','1','8'), ('Manchester', 'Nice','1','10'),('Orleans', 'Madrid','1', '12'),
('Lisbon','Stockholm','1','12')]
cities = pd.DataFrame(tupList, columns=['Origin', 'Destination', 'O_Code', 'D_code'])
# purpose - splits the DataFrame into smaller of max size chunkSize (last is smaller)
def splitDataFrameIntoSmaller(df, chunkSize = 3):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
citiesChunks = splitDataFrameIntoSmaller(cities)
for ind, cc in enumerate(citiesChunks):
cc["distance"] = 0
cc["time"] = 0
for i in xrange(len(cc)):
al = cc['Origin'][i]
bl = cc['Destination'][i]
'...' #trucating to make it readable
cc.to_csv('out.csv', sep=',', encoding='utf-8')
Traceback (most recent call last):
File ..., line 39, in <module>
al = cc['Origin'][i]
File ..., line 603, in __getitem__
result = self.index.get_value(self, key)
File ..., line 2169, in get_value
tz=getattr(series.dtype, 'tz', None))
File "pandas\index.pyx", line 98, in pandas.index.IndexEngine.get_value (pandas\index.c:3557)
File "pandas\index.pyx", line 106, in pandas.index.IndexEngine.get_value (pandas\index.c:3240)
File "pandas\index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)
File "pandas\src\hashtable_class_helper.pxi", line 404, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8564)
File "pandas\src\hashtable_class_helper.pxi", line 410, in pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8508)
KeyError: 0L
You can first floor divide index values and then use list comprehension - loop by unique values and select by loc, last reset_index for remove duplicated index:
cities.index = cities.index // 3
print (cities)
Origin Destination O_Code D_code
0 Eisenstadt Paris 1 2
0 London Berlin 1 3
0 Berlin stuttgat 1 4
1 Liverpool Southampton 1 5
1 Tirana Blackpool 1 6
1 blackpool tirana 1 7
2 Paris Lyon 1 8
2 Manchester Nice 1 10
2 Orleans Madrid 1 12
3 Lisbon Stockholm 1 12
citiesChunks = [cities.loc[[x]].reset_index(drop=True) for x in cities.index.unique()]
#print (citiesChunks)
print (citiesChunks[0])
Origin Destination O_Code D_code
0 Eisenstadt Paris 1 2
1 London Berlin 1 3
2 Berlin stuttgat 1 4
Last need iterrows if need loop in DataFrame:
#write columns to file first
cols = ['Origin', 'Destination', 'O_Code', 'D_code', 'distance', 'time']
df = pd.DataFrame(columns=cols)
df.to_csv('out.csv', encoding='utf-8', index=False)
for ind, cc in enumerate(citiesChunks):
cc["distance"] = 0
cc["time"] = 0
for i, val in cc.iterrows():
al = cc.loc[i, 'Origin']
bl = cc.loc[i, 'Destination']
'...' #trucating to make it readable
cc.to_csv('out.csv', encoding='utf-8', mode='a', header=None, index=False)
print (cc.to_csv(encoding='utf-8'))
,Origin,Destination,O_Code,D_code,distance,time
0,Eisenstadt,Paris,1,2,0,0
1,London,Berlin,1,3,0,0
2,Berlin,stuttgat,1,4,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Liverpool,Southampton,1,5,0,0
1,Tirana,Blackpool,1,6,0,0
2,blackpool,tirana,1,7,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Paris,Lyon,1,8,0,0
1,Manchester,Nice,1,10,0,0
2,Orleans,Madrid,1,12,0,0
,Origin,Destination,O_Code,D_code,distance,time
0,Lisbon,Stockholm,1,12,0,0

Categories

Resources