I have a pandas column with nested json data string. I'd like to flatten the data into multiple pandas columns.
Here's data from a single cell:
rent['ques'][9] = "{'Rent': [{'Name': 'Asking', 'Value': 16.07, 'Unit': 'Usd'}], 'Vacancy': {'Name': 'Vacancy', 'Value': 25.34100001, 'Unit': 'Pct'}}"
For each cell in pandas column, I'd like parse this string and create multiple columns. Expected output looks something like this:
When I run, json_normalize(rent['ques']), I receive the following error.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-28-cebc86357f34> in <module>()
----> 1 json_normalize(rentoff['Survey'])
/anaconda3/lib/python3.7/site-packages/pandas/io/json/normalize.py in json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep)
196 if record_path is None:
197 if any([[isinstance(x, dict)
--> 198 for x in compat.itervalues(y)] for y in data]):
199 # naive normalization, this is idempotent for flat records
200 # and potentially will inflate the data considerably for
/anaconda3/lib/python3.7/site-packages/pandas/io/json/normalize.py in <listcomp>(.0)
196 if record_path is None:
197 if any([[isinstance(x, dict)
--> 198 for x in compat.itervalues(y)] for y in data]):
199 # naive normalization, this is idempotent for flat records
200 # and potentially will inflate the data considerably for
/anaconda3/lib/python3.7/site-packages/pandas/compat/__init__.py in itervalues(obj, **kw)
210
211 def itervalues(obj, **kw):
--> 212 return iter(obj.values(**kw))
213
214 next = next
AttributeError: 'str' object has no attribute 'values'
Try this:
df['quest'] = df['quest'].str.replace("'", '"')
dfs = []
for i in df['quest']:
data = json.loads(i)
dfx = pd.json_normalize(data, record_path=['Rent'], meta=[['Vacancy', 'Name'], ['Vacancy', 'Unit'], ['Vacancy', 'Value']])
dfs.append(dfx)
df = pd.concat(dfs).reset_index(drop=['index'])
print(df)
Name Value Unit Vacancy.Name Vacancy.Unit Vacancy.Value
0 Asking 16.07 Usd Vacancy Pct 25.341
1 Asking 16.07 Usd Vacancy Pct 25.341
2 Asking 16.07 Usd Vacancy Pct 25.341
Related
Could you please help me to solve this issue in my code, as the spatial join using pandas (groupby(), agg()) it give me the below error:
I have a data frame df and I use several columns from it to groupby:
n the below way I almost get the table (data frame) that I need. What is missing is an additional column that contains number of rows in each group. In other words, I have mean but I also would like to know how many number were used to get these means.
In short: How do I get group-wise statistics for a dataframe?
Code:
def bin_the_midpoints(bins, midpoints):
b = bins.copy()
m = midpoints.copy()
reindexed = b.reset_index().rename(columns={'index':'bins_index'})
joined = gpd.tools.sjoin(reindexed, m)
bin_stats = joined.groupby('bins_index')['offset']\
.agg({'fold': len, 'min_offset': np.min})
return gpd.GeoDataFrame(b.join(bin_stats))
bin_stats = bin_the_midpoints(bins, midpoints)
Error:
---------------------------------------------------------------------------
SpecificationError Traceback (most recent call last)
Input In [103], in <cell line: 9>()
6 bin_stats = joined.groupby('bins_index')['offset']\
7 .agg({'fold': len, 'min_offset': np.min})
8 return gpd.GeoDataFrame(b.join(bin_stats))
----> 9 bin_stats = bin_the_midpoints(bins, midpoints)
Input In [103], in bin_the_midpoints(bins, midpoints)
4 reindexed = b.reset_index().rename(columns={'index':'bins_index'})
5 joined = gpd.tools.sjoin(reindexed, m)
----> 6 bin_stats = joined.groupby('bins_index')['offset']\
7 .agg({'fold': len, 'min_offset': np.min})
8 return gpd.GeoDataFrame(b.join(bin_stats))
File ~\anaconda3\envs\GeoSynapps\lib\site-packages\pandas\core\groupby\generic.py:271, in SeriesGroupBy.aggregate(self, func, engine, engine_kwargs, *args, **kwargs)
267 elif isinstance(func, abc.Iterable):
268 # Catch instances of lists / tuples
269 # but not the class list / tuple itself.
270 func = maybe_mangle_lambdas(func)
--> 271 ret = self._aggregate_multiple_funcs(func)
272 if relabeling:
273 # error: Incompatible types in assignment (expression has type
274 # "Optional[List[str]]", variable has type "Index")
275 ret.columns = columns # type: ignore[assignment]
File ~\anaconda3\envs\GeoSynapps\lib\site-packages\pandas\core\groupby\generic.py:307, in SeriesGroupBy._aggregate_multiple_funcs(self, arg)
301 def _aggregate_multiple_funcs(self, arg) -> DataFrame:
302 if isinstance(arg, dict):
303
304 # show the deprecation, but only if we
305 # have not shown a higher level one
306 # GH 15931
--> 307 raise SpecificationError("nested renamer is not supported")
309 elif any(isinstance(x, (tuple, list)) for x in arg):
310 arg = [(x, x) if not isinstance(x, (tuple, list)) else x for x in arg]
SpecificationError: nested renamer is not supported
You must read more about agg method in pandas. Easily you can add many calculation to this method.
For example you can write:
df.groupby(by=[...]).agg({'col1': ['count', 'sum', 'min']})
I do realize this has already been addressed here (e.g., Reading csv zipped files in python, How can I parse a YAML file in Python, Retrieving data from a yaml file based on a Python list). Nevertheless, I hope this question was different.
I know loading a YAML file to pandas dataframe
import yaml
import pandas as pd
with open(r'1000851.yaml') as file:
df = pd.io.json.json_normalize(yaml.load(file))
df.head()
I would like to read several yaml files from a directory into pandas dataframe and concatenate them into one big DataFrame. I have not been able to figure it out though...
import pandas as pd
import glob
path = r'../input/cricsheet-a-retrosheet-for-cricket/all' # use your path
all_files = glob.glob(path + "/*.yaml")
li = []
for filename in all_files:
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
frame = pd.concat(li, axis=0, ignore_index=True)
Error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<timed exec> in <module>
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
/opt/conda/lib/python3.7/site-packages/pandas/io/json/_normalize.py in <genexpr>(.0)
268
269 if record_path is None:
--> 270 if any([isinstance(x, dict) for x in y.values()] for y in data):
271 # naive normalization, this is idempotent for flat records
272 # and potentially will inflate the data considerably for
AttributeError: 'str' object has no attribute 'values'
Sample Dataset Zipped
Sample Dataset
Is there a way to do this and read files efficiently?
It seems your first part of the code and the second one you added is different.
First part correctly reads yaml files, but the second part is broken:
for filename in all_files:
# `filename` here is just a string containing the name of the file.
df = pd.json_normalize(yaml.load(filename, Loader=yaml.FullLoader))
li.append(df)
The problem is that you need to read the files. Currently you're just giving the filename and not the file content. Do this instead
li=[]
# Only loading 3 files:
for filename in all_files[:3]:
with open(filename,'r') as fh:
df = pd.json_normalize(yaml.safe_load(fh.read()))
li.append(df)
len(li)
3
pd.concat(li)
output:
innings meta.data_version meta.created meta.revision info.city info.competition ... info.player_of_match info.teams info.toss.decision info.toss.winner info.umpires info.venue
0 [{'1st innings': {'team': 'Glamorgan', 'delive... 0.9 2020-09-01 1 Bristol Vitality Blast ... [AG Salter] [Glamorgan, Gloucestershire] field Gloucestershire [JH Evans, ID Blackwell] County Ground
0 [{'1st innings': {'team': 'Pune Warriors', 'de... 0.9 2013-05-19 1 Pune IPL ... [LJ Wright] [Pune Warriors, Delhi Daredevils] bat Pune Warriors [NJ Llong, SJA Taufel] Subrata Roy Sahara Stadium
0 [{'1st innings': {'team': 'Botswana', 'deliver... 0.9 2020-08-29 1 Gaborone NaN ... [A Rangaswamy] [Botswana, St Helena] bat Botswana [R D'Mello, C Thorburn] Botswana Cricket Association Oval 1
[3 rows x 18 columns]
I am trying to apply TukeyHSD from statsmodels but receive the following error message.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-68-503f7cac5cbb> in <module>()
1 from statsmodels.stats.multicomp import pairwise_tukeyhsd
2
----> 3 pairwise_tukeyhsd(endog=data['value'], groups=data['sample_id'])
c:\users\hambarsous\appdata\local\programs\python\python36\lib\site-packages\statsmodels\stats\multicomp.py in pairwise_tukeyhsd(endog, groups, alpha)
36 '''
37
---> 38 return MultiComparison(endog, groups).tukeyhsd(alpha=alpha)
c:\users\hambarsous\appdata\local\programs\python\python36\lib\site-packages\statsmodels\sandbox\stats\multicomp.py in __init__(self, data, groups, group_order)
794 if group_order is None:
795 self.groupsunique, self.groupintlab = np.unique(groups,
--> 796 return_inverse=True)
797 else:
798 #check if group_order has any names not in groups
c:\users\hambarsous\appdata\local\programs\python\python36\lib\site-packages\numpy\lib\arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
221 ar = np.asanyarray(ar)
222 if axis is None:
--> 223 return _unique1d(ar, return_index, return_inverse, return_counts)
224 if not (-ar.ndim <= axis < ar.ndim):
225 raise ValueError('Invalid axis kwarg specified for unique')
c:\users\hambarsous\appdata\local\programs\python\python36\lib\site-packages\numpy\lib\arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
278
279 if optional_indices:
--> 280 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
281 aux = ar[perm]
282 else:
TypeError: '<' not supported between instances of 'str' and 'int'
I've tried things along the lines of the following to see if I could get around the type error but no luck.
import pandas as pd
pd.to_numeric(data['value'], errors='coerce')
Below is the code that produces my error, my only assumption at this point is when I read the data in and melt it maybe that's doing something to the data type that I am not fully understanding. When I check the data types for value and sample_id I get float64 and object.
import pandas as pd
data = pd.read_excel('CHOK1CGA2016WB1.xlsx', sheet_name='Cleaned Data')
data = pd.melt(data, id_vars=['sample_code', 'sample_id', 'day', 'test'], value_vars=['rep 1', 'rep 2', 'rep 3', 'rep 4', 'rep 5', 'rep 6'])
data.rename(columns={'variable': 'rep'}, inplace=True)
from statsmodels.stats.multicomp import pairwise_tukeyhsd
pairwise_tukeyhsd(endog=data['value'], groups=data['sample_id'])
The comment to my question put me on the path of further inspecting the elements of the data['sample_id'] column. It seems this is where the problem lied, since it was a mix of number or letter/numbers I believe the number only entries were causing the error. I simply added some letters into the elements that had numbers and the code ran.
Ehy guys
i had problem during the creation of my dataset with Python.
I m doing this:
userTab = pd.read_csv('C:\\Users\\anto-\\Desktop\\Ex.
Resource\\mapping_user_id.tsv',delimiter="\t",names =
["User","Sequence"])
wordTab = pd.read_csv('C:\\Users\\anto-\\Desktop\\Ex.
Resource\\mapping_word_id.tsv',delimiter="\t",names =
["Word","Sequence"])
df = pd.DataFrame(data=data, index= userTab.User, columns=wordTab.Word)
I m trying to create a dataset from 2 elements, userTab.User are the row and wordTab.Word are the columns elements.
Maybe the shape is too big for compute in this way.
I print the shape of my element, because first all i think that i wrong the dimensions.
((603668,), (37419,), (603668, 37419))
after that i try to print the type, and my user and word are Seris element, and data is scipy.sparse.csc.csc_matrix
Maybe i need use chunk for this shape, but I seen the pandas.DataFrame reference and don't have an attribute.
I have a 8GB Ram on 64bit Python. The sparse matrix is in npz file (300mb about)
the error is general error:
MemoryError Traceback (most recent call
last)
<ipython-input-26-ad363966ef6a> in <module>()
10 type(sparse_matrix)
11
---> 12 df = pd.DataFrame(data=sparse_matrix, index=
np.array(userTab.User), columns= np.array(wordTab.Word))
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __init__(self,
data, index, columns, dtype, copy)
416 if arr.ndim == 0 and index is not None and columns is not
None:
417 values = cast_scalar_to_array((len(index),
len(columns)),
--> 418 data, dtype=dtype)
419 mgr = self._init_ndarray(values, index, columns,
420 dtype=values.dtype,
copy=False)
~\Anaconda3\lib\site-packages\pandas\core\dtypes\cast.py in
cast_scalar_to_array(shape, value, dtype)
1164 fill_value = value
1165
-> 1166 values = np.empty(shape, dtype=dtype)
1167 values.fill(fill_value)
1168
MemoryError:
maybe the problem could be this, because i have a sort of ID that when i try to access at User column, id will be remain in userTab.User
Rather confused by running into this type error while trying out the Apache Parquet file format for the first time. Shouldn't Parquet support all the data types that Pandas does? What am I missing?
import pandas
import pyarrow
import numpy
data = pandas.read_csv("data/BigData.csv", sep="|", encoding="iso-8859-1")
data_parquet = pyarrow.Table.from_pandas(data)
raises:
---------------------------------------------------------------------------
ArrowInvalid Traceback (most recent call last)
<ipython-input-9-90533507bcf2> in <module>()
----> 1 data_parquet = pyarrow.Table.from_pandas(data)
table.pxi in pyarrow.lib.Table.from_pandas()
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in dataframe_to_arrays(df, schema, preserve_index, nthreads)
354 arrays = list(executor.map(convert_column,
355 columns_to_convert,
--> 356 convert_types))
357
358 types = [x.type for x in arrays]
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
423 raise CancelledError()
424 elif self._state == FINISHED:
--> 425 return self.__get_result()
426
427 self._condition.wait(timeout)
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
~\AppData\Local\Continuum\anaconda3\lib\concurrent\futures\thread.py in run(self)
54
55 try:
---> 56 result = self.fn(*self.args, **self.kwargs)
57 except BaseException as exc:
58 self.future.set_exception(exc)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pyarrow\pandas_compat.py in convert_column(col, ty)
343
344 def convert_column(col, ty):
--> 345 return pa.array(col, from_pandas=True, type=ty)
346
347 if nthreads == 1:
array.pxi in pyarrow.lib.array()
array.pxi in pyarrow.lib._ndarray_to_array()
error.pxi in pyarrow.lib.check_status()
ArrowInvalid: Error converting from Python objects to Int64: Got Python object of type str but can only handle these types: integer
data.dtypes is:
0 object
1 object
2 object
3 object
4 object
5 float64
6 float64
7 object
8 object
9 object
10 object
11 object
12 object
13 float64
14 object
15 float64
16 object
17 float64
...
In Apache Arrow, table columns must be homogeneous in their data types. pandas supports Python object columns where values can be different types. So you will need to do some data scrubbing before writing to Parquet format.
We've handled some rudimentary cases (like both bytes and unicode in a column) in the Arrow-Python bindings but we don't hazard any guesses about how to handle bad data. I opened the JIRA https://issues.apache.org/jira/browse/ARROW-2098 about adding an option to coerce unexpected values to null in situations like this, which might help in the future.
Had this same issue and took me a while to figure out a way to find the offending column. Here is what I came up with to find the mixed type column - although I know there must be a more efficient way.
The last column printed before the exception happens is the mixed type column.
# method1: try saving the parquet file by removing 1 column at a time to
# isolate the mixed type column.
cat_cols = df.select_dtypes('object').columns
for col in cat_cols:
drop = set(cat_cols) - set([col])
print(col)
df.drop(drop, axis=1).reset_index(drop=True).to_parquet("c:/temp/df.pq")
Another attempt - list the columns and each type based on the unique values.
# method2: list all columns and the types within
def col_types(col):
types = set([type(x) for x in col.unique()])
return types
df.select_dtypes("object").apply(col_types, axis=0)
I faced similar situation, if possible, you can first convert all columns to the required field datatype and then try to convert to parquet. Example :-
import pandas as pd
column_list = df.columns
for col in column_list:
df[col] = df[col].astype(str)
df.to_parquet('df.parquet.gzip', compression='gzip')