Problem when stores dict into pandas Dataframe - python

Recently in my project, I need to store a dictionary into a Pandas DataFrame with the code
self.device_info.loc[i,'interference']=[temp_dict]
The device_info is a Pandas DataFrame. The temp_dict is a dictionary and I want it to be stored as an element in the DataFrame for future use. The square bracket is added to ensure there's no error when assigning.
I just found it today that with Pandas version 0.22.0, this code will pack the dictionary as a list and store it into the DataFrame. However, in the version of 0.24.2, this code directly stores the dictionary into Pandas DataFrame.
For example, say when i=0, after executing the code
with Pandas.version == '0.22.0'
type(self.device_info.loc[0,'interference'])
returns list, while Pandas.version == '0.24.2', this code returns a dict. From my perspective, I need a consistent performance that there is always a dictionary stored.
I am currently working on two PCs, one's home and one's at my office, and I cannot update the older version of pandas on my office PC. So I would be much appreciated if anyone can help me figure out why this happens.

Pandas has a from_dict method with many option which takes a dict as input and returns a DataFrame.
You can chose to infer type or force it (to str, for example).
Then, manipulating and appending dataframes is way easier as you won't ever again have dict object problem in that line or column.

Related

unable to sort excel values using pandas [duplicate]

New to Pandas, so maybe I'm missing a big idea?
I have a Pandas DataFrame of register transactions with shape like (500,4):
Time datetime64[ns]
Net Total float64
Tax float64
Total Due float64
I'm working through my code in a Python3 Jupyter notebook. I can't get past sorting any column. Working through the different code examples for sort, I'm not seeing the output reorder when I inspect the df. So, I've reduced the problem to trying to order just one column:
df.sort_values(by='Time')
# OR
df.sort_values(['Total Due'])
# OR
df.sort_values(['Time'], ascending=True)
No matter which column title, or which boolean argument I use, the displayed results never change order.
Thinking it could be a Jupyter thing, I've previewed the results using print(df), df.head(), and HTML(df.to_html()) (the last example is for Jupyter notebooks). I've also rerun the whole notebook from import CSV to this code. And, I'm also new to Python3 (from 2.7), so I get stuck with that sometimes, but I don't see how that's relevant in this case.
Another post has a similar problem, Python pandas dataframe sort_values does not work. In that instance, the ordering was on a column type string. But as you can see all of the columns here are unambiguously sortable.
Why does my Pandas DataFrame not display new order using sort_values?
df.sort_values(['Total Due']) returns a sorted DF, but it doesn't update DF in place.
So do it explicitly:
df = df.sort_values(['Total Due'])
or
df.sort_values(['Total Due'], inplace=True)
My problem, fyi, was that I wasn't returning the resulting dataframe, so PyCharm wasn't bothering to update said dataframe. Naming the dataframe after the return keyword fixed the issue.
Edit:
I had return at the end of my method instead of
return df,
which the debugger must of noticed, because df wasn't being updated in spite of my explicit, in-place sort.

Python convert list of large json objects to DF

I want to convert a list of objects to a pandas dataframe. The objects are large and complex, a sample one can be found here
The output should be a DF with 3 columns: info, releases, URLs - as per the json object linked above. I've tried pd.DataFrame and from_records, but I keep getting hit with errors. Can anyone suggest a fix?
Have you tried pd.read_json() function?
Here's a link to the documentation
https://pandas.pydata.org/docs/reference/api/pandas.io.json.read_json.html

What is the difference between calling "iris.species" and "iris['species']"?

I have just recently started on python data science and noticed that i can call the columns of a dataset in two ways. I was wondering if there was an advantage to using one method over the other or can they be used interchangeably?
import seaborn
iris = seaborn.load_dataset('iris')
print(iris.species)
print(iris['species'])
Both print statements give the same output in Jupyter
There is no difference. iris is a Pandas Dataframe, and these are two different ways to access a column in a Dataframe.
Try this:
iris['species'] is iris.species
# True
You can use either method, but I find the indexing approach (iris['species']) is more versatile, e.g. you can use it to access columns whose names contain spaces, you can use it to create new columns, and you won't ever accidentally retrieve a dataframe method or attribute (e.g. iris.shape) instead of a column.
Also see answers to these questions:
In pandas, what's the difference between df['column'] and df.column?
For Pandas DataFrame, what's the difference between using squared brackets or dot to access a column?
Both methods of accessing the dictionary are equivalent.
The main advantage of accessing the iris dictionary via its 'species' key (e.g. iris['species']) is that the specified dictionary key can have spaces.
For example, you can access the iris dictionary with a 'plant color' key like so: iris['plant color']. However, you cannot access the iris dictionary via iris.plant color.

Reading Parquet File with Array<Map<String,String>> Column

I'm using Dask to read a Parquet file that was generated by PySpark, and one of the columns is a list of dictionaries (i.e. array<map<string,string>>'). An example of the df would be:
import pandas as pd
df = pd.DataFrame.from_records([
(1, [{'job_id': 1, 'started': '2019-07-04'}, {'job_id': 2, 'started': '2019-05-04'}], 100),
(5, [{'job_id': 3, 'started': '2015-06-04'}, {'job_id': 9, 'started': '2019-02-02'}], 540)],
columns=['uid', 'job_history', 'latency']
)
The when using engine='fastparquet, Dask reads all other columns fine but returns a column of Nones for the column with the complex type. When I set engine='pyarrow', I get the following exception:
ArrowNotImplementedError: lists with structs are not supported.
A lot of googling has made it clear that reading a column with a Nested Array just isn't really supported right now, and I'm not totally sure what the best way to handle this is. I figure my options are:
Some how tell dask/fastparquet to parse the column using the standard json library. The schema is simple and that would do the job if possible
See if I can possibily re-run the Spark job that generated the output and save it as something else, though this almost isn't an acceptable solution since my company uses parquet everywhere
Turn the keys of the map into columns and break the data up across several columns with dtype list and note that the data across these columns are related/map to each other by index (e.g. the elements in idx 0 across these keys/columns all came from the same source). This would work, but frankly, breaks my heart :(
I'd love to hear how others have navigated around this limitation. My company uses nested arrays in their parquest frequently, and I'd hate to have to let go of using Dask because of this.
It would be fairer to say that pandas does not support non-simple types very well (currently). It may be the case that pyarrow will, without conversion to pandas, and that as some future point, pandas will use these arrow structures directly.
Indeed, the most direct method that I can think for you to use, is to rewrite the columns as B/JSON-encoded text, and then load with fastparquet, specifying to load using B/JSON. You should get lists of dicts in the column, but the performance will be slow.
Note that the old project oamap and its successor awkward provides a way to iterate and aggregate over nested list/map/struct trees using python syntax, but compiled with Numba, such that you never need to instantiate the intermediate python objects. They were not designed for parquet, but had parquet compatibility, so might just be useful to you.
I'm dealing with pyarrow.lib.ArrowNotImplementedError: Reading lists of structs from Parquet files not yet supported when I try to read using Pandas; however, when I read using pyspark and then convert to pandas, the data at least loads:
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.read.load(path)
pdf = df.toPandas()
and the offending field is now rendered as a pyspark Row object, which have some structured parsing but you would have to probably write custom pandas functions to extract data from them:
>>> pdf["user"][0]["sessions"][0]["views"]
[Row(is_search=True, price=None, search_string='ABC', segment='listing', time=1571250719.393951), Row(is_search=True, price=None, search_string='ZYX', segment='homepage', time=1571250791.588197), Row(is_search=True, price=None, search_string='XYZ', segment='listing', time=1571250824.106184)]
the individual record can be rendered as a dictionary, simply call .asDict(recursive=True) on the Row object you would like.
Unfortunately, it takes ~5 seconds to start the SparkSession context and every spark action also takes much longer than pandas operations (for small to medium datasets) so I would greatly prefer a more python-native option

Set_index() on a Pandas DataFrame giving unexpected results

I asked this question before but the question was downgraded for being unclear. So I deleted it.
I hope that this re-worked version will be much clearer!
The buggy code is part of a much larger project so it's not so easy to create a minimum example, especially as I am still fairly new to Python can almost completely new to Pandas, but if required I will try.
All_holdings is part of the portfolio object. Looking at it in the variables window it appears to be a list of dictionaries (is this correct)?
As you can see from the code it is then converted into a pandas data frame called curve using.
curve = pd.DataFrame(self.all_holdings)
At this point the curve data frame includes the columns 'datetime' and 'total' both containing the correct values from the original list of dicts in self.all_holdings.
However after performing:
curve.set_index('datetime', inplace=True)
The 'datetime' column has disappeared and the column 'total' now has the 'datetime' values.
The original values of column 'total' have also disappeared?
I would have expected the 'datetime' column to become the index (but not for it's values to disappear) and everything else to stay the same?
Is this an issue of Python versions I am using 3.6 to his 2.7, also I'm using pandas 0.22.0 where as the example uses an unspecified earlier version.
I don't see any issue there. You did set an index on datetime. The total values are indexed on datetime too

Categories

Resources