Pandas dataframe from dict, why?

Pandas dataframe from dict, why? - python

I can create a pandas dataframe from dict as follows:
d = {'Key':['abc','def','xyz'], 'Value':[1,2,3]}
df = pd.DataFrame(d)
df.set_index('Key', inplace=True)
And also by first creating a series like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
a = pd.Series(d, name='Value')
df = pd.DataFrame(a)
But not directly like this:
d = {'abc': 1, 'def': 2, 'xyz': 3}
df = pd.DataFrame(d)
I'm aware of the from_dict method, and this also gives the desired result:
d = {'abc': 1, 'def': 2, 'xyz': 3}
pd.DataFrame.from_dict(d, orient='index')
but I don't see why:
(1) a separate method is needed to create a dataframe from dict when creating from series or list works without issue;
(2) how/why creating a dataframe from dict/list of lists works, but not creating from dict directly.
Have found several SE answers that offer solutions, but looking for the 'why' as this behavior seems inconsistent. Can anyone shed some light on what I may be missing here.

There's actually a lot happening here, so let's break it down.
The Problem
There are soooo many different ways to create a DataFrame (from a list of records, dict, csv, ndarray, etc ...) that even for python veterans it can take a long time to understand them all. Hell, within each of those ways, there are EVEN MORE ways to build a DataFrame by tweaking some parameters and whatnot.
For example, for dictionaries (where the values are equal length lists), here are two ways pandas can handle them:
Case 1:
You treat each key-value pair as a column title and it's values at each row respectively. In this case, the rows don't have names, and so by default you might just name them by their row index.
Case 2:
You treat each key-value pair as the row's name and it's values at each column respectively. In this case, the columns don't have names, and so by default you might just name them by their index.
The Solution
Python's is a weakly typed language (aka variables don't declare a type and functions don't declare a return). As a result, it doesn't have function overloading. So, you basically have two philosophies when you want to create a object class that can have multiple ways of being constructed:
Create only one constructor that checks the input and handles it accordingly, covering all possible options. This can get very bloated and complicated when certain inputs have their own options/parameters and when there's simply just too much variety.
Separate each option into #classmethod's to handle each specific individual way of constructing the object.
The second is generally better, as it really enforces seperation of concerns as a SE design principle, however the user will need to know all the different #classmethod constructor calls as a result. Although, in my opinion, if you're object class is complicated enough to have many different construction options, the user should be aware of that anyways.
The Panda's Way
Pandas adopts a sorta mix between the two solutions. It'll use the default behaviour for each input type, and it you wanna get any extra functionality you'll need to use the respective #classmethod constructor.
For example, for dicts, by default, if you pass a dict into the DataFrame constructor, it will handle it as Case 1. If you want to do the second case, you'll need to use DataFrame.from_dict and pass in orient='index' (without orient='index', it would would use default behaviour described base Case 1).
In my opinion, I'm not a fan of this kind of implementation. Personally, it's more confusing than helpful. Honestly, a lot of pandas is designed like that. There's a reason why pandas is the topic of every other python tagged question on stackoverflow.

Related

Splitting a DataFrame to filtered "sub - datasets"

So I have a DataFrame with several columns, some contain objects (string) and some are numerical.
I'd like to create new dataframes which are "filtered" to the combination of the objects available.
To be clear, those are my object type columns:
Index(['OS', 'Device', 'Design',
'Language'],
dtype='object')
["Design"] and ["Language"] have 3 options each.
I filtered ["OS"] and ["Device"] manually as I needed to match them.
However, now I want to create multiple variables each contains a "filtered" dataframe.
For example:
I have
"android_fltr1_d1" to represent the next filter:
["OS"]=android, ["Device"]=1,["Design"]=1
and "android_fltr3_d2" to represent:
["OS"]=android, ["Device"]=3,["Design"]=2
I tried the next code (which works perfectly fine).
android_fltr1_d1 = android_fltr1[android_fltr1["Design"]==1].drop(["Design"],axis=1)
android_fltr1_d2 = android_fltr1[android_fltr1["Design"]==2].drop(["Design"],axis=1)
android_fltr1_d3 = android_fltr1[android_fltr1["Design"]==3].drop(["Design"],axis=1)
android_fltr3_d1 = android_fltr3[android_fltr3["Design"]==1].drop(["Design"],axis=1)
android_fltr3_d2 = android_fltr3[android_fltr3["Design"]==2].drop(["Design"],axis=1)
android_fltr3_d3 = android_fltr3[android_fltr3["Design"]==3].drop(["Design"],axis=1)
android_fltr5_d1 = android_fltr5[android_fltr5["Design"]==1].drop(["Design"],axis=1)
android_fltr5_d2 = android_fltr5[android_fltr5["Design"]==2].drop(["Design"],axis=1)
android_fltr5_d3 = android_fltr5[android_fltr5["Design"]==3].drop(["Design"],axis=1)
As you can guess, I don't find it efficient and would like to use a for loop to generate those variables (as I'd need to match each ["Language"] option to each filter I created. Total of 60~ variables).
Thought about using something similar to .format() in the loop in order to be some kind of a "place-holder", couldn't find a way to do it.
It would be probably the best to use a nested loop to create all the variables, though I'd be content even with a single loop for each column.
I find it difficult to build the for loop to execute it and would be grateful for any help or directions.
Thanks!
As suggested I tried to find my answer in:How do I create variable variables?
Yet I failed to understand how I use the globals() function in my case. I also found that using '%' is not working anymore.

How can I manipulate a DataFrame name within a function?

How can I manipulate a DataFrame name within a function so that I can have a new DataFrame with a new name that is derived from the input DataFrame name in return?
let say I have this:
def some_func(df):
# some operations
return(df_copy)
and whatever df I put inside this function it should return the new df as ..._copy, e.g. some_func(my_frame) should return my_frame_copy.
Things that I considered are as follows:
As in string operations;
new_df_name = "{}_copy".format(df) -- I know this will not work since the df refers to an object but it just helps to explain what I am trying to do.
def date_timer(df):
df_copy = df.copy()
dates = df_copy.columns[df_copy.columns.str.contains('date')]
for i in range(len(dates)):
df_copy[dates[i]] = pd.to_datetime(df_copy[dates[i]].str.replace('T', ' '), errors='coerce')
return(df_copy)
Actually this was the first thing that I tried, If only DataFrame had a "name" attribute which allowed us to manipulate the name but this also not there:
df.name
Maybe f-string or any kind of string operations could be able to make it happen. if not, it might not be possible to do in python.
I think this might be related to variable name assignment rules in python. And in a sense what I want is reverse engineer that but probably not possible.
Please advice...

It looks like you're trying to access / dynamically set the global/local namespace of a variable from your program.
Unless your data object belongs to a more structured namespace object, I'd discourage you from dynamically setting names with such a method since a lot can go wrong, as per the docs:
Changes may not affect the values of local and free variables used by the interpreter.
The name attribute of your df is not an ideal solution since the state of that attribute will not be set on default. Nor is it particularly common. However, here is a solid SO answer which addresses this.
You might be better off storing your data objects in a dictionary, using dates or something meaningful as keys. Example:
my_data = {}
for my_date in dates:
df_temp = df.copy(deep=True) # deep copy ensures no changes are translated to the parent object
# Modify your df here (not sure what you are trying to do exactly
df_temp[my_date] = "foo"
# Now save that df
my_data[my_date] = df_temp
Hope this answers your Q. Feel free to clarify in the comments.

What is the purpose of the alias method in PySpark?

While learning Spark in Python, I'm having trouble understanding both the purpose of the alias method and its usage. The documentation shows it being used to create copies of an existing DataFrame with new names, then join them together:
>>> from pyspark.sql.functions import *
>>> df_as1 = df.alias("df_as1")
>>> df_as2 = df.alias("df_as2")
>>> joined_df = df_as1.join(df_as2, col("df_as1.name") == col("df_as2.name"), 'inner')
>>> joined_df.select("df_as1.name", "df_as2.name", "df_as2.age").collect()
[Row(name=u'Bob', name=u'Bob', age=5), Row(name=u'Alice', name=u'Alice', age=2)]
My question has two parts:
What is the purpose of the alias input? It seems redundant to give the alias string "df_as1" when we are already assigning the new DataFrame to the variable df_as1. If we were to instead use df_as1 = df.alias("new_df"), where would "new_df" ever appear?
In general, when is the alias function useful? The example above feels a bit artificial, but from exploring tutorials and examples it seems to be used regularly -- I'm just not clear on what value it provides.
Edit: some of my original confusion came from the fact that both DataFrame and Column have alias methods. Nevertheless, I'm still curious about both of the above questions, with question 2 now applying to Column.alias as well.

The variable name is irrelevant and can be whatever you like it to be. It's the alias what will be used in string column identifiers and printouts.
I think that the main purpose of aliases is to achieve better brevity and avoid possible confusion when having conflicting column names. For example what was simply 'age' could be aliased to 'max_age' for brevity after you searched for the biggest value in that column. Or you could have a data frame for employees in a company joined with itself and filter so that you have manager-subordinate pairs. It could be useful to use column names like "manager.name" in such context.

Python Sparse Dictionary / Repeated Values

I have a very, very large dictionary of dictionaries. Often the values are the same, and it seems there should be a way to reduce the size by having a reference to the dictionary value that is the same.
Currently I do this with a two-pass method of "Does value have synonym" followed by look up value of synonym.
But ideally it would be great to have a way to do this in a single go.
animals = {
'cat':{'legs':4,'eyes':2},
'dog':{'legs':4,'eyes':2},
'spider':{'legs':8,'eyes':6},
}
I could have a value "mammal" that is used such that I said 'cat':mammal, but what I'd like to be able to do is 'dog':animals['cat']
Because as a reference it should take up less memory which is the goal.
I am contemplating a Class to handle this, but I can't be the first person to think that repeated values in a dictionary could be "squished" somehow, and would prefer to do it in the most pythonic way.

I think object and inheritance are the better way for doing what you want, except maybe for the concern of memory.
For using reference instead of copying the values of each dictionary, you can use the ctypes module:
import ctypes
animals = {'cat':{'legs':4,'eyes':2},'spider':{'legs':8,'eyes':6}}
# You put the value of animals['cat'] in ['dog']
animals['dog'] = id(animals['cat'])
animals
{'dog': 47589527749808, 'spider': {'eyes': 6, 'legs': 8}, 'cat': {'eyes': 2, 'legs': 4}}
# You can access to ['dog'] with
ctypes.cast(animals['dog'], ctypes.py_object).value
{'eyes': 2, 'legs': 4}
Not sure if it is the "most pythonic way" btw. Imho class are the right way to do this.
Another way can by with using the weakref module. I don't know a lot about this one, look this post and the different answers for others hints about using reference.

Data structures with Python

Python has a lot of convenient data structures (lists, tuples, dicts, sets, etc) which can be used to make other 'conventional' data structures (Eg, I can use a Python list to create a stack and a collections.dequeue to make a queue, dicts to make trees and graphs, etc).
There are even third-party data structures that can be used for specific tasks (for instance the structures in Pandas, pytables, etc).
So, if I know how to use lists, dicts, sets, etc, should I be able to implement any arbitrary data structure if I know what it is supposed to accomplish?
In other words, what kind of data structures can the Python data structures not be used for?
Thanks

For some simple data structures (eg. a stack), you can just use the builtin list to get your job done. With more complex structures (eg. a bloom filter), you'll have to implement them yourself using the primitives the language supports.
You should use the builtins if they serve your purpose really since they're debugged and optimised by a horde of people for a long time. Doing it from scratch by yourself will probably produce an inferior data structure. Whether you're using Python, C++, C#, Java, whatever, you should always look to the built in data structures first. They will generally be implemented using the same system primitives you would have to use doing it yourself, but with the advantage of having been tried and tested.
Combinations of these data structures (and maybe some of the functions from helper modules such as heapq and bisect) are generally sufficient to implement most richer structures that may be needed in real-life programming; however, that's not invariably the case.
Only when the provided data structures do not allow you to accomplish what you need, and there isn't an alternative and reliable library available to you, should you be looking at building something from scratch (or extending what's provided).
Lets say that you need something more than the rich python library provides, consider the fact that an object's attributes (and items in collections) are essentially "pointers" to other objects (without pointer arithmetic), i.e., "reseatable references", in Python just like in Java. In Python, you normally use a None value in an attribute or item to represent what NULL would mean in C++ or null would mean in Java.
So, for example, you could implement binary trees via, e.g.:
class Node(object):
__slots__ = 'data', 'left', 'right'
def __init__(self, data=None, left=None, right=None):
self.data = data
self.left = left
self.right = right
plus methods or functions for traversal and similar operations (the __slots__ class attribute is optional -- mostly a memory optimization, to avoid each Node instance carrying its own __dict__, which would be substantially larger than the three needed attributes/references).
Other examples of data structures that may best be represented by dedicated Python classes, rather than by direct composition of other existing Python structures, include tries (see e.g. here) and graphs (see e.g. here).

You can use the Python data structures to do anything you like. The entire programming language Lisp (now people use either Common Lisp or Scheme) is built around the linked list data structure, and Lisp programmers can build any data structure they choose.
That said, there are sometimes data structures for which the Python data structures are not the best option. For instance, if you want to build a splay tree, you should either roll your own or use an open-source project like pysplay. If the built-in data structures, solve your problem, use them. Otherwise, look beyond the built-in data structures. As always, use the best tool for the job.

Given that all data structures exist in memory, and memory is effectively just a list (array)... there is no data structure that couldn't be expressed in terms of the basic Python data structures (with appropriate code to interact with them).

It is important to realize that Python can represent hierarchical structures which are combinations of list (or tuple) and dict. For example, list-of-dict or dict-of-list or dict-of-dict are common simple structures. These are so common, that in my own practice, I append the data type to the variable name, like 'parameters_lod'. But these can go on to arbitrary depth.
The _lod datatype can be easily converted into a pandas DataFrame, for example, or any database table. In fact, some realizations of big-data tables use the _lod structure, sometimes omitting the commas between each dict and omitting the surrounding list brackets []. This makes it easy to append to a file of such lines. AWS offers tables that are dict syntax.
A _lod can be easily converted to a _dod if there is a field that is unique and can be used to index the table. An important difference between _lod and _dod is that the _lod can have multiple entries for the same keyfield, whereas a dict is required to have only one. Thus, it is more general to start with the _lod as the primary basic table structure so duplicates are allowed until the table is inspected to combine those entries.
If the lod is turned into dod, it is preferred to keep the entire dict intact, and not remove the item that is used for the keyfield.
a_lod = [
{'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
{'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
{'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
]
a_dod = {'sam': {'k': 'sam', 'a': 1, 'b': 2, 'c': 3},
'sue': {'k': 'sue', 'a': 4, 'b': 5, 'c': 6},
'joe': {'k': 'joe', 'a': 7, 'b': 8, 'c': 9}
}
Thus, the dict key is added but the records are unchanged. We find this is a good practice so the underlying dicts are unchanged.
Pandas DataFrame.append() function is very slow. Therefore, you should not construct a dataframe one record at a time using this syntax:
df = df.append(record)
Instead, build it as a lod and then convert to a df, as follows.
df = pd.DataFrame.from_dict(lod)
This is much faster, as the other method will get slower and slower as the df grows, because the whole df is copied each time.
It has become important in our development to emphasize the use of _lod and avoid field names in each record that are not consistent, so they can be easily converted to Dataframe. So we avoid using key fields in dicts like 'sam':(data) and use {'name':'sam', 'dataname': (arbitrary data)} instead.
The most elegant thing about python structures is the fact that the default is to work with references to the data rather than values. This must be understood because modifying data in a reference will modify the larger structure.
If you want to make a copy, then you need to use .copy() and sometimes .deepcopy or .copy(deep=True) when using Pandas. Then the data structure will be copied, otherwise, a variable name is just a reference.
Further, we discourage using the dol structure, and instead prefer the lodol of dodol. This is because it is best to have each data item identified with a label, which also allows additional fields to be added.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.