Pandas dict to dataframe - columns out of order? - python

I did a search but didn't see any results pertaining to this specific question. I have a Python dict, and am converting my dict to a pandas dataframe:
pandas.DataFrame(data_dict)
It works, with only one problem - the columns of my pandas dataframe are not in the same order as my Python dict. I'm not sure how pandas is reordering things. How do I retain the ordering?

Python dictionaries (pre 3.6) are unordered so the column order can not be relied upon. You can simply set the column order afterwards.
In [1]:
df = pd.DataFrame({'a':np.random.rand(5),'b':np.random.randn(5)})
df
Out[1]:
a b
0 0.512103 -0.102990
1 0.762545 -0.037441
2 0.034237 1.343115
3 0.667295 -0.814033
4 0.372182 0.810172
In [2]:
df = df[['b','a']]
df
Out[2]:
b a
0 -0.102990 0.512103
1 -0.037441 0.762545
2 1.343115 0.034237
3 -0.814033 0.667295
4 0.810172 0.372182

Python dictionary is an unordered structure, and the key order you get when printing it (or looping over its keys) is arbitrary.
In this case, you would need to explicitly specify the order of columns in the DataFrame with,
pandas.DataFrame(data=data_dict, columns=columns_order)
where column_order is a list of column names in the order.

Related

Pandas, groupby by 2 non numeric columns

I have a dataframe with several columns that I only need to use 2 non numeric columns
1 is 'hashed_id' another is 'event name' with 10 unique names
I'm trying to do a groupby by 2 non numeric columns, so aggregation functions would not work here
my solution is:
df_events = df.groupby('subscription_hash', 'event_name')['event_name']
df_events = pd.DataFrame (df_events, columns = ["subscription_hash",
'event_name'])
I'm trying to get a format like:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) AddToQueue
1 (0000379144f24717a8d124d798008a0e672) page_view
but instead getting:
subscription_hash event_name
0 (0000379144f24717a8d124d798008a0e672) 832433 AddToQueue
1 (0000379144f24717a8d124d798008a0e672) 245400 page_view
Please advise
Is your data clean ? what are those undesired numbers coming from ?
from the docs, I see groupby being used by providing the name of columns as a list and an aggregate function:
df.groupby(['col1','col2']).mean()
since your values are not numeric maybe try the pivot method:
df.pivot(columns=['col1','col2'])
so id try first putting [] around your colnames, then try the pivot.

How to sort a dataFrame in python pandas by two or more columns based on a list of values? [duplicate]

So I have a pandas DataFrame, df, with columns that represent taxonomical classification (i.e. Kingdom, Phylum, Class etc...) I also have a list of taxonomic labels that correspond to the order I would like the DataFrame to be ordered by.
The list looks something like this:
class_list=['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes', 'Clostridia', 'Bacilli', 'Actinobacteria', 'Betaproteobacteria', 'delta/epsilon subdivisions', 'Synergistia', 'Mollicutes', 'Nitrospira', 'Spirochaetia', 'Thermotogae', 'Aquificae', 'Fimbriimonas', 'Gemmatimonadetes', 'Dehalococcoidia', 'Oscillatoriophycideae', 'Chlamydiae', 'Nostocales', 'Thermodesulfobacteria', 'Erysipelotrichia', 'Chlorobi', 'Deinococci']
This list would correspond to the Dataframe column df['Class']. I would like to sort all the rows for the whole dataframe based on the order of the list as df['Class'] is in a different order currently. What would be the best way to do this?
You could make the Class column your index column
df = df.set_index('Class')
and then use df.loc to reindex the DataFrame with class_list:
df.loc[class_list]
Minimal example:
>>> df = pd.DataFrame({'Class': ['Gammaproteobacteria', 'Bacteroidetes', 'Negativicutes'], 'Number': [3, 5, 6]})
>>> df
Class Number
0 Gammaproteobacteria 3
1 Bacteroidetes 5
2 Negativicutes 6
>>> df = df.set_index('Class')
>>> df.loc[['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']]
Number
Bacteroidetes 5
Negativicutes 6
Gammaproteobacteria 3
Alex's solution doesn't work if your original dataframe does not contain all of the elements in the ordered list i.e.: if your input data at some point in time does not contain "Negativicutes", this script will fail. One way to get past this is to append your df's in a list and concatenate them at the end. For example:
ordered_classes = ['Bacteroidetes', 'Negativicutes', 'Gammaproteobacteria']
df_list = []
for i in ordered_classes:
df_list.append(df[df['Class']==i])
ordered_df = pd.concat(df_list)

Create a new pandas dataframe from a python list of lists with a column with lists

I have a python list of lists, e.g. [["chunky","bacon","foxes"],["dr_cham"],["organ","instructor"],...] and would like to create a pandas dataframe with one column containing the lists:
0 ["chunky","bacon","foxes"]
1 ["dr_cham"]
2 ["organ","instructor"] .
. .
The std constructor (l is the list here)
pd.DataFrame(l)
returns a dataframe with 3 columns in that case.
How would this work? I'm sure it's very simple, but I'm searching for a solution since a while and can't figure it out for some obscure reason.
Thanks,
B.
The following code should achieve what you want:
import pandas as pd
l = [["hello", "goodbye"], ["cat", "dog"]]
# Replace "lists" with whatever you want to name the column
df = pd.DataFrame({"lists": l})
After printing df, we get
lists
0 [hello, goodbye]
1 [cat, dog]
Hope this helps -- let me know if I can clarify anything!
This pattern is rarely if ever a good idea and defeats the purpose of Pandas, but if you have some unusual use case that requires it, you can achieve it by further nesting your inner lists an additional level in your main list, then creating a DataFrame from that:
l = [[x] for x in l]
df = pd.DataFrame(l)
0
0 [chunky, bacon, foxes]
1 [dr_cham]
2 [organ, instructor]

Why does referencing a concatenated pandas dataframe return multiple entries?

When I create a dataframe using concat like this:
import pandas as pd
dfa = pd.DataFrame({'a':[1],'b':[2]})
dfb = pd.DataFrame({'a':[3],'b':[4]})
dfc = pd.concat([dfa,dfb])
And I try to reference like I would for any other DataFrame I get the following result:
>>> dfc['a'][0]
0 1
0 3
Name: a, dtype: int64
I would expect my concatenated DataFrame to behave like a normal DataFrame and return the integer that I want like this simple DataFrame does:
>>> dfa['a'][0]
1
I am just a beginner, is there a simple explanation for why the same call is returning an entire DataFrame and not the single entry that I want? Or, even better, an easy way to get my concatenated DataFrame to respond like a normal DataFrame when I try to reference it? Or should I be using something other than concat?
You've mistaken what normal behavior is. dfc['a'][0] is a label lookup and matches anything with an index value of 0 in which there are two because you concatenated two dataframes with index values including 0.
in order to specify position of 0
dfc['a'].iloc[0]
or you could have constructed dfc like
dfc = pd.concat([dfa,dfb], ignore_index=True)
dfc['a'][0]
Both returning
1
EDITED (thx piRSquared's comment)
Use append() instead pd.concat():
dfc = dfa.append(dfb, ignore_index=True)
dfc['a'][0]
1

pandas: check membership in array of lists, avoid looping through columns

What is the best way to accomplish the following task?
In the following DataFrame,
df = DataFrame({'a':[20,21,99], 'b':[[1,2,3,4],[1,2,99],[1,2]], 'c':['x','y','z']})
I want to check which elements in column df['a'] are contained in some list in column df['b']. In case there is a match I want the corresponding element in column df['c'], and if no match is found a 0.
So in my example I would like to get a Series:
[0,0,'y'].
Since 99 is the only element in column df['a'] contained in a list from column df['b'], and that list corresponds to element 'y' in column df['c']
I tried:
def match(item):
for ind, row in A.iterrows():
if item in row.b:
return row.c
return False
df['a'].apply(match)
But is quite slow.
Thanks!
I think this is an example of why you never want a column of lists in a Pandas DataFrame. Accessing the values in the lists force you to use Python loops with no opportunity to really take advantage of Pandas.
Ideally, I think you would be best off altering the way you are constructing df so that you do not store the values in b as lists. The appropriate data structure to use depends on how you intend to use the data.
For the particular purpose you describe in the question, a dict would be useful.
To construct the dict given the current df, you could do this:
In [69]: dct = {key:row['c'] for i, row in df[['b', 'c']].iterrows() for key in row['b']}
In [70]: df['a'].map(dct).fillna(0)
Out[70]:
0 0
1 0
2 y
Name: a, dtype: object

Categories

Resources