When I were output the result to CSV file, I generated a pandas dataframe. But the dataframe column order changed automatically, I am curious Why would this happened?
Problem Image :
As Youn Elan pointed out, python dictionaries aren't ordered, so if you use a dictionary to provide your data, the columns will end up randomly ordered. You can use the columns argument to set the order of the columns explicitly though:
import pandas as pd
before = pd.DataFrame({'lake_id': range(3), 'area': (['a', 'b', 'c'])})
print 'before'
print before
after = pd.DataFrame({'lake_id': range(3), 'area': (['a', 'b', 'c'])},
columns=['lake_id', 'area'])
print 'after'
print after
Result:
before
area lake_id
0 a 0
1 b 1
2 c 2
after
lake_id area
0 0 a
1 1 b
2 2 c
I notice you use a dictionary.
Dictionaries in python are not garanteed to be in any order. It depends on multiple factors, including what's in the array. Keys are garanteed to be unique though
Related
I have a dataframe:
import pandas as pd
df = pd.DataFrame({'val': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']})
that I would like to slice into two new dataframes such that the first contains every nth value, while the second contains the remaining values not in the first.
For example, in the case of n=3, the second dataframe would keep two values from the original dataframe, skip one, keep two, skip one, etc. This slice is illustrated in the following image where the original dataframe values are blue, and these are split into a green set and a red set:
I have achieved this successfully using a combination of iloc and isin:
df1 = df.iloc[::3]
df2 = df[~df.val.isin(df1.val)]
but what I would like to know is:
Is this the most Pythonic way to achieve this? It seems inefficient and not particularly elegant to take what I want out of a dataframe then get the rest of what I want by checking what is not in the new dataframe that is in the original. Instead, is there an iloc expression, like that which was used to generate df1, which could do the second part of the slicing procedure and replace the isin line? Even better, is there a single expression that could execute the the entire two-step slice in one step?
Use modulo 3 with compare for not equal first values (same like sliced rows):
#for default RangeIndex
df2 = df[df.index % 3 != 0]
#for any Index
df2 = df[np.arange(len(df)) % 3 != 0]
print (df2)
val
1 b
2 c
4 e
5 f
7 h
Basically, I have a pandas dataframe with an inconvenient ordered category field, I might even not know what category values are, I just know it's ordered and there are three values in the category:
import pandas as pd
dfs = pd.DataFrame({'C1': pd.Categorical(list('abbacabac'), categories=['a', 'b', 'c'], ordered=True), 'C2': [1,2,3,4,5,6,7,8,9]})
I can get, say, all the items that are in the second category by doing:
df1 = dfs[dfs.C1 == 'b']
But I might not even know what the categories are, or they might be really inconvenient ones to type in or something.
Considering the categories in the example are ordered, is there a simple way to just get the items that have the second category by order, something like
df1 = dfs[dfs.C1.category_order == 1]
?
Use cat.categories and select by indexing:
dfs = dfs[dfs.C1 == dfs.C1.cat.categories[1]]
print (dfs)
C1 C2
1 b 2
2 b 3
6 b 7
From the reindex docs:
Conform DataFrame to new index with optional filling logic, placing NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
Therefore, I thought that I would get a reordered Dataframe by setting copy=False in place (!). It appears, however, that I do get a copy and need to assign it to the original object again. I don't want to assign it back, if I can avoid it (the reason comes from this other question).
This is what I am doing:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5, 5))
df.columns = [ 'a', 'b', 'c', 'd', 'e' ]
df.head()
Outs:
a b c d e
0 0.234296 0.011235 0.664617 0.983243 0.177639
1 0.378308 0.659315 0.949093 0.872945 0.383024
2 0.976728 0.419274 0.993282 0.668539 0.970228
3 0.322936 0.555642 0.862659 0.134570 0.675897
4 0.167638 0.578831 0.141339 0.232592 0.976057
Reindex gives me the correct output, but I'd need to assign it back to the original object, which is what I wanted to avoid by using copy=False:
df.reindex( columns=['e', 'd', 'c', 'b', 'a'], copy=False )
The desired output after that line is:
e d c b a
0 0.177639 0.983243 0.664617 0.011235 0.234296
1 0.383024 0.872945 0.949093 0.659315 0.378308
2 0.970228 0.668539 0.993282 0.419274 0.976728
3 0.675897 0.134570 0.862659 0.555642 0.322936
4 0.976057 0.232592 0.141339 0.578831 0.167638
Why is copy=False not working in place?
Is it possible to do that at all?
Working with python 3.5.3, pandas 0.23.3
reindex is a structural change, not a cosmetic or transformative one. As such, a copy is always returned because the operation cannot be done in-place (it would require allocating new memory for underlying arrays, etc). This means you have to assign the result back, there's no other choice.
df = df.reindex(['e', 'd', 'c', 'b', 'a'], axis=1)
Also see the discussion on GH21598.
The one corner case where copy=False is actually of any use is when the indices used to reindex df are identical to the ones it already has. You can check by comparing the ids:
id(df)
# 4839372504
id(df.reindex(df.index, copy=False)) # same object returned
# 4839372504
id(df.reindex(df.index, copy=True)) # new object created - ids are different
# 4839371608
A bit off topic, but I believe this would rearrange the columns in place
for i, colname in enumerate(list_of_columns_in_desired_order):
col = dataset.pop(colname)
dataset.insert(i, colname, col)
So, I have a problem with my dataframe from dictionary - python actually "names" my rows and columns with numbers.
Here's my code:
a = dict()
dfList = [x for x in df['Marka'].tolist() if str(x) != 'nan']
dfSet = set(dfList)
dfList123 = list(dfSet)
for i in range(len(dfList123)):
number = dfList.count(dfList123[i])
a[dfList123[i]]=number
sorted_by_value = sorted(a.items(), key=lambda kv: kv[1], reverse=True)
dataframe=pd.DataFrame.from_dict(sorted_by_value)
print(dataframe)
I've tried to rename columns like this:
dataframe=pd.DataFrame.from_dict(sorted_by_value, orient='index', columns=['A', 'B', 'C']), but it gives me a error:
AttributeError: 'list' object has no attribute 'values'
Is there any way to fix it?
Edit:
Here's the first part of my data frame:
0 1
0 VW 1383
1 AUDI 1053
2 VOLVO 789
3 BMW 749
4 OPEL 621
5 MERCEDES BENZ 593
...
The 1st rows and columns are exactly what I need to remove/rename
index and columns are properties of your dataframe
As long as len(df.index) > 0 and len(df.columns) > 0, i.e. your dataframe has nonzero rows and nonzero columns, you cannot get rid of the labels from your pd.DataFrame object. Whether the dataframe is constructed from a dictionary, or otherwise, is irrelevant.
What you can do is remove them from a representation of your dataframe, with output either as a Python str object or a CSV file. Here's a minimal example:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]])
print(df)
# 0 1 2
# 0 1 2 3
# 1 4 5 6
# output to string without index or headers
print(df.to_string(index=False, header=False))
# 1 2 3
# 4 5 6
# output to csv without index or headers
df.to_csv('file.csv', index=False, header=False)
By sorting the dict_items object (a.items()), you have created a list.
You can check this with type(sorted_by_value). Then, when you try to use the pd.DataFrame.from_dict() method, it fails because it is expecting a dictionary, which has 'values', but instead receives a list.
Probably the smallest fix you can make to the code is to replace the line:
dataframe=pd.DataFrame.from_dict(sorted_by_value)
with:
dataframe = pd.DataFrame(dict(sorted_by_value), index=[0]).
(The index=[0] argument is required here because pd.DataFrame expects a dictionary to be in the form {'key1': [list1, of, values], 'key2': [list2, of, values]} but instead sorted_by_value is converted to the form {'key1': value1, 'key2': value2}.)
Another option is to use pd.DataFrame(sorted_by_value) to generate a dataframe directly from the sorted items, although you may need to tweak sorted_by_value or the result to get the desired dataframe format.
Alternatively, look at collections.OrderedDict (the documentation for which is here) to avoid sorting to a list and then converting back to a dictionary.
Edit
Regarding naming of columns and the index, without seeing the data/desired result it's difficult to give specific advice. The options above will allow remove the error and allow you to create a dataframe, the columns of which can then be renamed using dataframe.columns = [list, of, column, headings]. For the index, look at pd.DataFrame.set_index(drop=True) (docs) and pd.DataFrame.reset_index() (docs).
I did a search but didn't see any results pertaining to this specific question. I have a Python dict, and am converting my dict to a pandas dataframe:
pandas.DataFrame(data_dict)
It works, with only one problem - the columns of my pandas dataframe are not in the same order as my Python dict. I'm not sure how pandas is reordering things. How do I retain the ordering?
Python dictionaries (pre 3.6) are unordered so the column order can not be relied upon. You can simply set the column order afterwards.
In [1]:
df = pd.DataFrame({'a':np.random.rand(5),'b':np.random.randn(5)})
df
Out[1]:
a b
0 0.512103 -0.102990
1 0.762545 -0.037441
2 0.034237 1.343115
3 0.667295 -0.814033
4 0.372182 0.810172
In [2]:
df = df[['b','a']]
df
Out[2]:
b a
0 -0.102990 0.512103
1 -0.037441 0.762545
2 1.343115 0.034237
3 -0.814033 0.667295
4 0.810172 0.372182
Python dictionary is an unordered structure, and the key order you get when printing it (or looping over its keys) is arbitrary.
In this case, you would need to explicitly specify the order of columns in the DataFrame with,
pandas.DataFrame(data=data_dict, columns=columns_order)
where column_order is a list of column names in the order.