I am getting really strange results with a pandas DataFrame grouping operation. What I want to do is group by index (my index is non-unique), and then fill null values appropriately. This works in many cases but in some instances I am getting a strange behavior where an empty DataFrame is all that is returned:
df = pd.DataFrame(columns=['sample', 'cooling_rate'],
index=['SYd', 'SYd', 'XNa', 'Xna', 'Qza_new', 'Qza_new'],
data=[['SYd', 3], ['SYd', 3], ['XNa', 3],
['XNa', 3], ['val1', 'val3'], ['val1', None]])
res = df.groupby(df.index).fillna('1')
#Empty DataFrame
#Columns: []
#Index: []
However, if I change the DataFrame ever so slightly, by renaming the index item 'QZa_new' to 'qza_new':
df = pd.DataFrame(columns=['sample', 'cooling_rate'],
index=['SYd', 'SYd', 'XNa', 'Xna', 'qza_new', 'qza_new'],
data=[['SYd', 3], ['SYd', 3], ['XNa', 3],
['XNa', 3], ['val1', 'val3'], ['val1', None]])
res = df.groupby(df.index).fillna('1')
# sample cooling_rate
#SYd SYd 3
#SYd SYd 3
#XNa XNa 3
#Xna XNa 3
#qza_new val1 val3
#qza_new val1 1
The result is a properly grouped, filled DataFrame as expected. I can't make any sense of this behavior, and I'm not getting any sort of "error".
With more experimentation, it appears that the key is definitely in my DataFrame index line:
index=['SYd', 'SYd', 'XNa', 'Xna', 'qza_new', 'qza_new'],
It appears that the second to last value has to be earlier in the alphabet than the last value. In other words,
index=['SYd', 'SYd', 'XNa', 'XNa', 'a', 'b']
works and returns a filled in DataFrame, but:
index=['SYd', 'SYd', 'XNa', 'XNa', 'c', 'b']
returns an empty DataFrame. But why?
I suspect I must be missing something obvious, but I have no idea why I'm seeing this behavior.
Update:
This issue appears to be known: https://github.com/pandas-dev/pandas/issues/14955 Hopefully will be fixed next release.
Related
I would like to apply the loop below where for each index value the unique values of a column called SERIAL_NUMBER will be returned. Essentially I want to confirm that for each index there is a unique serial number.
index_values = df.index.levels
for i in index_values:
x = df.loc[[i]]
x["SERIAL_NUMBER"].unique()
The problem, however, is that my dataset has a multi-index and as you can see below it is stored in a frozen list. I am just interested in the index values that contain a long number. The word "vehicle" also as an index can be removed as it is repeated all over the dataset.
How can I extract these values into a list so I can use them in the loop?
index_values
>>
FrozenList([['0557bf98-c3e0-4955-a23f-2394635ab531', '074705a3-a96a-418c-9bfe-14c37f5c4e6f', '0f47e260-0fa2-40ba-a417-7c00ea74248c', '17342ca2-6246-4150-8080-96d6125cf2b5', '26c6c0d1-0134-4b3a-a149-61dd93afab3b', '7600be43-5d0a-49b3-a1ee-fd107db5822f', 'a07f2b0c-447c-4143-a361-d7ddbffdcc77', 'b929801c-2f32-4a95-bfc4-48a05b48ee01', 'cc912023-0113-42cd-8fe7-4df4005127c2', 'e424bd02-e188-462e-a1a6-2f4ed8fe0a2d'], ['vehicle']])
without an example its hard to judge, but I think you need
df.index.get_level_values(0).unique() # add .tolist() if you want a list
import pandas as pd
df = pd.DataFrame({'A' : [5]*5, 'B' : [6]*5})
df = df.set_index('A',append=True)
df.index.get_level_values(0).unique()
Int64Index([0, 1, 2, 3, 4], dtype='int64')
df.index.get_level_values(1).unique()
Int64Index([5], dtype='int64', name='A')
to drop duplicates from an index level use the .duplicated() method.
df[~df.index.get_level_values(1).duplicated(keep='first')]
B
A
0 5 6
In the following pandas dataframe there are missing values in different columns for each row.
import pandas as pd
import numpy as np
d = {'col1': [1, 2, None], 'col2': [None, 4, 5], 'col3': [3, None, None]}
df = pd.DataFrame(data=d)
df
I know I can use this to locate which columns are not empty in the ith row
df.iloc[0].notnull()
And then something like the following to find which specific columns are not empty.
np.where(df.iloc[0].notnull())
However, how can I then use those values as indices to return the non missing columns in the ith row?
For example, in the 0th row I'd like to return back columns
df.iloc[0, [0,2]]
This isn't quite right, but I'm guessing is somewhere along these lines?
df.iloc[0, np.where(df.iloc[0].notnull())]
** Edit
I realize I can do this
df.iloc[0, np.where(df.iloc[0].notnull())[0].tolist()]
And this returns the expected result. However, is this the most efficient approach?
Here's a way using np.isnan
# set row number
row_number = 0
# get dataframe
df.loc[row_number, ~np.isnan(df.values)[row_number]]
I tried to create DataFrames from a JSON file.
I have a list named "Series_participants" containing a part of this JSON file. My list look like thise when i print it.
participantId 1
championId 76
stats {'item0': 3265, 'item2': 3143, 'totalUnitsHeal...
teamId 100
timeline {'participantId': 1, 'csDiffPerMinDeltas': {'1...
spell1Id 4
spell2Id 12
highestAchievedSeasonTier SILVER
dtype: object
<class 'list'>
After i tri to convert this list to a DataFrame like this
pd.DataFrame(Series_participants)
But pandas use values of "stats" and "timeline" as index for the DataFrame. I expected to have automatic index range (0, ..., n)
EDIT 1:
participantId championId stats teamId timeline spell1Id spell2Id highestAchievedSeasonTier
0 1 76 3265 100 NaN 4 12 SILVER
I want to have a dataframe with "stats" & "timeline" colomns containing dicts of their values as in the Series display.
What is my error ?
EDIT 2:
I have tried to create manually the DataFrame but pandas didn't take my choices in consideration and finally take indexes of "stats" key of the Series.
here is my code :
for j in range(0,len(df.participants[0])):
for i in range(0,len(df.participants[0][0])):
Series_participants = pd.Series(df.participants[0][i])
test = {'participantId':Series_participants.values[0],'championId':Series_participants.values[1],'stats':Series_participants.values[2],'teamId':Series_participants.values[3],'timeline':Series_participants.values[4],'spell1Id':Series_participants.values[5],'spell2Id':Series_participants.values[6],'highestAchievedSeasonTier':Series_participants.values[7]}
if j == 0:
df_participants = pd.DataFrame(test)
else:
df_participants.append(test, ignore_index=True)
The double loop is to parse all "participant" of my JSON file.
LAST EDIT :
I achieved what i wanted with the following code :
for i in range(0,len(df.participants[0])):
Series_participants = pd.Series(df.participants[0][i])
df_test = pd.DataFrame(data=[Series_participants.values], columns=['participantId','championId','stats','teamId','timeline','spell1Id','spell2Id','highestAchievedSeasonTier'])
if i == 0:
df_participants = pd.DataFrame(df_test)
else:
df_participants = df_participants.append(df_test, ignore_index=True)
print(df_participants)
Thanks to all for your help !
For efficiency, you should try and manipulate your data as you construct your dataframe rather than as a separate step.
However, to split apart your dictionary keys and values you can use a combination of numpy.repeat and itertools.chain. Here's a minimal example:
df = pd.DataFrame({'A': [1, 2],
'B': [{'key1': 'val0', 'key2': 'val9'},
{'key1': 'val1', 'key2': 'val2'}],
'C': [{'key3': 'val10', 'key4': 'val8'},
{'key3': 'val3', 'key4': 'val4'}]})
import numpy as np
from itertools import chain
chainer = chain.from_iterable
lens = df['B'].map(len)
res = pd.DataFrame({'A': np.repeat(df['A'], lens),
'B': list(chainer(df['B'].map(lambda x: x.values())))})
res.index = chainer(df['B'].map(lambda x: x.keys()))
print(res)
A B
key1 1 val0
key2 1 val9
key1 2 val1
key2 2 val2
If you try to input lists, series or arrays containing dicts into the object constructor, it doesn't recognise what you're trying to do. One way around this is manually setting:
df.at['a', 'b'] = {'x':value}
Note, the above will only work if the columns and indexes are already created in your DataFrame.
Updated per comments: Pandas data frames can hold dictionaries, but it is not recommended.
Pandas is interpreting that you want one index for each of the your dictionary keys and then broadcasting the single item columns across them.
So to help with what you are trying to do I would recommend reading in your dictionaries items as columns. Which is what data frames are typically used for and very good at.
Example Error due to pandas trying to read in the dictionary by key, value pair:
df = pd.DataFrame(columns= ['a', 'b'], index=['a', 'b'])
df.loc['a','a'] = {'apple': 2}
returns
ValueError: Incompatible indexer with Series
Per jpp in the comments below (When using the constructor method):
"They can hold arbitrary types, e.g.
df.iat[0, 0] = {'apple': 2}
However, it's not recommended to use Pandas in this way."
I've got a CSV file that is generated in a format I cannot change. The file has a multiindex: headers on two lines. The first line (higher level of index) has blanks when the value doesn't change.
What my header looks like:
What it actually comes down to and what I want:
I would like to be able to process it correctly in Python 2.7 with Pandas.
I resulted to looping on the first level of index and if the value is blank, set it to be the same as the one on the left.
I start by loading the dataframe in pandas:
df = pd.read_csv(myFile, header=[0,1], sep=',')
df
I've tried the following:
for i, val in enumerate(df.columns.values):
if val[0][:7] == 'Unnamed':
l.append([l[i-1][0], val[1]])
else:
l.append(val)
The list "l" I'm getting appears to be what I want:
[('Foo', 'A'),
['Foo', 'B'],
['Foo', 'C'],
('Bar', 'A'),
['Bar', 'B'],
['Bar', 'C']]
I've tried both:
df.columns = l
Produces a non multiindex dataframe
index = pd.MultiIndex.from_tuples(l)
df.reindex(columns = index)
This one gives me the correct index, but values disappear.
I'm getting a strong gut feeling that the entire approach I'm trying isn't very pythonic nor does it make sense to use a list then converted to a dict. Any idea how I can multiindex properly?
Instead of using reindex, set the columns to your new index directly:
df.columns = pd.MultiIndex.from_tuples(l)
That should produce the desired result.
reindex doesn't just replace the index values (though that sounds like what it should do, and the documentation isn't especially clear). Instead it goes through your new indices, picks the rows or columns that match the new indices, and puts NaN where no old index matches a new index. That's what's happening to you: when reindex hits ['Foo', 'B'], which doesn't exist in your original dataframe, it fills the column in the new dataframe with NaN.
If your columns are always going to follow a consistent pattern (one top-level column for every three second-level columns, for example), you could also use MultiIndex.from_product to make the column index:
iterables = [["Foo", "Bar"], ["A", "B", "C"]]
index = pd.MultiIndex.from_product(iterables)
I would like to know how to add a new row efficiently to the dataframe.
Assuming I have a empty dataframe
"A" "B"
columns = ['A','B']
user_list = pd.DataFrame(columns=columns)
I want to add one row like {A=3, B=4} to the dataframe, how to do that in most efficient way?
columns = ['A', 'B']
user_list = pd.DataFrame(np.zeros((1000, 2)) + np.nan, columns=columns)
user_list.iloc[0] = [3, 4]
user_list.iloc[1] = [4, 5]
Pandas doesn't have built-in resizing, but it will ignore nan's pretty well. You'll have to manage your own resizing, though :/