Why does the content of a dataframe affect setting? - python

The outcome of this case:
df = _pd.DataFrame({'a':['1','2','3']})
df['b'] = _np.nan
for index in df.index:
df.loc[index, 'b'] = [{'a':1}]
is:
a b
0 1 {'a': 1}
1 2 [{'a': 1}]
2 3 [{'a': 1}]
The outcome of this case:
df = _pd.DataFrame({'a':[1,2,3]})
df['b'] = _np.nan
for index in df.index:
df.loc[index, 'b'] = [{'a':1}]
is:
a b
0 1 {'a': 1}
1 2 {'a': 1}
2 3 {'a': 1}
Why?
_pd.__version__
'0.23.4'
Edit: I want to add the version number, because this might be a bug. That seems reasonable to me. But, this new hold-your-hand system we have here at stackoverflow.com won't let me do it; hence I am adding this edit in order to meet the character requirement.

I think this is cause by the type transform when you assign object to a float type columns, the first item need to convert the whole columns type from float to object , then the whole column became object and the index number 1,2 will be the right type assign since the column itself already become object
df = pd.DataFrame({'a':['1','2','3']})
df['b'] = np.nan
df['b']=df['b'].astype(object)
for index in df.index:
df.loc[index, 'b'] = [{'a':1}]
print(df.loc[index, 'b'] ,index)
[{'a': 1}] 0
[{'a': 1}] 1
[{'a': 1}] 2
df
a b
0 1 [{'a': 1}]
1 2 [{'a': 1}]
2 3 [{'a': 1}]
Also , I think this may belong to the topic https://github.com/pandas-dev/pandas/issues/11617

Related

How to create a dataframe from a nested dictionary using pandas?

I have the following nested dictionary:
dict1 = {'a': 1,'b': 2,'remaining': {'c': 3,'d': 4}}
I want to create a dataframe using pandas in order to achieve the following
df = pd.DataFrame(columns=list('abcd'))
df.loc[0] = [1,2,3,4]
You could pop the 'remaining' dict to update dict1, then convert the values to vectors (like lists).
nested = dict1.pop('remaining')
dict1.update(nested)
pd.DataFrame({k: [v] for k, v in dict1.items()})
a b c d
0 1 2 3 4
You can use pandas.json_normalize:
dict1 = {'a': 1,'b': 2,'remaining': {'c': 3,'d': 4}}
df = pd.json_normalize(dict1)
df.columns = list('abcd')
Result:
a b c d
0 1 2 3 4

Creating a dataframe in a for loop based on another dataframe

I have a data frame, df, and I'd like to get all the columns in it and the count of unique values in it and save it as another data frame. I can't seem to find a way to do that. I can, however, print what I want on the console. Here's what I mean:
def counting_unique_values_in_df(df):
for evry_colm in df:
print (evry_colm, "-", df[evry_colm].value_counts().count())
Now that prints what I want just fine. Instead of printing, if I do something like newdf = pd.DataFrame(evry_colm, df[evry_colm].value_counts().count(), columns = ('a', 'b')), it throws an error that reads "TypeError: object of type 'numpy.int32' has no len()". Obviously, that isn't right.
Soo, how can I make a data frame like columnName and UniqueCounts?
To count unique values per column you can use apply and nunique function on data frame.
Something like:
import pandas as pd
df = pd.DataFrame([
{'a': 1, 'b': 2},
{'a': 2, 'b': 2}
])
count_series = df.apply(lambda col: col.nunique())
# returned object is pandas Series
# a 2
# b 1
# to map it to DataFrame try
pd.DataFrame(count_series).T
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 2, 2], 'B': [1, 2, 3, 4]})
print(df)
print()
df = pd.DataFrame({col: [df[col].nunique()] for col in df})
print(df)
Output:
A B
0 1 1
1 1 2
2 2 3
3 2 4
A B
0 2 4

convert a pandas dataframe to dictionary

I have a pandas dataframe as below:
df=pd.DataFrame({'a':['red','yellow','blue'], 'b':[0,0,1], 'c':[0,1,0], 'd':[1,0,0]})
df
which looks like
a b c d
0 red 0 0 1
1 yellow 0 1 0
2 blue 1 0 0
I want to convert it to a dictionary so that I get:
red d
yellow c
blue b
The dataset if quite large, so please avoid any iterative method. I haven't figured out a solution yet. Any help is appreciated.
First of all, if you really want to convert this to a dictionary, it's a little nicer to convert the value you want as a key into the index of the DataFrame:
df.set_index('a', inplace=True)
This looks like:
b c d
a
red 0 0 1
yellow 0 1 0
blue 1 0 0
Your data appears to be in "one-hot" encoding. You first have to reverse that, using the method detailed here:
series = df.idxmax(axis=1)
This looks like:
a
red d
yellow c
blue b
dtype: object
Almost there! Now and use to_dict on the 'value' column (this is where setting column a as the index helps out):
series.to_dict()
This looks like:
{'blue': 'b', 'red': 'd', 'yellow': 'c'}
Which I think is what you are looking for. As a one-liner:
df.set_index('a').idxmax(axis=1).to_dict()
You can try this.
df = df.set_index('a')
df.where(df > 0).stack().reset_index().drop(0, axis=1)
a level_1
0 red d
1 yellow c
2 blue b
You need dot and zip here
dict(zip(df.a,df.iloc[:,1:].dot(df.iloc[:,1:].columns)))
Out[508]: {'blue': 'b', 'red': 'd', 'yellow': 'c'}
Hope this works:
import pandas as pd
df=pd.DataFrame({'a':['red','yellow','blue'], 'b':[0,0,1], 'c':[0,1,0], 'd':[1,0,0]})
df['e'] = df.iloc[:,1:].idxmax(axis = 1).reset_index()['index']
newdf = df[["a","e"]]
print (newdf.to_dict(orient='index'))
Output:
{0: {'a': 'red', 'e': 'd'}, 1: {'a': 'yellow', 'e': 'c'}, 2: {'a': 'blue', 'e': 'b'}}
You can convert your dataframe to dict using pandas to_dict with list as argument. Then iterate over this resulting dict and fetch column label whose value is 1.
>>> {k:df.columns[1:][v.index(1)] for k,v in df.set_index('a').T.to_dict('list').items()}
>>> {'yellow': 'c', 'blue': 'b', 'red': 'd'}
set column a as index then looking at the rows of df find the index of value one, then convert the resulting series to dictionary using to_dict
here is the code
df.set_index('a').apply(lambda row:row[row==1].index[0],axis=1).to_dict()
alternatively set the index to a then use argmax to find the index of the max value in each row then use to_dict to convert to dictionary
df.set_index('a').apply(lambda row:row.argmax(),axis=1).to_dict()
In both cases, the result will be
{'blue': 'b', 'red': 'd', 'yellow': 'c'}
Ps. I used apply to iterate through the rows of df by setting axis=1

Pandas: convert each row to a <column name,row value> dict and add as a new column

I have a df such that
STATUS_ID STATUS_NM
0 1 A
1 2 B
2 3 C
3 4 D
I want to perform a row by apply to get a key, value par for each row in a separate column. The final df should be
STATUS
0 {STATUS_ID:1,STATUS_NM:A}
1 {STATUS_ID:2,STATUS_NM:B}
2 {STATUS_ID:3,STATUS_NM:C}
3 {STATUS_ID:4,STATUS_NM:D}
UPDATE:
I have tried df[cols].apply(pd.Series.to_dict, axis=1) and df[cols].apply(lambda x: x.to_dict(), axis=1) but instead of getting the actual dict, I get
<built-in method values of dict object at 0x00...
I believe its my version of pandas that is causing the issue. This has been discussed here - https://github.com/pandas-dev/pandas/issues/8735
So the question is if there's another way to perform the same operation circumventing this issue. I cannot update my Pandas version to 0.17
df['STATUS'] = df.apply(pd.Series.to_dict, axis=1)
df
Out:
STATUS_ID STATUS_NM STATUS
0 1 A {'STATUS_NM': 'A', 'STATUS_ID': 1}
1 2 B {'STATUS_NM': 'B', 'STATUS_ID': 2}
2 3 C {'STATUS_NM': 'C', 'STATUS_ID': 3}
3 4 D {'STATUS_NM': 'D', 'STATUS_ID': 4}
If in your real DataFrame you have other columns too, you may need to specify the columns you want to have in the dictionary.
cols = ['STATUS_ID', 'STATUS_NM']
df['STATUS'] = df[cols].apply(pd.Series.to_dict, axis=1)
An alternative would be iterating over the DataFrame:
lst = []
for _, row in df[cols].iterrows():
lst.append({col: row[col] for col in cols})
This creates a list:
[{'STATUS_ID': 1, 'STATUS_NM': 'A'},
{'STATUS_ID': 2, 'STATUS_NM': 'B'},
{'STATUS_ID': 3, 'STATUS_NM': 'C'},
{'STATUS_ID': 4, 'STATUS_NM': 'D'}]
You can directly assign this to your DataFrame:
df['STATUS'] = lst

Pandas grouping and summing just a certain column

below is a minimal example, showing the problem that I am facing. Let our initial state be the following (I only use dictionary for the purpose of demonstration):
A = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 2}, {'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 4}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df = pd.DataFrame(A)
>>> df
A B C D
0 1 0.0 2 16.5.2013
1 1 0.0 4 16.5.2013
2 1 0.5 7 16.5.2013
How do I get from df to df_new which is:
A_new = [{'D': '16.5.2013', 'A':1, 'B': 0.0, 'C': 6}, {'D': '16.5.2013', 'A':1, 'B': 0.5, 'C': 7}]
df_new = pd.DataFrame(A_new)
>>> df_new
A B C D
0 1 0.0 6 16.5.2013
1 1 0.5 7 16.5.2013
The first and the second rows of the 'C' column are summed, because 'B' is the same for these two rows. The rest is left the same, for instance, column 'A' is not summed, column 'D' is unchanged. How do I do that assuming I only have df and I want to get df_new. I would really like to find some kind of elegant solution if possible.
Thanks in advance.
Assuming the other columns are always the same, and should not be treated specially.
First create the df_new grouped by B where I take for each column the first row in the group:
In [17]: df_new = df.groupby('B', as_index=False).first()
and then calculate specificaly the C column as a sum for each group:
In [18]: df_new['C'] = df.groupby('B', as_index=False)['C'].sum()['C']
In [19]: df_new
Out[19]:
B A C D
0 0.0 1 6 16.5.2013
1 0.5 1 7 16.5.2013
If you have a limited number of columns, you can also do this in one step (but the above will be handier (less manual) if you have more columns) by specifying the desired function for each column:
In [20]: df_new = df.groupby('B', as_index=False).agg({'A':'first', 'C':'sum', 'D':'first'})
If A, and D are always equal when grouping by B, then you can can just group by A, B D, and sum C:
df.groupby(['A', 'B', 'D'], as_index = False).agg(sum)
Output:
A B D C
0 1 0.0 16.5.2013 6
1 1 0.5 16.5.2013 7
Alternatively:
You essentially want to aggregate the data grouped by column 'B'. To aggregate column C you will just use the built in sum function. For the other columns, you basically just want to select a sole value as you believe they are always the same within groups. To do that, just write a very simple function that aggregates those columns simply by taking the first value.
# will take first value of the grouped data
sole_value = lambda x : list(x)[0]
#dictionary that maps columns to aggregation functions
agg_funcs = {'A' : sole_value, 'C' : sum, 'D' : sole_value}
#group and aggregate
df.groupby('B', as_index = False).agg(agg_funcs)
Output:
B A C D
0 0.0 1 6 16.5.2013
1 0.5 1 7 16.5.2013
Of course you really need to be sure that you have values that are definitely equal in columns A, and D, otherwise you might be preserving the wrong data.

Categories

Resources