Creating a column containing the other columns as a JSON object?

Creating a column containing the other columns as a JSON object? - python

I'm trying add a column to my dataframe that contains the information from the other columns as a json object
My dataframe looks like this:
col_1
col_2
1
1
2
2
I'm then trying to add the json column using the following
for i, row in df:
i_val = row.to_json()
df.at[i,'raw_json'] = i_val
However it results in a "cascaded" dataframe where the json appears twice
col_1
col_2
raw_json
1
1
{"col_1":1,"col_2":1,"raw_json":{"col_1":1,"col_2":1}}
2
2
{"col_1":2,"col_2":2,"raw_json":{"col_1":2,"col_2":2}}
I'm expecting it to look like the following
col_1
col_2
raw_json
1
1
{"col_1":1,"col_2":1}
2
2
{"col_1":2,"col_2":2}

use df.to_json(orient='records')
df['raw_json'] = df.to_json(orient='records')
col_1 col_2 raw_json
0 1 1 [{"col_1":1,"col_2":1},{"col_1":2,"col_2":2}]
1 2 2 [{"col_1":1,"col_2":1},{"col_1":2,"col_2":2}]

Using a list comp and itterrows (your expected has a dict if you want json you can remove the [0]):
df["raw_json"] = [pd.DataFrame(data=[row], columns=df.columns).to_dict(orient="records")[0] for _, row in df.iterrows()]
print(df)
Output:
col_1 col_2 raw_json
0 1 1 {'col_1': 1, 'col_2': 1}
1 2 2 {'col_1': 2, 'col_2': 2}

Related

Getting rows where multiple columns are not blank in pandas

I have a table like so
id col_1 col_2 col_3
101 1 17 12
102 17
103 4
2
how do i only records where col_1, col_2, and col_3 are not blank?
Expected output:
id col_1 col_2 col_3
101 1 17 12

This will select only those rows in the dataframe, where all ['col_1', 'col_2', 'col_3'] are non-empty:
df[df[['col_1', 'col_2', 'col_3']].ne('').all(axis=1)]

here is one way to do it
make a list of the blank, nulls etc and then convert the columns that has any of these values into a True/False, and take their sum. you need the rows where sum is zero
df[df.isin ([' ','', np.nan]).astype(int).sum(axis=1).eq(0)]
id col_1 col_2 col_3
0 101 1 17 12

This can be done using DataFrame.query():
df = df.query(' and '.join([f'{col}!=""' for col in ['col_1','col_2','col_3']]))
Alternatively you can do it this way:
df = df[lambda x: (x.drop(columns='id') != "").all(axis=1)]

Filter Pandas Dataframe using list, but making sure count of elements matches count in list

so I have a list
my_list = [1,1,2,3,4,4]
I have a dataframe that looks like this
col_1 col_2
a 1
b 1
c 2
d 3
e 3
f 4
g 4
h 4
I basically want a final dataframe like
col_1 col_2
a 1
b 1
c 2
d 3
f 4
g 4
Basically I cant use
my_df[my_df['col_2'].isin(my_list)]
since this will include all the rows. I want the first row that matches with each item on the list, but all the same count of rows.

Use GroupBy.cumcount for counter with original and helper DataFrame and filter by inner join in DataFrame.merge:
my_list = [1,1,2,3,4,4]
df1 = pd.DataFrame({'col_2':my_list})
df1['g'] = df1.groupby('col_2').cumcount()
my_df['g'] = my_df.groupby('col_2').cumcount()
df = my_df.merge(df1).drop('g', axis=1)
print (df)
col_1 col_2
0 a 1
1 b 1
2 c 2
3 d 3
4 f 4
5 g 4

Pandas DataFrame efficiently split one column into multiple

I have a dataframe similar to this:
data = {"col_1": [0, 1, 2],
"col_2": ["abc", "defg", "hi"]}
df = pd.DataFrame(data)
Visually:
col_1 col_2
0 0 abc
1 1 defg
2 2 hi
What I'd like to do is split up each character in col_2, and append it as a new column to the dataframe
example iterative method:
def get_chars(string):
chars = []
for char in string:
chars.append(char)
return chars
char_df = pd.DataFrame()
for i in range(len(df)):
char_arr = get_chars(df.loc[i, "col_2"])
temp_df = pd.DataFrame(char_arr).T
char_df = pd.concat([char_df, temp_df], ignore_index=True, axis=0)
df = pd.concat([df, char_df], ignore_index=True, axis=1)
Which results in the correct form:
0 1 2 3 4 5
0 0 abc a b c NaN
1 1 defg d e f g
2 2 hi h i NaN NaN
But I believe iterating though the dataframe like this is very inefficient, so I want to find a faster (ideally vectorised) solution.
In reality, I'm not really splitting up strings, but the point of this question is to find a way to efficiently process one column, and return many.

If need performance use DataFrame constructor with convert values to lists:
df = df.join(pd.DataFrame([list(x) for x in df['col_2']], index=df.index))
Or:
df = df.join(pd.DataFrame(df['col_2'].apply(list).tolist(), index=df.index))
print (df)
col_1 col_2 0 1 2 3
0 0 abc a b c None
1 1 defg d e f g
2 2 hi h i None None

Most efficient way to return Column name in a pandas df

I have a pandas df that contains 4 different columns. For every row theres a value thats of importance. I want to return the Column name where that value is displayed. So for the df below I want to return the Column name when the value 2 is labelled.
d = ({
'A' : [2,0,0,2],
'B' : [0,0,2,0],
'C' : [0,2,0,0],
'D' : [0,0,0,0],
})
df = pd.DataFrame(data=d)
Output:
A B C D
0 2 0 0 0
1 0 0 2 0
2 0 2 0 0
3 2 0 0 0
So it would be A,C,B,A
I'm doing this via
m = (df == 2).idxmax(axis=1)[0]
And then changing the row. But this isn't very efficient.
I'm also hoping to produce the output as a Series from pandas df

Use DataFrame.dot:
df.astype(bool).dot(df.columns).str.cat(sep=',')
Or,
','.join(df.astype(bool).dot(df.columns))
'A,C,B,A'
Or, as a list:
df.astype(bool).dot(df.columns).tolist()
['A', 'C', 'B', 'A']
...or a Series:
df.astype(bool).dot(df.columns)
0 A
1 C
2 B
3 A
dtype: object

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

>>> df
0 1
0 0 0
1 1 1
2 2 1
>>> df1
0 1 2
0 A B C
1 D E F
>>> crazy_magic()
>>> df
0 1 3
0 0 0 A #df1[0][0]
1 1 1 E #df1[1][1]
2 2 1 F #df1[2][1]
Is there a way to achieve this without for?

import pandas as pd
df = pd.DataFrame([[0,0],[1,1],[2,1]])
df1 = pd.DataFrame([['A', 'B', 'C'],['D', 'E', 'F']])
df2 = df1.reset_index(drop=False)
# index 0 1 2
# 0 0 A B C
# 1 1 D E F
df3 = pd.melt(df2, id_vars=['index'])
# index variable value
# 0 0 0 A
# 1 1 0 D
# 2 0 1 B
# 3 1 1 E
# 4 0 2 C
# 5 1 2 F
result = pd.merge(df, df3, left_on=[0,1], right_on=['variable', 'index'])
result = result[[0, 1, 'value']]
print(result)
yields
0 1 value
0 0 0 A
1 1 1 E
2 2 1 F
My reasoning goes as follows:
We want to use two columns of df as coordinates.
The word "coordinates" reminds me of pivot, since
if you have two columns whose values represent "coordinates" and a third
column representing values, and you want to convert that to a grid, then
pivot is the tool to use.
But df does not have a third column of values. The values are in df1. In fact df1 looks like the result of a pivot operation. So instead of pivoting df, we want to unpivot df1.
pd.melt is the function to use when you want to unpivot.
So I tried melting df1. Comparison with other uses of pd.melt led me to conclude df1 needed the index as a column. That's the reason for defining df2. So we melt df2.
Once you get that far, visually comparing df3 to df leads you naturally to the use of pd.merge.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating a column containing the other columns as a JSON object? - python

use df.to_json(orient='records') df['raw_json'] = df.to_json(orient='records') col_1 col_2 raw_json 0 1 1 [{"col_1":1,"col_2":1},{"col_1":2,"col_2":2}] 1 2 2 [{"col_1":1,"col_2":1},{"col_1":2,"col_2":2}]

Related

Getting rows where multiple columns are not blank in pandas

Filter Pandas Dataframe using list, but making sure count of elements matches count in list

Pandas DataFrame efficiently split one column into multiple

Most efficient way to return Column name in a pandas df

Pandas: set the value of a column in a row to be the value stored in a different df at the index of its other rows

Categories

Resources