Converting all pandas column: row to key:value pair json - python

I am trying to add a new column at the end of my pandas dataframe that will contain the values of previous cells in key:value pair. I have tried the following:
import json
df["json_formatted"] = df.apply
(
lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1
)
It creates the the column json_formatted successfully with all required data, but the problem is it also adds the json_formatted as another extra key. I don't want that. I want the json data to contain only the information from the original df columns. How can I do that?
Note: I made ensure_ascii=False because the column names are in Japanese characters.

Create a new variable holding the created column and add it afterwards:
json_formatted = df.apply(lambda row: json.dumps(row.to_dict(), ensure_ascii=False), axis=1)
df['json_formatted'] = json_formatted

This behaviour shouldn't happen, but might be caused by your having run this function more than once. (You added the column, and then ran df.apply on the same dataframe).
You can avoid this by making your columns explicit: df[['col1', 'col2']].apply()
Apply is an expensive operation is Pandas, and if performance matters it is better to avoid it. An alternative way to do this is
df["json_formatted"] = [json.dumps(s, ensure_ascii=False) for s in df.T.to_dict().values()]

Related

Using describe() method to exclude a column

I am new to using python with data sets and am trying to exclude a column ("id") from being shown in the output. Wondering how to go about this using the describe() and exclude functions.
describe works on the datatypes. You can include or exclude based on the datatype & not based on columns. If your column id is of unique data type, then
df.describe(exclude=[datatype])
or if you just want to remove the column(s) in describe, then try this
cols = set(df.columns) - {'id'}
df1 = df[list(cols)]
df1.describe()
TaDa its done. For more info on describe click here
You can do that by slicing your original DF and remove the 'id' column. One way is through .iloc . Let's suppose the column 'id' is the first column from you DF, then, you could do this:
df.iloc[:,1:].describe()
The first colon represents the rows, the second the columns.
Although somebody responded with an example given from the official docs which is more then enough, I'd just want to add this, since It might help a few ppl:
IF your DataFrame is large (let's say 100s columns), removing one or two, might not be a good idea (not enough), instead, create a smaller DataFrame holding what you're interested and go from there.
Example of removing 2+ columns:
table_of_columns_you_dont_want = set(your_bigger_data_frame.colums) = {'column_1', 'column_2','column3','etc'}
your_new_smaller_data_frame = your_new_smaller_data_frame[list[table_of_columns_you_dont_want]]
your_new_smaller_data_frame.describe()
IF your DataFrame is medium/small size, you already know every column and you only need a few columns, just create a new DataFrame and then apply describe():
I'll give an example from reading a .csv file and then read a smaller portion of that DataFrame which only holds what you need:
df = pd.read_csv('.\docs\project\file.csv')
df = [['column_1','column_2','column_3','etc']]
df.describe()
Use output.describe(exclude=['id'])

how to create the subcolumns in the dataframe and add the data using pandas in python

Actually, I am facing the problem to add the data in the subcolumn in the specific format. I have created the "Polypoints" as the main column and I want
df["Polypoints"] = [{"__type":"Polygon","coordinates":Row_list}]
where Row_list is the column of dataframe which contains the data in the below format
df["Row_list"] = [[x1,y1],[x2,y2],[x3,y3]
[x1,y1],[x2,y2],[x3,y3]
[x1,y1],[x2,y2],[x3,y3]]
I want to convert the dataframe into json in the format
"Polypoints" :{"__type":"Polygon" ,"coordinates":Row_list}
There are various ways to do that.
One can create a function create_polygon that takes as input the dataframe (df), and the column name (columname). That would look like the following
def create_polygon(df, columnname):
return {"__type":"Polygon", "coordinates":df[columnname]}
Considering that the column name will be Row_list, the following will already be enough
def create_polygon(df):
return {"__type":"Polygon", "coordinates":df['Row_list']}
Then with pandas.DataFrame.apply one can apply it to the column Polypoints as follows
df['Polypoints'] = df.apply(create_polygon, axis=1)
As Itamar Mushkin mentions, one can also do it with a Lambda function as follows
df['Polypoints'] = df.apply(lambda row: {"__type":"Polygon", "coordinates":row['Row_list']} ,axis=1)

adding row from one dataframe to another

I am trying to insert or add from one dataframe to another dataframe. I am going through the original dataframe looking for certain words in one column. When I find one of these terms I want to add that row to a new dataframe.
I get the row by using.
entry = df.loc[df['A'] == item]
But when trying to add this row to another dataframe using .add, .insert, .update or other methods i just get an empty dataframe.
I have also tried adding the column to a dictionary and turning that into a dataframe but it writes data for the entire row rather than just the column value. So is there a way to add one specific row to a new dataframe from my existing variable ?
So the entry is a dataframe containing the rows you want to add?
you can simply concatenate two dataframe using concat function if both have the same columns' name
import pandas as pd
entry = df.loc[df['A'] == item]
concat_df = pd.concat([new_df,entry])
pandas.concat reference:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
The append function expect a list of rows in this formation:
[row_1, row_2, ..., row_N]
While each row is a list, representing the value for each columns
So, assuming your trying to add one row, you shuld use:
entry = df.loc[df['A'] == item]
df2=df2.append( [entry] )
Notice that unlike python's list, the DataFrame.append function returning a new object and not changing the object called it.
See also enter link description here
Not sure how large your operations will be, but from an efficiency standpoint, you're better off adding all of the found rows to a list, and then concatenating them together at once using pandas.concat, and then using concat again to combine the found entries dataframe with the "insert into" dataframe. This will be much faster than using concat each time. If you're searching from a list of items search_keys, then something like:
entries = []
for i in search_keys:
entry = df.loc[df['A'] == item]
entries.append(entry)
found_df = pd.concat(entries)
result_df = pd.concat([old_df, found_df])

Splitting Pandas Dataframe with groupby and last

I am working with a pandas dataframe where i want to group by one column, grab the last row of each group (creating a new dataframe), and then drop those rows from the original.
I've done a lot of reading and testing, and it seems that I can't do that as easily as I'd hoped. I can do a kludgy solution, but it seems inefficient and, well, kludgy.
Here's pseudocode for what I wanted to do:
df = pd.DataFrame
last_lines = df.groupby('id').last()
df.drop(last_lines.index)
creating the last_lines dataframe is fine, it's dropping those rows from the original df that's an issue. the problem is that the original index (from df) is disconnected when last_lines is created. i looked at filter and transform, but neither seems to address this problem. is there a good way to split the dataframe into two pieces based on position?
my kludge solution is to iterate over the group iterator and create a list of indexes, then drop those.
grouped = df.groupby('id')
idx_to_remove = []
for _, group in grouped:
idx_to_remove.append(group.tail(1).index[0])
df.drop(idx_to_remove)
Better suggestions?
If you use .reset_index() first, you'll get the index as a column and you can use .last() on that to get the indices you want.
last_lines = df.reset_index().groupby('A').index.last()
df.drop(last_lines)
Here the index is accessed as .index because "index" is the default name given to this column when you use reset_index. If your index has a name, you'll use that instead.
You can also "manually" grab the last index by using .apply():
last_lines = d.groupby('A').apply(lambda g: g.index[-1])
You'll probably have to do it this way if you're using a MultiIndex (since in that case using .reset_index() would add multiple columns that can't easily be combined back into indices to drop).
Try:
df.groupby('A').apply(lambda x: x.iloc[:-1, :])

Change values in Pandas Dataframe [Python]

I'm trying to remove the percent sign after a value in a pandas dataframe, relevant code:
for i in loansdata:
if i.endswith('%'):
i = i[:-1]
I was thinking that i = i[:-1] would set the new value, but it doesn't. How do I go about it? For clarity: if I print i inside the for loop, it prints without the percent sign. But if I print the whole dataframe, it has not changed.
use str.replace to replace a specific character for a column:
df[col] = df[col].str.replace('%','')
What you're doing depending on what loansdata actually is, is either looping over the columns or the row values of a column.
You can't modify the row contents like that, even if you could you should avoid loops where a vectorised solution exists.
If % exists in multiple cols then you could call the above for each col but this method only exists for str dtypes

Categories

Resources