I have a dataframe that resembles the following:
Name
Amount
A
3,580,093,709.00
B
5,656,745,317.00
Which I am then applying some styling using CSS, however when I do this the Amount values become scientific formatted so 3.58009e+09 and 5.39538e+07.
Name
Amount
A
3.58009e+09
B
5.65674e+07
How can I stop this from happening?
d = {'Name': ['A', 'B'], 'Amount': [3580093709.00, 5656745317.00]}
df = pd.DataFrame(data=d)
df= df.style
df
You are not showing how you are styling the columns but, to set it as a float with two decimals, you should add the following to your styler, based on the first line of Pandas documentation (they write it for something):
df = df.style.format(formatter={('Amount'): "{:.2f}"})
Here is the link for more information:
https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html
Related
There dataframe below has columns with mixed types. Column of interest for expansion is "Info". Each row value in this column is a JSON object.
data = {'Code':['001', '002', '003', '004'],
'Info':['{"id":001,"x_cord":[1,1,1,1],"x_y_cord":[4.703978,-39.601876],"neutral":1,"code_h":"S38A46","group":null}','{"id":002,"x_cord":[2,1,3,1],"x_y_cord":[1.703978,-38.601876],"neutral":2,"code_h":"S17A46","group":"New"}','{"id":003,"x_cord":[1,1,4,1],"x_y_cord":[112.703978,-9.601876],"neutral":4,"code_h":"S12A46","group":"Old"}','{"id":004,"x_cord":[2,1,7,1],"x_y_cord":[6.703978,-56.601876],"neutral":1,"code_h":"S12A46","group":null}'],
'Region':['US','Pacific','Africa','Asia']}
df = pd.DataFrame(data)
I would like to have the headers expanded i.e. have "Info.id","info.x_y_cord","info.neutral" etc as individual columns with corresponding values under them across the dataset. I've tried normalizing them via pd.json_normalize(df["Info"]) iteration but nothing seems to change. Do I need to convert the column to another type first? Can someone point me to the right direction?
The output should be something like this:
data1 = {'Code':['001', '002', '003', '004'],
'Info.id':['001','002','003','004'],
'Info.x_cord':['[1,1,1,1]','[2,1,3,1]','[1,1,4,1]','[2,1,7,1]'],
'Info.x_y_cord':['[4.703978,-39.601876]','[1.703978,-38.601876]','[112.703978,-9.601876]','[6.703978,-56.601876]'],
'Info.neutral':[1,2,4,1],
'Info.code_h':['S38A46','S17A46','S12A46','S12A46'],
'Info.group':[np.NaN,"New","Old",np.NaN],
'Region':['US','Pacific','Africa','Asia']}
df_final = pd.DataFrame(data1)
First of all, your JSON strings seem to be not valid because of the ID value. 001 is not processed correctly so you'll need to pass the "id" value as a string instead. Here's one way to do that:
def id_as_string(matchObj):
# Adds " around the ID value
return f"\"id\":\"{matchObj.group(1)}\","
df["Info"] = df["Info"].str.replace("\"id\":(\d*),", repl=id_to_string, regex=True))
Once you've done that, you can use pd.json_normalize on your "Info" column after you've loaded the values from the JSON strings using json.loads:
import json
json_part_df = pd.json_normalize(df["Info"].map(json.loads))
After that, just rename the columns and use pd.concat to form the output dataframe:
# Rename columns
json_part_df.columns = [f"Info.{column}" for column in json_part_df.columns]
# Use pd.concat to create output
df = pd.concat([df[["Code", "Region"]], json_part_df], axis=1)
I'm looking for a clean, fast way to expand a pandas dataframe column which contains a json object (essentially a dict of nested dicts), so I could have one column for each element in the json column in json normalized form; however, this needs to retain all of the original dataframe columns as well. In some instances, this dict might have a common identifier I could use to merge with the original dataframe, but not always. For example:
import pandas as pd
import numpy as np
df = pd.DataFrame([
{
'col1': 'a',
'col2': {'col2.1': 'a1', 'col2.2': {'col2.2.1': 'a2.1', 'col2.2.2': 'a2.2'}},
'col3': '3a'
},
{
'col1': 'b',
'col2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2': {'col2.1': 'c1', 'col2.2': {'col2.2.1': np.nan, 'col2.2.2': 'c2.2'}},
'col3': '3c'
}
])
Here is a sample dataframe. As you can see, col2 is a dict in all of these cases which has another nested dict inside of it, or could be a null value, containing nested elements I would like to be able to access. (For the nulls, I would want to be able to handle them at any level--entire elements in the dataframe, or just specific elements in the row.) In this case, they have no ID that could link up to the original dataframe. My end goal would be essentially to have this:
final = pd.DataFrame([
{
'col1': 'a',
'col2.1': 'a1',
'col2.2.col2.2.1': 'a2.1',
'col2.2.col2.2.2': 'a2.2',
'col3': '3a'
},
{
'col1': 'b',
'col2.1': np.nan,
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': np.nan,
'col3': '3b'
},
{
'col1': 'c',
'col2.1': 'c1',
'col2.2.col2.2.1': np.nan,
'col2.2.col2.2.2': 'c2.2',
'col3': '3c'
}
])
In my instance, the dict could have up to 50 nested key-value pairs, and I might only need to access a few of them. Additionally, I have about 50 - 100 other columns of data I need to preserve with these new columns (so an end goal of around 100 - 150). So I suppose there might be two methods I'd be looking for--getting a column for each value in the dict, or getting a column for a select few. The former option I haven't yet found a great workaround for; I've looked at some prior answers but found them to be rather confusing, and most threw errors. This seems especially difficult when there are dicts nested inside of the column. To attempt the second solution, I tried the following code:
def get_val_from_dict(row, col, label):
if pd.isnull(row[col]):
return np.nan
norm = pd.json_normalize(row[col])
try:
return norm[label]
except:
return np.nan
needed_cols = ['col2.1', 'col2.2.col2.2.1', 'col2.2.col2.2.2']
for label in needed_cols:
df[label] = df.apply(get_val_from_dict, args = ('col2', label), axis = 1)
This seemed to work for this example, and I'm perfectly happy with the output, but for my actual dataframe which had substantially more data, this seemed a bit slow--and, I would imagine, is not a great or scalable solution. Would anyone be able to offer an alternative to this sluggish approach to resolving the issue I'm having?
(Also, apologies also about the massive amounts of nesting in my naming here. If helpful, I am adding in several images of the dataframes below--the original, then the target, and then the current output.)
instead of using apply or pd.json_normalize on the column that has a dictionary, convert the whole data frame to dictionary & use pd.json_normalize on it, finally picking the fields you wish to keep. This works because while the individual column for any given row may be null, the entire row would not be.
example:
# note that this method also prefixes an extra `col2.`
# at the start of the names of the denested data,
# which is not present in the example output
# the column renaming conforms to your desired name.
import re
final_cols = ['col1', 'col2.col2.1', 'col2.col2.2.col2.2.1', 'col2.col2.2.col2.2.2', 'col3']
out = pd.json_normalize(df.to_dict(orient='records'))[final_cols]
out.rename(columns=lambda x: re.sub(r'^col2\.', '', x), inplace=True)
out
# out:
col1 col2.1 col2.2.col2.2.1 col2.2.col2.2.2 col3
0 a a1 a2.1 a2.2 3a
1 b NaN NaN NaN 3b
2 c c1 NaN c2.2 3c
but for my actual dataframe which had substantially more data, this was quite slow
Right now I have 1000 rows of data, each row has about 100 columns, and then the column I want to expand has about 50 nested key/value pairs in it. I would expect that the data could scale up to 100k rows with the same number of columns over the next year or so, and so I'm hoping to have a scalable process ready to go at that point
pd.json_normalize should be faster than your attempt, but it is not faster than doing the flattening in pure python, so you might get more performance if you wrote a custom transform function & constructed the dataframe as below.
out = pd.DataFrame(transform(x) for x in df.to_dict(orient='records'))
I've been running some groupings on a dataframe that I have and saving the results in variables. However, I just noticed that the variables are actually being saved as series rather than dataframes.
I've seen tutorials/docs on how to convert a series to a dataframe, but all of them show only static data (by manually typing each of the values into an array), and this isn't an option for me, because I have over 2 million rows in my data frame.
So if I have
TopCustomers = raw_data.groupby(raw_data['Company'])['Total Records'].sum()
Top10Customers = TopCustomers.sort_values().tail(10)
How can I turn Top10Customers into a dataframe? I need it because not all plots work with series.
The syntax frame = { 'Col 1': series1, 'Col 2': series2 } doesn't work because I only have 1 series
Here a small example with data:
import pandas as pd
raw_data = pd.DataFrame({'Company':['A', 'A','B', 'B', 'C', 'C'], 'Total Records':[2,3,6,4,5,10]})
TopCustomers = raw_data.groupby(raw_data['Company'])['Total Records'].sum()
Indeed type(TopCustomers) is pandas.core.series.Series
The following turns it in a DataFrame:
pd.DataFrame(TopCustomers)
Otherwise .to_frame() works equally well as indicated above.
You can use the .to_frame() method and it will turn it into a pd.DataFrame.
Im very new so please excuse this question if it's very basic but I have a data frame with some columns (Open High Low and Close). I'd like to write a simple function that just takes the Close column (as a default but allows for any of the other columns to be specified) and just returns a new data frame with just that column.
My code below just returns a dataframe with the column name but no data
import pandas as pd
df = pd.read_csv('Book2.csv')
df = df.loc[2:, :'Close'].drop('Unnamed: 7', axis=1)
df.rename(columns={'Unnamed: 0': 'X'}, inplace=True)
df.drop(['O', 'H', 'L'], axis=1, inplace=True)
def agg_data(ys):
agg_df = pd.DataFrame(ys, columns=['Y Values'])
return agg_df
result = agg_data(df['Close'])
print(result)
You don't need to put data into pd.DataFrame() when your data already is a pandas dataframe. Correct me if I'm misunderstanding what you want, but as I see it this should be sufficient:
result = df['close'].copy()
If you don't use copy(), your initial df will also change if you're making changes to result, so since you want a new dataframe (Or a series, since it's one-dimentional) that's probably what you want.
The program I have written generally has done what I've wanted it to do - for the most part. To add totals of each column. My dataframe uses the csv file format. My code is below:
import pandas as pd
import matplotlib.pyplot
class ColumnCalculation:
"""This houses the functions for all the column manipulation calculations"""
def max_electricity(self):
df.set_index('Date', inplace=True)
df.loc['Total'] = df.sum()
print(df)
df = pd.read_csv("2011-onwards-city-elec-consumption.csv")
ColumnCalculation.max_electricity(df)
Also my dataset (I didn't know how to format it properly)
The code nicely adds up all totals into a total column at the bottom of each column, except when it comes to the last column(2017)(image below):
I am not sure the program does is, I've tried to use different formatting options like .iloc or .ix but it doesn't seem to make a difference. I have also tried adding each column individually (below):
def max_electricity(self):
df.set_index('Date', inplace=True)
df.loc['Total', '2011'] = df['2011'].sum()
df.loc['Total', '2012'] = df['2012'].sum()
df.loc['Total', '2013'] = df['2013'].sum()
df.loc['Total', '2014'] = df['2014'].sum()
df.loc['Total', '2015'] = df['2015'].sum()
df.loc['Total', '2016'] = df['2016'].sum()
df.loc['Total', '2017'] = df['2017'].sum()
print(df)
But I receive an error, as I assume this would be too much? I've tried to figure this out for a good hour and a bit.
Your last column isn't being parsed as floats, but strings.
To fix this, try casting to numeric before summing:
import locale
locale.setlocale(locale.LC_NUMERIC, '')
df['2017'] = df['2017'].map(locale.atoi)
Better still, try reading in the data as numeric data. For example:
df = pd.read_csv('file.csv', sep='\t', thousands=',')