Colouring one column of pandas dataframe: change in format - python

I would like to color a column in my dataframe (using the method proposed in a given answer, link below). So (taking only 1st row of my dataframe) I used to change the color the following code expression:
data1.head(1).style.set_properties(**{'background-color': 'red'}, subset=['column10'])
However it causes another problem: it changes the format of my dataframe (= addes more 0's after decimal point..). Is there any possibility I can keep the old format of my dataframe and still be able to color a column? Thanks in advance
Old output (first row):
2021-01-02 32072 0.0 1831 1831 1.0 1 0 1 0.0 1.0
New output (first row):
2021-01-02 32072 0.000000 1831 1831 1.000000 1 0 1 0.000000 1.000000 1.000000
Colouring one column of pandas dataframe

~~ Edit, as I see you wanted rows not columns:
Once you apply the style code, it changes the pandas dataframe from a pandas.core.frame.DataFrame object to a pandas.io.formats.style.Styler object. The styler object treats floats differently than the pandas dataframe object and yields what you see in your code (more decimal points). You can change the format with style.format to get the results you want:
data = [{"col1":"2021-01-02", "col2":32072, "col3":0.0, "col4":1831, "col5":1831,
"col6":1.0, "col7":1, "col8":0, "col9":1, "column10":0.0, "col11":1.0}]
data1 = pd.DataFrame(data)
data1 = data1.style.format(precision=1, subset=list(data1.columns)).set_properties(**{'background-color': 'red'}, subset=['column10'])
data1
Output:
Once you use style, it is no longer a pandas dataframe and is now a Styler object. So, normal commands that work on pandas dataframes no longer work on your newly styled dataframe (e.g. just doing head(10) no longer works). But, there are work arounds. If you want to look at only the first 10 rows of your Styler after you applied the style, you can export the style that was used and then reapply it to just look at the top 10 rows:
data = [{"col1":"2021-01-02", "col2":32072, "col3":0.0, "col4":1831, "col5":1831,
"col6":1.0, "col7":1, "col8":0, "col9":1, "column10":0.0, "col11":1.0}]
data1 = pd.DataFrame(data)
data1 = data1.append(data*20).reset_index(drop=True)
data1 = data1.style.format(precision=1, subset=list(data1.columns)).set_properties(**{'background-color': 'red'}, subset=['column10'])
Gives a large dataframe:
And then using this code afterwards will head (ie show) only 10 of the rows:
style = data1.export()
data1.data.head(10).style.use(style).format(precision=1, subset=list(data1.columns))
Now it is only showing the first 10 rows:

Related

Slicing pandas raw dataframe (prior to re-organizing the data)

This is my very first post but I'll do my best to make it relevant.
I have a dataframe of stock prices freshly imported with the DataReader, from Morningstar. It looks like this :
print df.head()
Close High Low Open Volume Symbol
Symbol Date
AAPL 2018-03-01 175.00 179.775 172.66 178.54 48801970 AAPL
2018-03-02 176.21 176.300 172.45 172.80 38453950 AAPL
2018-03-05 176.82 177.740 174.52 175.21 28401366 AAPL
2018-03-06 176.67 178.250 176.13 177.91 23788506 AAPL
2018-03-07 175.03 175.850 174.27 174.94 31703462 AAPL
I want to refer to specific cells in the dataframe, especially the values in the last row for a given stock. There are 255 rows.
Please note that the dataframe is a concatenation of multiple DataReader fetches. I made it from code found on StackOverflow with slight updates and changes :
rawdata = [] # initializing empty dataframe
for ticker in tickers:
fetched = web.DataReader(ticker, "morningstar", start='3/1/2018', end='4/15/2018') # bloody month/day/year
fetched['Symbol'] = ticker # add a symbol column
rawdata.append(fetched)
stocks = pd.concat(fetched) # concatenate all the dfs
Now
print df[255:]
returns the last row with column names, and
print df[255:].values
returns the values of the last row.
But
print df[-1]
returns an error.
I will need to refer to the last row after updating the dataframe, without knowing whether there are now x or y rows. Why can't I do df[-1] ?
I've looked around and found techniques with "iloc" notably, but I'm trying to keep this very simple for the moment.
I've also looked for questions about slicing. But
print df[255:['Close']]
returns the error "unhashable type" - although there already is a column named 'Close'.
Is it because my dataframe is not properly indexed ? Or because it is not a csv yet ?
I know how to work on indexes, and also how to write to csv. And I will definitely have to organize the data in a better way at some stage. But I don't understand why I cannot call the last row or slice for a specific cell with the current format of my data.
Thanks for your kind attention
You need to be a bit careful when slice DataFrames with []
When you provide only a single argument, it looks to slice the DataFrame by columns. When you write df[-1] you're going to get KeyError: -1 because your df doesn't have any column labeled -1.
If you want to slice the last row, you need to either add a semi-colon with [] or if you want to be super safe, use .iloc.
Hopefully this illustrates this a bit more. I've included a column labeled -1 just to show you what df[-1] will actually do.
import pandas as pd
df = pd.DataFrame({'value': [-2,-1,0,1,2],
'name': ['a', 'b', 'c', 'd', 'e'],
-1: [1,2,3,4,5]})
# value name -1
#0 -2 a 1
#1 -1 b 2
#2 0 c 3
#3 1 d 4
#4 2 e 5
df[-1]
#0 1
#1 2
#2 3
#3 4
#4 5
#Name: -1, dtype: int64
df[-1:] # or df.iloc[-1:]
# value name -1
#4 2 e 5

Python: reorganize a dataframe with repeated values appearing in one column.

I have a dataframe that looks like this:
Instrument Date Total Return
0 KYG2615B1014 2017-11-29T00:00:00Z 0.000000
1 KYG2615B1014 2017-11-28T00:00:00Z -10.679612
2 KYG2615B1014 2017-11-27T00:00:00Z -8.035714
3 JP3843250006 2017-11-29T00:00:00Z 0.348086
4 JP3843250006 2017-11-28T00:00:00Z 0.349301
5 JP3843250006 2017-11-27T00:00:00Z 0.200000
Given that dataframe, I would like to make it look like this:
11/27/2017 11/28/2017 11/29/2017
KYG2615B1014 -8.035714 -10.679612 0.000000
JP3843250006 0.200000 0.349301 0.348086
Basically what I want is to place every date as a new column and inside that column, placing the corresponding value. I wouldn't say "filtering" or "deleting" duplicates, I'd say this is much more like rearranging.
Both dataframes were generated by me, but the thing is that to acquire this data I have to make a call to an API. In the 1st dataframe I make only one call and pull all of this data, while in the other I make one call per each date. So 1st is much more efficient than the 2nd and figured it was the right call, but I'm stuck in this part of reorganizing the dataframe to what I need.
I thought of creating an empty dataframe and then populate it, by picking indexes of repeated elements in the 'Instrument' column, use those indexes to get elements from the 'Total Return' column and then place the elements from that chunk of data accordingly, but I don't know how to do it.
If someone can help me, I'll be happy to know.
Not sure if useful at this point, but this how I generated the dataframe (before populating it) in the 2nd version:
import pandas as pd
import datetime
#Getting a list of dates
start=datetime.date(2017,11,27)
end=datetime.date.today() - datetime.timedelta(days=1)
row_dates=[x.strftime('%m/%d/%Y') for x in pd.bdate_range(start,end).tolist()]
#getting identifiers to be used on Eikon
csv_data=pd.read_csv('171128.csv', header=None)
identifiers=csv_data[0].tolist()
df=pd.DataFrame(index=identifiers, columns=row_dates)
You can use pd.crosstab:
pd.crosstab(df.Instrument, df['Date'],values=df['Total Return'], aggfunc='mean')
Output:
Date 2017-11-27T00:00:00Z 2017-11-28T00:00:00Z 2017-11-29T00:00:00Z
Instrument
JP3843250006 0.200000 0.349301 0.348086
KYG2615B1014 -8.035714 -10.679612 0.000000
This looks like pandas.pivot_table() pivot_table to me, note you can add an agg function if you think there will be duplicates (from example looks like only one reading per day).
import pandas as pd
instrument=['KYG2615B1014','KYG2615B1014','KYG2615B1014', 'JP3843250006', 'JP3843250006', 'JP3843250006']
date=['11/29/2017', '11/28/2017', '11/27/2017', '11/29/2017', '11/28/2017', '11/27/2017']
total_return=[0.0, -10.679612, -8.035714, 0.348086, 0.349301, 0.200000]
stacked = pd.DataFrame(dict(Instrument=instrument, Date=date, Total_return=total_return)
pd.pivot_table(stacked, values='Total_return', index='Instrument', columns='Date')
This returns the following:
Date 11/27/2017 11/28/2017 11/29/2017
Instrument
JP3843250006 0.200000 0.349301 0.348086
KYG2615B1014 -8.035714 -10.679612 0.000000

Subset Pandas Data Frame Based on Some Row Value

I have a Pandas data frame with columns that are 'dynamic' (meaning that I don't know what the column names will be until I retrieve the data from the various databases).
The data frame is a single row and looks something like this:
Make Date Red Blue Green Black Yellow Pink Silver
89 BMW 2016-10-28 300.0 240.0 2.0 500.0 1.0 1.0 750.0
Note that '89' is that particular row in the data frame.
I have the following code:
cars_bar_plot = df_cars.loc[(df_cars.Make == 'BMW') & (df_cars.Date == as_of_date)]
cars_bar_plot = cars_bar_plot.replace(0, value=np.nan)
cars_bar_plot = cars_bar_plot.dropna(axis=1, how='all')
This works fine in helping me to create the above-mentioned single-row data frame, BUT some of the values in each column are very small (e.g. 1.0 and 2.0) relative to the other values and they are distorting a horizontal bar chart that I'm creating with Matplotlib. I'd like to get rid of numbers that are smaller than some minimum threshold value (e.g. 3.0).
Any idea how I can do that?
Thanks!
UPDATE 1
The following line of code helps, but does not fully solve the problem.
cars_bar_plot = cars_bar_plot.loc[:, (cars_bar_plot >= 3.0).any(axis=0)]
The problem is that it's eliminating unintended columns. For example, referencing the original data frame, is it possible to modify this code such that it only removes columns with a value less than 3.0 to the right of the "Black" column (under the assumption that we actually want to retain the value of 2.0 in the "Green" column)?
Thanks!
Assuming you want to only keep rows matching your criteria, you can filter you data like this:
df[df.apply(lambda x: x > 0.5).min(axis=1)]
i.e. simply look at all values matching your condition, and remove the row as soon if at least one doesn't.
Here is the answer to my question:
lower_threshold = 3.0
start_column = 5
df = df.loc[start_column:, (df >= lower_threshold).any(axis=0)]

python pandas - map using 2 columns as reference

I have 2 txt files I'd like to read into python: 1) A map file, 2) A data file. I'd like to have a lookup table or dictionary read the values from TWO COLUMNS of one, and determine which value to put in the 3rd column using something like the pandas.map function. The real map file is ~700,000 lines, and the real data file is ~10 million lines.
Toy Dataframe (or I could recreate as a dictionary) - Map
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2000 SNPD
Toy Dataframe - Data File
Chr Position
1 1000
1 2000
2 1000
2 2001
Resulting final table:
Chr Position Name
1 1000 SNPA
1 2000 SNPB
2 1000 SNPC
2 2001 NaN
I found several questions about this with only one column lookup: Adding a new pandas column with mapped value from a dictionary. But can't seem to find a way to use 2 columns. I'm also open to other packages that may handle genomic data.
As a bonus second question, it'd also be nice if there was a way to map the 3rd column if it was with a certain amount of the mapped value. In other words, row 4 of the resulting table above would map to SNPD, as it's only 1 away. But I'd be happy to just get the solution for above.
i would do it this way:
read your map data so that first two columns will become an index:
dfm = pd.read_csv('/path/to/map.csv', delim_whitespace=True, index_col=[0,1])
change delim_whitespace=True to sep=',' if you have , as a delimiter
read up your DF (setting the same index):
df = pd.read_csv('/path/to/data.csv', delim_whitespace=True, index_col=[0,1])
join your DFs:
df.join(dfm)
Output:
In [147]: df.join(dfm)
Out[147]:
Name
Chr Position
1 1000 SNPA
2000 SNPB
2 1000 SNPC
2001 NaN
PS for the bonus question try something like this

Pandas.DataFrame - find the oldest date for which a value is available

I have a pandas.DataFrame object containing 2 time series. One series is much shorter than the other.
I want to determine the farther date for which a data is available in the shortest series, and remove data in the 2 columns before that date.
What is the most pythonic way to do that?
(I apologize that I don't really follow the SO guideline for submitting questions)
Here is a fragment of my dataframe:
osr go
Date
1990-08-17 NaN 239.75
1990-08-20 NaN 251.50
1990-08-21 352.00 265.00
1990-08-22 353.25 274.25
1990-08-23 351.75 290.25
In this case, I want to get rid of all rows before 1990-08-21 (I add there may be NAs in one of the columns for more recent dates)
You can use idxmax in inverted s by df['osr'][::-1] and then use subset of df:
print df
# osr go
#Date
#1990-08-17 NaN 239.75
#1990-08-20 NaN 251.50
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
s = df['osr'][::-1]
print s
#Date
#1990-08-23 351.75
#1990-08-22 353.25
#1990-08-21 352.00
#1990-08-20 NaN
#1990-08-17 NaN
#Name: osr, dtype: float64
maxnull = s.isnull().idxmax()
print maxnull
#1990-08-20 00:00:00
print df[df.index > maxnull]
# osr go
#Date
#1990-08-21 352.00 265.00
#1990-08-22 353.25 274.25
#1990-08-23 351.75 290.25
EDIT: New answer based upon comments/edits
It sounds like the data is sequential and once you have lines that don't have data you want to throw them out. This can be done easily with dropna.
df = df.dropna()
This answer assumes that once you are passed the bad rows, they stay good. Or if you don't care about dropping rows in the middle...depends on how sequential you need to be. If the data needs to be sequential and your input is well formed jezrael answer is good
Original answer
You haven't given much here by way of structure in your dataframe so I am going to make assumptions here. I'm going to assume you have many columns, two of which: time_series_1 and time_series_2 are the ones you referred to in your question and this is all stored in df
First we can find the shorter series by just using
shorter_col = df['time_series_1'] if len(df['time_series_1']) > len(df['time_series_2']) else df['time_series_2']
Now we want the last date in that
remove_date = max(shorter_col)
Now we want to remove data before that date
mask = (df['time_series_1'] > remove_date) | (df['time_series_2'] > remove_date)
df = df[mask]

Categories

Resources