Transposed dataframe to LaTeX - python

I am not able to change the number format in the LaTeX output of the library Pandas.
Consider this example:
import pandas as pd
values = [ { "id":"id1", "c1":1e-10, "c2":int(1000) }]
df = pd.DataFrame.from_dict(values).set_index("id")
print(df)
with output:
c1 c2
id
id1 1.000000e-10 1000
Let's say that I want c1 formatted with two decimal places, c2 as an integer:
s = df.style
s.clear()
s.format({ "c1":"{:.2f}", "c2":"{:d}" })
print(s.to_latex())
with output:
\begin{tabular}{lrr}
& c1 & c2 \\
id & & \\
id1 & 0.00 & 1000 \\
\end{tabular}
However, I do not need a LaTeX table for df but for df.T.
Question: since I can specify the styles only for the columns (at least it seems so in the docs), how can I specify the row-based output format for df.T?
If I simply write this:
dft = df.T
s2 = dft.style
# s2.clear() # nothing changes with this instruction
print(s2.to_latex())
it is ever worse as I get:
\begin{tabular}{lr}
id & id1 \\
c1 & 0.000000 \\
c2 & 1000.000000 \\
\end{tabular}
where even the integer (the one with value int(1000)) became a float using the default style/format.
I played with the subset parameter and various slices with no success.

As workaround, you can format each value to its correct one with:
fmt = { "c1":"{:.2f}", "c2":"{:d}" }
df_print = df.apply(lambda x: [fmt[x.name].format(v) for v in x])
print(df_print.T.style.to_latex())
Output:
\begin{tabular}{ll}
{id} & {id1} \\
c1 & 0.00 \\
c2 & 1000 \\
\end{tabular}

Related

Pandas df.to_latex() output gets truncated

Problem: I tried to export a pandas.DataFrame to LaTex using .to_latex().
However, the output for long values (in my case long strings) gets truncated.
Step to reproduce:
import pandas as pd
df = pd.DataFrame(['veryLongString' * i for i in range(1, 5)], dtype='string')
print(df.to_latex())
Output:
\begin{tabular}{ll}
\toprule
{} & 0 \\
\midrule
0 & veryLongString \\
1 & veryLongStringveryLongString \\
2 & veryLongStringveryLongStringveryLongString \\
3 & veryLongStringveryLongStringveryLongStringvery... \\
\bottomrule
\end{tabular}
As you can see, the last row gets truncated (with ...).
I already tried to use the col_space parameter but this does not change the behavior as expected.
It simply shifts the table cells as following:
\begin{tabular}{ll}
\toprule
{} & 0 \\
\midrule
0 & veryLongString \\
1 & veryLongStringveryLongString \\
2 & veryLongStringveryLongStringveryLongString \\
3 & veryLongStringveryLongStringveryLongStringvery... \\
\bottomrule
\end{tabular}
How do I get the full content of the DataFrame exported to Latex?
You can call the context manager with a with statement to temporarily change the max column width:
with pd.option_context("max_colwidth", 1000):
print (df.to_latex())
Output:
\begin{tabular}{ll}
\toprule
{} & 0 \\
\midrule
0 & veryLongString \\
1 & veryLongStringveryLongString \\
2 & veryLongStringveryLongStringveryLongString \\
3 & veryLongStringveryLongStringveryLongStringveryLongString \\
\bottomrule
\end{tabular}
This behaviour is also described here.
After spending some time trying out other parameters from to_latex() as well as other export options, e.g., to_csv(), I was sure that this is not a problem of to_latex().
I found the solution in the pandas documentation:
So the solution is setting this option to None to don't restrict the output (globally).
pd.set_option('display.max_colwidth', None)
Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

Substituting variable in a dataframe row based on other row's value

I have a dataframe that contains ID, Formula, and a dependent ID column that I extracted the ID from the Formula column.
Now I have to substitute all the dependent ID into the formulas based on the dataframe.
My approach is to run a nested loop for each row to substitute a dependent ID in the formula using the replace function. The loop would stop until there's no more possible substitution. However I don't know where to begin and not sure if this is the correct approach.
I am wondering if there's any function that can make the process easier?
Here is the code to create the current dataframe:
data = pd.DataFrame({'ID':['A1','A3','B2','C2','D3','E3'],
'Formula':['C2/500','If B2 >10 then (B2*D3) + 100 else D3+10','E3/2 +20','E3/2 +20','var_i','var_x'],
'Dependent ID':['C2','B2, D3','E3','D3, E3', '','']})
Here are the examples of my current dataframe and my desire end result.
Current dataframe:
Desire end result:
Recursively replace dependent ID inside formula with formula:
df = pd.DataFrame({'ID':['A1','A3','B2','C2','D3','E3'],
'Formula':['C2/500','If B2 >10 then (B2*D3) + 100 else D3+10','E3/2 +20','D3+E3','var_i','var_x'],
'Dependent ID':['C2','B2,D3','E3','D3,E3', '','']})
def find_formula(formula:str, ids:str):
#replace all the ids inside formula with the correct formula
if ids == '':
return formula
ids = ids.split(',')
for x in ids:
sub_formula = df.loc[df['ID']==x, 'Formula'].values[0]
sub_id = df.loc[df['ID']==x, 'Dependent ID'].values[0]
formula = formula.replace(x, find_formula(sub_formula, sub_id))
return formula
df['new_formula']=df.apply(lambda x: find_formula(x['Formula'], x['Dependent ID']), axis=1)
output:
ID Formula Dependent ID new_formula
0 A1 C2/500 C2 var_i+var_x/500
1 A3 If B2 >10 then (B2*D3) + ... If var_x/2 +20 >10 then (var_x/2 +20*var_i) + ...
2 B2 E3/2 +20 E3 var_x/2 +20
3 C2 D3+E3 D3,E3 var_i+var_x
4 D3 var_i var_i
5 E3 var_x var_x

Numerical Column Not displaying numerical correctly in DF

I have a df such as below ( 3 rows for example )
ID | Dollar_Value
C 45.32
E 5.21
V 121.32
When I view the df in my notebook such as df:
It shows the Dollar_value as
ID | Dollar_Value
C 8.493000e+01
E 2.720000e+01
V 1.720000e+01
Instead of the regular format, but when I try to filter the df for specific ID, it shows the values as they are supposed to be ( 82.23 or 2.45)
df[df['ID'] == 'E']
ID | Dollar_Value
E 45.32
is there something I have to do formatting wise? So the df itself can display the value column as its supposed to?
Thanks!
You can try run this code before print , since you columns may have big number or very small number.(Check with df.describe())
pd.set_option('display.float_format', lambda x: '%.3f' % x)

Statsmodel summary_col latex format error

I am using the statsmodel's summary_col to give me an output table which summarizes the two regression output.
The code for this is
res3 = summary_col([res1,res2],stars=True,float_format='%0.2f',
info_dict={'R2':lambda x: "{:.2f}".format(x.rsquared)})
f = open('res3.tex', 'w')
f.write(res3.as_latex())
f.close()
I use the res3.tex file as input for another tex file which then generates the results. The problem arises when i convert the table to LaTeX format using as_latex(). The table header shifts to the side in the tex file and looks like this.
The res3.tex file has the following latex code
\begin{table}
\caption{}
\begin{center}
\begin{tabular}{lcc}
\hline
& investment I & investment II \\
\hline
\hline
\end{tabular}
\begin{tabular}{lll}
GDP & 1.35*** & 1.19*** \\
& (0.24) & (0.23) \\
bsent & 0.28*** & 0.26*** \\
& (0.06) & (0.06) \\
rate & -0.22* & -0.65*** \\
& (0.13) & (0.19) \\
research & & 0.80*** \\
& & (0.27) \\
R2 & 0.76 & 0.80 \\
\hline
\end{tabular}
\end{center}\end{table}
The problem seems to arise due to multiple tabular environments. Is there a way to get the investment header on top of the table without manually changing the res3 file (intermediary file)?
You first have to add couple of things to Latex's preamble.
Here you can probably find an answer to your problem Outputting Regressions as Table in Python (similar to outreg in stata)?.
res3 = summary_col([res1,res2],stars=True,float_format='%0.2f',
info_dict={'R2':lambda x: "{:.2f}".format(x.rsquared)})
beginningtex = """\\documentclass{report}
\\usepackage{booktabs}
\\begin{document}"""
endtex = "\end{document}"
f = open('myreg.tex', 'w')
f.write(beginningtex)
f.write(res3.as_latex())
f.write(endtex)
f.close()
All credits to #BKay

subsetting a Python DataFrame

I am transitioning from R to Python. I just began using Pandas. I have an R code that subsets nicely:
k1 <- subset(data, Product = p.id & Month < mn & Year == yr, select = c(Time, Product))
Now, I want to do similar stuff in Python. this is what I have got so far:
import pandas as pd
data = pd.read_csv("../data/monthly_prod_sales.csv")
#first, index the dataset by Product. And, get all that matches a given 'p.id' and time.
data.set_index('Product')
k = data.ix[[p.id, 'Time']]
# then, index this subset with Time and do more subsetting..
I am beginning to feel that I am doing this the wrong way. perhaps, there is an elegant solution. Can anyone help? I need to extract month and year from the timestamp I have and do subsetting. Perhaps there is a one-liner that will accomplish all this:
k1 <- subset(data, Product = p.id & Time >= start_time & Time < end_time, select = c(Time, Product))
thanks.
I'll assume that Time and Product are columns in a DataFrame, df is an instance of DataFrame, and that other variables are scalar values:
For now, you'll have to reference the DataFrame instance:
k1 = df.loc[(df.Product == p_id) & (df.Time >= start_time) & (df.Time < end_time), ['Time', 'Product']]
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators. The & operator is actually an overloaded bitwise operator which has the same precedence as arithmetic operators which in turn have a higher precedence than comparison operators.
In pandas 0.13 a new experimental DataFrame.query() method will be available. It's extremely similar to subset modulo the select argument:
With query() you'd do it like this:
df[['Time', 'Product']].query('Product == p_id and Month < mn and Year == yr')
Here's a simple example:
In [9]: df = DataFrame({'gender': np.random.choice(['m', 'f'], size=10), 'price': poisson(100, size=10)})
In [10]: df
Out[10]:
gender price
0 m 89
1 f 123
2 f 100
3 m 104
4 m 98
5 m 103
6 f 100
7 f 109
8 f 95
9 m 87
In [11]: df.query('gender == "m" and price < 100')
Out[11]:
gender price
0 m 89
4 m 98
9 m 87
The final query that you're interested will even be able to take advantage of chained comparisons, like this:
k1 = df[['Time', 'Product']].query('Product == p_id and start_time <= Time < end_time')
Just for someone looking for a solution more similar to R:
df[(df.Product == p_id) & (df.Time> start_time) & (df.Time < end_time)][['Time','Product']]
No need for data.loc or query, but I do think it is a bit long.
I've found that you can use any subset condition for a given column by wrapping it in []. For instance, you have a df with columns ['Product','Time', 'Year', 'Color']
And let's say you want to include products made before 2014. You could write,
df[df['Year'] < 2014]
To return all the rows where this is the case. You can add different conditions.
df[df['Year'] < 2014][df['Color' == 'Red']
Then just choose the columns you want as directed above. For instance, the product color and key for the df above,
df[df['Year'] < 2014][df['Color'] == 'Red'][['Product','Color']]
Regarding some points mentioned in previous answers, and to improve readability:
No need for data.loc or query, but I do think it is a bit long.
The parentheses are also necessary, because of the precedence of the & operator vs. the comparison operators.
I like to write such expressions as follows - less brackets, faster to type, easier to read. Closer to R, too.
q_product = df.Product == p_id
q_start = df.Time > start_time
q_end = df.Time < end_time
df.loc[q_product & q_start & q_end, c('Time,Product')]
# c is just a convenience
c = lambda v: v.split(',')
Creating an Empty Dataframe with known Column Name:
Names = ['Col1','ActivityID','TransactionID']
df = pd.DataFrame(columns = Names)
Creating a dataframe from csv:
df = pd.DataFrame('...../file_name.csv')
Creating a dynamic filter to subset a dtaframe:
i = 12
df[df['ActivitiID'] <= i]
Creating a dynamic filter to subset required columns of dtaframe
df[df['ActivityID'] == i][['TransactionID','ActivityID']]

Categories

Resources