How to extract data from LaTeX table using Python

How to extract data from LaTeX table using Python - python

I have a table from written in LaTeX in a .tex file:
\begin{tabular}{cccccccc}
\hline
\hline
$ \beta$ & $T\times L^3$ & $am_{ud}^\bare$ & $am_s^\bare$ & $Z_S (am_{ud})$ & $a\Mss$ & $a M_\pi$ & $aF_\pi/Z_A$ \\ \hline
%48\_24\_3.5\_ud-0.041\_s-0.006
& $48\times 24^3$ & -0.041 & -0.006
& 0.01475(33) & 0.3415(5)(2) & 0.19188(50)(6) & 0.05491(34)(0) \\
%48\_24\_3.5\_ud-0.0437\_s-0.006
& $48\times 24^3$ & -0.0437 & -0.006
& 0.01188(27) & 0.3396(5)(2) & 0.17238(49)(3) & 0.05263(34)(0) \\
%64\_24\_3.5\_ud-0.041\_s-0.012
& $64 \times 24^3$ & -0.041 & -0.012
& 0.01428(33) & 0.3175(95)(4) & 0.18790(90)(30) & 0.05384(84)(6) \\
%64\_32\_3.5\_ud-0.0463\_s-0.012
& $64 \times 32^3$ & -0.0463 & -0.012
& 0.00853(20) & 0.3134(10)(7) & 0.14440(70)(60) & 0.05004(62)(6) \\
%64\_32\_3.5\_ud-0.048\_s-0.0023
3.5 & $64 \times 32^3$ & -0.048 & -0.0023
& 0.00726(17) & 0.3496(75)(5) & 0.13480(70)(20) & 0.04982(59)(1) \\
%64\_32\_3.5\_ud-0.049\_s-0.006
& $64 \times 32^3$ & -0.049 & -0.006
& 0.00579(15) & 0.3339(10)(5) & 0.12100(9)(3) & 0.04837(84)(3) \\
%64\_32\_3.5\_ud-0.049\_s-0.012
& $64 \times 32^3$ & -0.049 & -0.012
& 0.00560(14) & 0.3103(69)(9) & 0.11733(64)(3) & 0.04800(68)(2) \\
%64\_48\_3.5\_ud-0.0515\_s-0.012
& $64 \times 48^3$ & -0.0515 & -0.012
& 0.00288(7) & 0.3079(9)(1) & 0.08410(60)(20) & 0.04628(58)(3) \\
%64\_64\_3.5\_ud-0.05294\_s-0.006
& $64 \times 64^3$ & -0.05294 & -0.006
& 0.00149(5) & 0.3281(9)(5) & 0.06126(60)(9) & 0.04440(75)(6) \\
\hline
%48\_32\_3.61\_ud-0.028\_s0.0045
& $48 \times 32^3$ & -0.028 & 0.0045
& 0.01008(23) & 0.2955(6)(3) & 0.14852(49)(2) & 0.04408(34)(2) \\
%48\_32\_3.61\_ud-0.03\_s0.0045
& $48 \times 32^3$ & -0.03 & 0.0045
& 0.00808(18) & 0.2929(7)(3) & 0.13217(50)(9) & 0.04262(39)(1) \\
%48\_32\_3.61\_ud-0.03\_s-0.0042
& $48 \times 32^3$ & -0.03 & -0.0042
& 0.00783(18) & 0.2602(7)(2) & 0.12943(59)(4) & 0.04207(39)(1) \\
%48\_48\_3.61\_ud-0.03121\_s0.0045
3.61 & $48 \times 48^3$ & -0.03121 & 0.0045
\end{tabular}
I obviously only want the numbers but I'm having trouble even getting Python to read the lines. If I for example define:
file=open('lattice-data.tex','r')
and try file.read() or file.readline() I only get '' in return.

Extract the information using regular expressions (regex). You may read, for example, at w3 school how to use regex in python.
At the core you should look for column delimiters amp '&' and new lines double backslash '\\'. After prying apart the table you may deal with 'decoding' each entry with regex for each type of data. (I cannot see a clear pattern in the unformatted source.)

Related

Converting a string from long to short based on logic in python

I have multiple long values like the following in a column in pandas dataframe (an example) -
((Type=Food & Value1=Fruit & Value2=Apple) or (Type=Food & Value1=Fruit & Value2=Banana) or (Type=Food & Value1=Vegetable & Value2=Carrot) or (Type=Food & Value1=Vegetable & Value2=Tomato))
I want to convert it to -
((Type=Food & Value1=Fruit & Value2 = Apple|Banana) or (Type=Food & Value1=Vegetable & Value2= Carrot|Tomato))
How can I do it? could not find anything that helps this

((Type=Food & Value1=Fruit & Value2 = Apple|Banana) =>
((Type=Food & Value1=Fruit & ((Value2 = Apple) or (Value2 = Banana))
is this helpful ?

ok I think you need something like this
fruits = ['banana','apple']
print('banana' in fruits)
print('value' in fruits)
output
True
False
for your case:
((Type=Food & Value1=Fruit & Value2 in [Apple,Banana])

Pandas df.to_latex() output gets truncated

Problem: I tried to export a pandas.DataFrame to LaTex using .to_latex().
However, the output for long values (in my case long strings) gets truncated.
Step to reproduce:
import pandas as pd
df = pd.DataFrame(['veryLongString' * i for i in range(1, 5)], dtype='string')
print(df.to_latex())
Output:
\begin{tabular}{ll}
\toprule
{} & 0 \\
\midrule
0 & veryLongString \\
1 & veryLongStringveryLongString \\
2 & veryLongStringveryLongStringveryLongString \\
3 & veryLongStringveryLongStringveryLongStringvery... \\
\bottomrule
\end{tabular}
As you can see, the last row gets truncated (with ...).
I already tried to use the col_space parameter but this does not change the behavior as expected.
It simply shifts the table cells as following:
\begin{tabular}{ll}
\toprule
{} & 0 \\
\midrule
0 & veryLongString \\
1 & veryLongStringveryLongString \\
2 & veryLongStringveryLongStringveryLongString \\
3 & veryLongStringveryLongStringveryLongStringvery... \\
\bottomrule
\end{tabular}
How do I get the full content of the DataFrame exported to Latex?

You can call the context manager with a with statement to temporarily change the max column width:
with pd.option_context("max_colwidth", 1000):
print (df.to_latex())
Output:
\begin{tabular}{ll}
\toprule
{} & 0 \\
\midrule
0 & veryLongString \\
1 & veryLongStringveryLongString \\
2 & veryLongStringveryLongStringveryLongString \\
3 & veryLongStringveryLongStringveryLongStringveryLongString \\
\bottomrule
\end{tabular}
This behaviour is also described here.

After spending some time trying out other parameters from to_latex() as well as other export options, e.g., to_csv(), I was sure that this is not a problem of to_latex().
I found the solution in the pandas documentation:
So the solution is setting this option to None to don't restrict the output (globally).
pd.set_option('display.max_colwidth', None)
Source: https://pandas.pydata.org/pandas-docs/stable/user_guide/options.html

Convert statsmodel table to latex style .png with python

I think I am quite close to solve a problem a lot of statsmodel users had in the past. However, my nonexistent latex experience is slowing down my progress by a lot. :D
Let's jump directly into the problem:
In the following I am using the example dataset from https://www.statsmodels.org/stable/regression.html.
The output is therefore the following regression table:
OLS Regression Results
I have tried several packages to convert the output to a latex style .png. The most promising method seems to be to use the sympy.printing.preview.preview function:
In:
from sympy.printing.preview import preview
preamble = "\\documentclass[12pt]{article}\n" \
"\\usepackage{booktabs,amsmath,amsfonts}\\begin{document}"
preview(res.summary().as_latex(), output='png', filename='output.png', preamble=preamble)
Out:
Latex style .png 1
As you can see, the second and third table of the regression as well as the warning below the table are not in line with the first table.
To fix this, I have tried to use the tabularx package as you can see in the following:
preamble = "\\documentclass[12pt]{article}\n" \
"\\usepackage{booktabs,amsmath,amsfonts, tabularx}\\begin{document}"
preview(res.summary().as_latex().replace("\\begin{tabular}", "\\begin{tabularx}
{\\textwidth}").replace("\\end{tabular}", "\\end{tabularx}"), output='png', filename='output.png', preamble=preamble)
Output:
Latex style .png 2
The horizontal lines of table two and three are aligned with the ones of table one, however the input of the table as well as the warning aren't.
Does anybody know how to continue from here? How can I get the original output table in a latex style saved as a .png with python?
Thank you guys in advance!
Edit1:
Since res.summary().as_latex() outputs the code shown at the end of the post, I have managed to align the tables more or less manually with:
preview(res.summary().as_latex().replace("\\begin{tabular}", "\\begin{tabularx}{20cm}").replace("\\end{tabular}", "\\end{tabularx}").replace("{lclc}", "{>{\hsize=.3\hsize}X >{\hsize=.2\hsize}X >{\hsize=.3\hsize}X >{\hsize=.2\hsize}X}").replace("{lcccccc}", "{XXXXXX p{3.6cm}}"), output='png', filename='output.png', preamble=preamble)
The new Output looks like that. This shouldn't and can't be the final solution, therefore I would like to expand the question if somebody knows a more dynamic (non-manual) way to do this.
Full latex code of the three tables:
\begin{center}
\begin{tabular}{lclc}
\toprule
\textbf{Dep. Variable:} & y & \textbf{ R-squared: } & 0.416 \\
\textbf{Model:} & OLS & \textbf{ Adj. R-squared: } & 0.353 \\
\textbf{Method:} & Least Squares & \textbf{ F-statistic: } & 6.646 \\
\textbf{Date:} & Thu, 04 Feb 2021 & \textbf{ Prob (F-statistic):} & 0.00157 \\
\textbf{Time:} & 18:38:15 & \textbf{ Log-Likelihood: } & -12.978 \\
\textbf{No. Observations:} & 32 & \textbf{ AIC: } & 33.96 \\
\textbf{Df Residuals:} & 28 & \textbf{ BIC: } & 39.82 \\
\textbf{Df Model:} & 3 & \textbf{ } & \\
\bottomrule
\end{tabular}
\begin{tabular}{lcccccc}
& \textbf{coef} & \textbf{std err} & \textbf{t} & \textbf{P$> |$t$|$} & \textbf{[0.025} & \textbf{0.975]} \\
\midrule
\textbf{x1} & 0.4639 & 0.162 & 2.864 & 0.008 & 0.132 & 0.796 \\
\textbf{x2} & 0.0105 & 0.019 & 0.539 & 0.594 & -0.029 & 0.050 \\
\textbf{x3} & 0.3786 & 0.139 & 2.720 & 0.011 & 0.093 & 0.664 \\
\textbf{const} & -1.4980 & 0.524 & -2.859 & 0.008 & -2.571 & -0.425 \\
\bottomrule
\end{tabular}
\begin{tabular}{lclc}
\textbf{Omnibus:} & 0.176 & \textbf{ Durbin-Watson: } & 2.346 \\
\textbf{Prob(Omnibus):} & 0.916 & \textbf{ Jarque-Bera (JB): } & 0.167 \\
\textbf{Skew:} & 0.141 & \textbf{ Prob(JB): } & 0.920 \\
\textbf{Kurtosis:} & 2.786 & \textbf{ Cond. No. } & 176. \\
\bottomrule
\end{tabular}
%\caption{OLS Regression Results}
\end{center}
Notes: \newline
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

regex matching with multiple conditions

I need to obtain a string of letters from the following rows_string variable:
'Equity & 1,638 & \\$3,227,305 & \\$2,649,208 & \\$3,270,402 & \\$3,114,298 & \\$3,173,369 & \\$2,978,769 & \\$3,016,161 & \\$2,807,840\\\\\nFixed Income & 420 & \\$765,856 & \\$661,395 & \\$824,603 & \\$792,579 & \\$794,224 & \\$783,793 & \\$719,307 & \\$630,298\\\\\nCommodities & 119 & \\$72,911 & \\$66,302 & \\$81,649 & \\$81,633 & \\$79,296 & \\$76,450 & \\$64,136 & \\$63,667\\\\\nAsset Allocation & 63 & \\$10,190 & \\$9,275 & \\$10,684 & \\$10,089 & \\$10,371 & \\$9,829 & \\$9,619 & \\$8,880\\\\\nAlternatives & 55 & \\$5,601 & \\$6,023 & \\$6,715 & \\$6,279 & \\$6,365 & \\$6,645 & \\$6,757 & \\$6,243\\\\\nCurrency & 34 & \\$311 & \\$2,014 & \\$1,665 & \\$1,743 & \\$1,683 & \\$1,666 & \\$1,722 & \\$2,058\\\\\nTOTALS & 2,329 & \\$4,082,173 & \\$3,394,217 & \\$4,195,718 & \\$4,006,620 & \\$4,065,308 & \\$3,857,151 & \\$3,817,700 & \\$3,518,986\\\\'
So for instance, I need the following list:
[Equity, Fixed Income, Commodities, Asset Allocation, Alternatives, Currency, Total]
I tried:
re.findall(r'\\\\\n(\w+.*?) &', rows_string)
Great but this omits the "equity" variable
And also gives me an empty list for this string variable
'Starting Portfolio & sell & 21.39\\% & -0.91\\% & 1.52\\% & 9.29\\% & 9.72\\% & 14.89\\% & 38.21\\% & 55.4\\% & & 90.86\\%\\\\'
So for the second string, I need ['Starting Portfolio', 'sell']
What I want is to grab the first item following \\\\\n and first item before the '&' in the string variable. Thank you

You are just missing one \. You are not searching for the letters \ and n but instead for a line break. So just ad \ at the beginning of your regex. Also you are missing the first entry since you state, that your word starts with \\\\\n. To also get the first you could use ^(\w+.*?)|[\\\\\n](\w+.*?) & for example

Try this pattern with re.finditer():
pattern = r"(((?!\\\\\\\\\\n)([a-zA-Z\\s]+))|([a-zA-Z\\s]{2,}\\s?(?!\\&)))"
output_list = [i.group().strip() for i in re.finditer(pattern, rows_string) if i.group().strip()]
Inputs :
s1 = 'Equity & 1,638 & \\$3,227,305 & \\$2,649,208 & \\$3,270,402 & \\$3,114,298 & \\$3,173,369 & \\$2,978,769 & \\$3,016,161 & \\$2,807,840\\\\\nFixed Income & 420 & \\$765,856 & \\$661,395 & \\$824,603 & \\$792,579 & \\$794,224 & \\$783,793 & \\$719,307 & \\$630,298\\\\\nCommodities & 119 & \\$72,911 & \\$66,302 & \\$81,649 & \\$81,633 & \\$79,296 & \\$76,450 & \\$64,136 & \\$63,667\\\\\nAsset Allocation & 63 & \\$10,190 & \\$9,275 & \\$10,684 & \\$10,089 & \\$10,371 & \\$9,829 & \\$9,619 & \\$8,880\\\\\nAlternatives & 55 & \\$5,601 & \\$6,023 & \\$6,715 & \\$6,279 & \\$6,365 & \\$6,645 & \\$6,757 & \\$6,243\\\\\nCurrency & 34 & \\$311 & \\$2,014 & \\$1,665 & \\$1,743 & \\$1,683 & \\$1,666 & \\$1,722 & \\$2,058\\\\\nTOTALS & 2,329 & \\$4,082,173 & \\$3,394,217 & \\$4,195,718 & \\$4,006,620 & \\$4,065,308 & \\$3,857,151 & \\$3,817,700 & \\$3,518,986\\\\'
s2 = 'Starting Portfolio & sell & 21.39\\% & -0.91\\% & 1.52\\% & 9.29\\% & 9.72\\% & 14.89\\% & 38.21\\% & 55.4\\% & & 90.86\\%\\\\'*
Output :
['Equity', 'Fixed Income', 'Commodities', 'Asset Allocation', 'Alternatives', 'Currency', 'TOTALS']
['Starting Portfolio', 'sell']

I don't think there's any reason to focus on the escaped newlines. This should do the trick:
import re
pattern = r'\b[A-Za-z ]*[A-Za-z]\b'
rows_string = 'Equity & 1,638 & \\$3,227,305 & \\$2,649,208 & \\$3,270,402 & \\$3,114,298 & \\$3,173,369 & \\$2,978,769 & \\$3,016,161 & \\$2,807,840\\\\\nFixed Income & 420 & \\$765,856 & \\$661,395 & \\$824,603 & \\$792,579 & \\$794,224 & \\$783,793 & \\$719,307 & \\$630,298\\\\\nCommodities & 119 & \\$72,911 & \\$66,302 & \\$81,649 & \\$81,633 & \\$79,296 & \\$76,450 & \\$64,136 & \\$63,667\\\\\nAsset Allocation & 63 & \\$10,190 & \\$9,275 & \\$10,684 & \\$10,089 & \\$10,371 & \\$9,829 & \\$9,619 & \\$8,880\\\\\nAlternatives & 55 & \\$5,601 & \\$6,023 & \\$6,715 & \\$6,279 & \\$6,365 & \\$6,645 & \\$6,757 & \\$6,243\\\\\nCurrency & 34 & \\$311 & \\$2,014 & \\$1,665 & \\$1,743 & \\$1,683 & \\$1,666 & \\$1,722 & \\$2,058\\\\\nTOTALS & 2,329 & \\$4,082,173 & \\$3,394,217 & \\$4,195,718 & \\$4,006,620 & \\$4,065,308 & \\$3,857,151 & \\$3,817,700 & \\$3,518,986\\\\'
rows = re.findall(pattern, rows_string)
print(rows)
rows_string2 = 'Starting Portfolio & sell & 21.39\\% & -0.91\\% & 1.52\\% & 9.29\\% & 9.72\\% & 14.89\\% & 38.21\\% & 55.4\\% & & 90.86\\%\\\\'
rows2 = re.findall(pattern, rows_string2)
print(rows2)

To get the values, you might use an alternation to either match the words from the start of the string or get the words before &
(?:^[A-Za-z]+(?: [A-Za-z]+)*|[A-Za-z]+(?: [A-Za-z]+)*(?= &))
(?: Non capturing group
^ Start of the line
[A-Za-z]+(?: [A-Za-z]+)* Match 1+ words with only chars A-Za-z
| Or
[A-Za-z]+(?: [A-Za-z]+)*(?= &) Match words followed by &
) Close group
Regex demo | Python demo
For example
import re
pattern = r'(?:^[A-Za-z]+(?: [A-Za-z]+)*|[A-Za-z]+(?: [A-Za-z]+)*(?= &))'
rows_string = 'Equity & 1,638 & \\$3,227,305 & \\$2,649,208 & \\$3,270,402 & \\$3,114,298 & \\$3,173,369 & \\$2,978,769 & \\$3,016,161 & \\$2,807,840\\\\\nFixed Income & 420 & \\$765,856 & \\$661,395 & \\$824,603 & \\$792,579 & \\$794,224 & \\$783,793 & \\$719,307 & \\$630,298\\\\\nCommodities & 119 & \\$72,911 & \\$66,302 & \\$81,649 & \\$81,633 & \\$79,296 & \\$76,450 & \\$64,136 & \\$63,667\\\\\nAsset Allocation & 63 & \\$10,190 & \\$9,275 & \\$10,684 & \\$10,089 & \\$10,371 & \\$9,829 & \\$9,619 & \\$8,880\\\\\nAlternatives & 55 & \\$5,601 & \\$6,023 & \\$6,715 & \\$6,279 & \\$6,365 & \\$6,645 & \\$6,757 & \\$6,243\\\\\nCurrency & 34 & \\$311 & \\$2,014 & \\$1,665 & \\$1,743 & \\$1,683 & \\$1,666 & \\$1,722 & \\$2,058\\\\\nTOTALS & 2,329 & \\$4,082,173 & \\$3,394,217 & \\$4,195,718 & \\$4,006,620 & \\$4,065,308 & \\$3,857,151 & \\$3,817,700 & \\$3,518,986\\\\'
print(re.findall(pattern, rows_string, re.M))
rows_string2 = 'Starting Portfolio & sell & 21.39\\% & -0.91\\% & 1.52\\% & 9.29\\% & 9.72\\% & 14.89\\% & 38.21\\% & 55.4\\% & & 90.86\\%\\\\'
print(re.findall(pattern, rows_string2, re.M))
Output
['Equity', 'Fixed Income', 'Commodities', 'Asset Allocation', 'Alternatives', 'Currency', 'TOTALS']
['Starting Portfolio', 'sell']
If all matches should be followed by & you might simplify the pattern to
[A-Za-z]+(?: [A-Za-z]+)*(?= &)
Regex demo

Assuming your target string (financial keyword) comes after a newline (or start of string) and before & you could do:
>>> re.findall(r'(?:\n|^)([A-Za-z ]+)\s&', s)
['Equity', 'Fixed Income', 'Commodities', 'Asset Allocation', 'Alternatives', 'Currency', 'TOTALS']
This uses some shortcuts, but depending on if you have more complex strings, such as "P&E", "Misc. Expenses", and such, the above might suffice.

Statsmodel summary_col latex format error

I am using the statsmodel's summary_col to give me an output table which summarizes the two regression output.
The code for this is
res3 = summary_col([res1,res2],stars=True,float_format='%0.2f',
info_dict={'R2':lambda x: "{:.2f}".format(x.rsquared)})
f = open('res3.tex', 'w')
f.write(res3.as_latex())
f.close()
I use the res3.tex file as input for another tex file which then generates the results. The problem arises when i convert the table to LaTeX format using as_latex(). The table header shifts to the side in the tex file and looks like this.
The res3.tex file has the following latex code
\begin{table}
\caption{}
\begin{center}
\begin{tabular}{lcc}
\hline
& investment I & investment II \\
\hline
\hline
\end{tabular}
\begin{tabular}{lll}
GDP & 1.35*** & 1.19*** \\
& (0.24) & (0.23) \\
bsent & 0.28*** & 0.26*** \\
& (0.06) & (0.06) \\
rate & -0.22* & -0.65*** \\
& (0.13) & (0.19) \\
research & & 0.80*** \\
& & (0.27) \\
R2 & 0.76 & 0.80 \\
\hline
\end{tabular}
\end{center}\end{table}
The problem seems to arise due to multiple tabular environments. Is there a way to get the investment header on top of the table without manually changing the res3 file (intermediary file)?

You first have to add couple of things to Latex's preamble.
Here you can probably find an answer to your problem Outputting Regressions as Table in Python (similar to outreg in stata)?.
res3 = summary_col([res1,res2],stars=True,float_format='%0.2f',
info_dict={'R2':lambda x: "{:.2f}".format(x.rsquared)})
beginningtex = """\\documentclass{report}
\\usepackage{booktabs}
\\begin{document}"""
endtex = "\end{document}"
f = open('myreg.tex', 'w')
f.write(beginningtex)
f.write(res3.as_latex())
f.write(endtex)
f.close()
All credits to #BKay

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract data from LaTeX table using Python - python

Related

Converting a string from long to short based on logic in python

Pandas df.to_latex() output gets truncated

Convert statsmodel table to latex style .png with python

regex matching with multiple conditions

Statsmodel summary_col latex format error

Categories

Resources