Print whole numbers as integers in Pandas LaTeX conversion - python

I'm trying to write a small script to print LaTeX tables based on CSV.
A lot of the functionality formerly in e.g. matrix2latex have now been included in Pandas proper, which is cool.
However, no matter what I do (I tried a number of the other suggestions on here, it ended up becoming a convoluted mess which in effect changes nothing), it keeps coming out like this:
[deco]/tmp/table ❱ python lala.py
Dataframe:
Unnamed: 0 Treatment Implant Transgenic Untreated Mice Transgenic Treated Mice Wildtype Mice Total Mice
0 P1 Armodafinil VTA 20+15 20.0 5.0 60
1 P2 NaN LC 50 NaN 10.0 60
2 P3 Escitalopram DR 20 20.0 NaN 40
3 P4 Reboxetine LC 20 20.0 NaN 40
LaTeX Table Conversion:
\begin{tabular}{lllllrrr}
& Unnamed: 0 & Treatment & Implant & Transgenic Untreated Mice & Transgenic Treated Mice & Wildtype Mice & Total Mice \\
0 & P1 & Armodafinil & VTA & 20+15 & 20.000000 & 5.000000 & 60 \\
1 & P2 & nan & LC & 50 & nan & 10.000000 & 60 \\
2 & P3 & Escitalopram & DR & 20 & 20.000000 & nan & 40 \\
3 & P4 & Reboxetine & LC & 20 & 20.000000 & nan & 40 \\
\end{tabular}
[deco]/tmp/table ❱ cat lala.py
import pandas as pd
df = pd.read_csv("table.csv")
print("\n")
print("Dataframe:")
print("")
print(df)
tex = df.style.to_latex()
print("\n")
print("LaTeX Table Conversion:")
print("")
print(tex)
[deco]/tmp/table ❱ cat table.csv
,Treatment,Implant,Transgenic Untreated Mice,Transgenic Treated Mice,Wildtype Mice,Total Mice
P1,Armodafinil,VTA,20+15,20,5,60
P2,N/A,LC,50,,10,60
P3,Escitalopram,DR,20,20,,40
P4,Reboxetine,LC,20,20,,40
Is there any way to make sure that whole numbers are always displayed as integers?

the issue, that you seem to be facing is that missing table entries are interpreted as NaN, which then forces the entire column to float you can prevent the empty entries from getting read as NaN and have them just left empty by using
import pandas as pd
df = pd.read_csv("table.csv", keep_default_na=False)
print(df)
tex = df.style.to_latex()
print(tex)
this leads to the following output:
Unnamed: 0 Treatment Implant Transgenic Untreated Mice Transgenic Treated Mice Wildtype Mice Total Mice
0 P1 Armodafinil VTA 20+15 20 5 60
1 P2 N/A LC 50 10 60
2 P3 Escitalopram DR 20 20 40
3 P4 Reboxetine LC 20 20 40
\begin{tabular}{lllllllr}
& Unnamed: 0 & Treatment & Implant & Transgenic Untreated Mice & Transgenic Treated Mice & Wildtype Mice & Total Mice \\
0 & P1 & Armodafinil & VTA & 20+15 & 20 & 5 & 60 \\
1 & P2 & N/A & LC & 50 & & 10 & 60 \\
2 & P3 & Escitalopram & DR & 20 & 20 & & 40 \\
3 & P4 & Reboxetine & LC & 20 & 20 & & 40 \\
\end{tabular}
if you're dissatisfied with the numbering of the rows you might want to remove the first comma on the first line of your csv to get an output that looks like this:
Treatment Implant Transgenic Untreated Mice Transgenic Treated Mice Wildtype Mice Total Mice
P1 Armodafinil VTA 20+15 20 5 60
P2 N/A LC 50 10 60
P3 Escitalopram DR 20 20 40
P4 Reboxetine LC 20 20 40
\begin{tabular}{llllllr}
& Treatment & Implant & Transgenic Untreated Mice & Transgenic Treated Mice & Wildtype Mice & Total Mice \\
P1 & Armodafinil & VTA & 20+15 & 20 & 5 & 60 \\
P2 & N/A & LC & 50 & & 10 & 60 \\
P3 & Escitalopram & DR & 20 & 20 & & 40 \\
P4 & Reboxetine & LC & 20 & 20 & & 40 \\
\end{tabular}

You can format the integers using the .astype(int) function before converting to LaTeX. Also, you can use the format option to control the formatting of all columns in LaTeX:
import pandas as pd
df = pd.read_csv("table.csv")
# Format columns as integers
df = df.astype({"Transgenic Untreated Mice": int, "Transgenic Treated Mice": int, "Wildtype Mice": int, "Total Mice": int})
# Format all columns as integers in LaTeX
tex = df.style.format("{:.0f}").to_latex()
print("\n")
print("Dataframe:")
print("")
print(df)
print("\n")
print("LaTeX Table Conversion:")
print("")
print(tex)

A more general way to do that would be to convert the missing values to an integer (if possible) and to convert the datatype using convert_dtypes
.
The same works without replacing nan values, convert_dtypes then replaces the datatype to Int64.
import pandas as pd
df = pd.read_csv("table.csv")
print("\n")
print("Dataframe:")
print("")
print(df)
tex = df.fillna(-1).convert_dtypes().to_latex()
print("\n")
print("LaTeX Table Conversion:")
print("")
print(tex)
LaTeX Table Conversion:
\begin{tabular}{lllllrrr}
\toprule
{} & Unnamed: 0 & Treatment & Implant & Transgenic Untreated Mice & Transgenic Treated Mice & Wildtype Mice & Total Mice \\
\midrule
0 & P1 & Armodafinil & VTA & 20+15 & 20 & 5 & 60 \\
1 & P2 & -1 & LC & 50 & -1 & 10 & 60 \\
2 & P3 & Escitalopram & DR & 20 & 20 & -1 & 40 \\
3 & P4 & Reboxetine & LC & 20 & 20 & -1 & 40 \\
\bottomrule
\end{tabular}

Related

Using str.extract with regex on pandas df column

I have some address info. stored in a pandas df column like below:
df['Addr']
LT 75 CEDAR WOOD 3RD PL
LTS 22,25 & 26 MULLINS CORNER
LTS 7 & 8
PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION
I want to extract lot information and create a new column so for the example above, my expected results are as below:
df['Lot']
75
22,25 & 26
7 & 8
22-23
This is my code:
df['Lot'] = df['Addr'].str.extract(r'\b(?:LOT|LT|LTS?) (\w+(?:-\d+)*)')
Results I'm getting is:
75
22
7
22-23
How can I modify my regex for expected results if at all possible? Please advise.
You could use
\b(?:LOT|LTS?) (\d+(?:(?:[-,]| & )\d+)*)
Explanation
\b A word boundary
(?:LOT|LTS?) Match LOT or LT or LTS
( Capture group 1
\d+ Match 1+ digits
(?:(?:[-,]| & )\d+)* Optionally repeat either - or , or & followed by 1+ digits
) Close group 1
Regex demo
data = [
"LT 75 CEDAR WOOD 3RD PL",
"LTS 22,25 & 26 MULLINS CORNER",
"LTS 7 & 8",
"PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION"
]
df = pd.DataFrame(data, columns = ['Addr'])
df['Lot'] = df['Addr'].str.extract(r'\b(?:LO?T|LTS?) (\d+(?:(?:[-,]| & )\d+)*)')
print(df)
Output
Addr Lot
0 LT 75 CEDAR WOOD 3RD PL 75
1 LTS 22,25 & 26 MULLINS CORNER 22,25 & 26
2 LTS 7 & 8 7 & 8
3 PT LT 22-23 JEFFERSON HIGHLANDS EXTENSION 22-23
If the - , and & can all be surrounded by optional whitespace chars, you might shorten the pattern to:
\b(?:LOT|LTS?) (\d+(?:\s*[-,&]\s*\d+)*)\b
Regex demo

How to prevent selecting first row as index column

I come to encounter a problem with reading my data the first column is assigned as index column even though I use index_col=None or index_col=None. Similar issue posted as pandas read_csv index_col=None not working with delimiters at the end of each line
raw_data = {'patient': ['spried & roy']*5,
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'normal', 'weak', 'strong'],
}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
patient obs treatment score
0 spried & roy 1 0 strong
1 spried & roy 2 1 weak
2 spried & roy 3 0 normal
3 spried & roy 1 1 weak
4 spried & roy 2 0 strong
writing df to csv with tab seperated format
df.to_csv('xgboost.txt', sep='\t', index=False)
reading back again
read_df=pd.read_table(r'xgboost.txt', header=0,index_col=None, skiprows=0, skipfooter=0, sep="\t",delim_whitespace=True)
read_df
patient obs treatment score
spried & roy 1 0 strong
& roy 2 1 weak
& roy 3 0 normal
& roy 1 1 weak
& roy 2 0 strong
As we can see patient column separated into spried & and roy and spried & became index column even if I explicitly write index_col=None.
How can we correctly get patient column as it is and control index column exist or not ?
thx
Just remove delim_whitespace=True, because it use whitespaces separator instead tabs in your solution, but here working only sep='\t' parameter with file name:
df.to_csv('xgboost.txt', sep='\t', index=False)
read_df=pd.read_table(r'xgboost.txt', sep="\t")
print (read_df)
patient obs treatment score
0 spried & roy 1 0 strong
1 spried & roy 2 1 weak
2 spried & roy 3 0 normal
3 spried & roy 1 1 weak
4 spried & roy 2 0 strong
Another idea is write to file whitespace separator, so delim_whitespace=True working nice:
df.to_csv('xgboost.txt', sep=' ', index=False)
read_df=pd.read_table(r'xgboost.txt', delim_whitespace=True)

Fuzzy String match and merge database - Dataframe

I have two dataframes (with strings) that I am trying to compare to each other. One has a list of areas, the other has a list of areas with long,lat info as well. I am struggling to write a code to perform the following:
1) Check if the string in df1 matches (or a partially matches) area names in df2, then it will merge & carry over the long lat columns.
2) if df1 does not match with df2, then the new column will have NaN or zero.
Code:
import pandas as pd
df1 = pd.read_csv('Dubai Communities1.csv')
df1.head()
CNAME_E1
0 Abu Hail
1 Al Asbaq
2 Al Aweer First
3 Al Aweer Second
4 Al Bada
df2 = pd.read_csv('Dubai Communities2.csv')
df2.head()
COMM_NUM CNAME_E2 Latitude Longitude
0 315 UMM HURAIR 55.3237 25.2364
1 917 AL MARMOOM 55.4518 24.9756
2 624 WARSAN 55.4034 25.1424
3 123 AL MUTEENA 55.3228 25.2739
4 813 AL ROWAIYAH 55.3981 25.1053
The output after search and join will look like this:
CName_E1 CName_E3 Latitude Longitude
0 Area1 Area1 22 7.25
1 Area2 Area2 38 71.83
2 Area3 NaN NaN NaN
3 Area4 Area4 35 8.05

Python looping two different dataFrames to create a new column

I want to add a new column to a dataframe by referencing another dataframe.
I want to run an if statement using startswith method to match df1['BSI'] column to df2['initial'] to assign the corresponding df2['marker'], and give df1 a new column that consists of markers, which I will use for cartopy marker style.
I am having trouble looping df2 inside a df1 loop. I basically can't figure out how to call df1 item onto df2 loop to compare to df2 items.
df1 looks like this:
BSI Shelter_Number Location Latitude Longitude
0 AA-010 1085 SUSSEX (N SIDE) & RIDEAU FALLS 45.439571 -75.695694
1 AA-030 3690 SUSSEX (E SIDE) & ALEXANDER NS 45.442795 -75.692322
2 AA-180 279 CRICHTON (E SIDE) & BEECHWOOD FS 45.439556 -75.676849
3 AA-200 2018 BEECHWOOD (S SIDE) & CHARLEVOIX NS 45.441154 -75.673622
4 AA-220 3301 BEECHWOOD (S SIDE) & MAISONNEUVE NS 45.442188 -75.671356
df2 looks like this:
initial marker
0 AA bo
1 AB bv
2 AC b^
3 AD b<
4 AE b>
desired output is:
BSI, Shelter_Number, Location, Latitude, Longitude, marker
0
AA-010 1085 SUSSEX (N SIDE) & RIDEAU FALLS 45.439571 -75.695694 bo
1
AA-030 3690 SUSSEX (E SIDE) & ALEXANDER NS 45.442795 -75.692322 bo
2
AA-180 279 CRICHTON (E SIDE) & BEECHWOOD FS 45.439556 -75.676849 bo
3
AA-200 2018 BEECHWOOD (S SIDE) & CHARLEVOIX NS 45.441154 -75.673622 bo
4
AA-220 3301 BEECHWOOD (S SIDE) & MAISONNEUVE NS 45.442188 -75.671356 bo
Use map. Infact there are many similar answers using map but the only difference here is that you are using only a part of BSI in df1 for matching
df1['marker'] = df1['BSI'].str.extract('(.*)-', expand = False).map(df2.set_index('initial').marker)
BSI Shelter_Number Location Latitude Longitude marker
0 AA-010 1085 SUSSEX (N SIDE) & RIDEAU FALLS 45.439571 -75.695694 bo
1 AA-030 3690 SUSSEX (E SIDE) & ALEXANDER NS 45.442795 -75.692322 bo
2 AA-180 279 RICHTON (E SIDE) & BEECHWOOD FS 45.439556 -75.676849 bo
3 AA-200 2018 BEECHWOOD (S SIDE) & CHARLEVOIX NS 45.441154 -75.673622 bo
4 AA-220 3301 BEECHWOOD (S SIDE) & MAISONNEUVE NS 45.442188 -75.671356 bo
You can create a dictionary from your df2 and then map df1 to create the new column. If all of your entries in BSI are the same format as provided, then it's simple to just select the first 2 letters. If if it needs to be more complicated, like all things before the first hyphen, then you can use regex.
Here's some test data
import pandas as pd
df1 = pd.DataFrame({'BSI': ['AA-010', 'AA-030', 'AA-180', 'AA-200', 'AA-220'],
'Latitude': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'initial': ['AA', 'AB', 'AC', 'AD', 'AE'],
'marker': ['bo', 'bv', 'b^', 'b<', 'b>']})
Here's the mapping
dct = pd.Series(df2.marker.values, index=df2.initial).to_dict()
df1['marker'] = df1['BSI'].str[0:2].map(dct)
BSI Latitude marker
0 AA-010 1 bo
1 AA-030 2 bo
2 AA-180 3 bo
3 AA-200 4 bo
4 AA-220 5 bo

Pandas DataFrame [cell=(label,value)], split into 2 separate dataframes

I found an awesome way to parse html with pandas. My data is in kind of a weird format (attached below). I want to split this data into 2 separate dataframes.
Notice how each cell is separated by a , ... is there any really efficient method to split all of these cells and create 2 dataframes, one for the labels and one for the ( value ) in parenthesis?
NumPy has all those ufuncs, is there a way I can use them on string dtypes since they can be converted to np.array with DF.as_matrix()? I'm trying to steer clear of for loops, I could iterate through all the indices and populate an empty array but that's pretty barbaric.
I'm using Beaker Notebook btw, it's really cool (HIGHLY RECOMMENDED)
#Set URL Destination
url = "http://www.reef.org/print/db/stats"
#Process raw table
DF_raw = pd.pandas.read_html(url)[0]
#Get start/end indices of table
start_label = "10 Most Frequent Species"; start_idx = (DF_raw.iloc[:,0] == start_label).argmax()
end_label = "Top 10 Sites for Species Richness"; end_idx = (DF_raw.iloc[:,0] == end_label).argmax()
#Process table
DF_freqSpecies = pd.DataFrame(
DF_raw.as_matrix()[(start_idx + 1):end_idx,:],
columns = DF_raw.iloc[0,:]
)
DF_freqSpecies
#Split these into 2 separate DataFrames
Here's my naive way of doing such:
import re
DF_species = pd.DataFrame(np.zeros_like(DF_freqSpecies),columns=DF_freqSpecies.columns)
DF_freq = pd.DataFrame(np.zeros_like(DF_freqSpecies).astype(str),columns=DF_freqSpecies.columns)
dims = DF_freqSpecies.shape
for i in range(dims[0]):
for j in range(dims[1]):
#Parse current dataframe
species, freq = re.split("\s\(\d",DF_freqSpecies.iloc[i,j])
freq = float(freq[:-1])
#Populate split DataFrames
DF_species.iloc[i,j] = species
DF_freq.iloc[i,j] = freq
I want these 2 dataframes as my output:
(1) Species;
and (2) Frequencies
you can do it this way:
DF1:
In [182]: df1 = DF_freqSpecies.replace(r'\s*\(\d+\.*\d*\)', '', regex=True)
In [183]: df1.head()
Out[183]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska \
0 Bluehead Copper Rockfish
1 Blue Tang Lingcod
2 Stoplight Parrotfish Painted Greenling
3 Bicolor Damselfish Sunflower Star
4 French Grunt Plumose Anemone
0 Hawaii Tropical Eastern Pacific \
0 Saddle Wrasse King Angelfish
1 Hawaiian Whitespotted Toby Mexican Hogfish
2 Raccoon Butterflyfish Barberfish
3 Manybar Goatfish Flag Cabrilla
4 Moorish Idol Panamic Sergeant Major
0 South Pacific Northeast US and Eastern Canada \
0 Regal Angelfish Cunner
1 Bluestreak Cleaner Wrasse Winter Flounder
2 Manybar Goatfish Rock Gunnel
3 Brushtail Tang Pollock
4 Two-spined Angelfish Grubby Sculpin
0 South Atlantic States Central Indo-Pacific
0 Slippery Dick Moorish Idol
1 Belted Sandfish Three-spot Dascyllus
2 Black Sea Bass Bluestreak Cleaner Wrasse
3 Tomtate Blacklip Butterflyfish
4 Cubbyu Clark's Anemonefish
and DF2
In [193]: df2 = DF_freqSpecies.replace(r'.*\((\d+\.*\d*)\).*', r'\1', regex=True)
In [194]: df2.head()
Out[194]:
0 Tropical Western Atlantic California, Pacific Northwest and Alaska Hawaii \
0 85 54.6 92
1 84.8 53.2 85.8
2 81 50.8 85.7
3 79.9 50.2 85.7
4 74.8 49.7 82.9
0 Tropical Eastern Pacific South Pacific Northeast US and Eastern Canada \
0 85.7 79 67.4
1 82.5 77.3 46.6
2 75.2 73.9 26.2
3 68.9 73.3 25.2
4 67.9 72.8 23.7
0 South Atlantic States Central Indo-Pacific
0 79.7 80.1
1 78.5 75.6
2 78.5 73.5
3 72.7 71.4
4 65.7 70.2
RegEx debugging and explanation:
we basically want to remove everything, except number in parentheses:
(\d+\.*\d*) - group(1) - it's our number
\((\d+\.*\d*)\) - our number in parentheses
.*\((\d+\.*\d*)\).* - the whole thing - anything before '(', '(', our number, ')', anything till the end of the cell
it will be replaced with the group(1) - our number

Categories

Resources