I want SIMBAD to treat the dash(hyphen) as a space

I want SIMBAD to treat the dash(hyphen) as a space - python

I have a code using astroquery.Simbad to query star names. However Simbad working with names like "LP 944-20". However, the data contains names as "LP-944-20". How can i make code to ignore that first dash(hyphen)?
My code:
from astroquery.simbad import Simbad
result_table = Simbad.query_object("LP-944-20", wildcard=True)
print(result_table)

One simple approach would be to just replace the first hyphen with space:
inp = ["LP-944-20", "944-20", "20"]
output = [x.replace("-", " ", 1) for x in inp]
print(output) # ['LP 944-20', '944 20', '20']

Related

Is there a function to format the index name in a pandas styler (DataFrame.style.to_latex) so can escape latex?

I am trying to format the index name so it can escape latex when using .to_latex().
Using .format_index() works only for the index values but not for the index names.
Here is a Minimal, Reproducible Example.
import pandas as pd
import numpy as np
import pylatex as pl
dict1= {
'employee_w': ['John_Smith','John_Smith','John_Smith', 'Marc_Jones','Marc_Jones', 'Tony_Jeff', 'Maria_Mora','Maria_Mora'],
'customer&client': ['company_1','company_2','company_3','company_4','company_5','company_6','company_7','company_8'],
'calendar_week': [18,18,19,21,21,22,23,23],
'sales': [5,5,5,5,5,5,5,5],
}
df1 = pd.DataFrame(data = dict1)
ptable = pd.pivot_table(
df1,
values='sales',
index=['employee_w','customer&client'],
columns=['calendar_week'],
aggfunc=np.sum
)
mystyler = ptable.style
mystyler.format(na_rep='-', precision=0, escape="latex")
mystyler.format_index(escape="latex", axis=0)
mystyler.format_index(escape="latex", axis=1)
latex_code1 = mystyler.to_latex(
column_format='|c|c|c|c|c|c|c|',
multirow_align="t",
multicol_align="r",
clines="all;data",
hrules=True,
)
# latex_code1 = latex_code1.replace("employee_w", "employee")
# latex_code1 = latex_code1.replace("customer&client", "customer and client")
# latex_code1 = latex_code1.replace("calendar_week", "week")
doc = pl.Document(geometry_options=['a4paper'], document_options=["portrait"], textcomp = None)
doc.packages.append(pl.Package('newtxtext,newtxmath'))
doc.packages.append(pl.Package('textcomp'))
doc.packages.append(pl.Package('booktabs'))
doc.packages.append(pl.Package('xcolor',options= pl.NoEscape('table')))
doc.packages.append(pl.Package('multirow'))
doc.append(pl.NoEscape(latex_code1))
doc.generate_pdf('file1.pdf', clean_tex=False, silent=True)
When I replace them using .replace() it works. such as the commented lines.
(desired result):
But I'm dealing with houndreds of tables with unknown index/column names.
The scope is to generate PDF files using Pylatex automatically. So any html option is not helpful for me.
Thanks in advance!

I coded all the Styler.to_latex features and I'm afraid the index names are currently not formatted, which also means that they are not escaped. So there is not a direct function to do what you desire. (by the way its great to see an example where many of the features including the hrules table styles definition is being used). I actually just created an issue on this on Pandas Github.
However, the code itself contains an _escape_latex(s) method in pandas.io.formats.styler_render.py
def _escape_latex(s):
r"""
Replace the characters ``&``, ``%``, ``$``, ``#``, ``_``, ``{``, ``}``,
``~``, ``^``, and ``\`` in the string with LaTeX-safe sequences.
Use this if you need to display text that might contain such characters in LaTeX.
Parameters
----------
s : str
Input to be escaped
Return
------
str :
Escaped string
"""
return (
s.replace("\\", "ab2§=§8yz") # rare string for final conversion: avoid \\ clash
.replace("ab2§=§8yz ", "ab2§=§8yz\\space ") # since \backslash gobbles spaces
.replace("&", "\\&")
.replace("%", "\\%")
.replace("$", "\\$")
.replace("#", "\\#")
.replace("_", "\\_")
.replace("{", "\\{")
.replace("}", "\\}")
.replace("~ ", "~\\space ") # since \textasciitilde gobbles spaces
.replace("~", "\\textasciitilde ")
.replace("^ ", "^\\space ") # since \textasciicircum gobbles spaces
.replace("^", "\\textasciicircum ")
.replace("ab2§=§8yz", "\\textbackslash ")
)
So your best bet is to reformat the input dataframe and escape the index name before you do any styling to it:
df.index.name = _escape_latex(df.index.name)
# then continue with your previous styling code

Is there any way to run the below code faster?

I ran the below code for about 20k data. Although the code is fine, and I am able to get the output but it's running very slow. It took almost 45 mins to get the output. Can someone please provide the appropriate solution to it?
Code:
import numpy as np
import pandas as pd
import re
def demoji(text):
emoji_pattern = re.compile("["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese char
u"\U00002702-\U000027B0"
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001f926-\U0001f937"
u"\U00010000-\U0010ffff"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200d"
u"\u23cf"
u"\u23e9"
u"\u231a"
u"\ufe0f" # dingbats
u"\u3030"
"]+", flags=re.UNICODE)
return(emoji_pattern.sub(r'', text))
df = pd.read_csv("data.csv")
print(df['Body'])
tweets=df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
tweets[u'Body'] = tweets[u'Body'].astype(str)
tweets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
weets[u'Body'] = tweets[u'Body'].apply(lambda x:demoji(x))
#Preprocessing del RT #blablabla:
tweets['tweetos'] = ''
#add tweetos first part
for i in range(len(tweets['Body'])):
try:
tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]
except AttributeError:
tweets['tweetos'][i] = 'other'
#Preprocessing tweetos. select tweetos contains 'RT #'
for i in range(len(tweets['Body'])):
if tweets['tweetos'].str.contains('#')[i] == False:
tweets['tweetos'][i] = 'other'# remove URLs, RTs, and twitter handles
for i in range(len(tweets['Body'])):
tweets['Body'][i] = " ".join([word for word in tweets['Body'][i].split()
if 'http' not in word and '#' not in word and '<' not in word])
This code is to remove special characters, like /n, Twitter mentions, basically text cleaning

Whenever you work with Pandas and start iterating over dataframe content there's a good chance that your approach is lacking. Try to stick to the native Pandas tools/methods, which are highly optimized! Also, watch out for repetition: In your code you do some stuff over and over again. E.g. in every iteration of
the 1. loop you split df.Body (tweets['tweetos'][i] = tweets['Body'].str.split(' ')[i][0]), only to pick one item from the resulting frame
the 2. loop you evaluate a complete column of the frame (tweets['tweetos'].str.contains('#')) only to pick one item from the result.
Your code could probably look like this:
import pandas as pd
import re
df = pd.read_csv("data.csv")
tweets = df.replace(to_replace=[r"\\t|\\n|\\r", "\t|/n|/r|w/|\n|w/|Quote::"], value=["",""], regex=True)
# Why not tweets = df.replace(r'\\t|\\n|\\r|\t|/n|/r|w/|\n|w/|Quote::', ',') ?
re_emoji = re.compile(...) # As in your code
tweets.Body = tweets.Body.astype(str).str.replace(re_emoji, '') # Is the astype(str) necessary?
body_split = tweets.Body.str.split()
tweets['tweetos'] = body_split.map(lambda l: 'other' if not l else l[0])
tweets.tweetos[~tweets.tweetos.str.contains('#')] = 'other'
re_discard = re.compile(r'http|#|<')
tweets.Body = (body_split.map(lambda l: [w for w in l if not re_discard.search(w)])
.str.join(' '))
Be aware that I don't have any real insight into the data you're working with - you haven't provided a sample. So there might be bugs in the code I've proposed.

How to put all words from DataFrame in Normal form

I need to put all words from one column of DataFrame in normal form (by pymorphy2) ?
For example i have :
First Sec
My Я вчера видел цветы красных цветов
after it i need to get :
First Sec
My я вчера видеть цвета красных цветок

Try the following and let me know how you get on.
By the way I don't know how to use pymorphy2 and the documnetation is in Russian which I don't speak so you may need to adjust that line.
import pandas as pd
import pymorphy2
data = pd.read_excel(r'your_file.xlsx')
def converter(sentence):
list = []
words = sentence.split()
for item in words:
list.append(pymorphy2.MorphAnalyzer().parse(item)[0].word)
return ' '.join(list)
data['column_to_convert'] = data['column_to_convert'].apply(converter)

Running multiple Regex once

I have a very long file that I managed to parse using Python regular expression one value at a time, for example, here is the code that I'm using to print out all the values between the <h2> tags:
import os
import re
def query():
f = open('company.txt', 'r')
names = re.findall(r'<h2>(.*?)</h2>', f.read(), re.DOTALL)
for name in names:
print name
if __name__=="__main__":
query()
and I repeat the same thing to print out the area_code as well. But this time, I just replace the pattern in the findall function to print the area code. This means I'm having to run the code twice.
My question is, is there a way to simply run the two queries at the same time and printing the results in one line separated by a pipe (|)?
like so: Planner | B21
Below is the short sample file I'm trying to parse.
<h2>Planner</h2>
area_place = 'City of Angels';
area_code = 'B21';
period = 'Summer';
... more content
<h2>Executive</h2>
area_place = 'London';
area_code = 'D33';
period = 'Winter';
...more content

This is working for me with your test data in Python 2.7, give it a try:
import os
import re
def query():
f = open('company.txt', 'r')
names = re.findall(r"<h2>(.+?)</h2>.*?area_code = '(.+?)'", f.read(), re.DOTALL)
for name in names:
print name[0] + " | " + name[1]
if __name__=="__main__":
query()
Basically, I'm just incorporating both queries into one, and then specifying the capture group numerically. You may want to rename "names" since it makes less sense the way I'm doing it.
Alternatively, if you'd like to keep your existing queries and you can assume that they will all be the same length, you could do something like this:
names = re.findall(your names regex)
area_codes = re.findall(your area code regex)
for i in range(len(names)): //very dangerous, if there's one failed match many entries may be mismatched!
print names[i] + " | " + area_codes[i]
However, I would not recommend this approach unless you're extremely confident in the regularity of your data.

Using the split function in Python

I am working with the CSV module, and I am writing a simple program which takes the names of several authors listed in the file, and formats them in this manner: john.doe
So far, I've achieved the results that I want, but I am having trouble with getting the code to exclude titles such as "Mr."Mrs", etc. I've been thinking about using the split function, but I am not sure if this would be a good use for it.
Any suggestions? Thanks in advance!
Here's my code so far:
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
print '.'.join ([item.lower() for item in [row[index] for index in (1, 0)]])

It depends on how much messy the strings are, in worst cases this regexp-based solution should do the job:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
x.sub("", text)
(I'm using re.compile() here since for some reasons Python 2.6 re.sub doesn't accept the flags= kwarg..)
UPDATE: I wrote some code to test that and, although I wasn't able to figure out a way to automate results checking, it looks like that's working fine.. This is the test code:
import re
x=re.compile(r"^\s*(mr|mrs|ms|miss)[\.\s]+", flags=re.IGNORECASE)
names = ["".join([a,b,c,d]) for a in ['', ' ', ' ', '..', 'X'] for b in ['mr', 'Mr', 'miss', 'Miss', 'mrs', 'Mrs', 'ms', 'Ms'] for c in ['', '.', '. ', ' '] for d in ['Aaaaa', 'Aaaa Bbbb', 'Aaa Bbb Ccc', ' aa ']]
print "\n".join([" => ".join((n,x.sub('',n))) for n in names])

Depending on the complexity of your data and the scope of your needs you may be able to get away with something as simple as stripping titles from the lines in the csv using replace() as you iterate over them.
Something along the lines of:
titles = ["Mr.", "Mrs.", "Ms", "Dr"] #and so on
for line in lines:
line_data = line
for title in titles:
line_data = line_data.replace(title,"")
#your code for processing the line
This may not be the most efficient method, but depending on your needs may be a good fit.
How this could work with the code you posted (I am guessing the Mr./Mrs. is part of column 1, the first name):
import csv
books = csv.reader(open("books.csv","rU"))
for row in books:
first_name = row[1]
last_name = row[0]
for title in titles:
first_name = first_name.replace(title,"")
print '.'.(first_name, last_name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

I want SIMBAD to treat the dash(hyphen) as a space - python

One simple approach would be to just replace the first hyphen with space: inp = ["LP-944-20", "944-20", "20"] output = [x.replace("-", " ", 1) for x in inp] print(output) # ['LP 944-20', '944 20', '20']

Related

Is there a function to format the index name in a pandas styler (DataFrame.style.to_latex) so can escape latex?

Is there any way to run the below code faster?

How to put all words from DataFrame in Normal form

Running multiple Regex once

Using the split function in Python

Categories

Resources