if then else in Regex in DataFrame - python

I'm trying to rationalize a quite scrambled phonebook xls of several thousandth of records. Some fields are kind of merged with other and/or saved into the wrong column, while other filed are splitted through 2 or more ones... and so on. I'm trying to find the path of the main error and solve those through regex, placing the right record into right column.
An example:
DataFrame as df:
id
Name
SecondName
Surname
Title
Company
01
Marc
Gigio
ETC ltd
02
Piero (Four
Season
Restaurant
)
03
bubbu(Caterpilar)
04
gaby(ts Inc)
05
Pit(REV inc)
REV Inc
06
Pluto
In record 01: would nothing to do, but see how manage conditional exception as point 5.
In record 02: merge Name + SecondName + Surname , then extract from new string the name (Piero) to place in Name column while extract from same string the content of squared bracket and place it into Company Column
df['Nameall_tmp'] = df[Name]+' '+df[SecondName]+' '+df[Surname]+' '+df[Company]
df['Name_tmp'] = df[Nameall_tmp].str.extract(r'(.+)(.+')
df['Company_tmp'] = df[Nameall_tmp].str.extract(r'.*((.+))')
In record 03 and 04: is almost 02
In record 06:
df['Nameall_tmp'] = df[Name]+' '+df[SecondName]+' '+df[Surname]+' '+df[Company]
df['Name_tmp'] = df['Nameall_tmp'].str.extract(r'(.+)(.+')
df['Name_tmp']= np.where(df['Name_tmp'] == 'nan' , df['Name'],df['Name_tmp'] )
In this case np.where statement doesn't work like if then else, in order to check if df['Name_tmp'] is "nan", in the case, fill with original df['Name'] to eliminate "nan" from record,else take df['Name_tmp']. Any sugestion ?

Rough thinking here:
munge the "company" column so that: if it contains a legit company name,
add () to it. If not, keep original content
concat all columns into one conglomerate column
use 1 regex to sr.str.extract(rex) that single conglomerate column into desired columns again
anyways, following the rough thinking, I have at least reduced the problem into fine tunning a single regex:
df = pd.DataFrame(
columns=" index Name SecondName Surname Company ".split(),
data= [
[ 0, "Marc", np.nan, "Gigio", "ETC ltd", ],
[ 1, "Piero", "(four", "season", "restaurant)", ],
[ 2, "bubbu(caterpilar)", np.nan, np.nan, np.nan, ],
[ 3, np.nan, np.nan, np.nan, "gaby(ts inc)", ],
[ 4, "Pit(REV inc)", np.nan, np.nan, "REV inc", ],
[ 5, "pluto", np.nan, np.nan, np.nan, ],]).set_index("index", drop=True)
df = df.fillna('')
df['Company'] = df['Company'].apply(lambda x: f'({x})' if ('(' not in x and ')' not in x and x!="") else x)
# df['sum'] = df.sum(axis=1)
df['sum'] = df['Name'] + ' ' + df['SecondName'] + ' ' + df['Surname'] + ' ' + df['Company']
df['sum'] = df['sum'].str.replace(r'\s+', ' ', regex=True) # get rid of extra \s due to above concat
rex = re.compile( # very fragil and hardcoded,
r"""
(?P<name0>[a-z]{2,})
\s?
(?P<surename0>[a-z]{2,})?
\s?
\(?
(?P<company0>[a-z\s]{3,})?
\)?
\s?
""",
re.X+re.I
)
df['sum'].str.extract(rex)
output:
+---------+---------+-------------+------------------------+
| index | name0 | surename0 | company0 |
|---------+---------+-------------+------------------------|
| 0 | Marc | Gigio | ETC ltd |
| 1 | Piero | nan | four season restaurant |
| 2 | bubbu | nan | caterpilar |
| 3 | gaby | nan | ts inc |
| 4 | Pit | nan | REV inc |
| 5 | pluto | nan | nan |
+---------+---------+-------------+------------------------+
EDIT:
Earlier answer contains an error in my regex (forgot to ? the \(), couldn't quite handle "pluto", corrected now.
The moral of the story is that, the regex you need to design will be very very specialized, fragil and hardcoded. almost worth considering a df['sum'].apply(myfoo) approach just to parse df['sum'] more thoroughly.

Related

How do I organize my data into fancy table/dataframe like this (pic inside) [duplicate]

I am quite new to Python and I am now struggling with formatting my data nicely for printed output.
I have one list that is used for two headings, and a matrix that should be the contents of the table. Like so:
teams_list = ["Man Utd", "Man City", "T Hotspur"]
data = np.array([[1, 2, 1],
[0, 1, 0],
[2, 4, 2]])
Note that the heading names are not necessarily the same lengths. The data entries are all integers, though.
Now, I want to represent this in a table format, something like this:
Man Utd Man City T Hotspur
Man Utd 1 0 0
Man City 1 1 0
T Hotspur 0 1 2
I have a hunch that there must be a data structure for this, but I cannot find it. I have tried using a dictionary and formatting the printing, I have tried for-loops with indentation and I have tried printing as strings.
I am sure there must be a very simple way to do this, but I am probably missing it due to lack of experience.
There are some light and useful python packages for this purpose:
1. tabulate: https://pypi.python.org/pypi/tabulate
from tabulate import tabulate
print(tabulate([['Alice', 24], ['Bob', 19]], headers=['Name', 'Age']))
Name Age
------ -----
Alice 24
Bob 19
tabulate has many options to specify headers and table format.
print(tabulate([['Alice', 24], ['Bob', 19]], headers=['Name', 'Age'], tablefmt='orgtbl'))
| Name | Age |
|--------+-------|
| Alice | 24 |
| Bob | 19 |
2. PrettyTable: https://pypi.python.org/pypi/PrettyTable
from prettytable import PrettyTable
t = PrettyTable(['Name', 'Age'])
t.add_row(['Alice', 24])
t.add_row(['Bob', 19])
print(t)
+-------+-----+
| Name | Age |
+-------+-----+
| Alice | 24 |
| Bob | 19 |
+-------+-----+
PrettyTable has options to read data from csv, html, sql database. Also you are able to select subset of data, sort table and change table styles.
3. texttable: https://pypi.python.org/pypi/texttable
from texttable import Texttable
t = Texttable()
t.add_rows([['Name', 'Age'], ['Alice', 24], ['Bob', 19]])
print(t.draw())
+-------+-----+
| Name | Age |
+=======+=====+
| Alice | 24 |
+-------+-----+
| Bob | 19 |
+-------+-----+
with texttable you can control horizontal/vertical align, border style and data types.
4. termtables: https://github.com/nschloe/termtables
import termtables as tt
string = tt.to_string(
[["Alice", 24], ["Bob", 19]],
header=["Name", "Age"],
style=tt.styles.ascii_thin_double,
# alignment="ll",
# padding=(0, 1),
)
print(string)
+-------+-----+
| Name | Age |
+=======+=====+
| Alice | 24 |
+-------+-----+
| Bob | 19 |
+-------+-----+
with texttable you can control horizontal/vertical align, border style and data types.
Other options:
terminaltables Easily draw tables in terminal/console applications from a list of lists of strings. Supports multi-line rows.
asciitable Asciitable can read and write a wide range of ASCII table formats via built-in Extension Reader Classes.
Some ad-hoc code:
row_format ="{:>15}" * (len(teams_list) + 1)
print(row_format.format("", *teams_list))
for team, row in zip(teams_list, data):
print(row_format.format(team, *row))
This relies on str.format() and the Format Specification Mini-Language.
>>> import pandas
>>> pandas.DataFrame(data, teams_list, teams_list)
Man Utd Man City T Hotspur
Man Utd 1 2 1
Man City 0 1 0
T Hotspur 2 4 2
Python actually makes this quite easy.
Something like
for i in range(10):
print '%-12i%-12i' % (10 ** i, 20 ** i)
will have the output
1 1
10 20
100 400
1000 8000
10000 160000
100000 3200000
1000000 64000000
10000000 1280000000
100000000 25600000000
1000000000 512000000000
The % within the string is essentially an escape character and the characters following it tell python what kind of format the data should have. The % outside and after the string is telling python that you intend to use the previous string as the format string and that the following data should be put into the format specified.
In this case I used "%-12i" twice. To break down each part:
'-' (left align)
'12' (how much space to be given to this part of the output)
'i' (we are printing an integer)
From the docs: https://docs.python.org/2/library/stdtypes.html#string-formatting
Updating Sven Marnach's answer to work in Python 3.4:
row_format ="{:>15}" * (len(teams_list) + 1)
print(row_format.format("", *teams_list))
for team, row in zip(teams_list, data):
print(row_format.format(team, *row))
I know that I am late to the party, but I just made a library for this that I think could really help. It is extremely simple, that's why I think you should use it. It is called TableIT.
Basic Use
To use it, first follow the download instructions on the GitHub Page.
Then import it:
import TableIt
Then make a list of lists where each inner list is a row:
table = [
[4, 3, "Hi"],
[2, 1, 808890312093],
[5, "Hi", "Bye"]
]
Then all you have to do is print it:
TableIt.printTable(table)
This is the output you get:
+--------------------------------------------+
| 4 | 3 | Hi |
| 2 | 1 | 808890312093 |
| 5 | Hi | Bye |
+--------------------------------------------+
Field Names
You can use field names if you want to (if you aren't using field names you don't have to say useFieldNames=False because it is set to that by default):
TableIt.printTable(table, useFieldNames=True)
From that you will get:
+--------------------------------------------+
| 4 | 3 | Hi |
+--------------+--------------+--------------+
| 2 | 1 | 808890312093 |
| 5 | Hi | Bye |
+--------------------------------------------+
There are other uses to, for example you could do this:
import TableIt
myList = [
["Name", "Email"],
["Richard", "richard#fakeemail.com"],
["Tasha", "tash#fakeemail.com"]
]
TableIt.print(myList, useFieldNames=True)
From that:
+-----------------------------------------------+
| Name | Email |
+-----------------------+-----------------------+
| Richard | richard#fakeemail.com |
| Tasha | tash#fakeemail.com |
+-----------------------------------------------+
Or you could do:
import TableIt
myList = [
["", "a", "b"],
["x", "a + x", "a + b"],
["z", "a + z", "z + b"]
]
TableIt.printTable(myList, useFieldNames=True)
And from that you get:
+-----------------------+
| | a | b |
+-------+-------+-------+
| x | a + x | a + b |
| z | a + z | z + b |
+-----------------------+
Colors
You can also use colors.
You use colors by using the color option (by default it is set to None) and specifying RGB values.
Using the example from above:
import TableIt
myList = [
["", "a", "b"],
["x", "a + x", "a + b"],
["z", "a + z", "z + b"]
]
TableIt.printTable(myList, useFieldNames=True, color=(26, 156, 171))
Then you will get:
Please note that printing colors might not work for you but it does works the exact same as the other libraries that print colored text. I have tested and every single color works. The blue is not messed up either as it would if using the default 34m ANSI escape sequence (if you don't know what that is it doesn't matter). Anyway, it all comes from the fact that every color is RGB value rather than a system default.
More Info
For more info check the GitHub Page
Just use it
from beautifultable import BeautifulTable
table = BeautifulTable()
table.column_headers = ["", "Man Utd","Man City","T Hotspur"]
table.append_row(['Man Utd', 1, 2, 3])
table.append_row(['Man City', 7, 4, 1])
table.append_row(['T Hotspur', 3, 2, 2])
print(table)
As a result, you will get such a neat table and that's it.
A simple way to do this is to loop over all columns, measure their width, create a row_template for that max width, and then print the rows. It's not exactly what you are looking for, because in this case, you first have to put your headings inside the table, but I'm thinking it might be useful to someone else.
table = [
["", "Man Utd", "Man City", "T Hotspur"],
["Man Utd", 1, 0, 0],
["Man City", 1, 1, 0],
["T Hotspur", 0, 1, 2],
]
def print_table(table):
longest_cols = [
(max([len(str(row[i])) for row in table]) + 3)
for i in range(len(table[0]))
]
row_format = "".join(["{:>" + str(longest_col) + "}" for longest_col in longest_cols])
for row in table:
print(row_format.format(*row))
You use it like this:
>>> print_table(table)
Man Utd Man City T Hotspur
Man Utd 1 0 0
Man City 1 1 0
T Hotspur 0 1 2
When I do this, I like to have some control over the details of how the table is formatted. In particular, I want header cells to have a different format than body cells, and the table column widths to only be as wide as each one needs to be. Here's my solution:
def format_matrix(header, matrix,
top_format, left_format, cell_format, row_delim, col_delim):
table = [[''] + header] + [[name] + row for name, row in zip(header, matrix)]
table_format = [['{:^{}}'] + len(header) * [top_format]] \
+ len(matrix) * [[left_format] + len(header) * [cell_format]]
col_widths = [max(
len(format.format(cell, 0))
for format, cell in zip(col_format, col))
for col_format, col in zip(zip(*table_format), zip(*table))]
return row_delim.join(
col_delim.join(
format.format(cell, width)
for format, cell, width in zip(row_format, row, col_widths))
for row_format, row in zip(table_format, table))
print format_matrix(['Man Utd', 'Man City', 'T Hotspur', 'Really Long Column'],
[[1, 2, 1, -1], [0, 1, 0, 5], [2, 4, 2, 2], [0, 1, 0, 6]],
'{:^{}}', '{:<{}}', '{:>{}.3f}', '\n', ' | ')
Here's the output:
| Man Utd | Man City | T Hotspur | Really Long Column
Man Utd | 1.000 | 2.000 | 1.000 | -1.000
Man City | 0.000 | 1.000 | 0.000 | 5.000
T Hotspur | 2.000 | 4.000 | 2.000 | 2.000
Really Long Column | 0.000 | 1.000 | 0.000 | 6.000
I think this is what you are looking for.
It's a simple module that just computes the maximum required width for the table entries and then just uses rjust and ljust to do a pretty print of the data.
If you want your left heading right aligned just change this call:
print >> out, row[0].ljust(col_paddings[0] + 1),
From line 53 with:
print >> out, row[0].rjust(col_paddings[0] + 1),
Pure Python 3
def print_table(data, cols, wide):
'''Prints formatted data on columns of given width.'''
n, r = divmod(len(data), cols)
pat = '{{:{}}}'.format(wide)
line = '\n'.join(pat * cols for _ in range(n))
last_line = pat * r
print(line.format(*data))
print(last_line.format(*data[n*cols:]))
data = [str(i) for i in range(27)]
print_table(data, 6, 12)
Will print
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26
table_data= [[1,2,3],[4,5,6],[7,8,9]]
for row in table_data:
print("{: >20} {: >20} {: >20}".format(*row))
OUTPUT:
1 2 3
4 5 6
7 8 9
wherein f string formatting
">" is used for right alignment
"<" is used for left alignment
20 is the space width that can be changed according to the requirement.
try rich: https://github.com/Textualize/rich
from rich.console import Console
from rich.table import Table
console = Console()
table = Table(show_header=True, header_style="bold magenta")
table.add_column("Date", style="dim", width=12)
table.add_column("Title")
table.add_column("Production Budget", justify="right")
table.add_column("Box Office", justify="right")
table.add_row(
"Dec 20, 2019", "Star Wars: The Rise of Skywalker", "$275,000,000", "$375,126,118"
)
table.add_row(
"May 25, 2018",
"[red]Solo[/red]: A Star Wars Story",
"$275,000,000",
"$393,151,347",
)
table.add_row(
"Dec 15, 2017",
"Star Wars Ep. VIII: The Last Jedi",
"$262,000,000",
"[bold]$1,332,539,889[/bold]",
)
console.print(table)
https://github.com/willmcgugan/rich/raw/master/imgs/table.png
The following function will create the requested table (with or without numpy) with Python 3 (maybe also Python 2). I have chosen to set the width of each column to match that of the longest team name. You could modify it if you wanted to use the length of the team name for each column, but will be more complicated.
Note: For a direct equivalent in Python 2 you could replace the zip with izip from itertools.
def print_results_table(data, teams_list):
str_l = max(len(t) for t in teams_list)
print(" ".join(['{:>{length}s}'.format(t, length = str_l) for t in [" "] + teams_list]))
for t, row in zip(teams_list, data):
print(" ".join(['{:>{length}s}'.format(str(x), length = str_l) for x in [t] + row]))
teams_list = ["Man Utd", "Man City", "T Hotspur"]
data = [[1, 2, 1],
[0, 1, 0],
[2, 4, 2]]
print_results_table(data, teams_list)
This will produce the following table:
Man Utd Man City T Hotspur
Man Utd 1 2 1
Man City 0 1 0
T Hotspur 2 4 2
If you want to have vertical line separators, you can replace " ".join with " | ".join.
References:
lots about formatting https://pyformat.info/ (old and new formatting
styles)
the official Python tutorial (quite good) -
https://docs.python.org/3/tutorial/inputoutput.html#the-string-format-method
official Python information (can be difficult to read) -
https://docs.python.org/3/library/string.html#string-formatting
Another resource -
https://www.python-course.eu/python3_formatted_output.php
For simple cases you can just use modern string formatting (simplified Sven's answer):
f'{column1_value:15} {column2_value}':
table = {
'Amplitude': [round(amplitude, 3), 'm³/h'],
'MAE': [round(mae, 2), 'm³/h'],
'MAPE': [round(mape, 2), '%'],
}
for metric, value in table.items():
print(f'{metric:14} : {value[0]:>6.3f} {value[1]}')
Output:
Amplitude : 1.438 m³/h
MAE : 0.171 m³/h
MAPE : 27.740 %
Source: https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals
I found this just looking for a way to output simple columns. If you just need no-fuss columns, then you can use this:
print("Titlex\tTitley\tTitlez")
for x, y, z in data:
print(x, "\t", y, "\t", z)
EDIT: I was trying to be as simple as possible, and thereby did some things manually instead of using the teams list. To generalize to the OP's actual question:
#Column headers
print("", end="\t")
for team in teams_list:
print(" ", team, end="")
print()
# rows
for team, row in enumerate(data):
teamlabel = teams_list[team]
while len(teamlabel) < 9:
teamlabel = " " + teamlabel
print(teamlabel, end="\t")
for entry in row:
print(entry, end="\t")
print()
Ouputs:
Man Utd Man City T Hotspur
Man Utd 1 2 1
Man City 0 1 0
T Hotspur 2 4 2
But this no longer seems any more simple than the other answers, with perhaps the benefit that it doesn't require any more imports. But #campkeith's answer already met that and is more robust as it can handle a wider variety of label lengths.
I would try to loop through the list and use a CSV formatter to represent the data you want.
You can specify tabs, commas, or any other char as the delimiter.
Otherwise, just loop through the list and print "\t" after each element
http://docs.python.org/library/csv.html
I got a better one that can save a lot of spaces.
table = [
['number1', 'x', 'name'],
["4x", "3", "Hi"],
["2", "1", "808890312093"],
["5", "Hi", "Bye"]
]
column_max_width = [max(len(row[column_index]) for row in table) for column_index in range(len(table[0]))]
row_format = ["{:>"+str(width)+"}" for width in column_max_width]
for row in table:
print("|".join([print_format.format(value) for print_format, value in zip(row_format, row)]))
output:
number1| x| name
4x| 3| Hi
2| 1|808890312093
5|Hi| Bye
To create a simple table using terminaltables open the terminal or your command prompt and run pip install terminaltables.
You can print a Python list as the following:
from terminaltables import AsciiTable
l = [
['Head', 'Head'],
['R1 C1', 'R1 C2'],
['R2 C1', 'R2 C2'],
['R3 C1', 'R3 C2']
]
table = AsciiTable(l)
print(table.table)
list1 = [1, 2, 3]
list2 = [10, 20, 30]
l = []
for i in range(0, len(list1)):
l.append(list1[i]), l.append(list2[i])
# print(l)
for i in range(0, len(l), 2):
print(l[i], "", l[i + 1])

How to assign a new column to a dataframe based on comparison between other columns?

In my one sheet Excel file that I created through my SQL, I have 3 columns that represent letter ratings. The rating values may differ between ratings 1, 2, and 3, but they can still be ranked with the same value.
I am trying to create a new column in my Excel file that can take these 3 letter ratings and pull the middle rating.
ranking | Rating_1 | Rating_2 | Rating_3 | NEW_COLUMN |
(1 lowest) | -------- | -------- | -------- | -------- |
3 | A+ | AA | Aa | middle(rating)|
2 | B+ | BB | Bb | middle(rating)|
1 | Fa | Fb | Fc | middle(rating)|
| -------- | -------- | -------- | --------- |
There are three scenarios I need to account for:
if all three ratings differ, pick the rating between rating_1, rating_2, and rating_3 that isn't the highest rating or the lowest rating
if all three ratings are the same, pick rating on rating_1
if 2 of the ratings are the same, but one is different, pick the minimum rating
I created a dataframe :
df = pd.DataFrame(
{"Rating_1": ["A+", "AA", "Aa"],
"Rating_2": ["B+", "BB", "Bb"],
"Rating_3": ["Fa", "Fb", "Fc"]}
)
df["NEW COLUMN"] = {insert logic here}
Or is it easier to create a new DF that filters down the the original DF?
With the fowllowing toy dataframe:
import pandas as pd
df = pd.DataFrame(
{
"Rating_1": ["A+", "Cc", "Aa"],
"Rating_2": ["AA", "Cc", "Aa"],
"Rating_3": ["BB", "Cc", "Bb"],
}
)
print(df)
# Output
Rating_1 Rating_2 Rating_3
0 A+ AA BB
1 Cc Cc Cc
2 Aa Aa Bb
Here is one way to do it using Python sets to check conditions:
# First condition
df["Middle_rating"] = df.apply(
lambda x: sorted([x["Rating_1"], x["Rating_2"], x["Rating_3"]])[1]
if len(set([x["Rating_1"], x["Rating_2"], x["Rating_3"]])) == 3
else "",
axis=1,
)
# Second condition
df["Middle_rating"] = df.apply(
lambda x: x["Rating_1"]
if len(set([x["Rating_1"], x["Rating_2"], x["Rating_3"]])) == 1
else x["Middle_rating"],
axis=1,
)
# Third condition
ratings = {
rating: i
for i, rating in enumerate(["A+", "AA", "Aa", "B+", "BB", "Bb", "C+", "CC", "Cc"])
} # ratings ordered from best (A+: 0) to worst (CC: 8)
df["Middle_rating"] = df.apply(
lambda x: max(x["Rating_1"], x["Rating_2"], x["Rating_3"])
if len(
set([ratings[x["Rating_1"]], ratings[x["Rating_2"]], ratings[x["Rating_3"]]])
)
== 2
else x["Middle_rating"],
axis=1,
)
Then:
print(df)
# Output
Rating_1 Rating_2 Rating_3 Middle_rating
0 A+ AA BB AA
1 Cc Cc Cc Cc
2 Aa Aa Bb Bb

Concatenation of multiple columns

I have three columns in my data frame:
CaseID
FirstName
LastName
1
rohit
pandey
2
rai
3
In the output, I am trying to add the fourth column and have values as LastName,FirstName
I have this Python code
df_ids['ContactName'] = df_ids[['LastName', 'FirstName']].agg(lambda x: ','.join(x.values), axis=1)
But it appends the blank values also which something like below that I am able to get like below:
CaseID
FirstName
LastName
ContactName
1
rohit
pandey
pandey, rohit
2
rai
, rai
3
,
The expected output:
CaseID
FirstName
LastName
ContactName
1
rohit
pandey
pandey, rohit
2
rai
rai
3
Someone has added PySpark tag. This is PySpark version:
from pyspark.sql import functions as F
df_ids = df_ids.replace('', None) # Replaces empty strings with nulls
df_ids = df_ids.withColumn('ContactName', F.concat_ws(', ', 'LastName', 'FirstName'))
df_ids = df_ids.fillna('') # Replaces nulls back to empty strings
df_ids.show()
# +------+---------+--------+-------------+
# |CaseID|FirstName|LastName| ContactName|
# +------+---------+--------+-------------+
# | 1| rohit| pandey|pandey, rohit|
# | 2| | rai| rai|
# | 3| | | |
# +------+---------+--------+-------------+
This is the easy way, using apply. apply takes each row one at a time and passes it to the given function.
import pandas as pd
data = [
[ 1, 'rohit', 'pandey' ],
[ 2, '', 'rai' ],
[ 3, '', '' ]
]
df = pd.DataFrame(data, columns=['CaseID', 'FirstName', 'LastName'] )
def fixup( row ):
if not row['LastName']:
return ''
if not row['FirstName']:
return row['LastName']
return row['LastName'] + ', ' + row['FirstName']
print(df)
df['Contact1'] = df.apply(fixup, axis=1)
print(df)
Output:
CaseID FirstName LastName
0 1 rohit pandey
1 2 rai
2 3
CaseID FirstName LastName Contact1
0 1 rohit pandey pandey, rohit
1 2 rai rai
2 3
Two (actually 1 and a half) other options, which are very close to your attempt:
df_ids['ContactName'] = (
df_ids[['LastName', 'FirstName']]
.agg(lambda row: ', '.join(name for name in row if name), axis=1)
)
or
df_ids['ContactName'] = (
df_ids[['LastName', 'FirstName']]
.agg(lambda row: ', '.join(filter(None, row)), axis=1)
)
In both version the ''s are filtered out:
Via a generator expression: The if name makes sure that '' isn't allowed, because its truth value is False - try print(bool('')).
By the built-in function filter() with the first argument set to None.

Trying to return my list as a table in Python [duplicate]

I am quite new to Python and I am now struggling with formatting my data nicely for printed output.
I have one list that is used for two headings, and a matrix that should be the contents of the table. Like so:
teams_list = ["Man Utd", "Man City", "T Hotspur"]
data = np.array([[1, 2, 1],
[0, 1, 0],
[2, 4, 2]])
Note that the heading names are not necessarily the same lengths. The data entries are all integers, though.
Now, I want to represent this in a table format, something like this:
Man Utd Man City T Hotspur
Man Utd 1 0 0
Man City 1 1 0
T Hotspur 0 1 2
I have a hunch that there must be a data structure for this, but I cannot find it. I have tried using a dictionary and formatting the printing, I have tried for-loops with indentation and I have tried printing as strings.
I am sure there must be a very simple way to do this, but I am probably missing it due to lack of experience.
There are some light and useful python packages for this purpose:
1. tabulate: https://pypi.python.org/pypi/tabulate
from tabulate import tabulate
print(tabulate([['Alice', 24], ['Bob', 19]], headers=['Name', 'Age']))
Name Age
------ -----
Alice 24
Bob 19
tabulate has many options to specify headers and table format.
print(tabulate([['Alice', 24], ['Bob', 19]], headers=['Name', 'Age'], tablefmt='orgtbl'))
| Name | Age |
|--------+-------|
| Alice | 24 |
| Bob | 19 |
2. PrettyTable: https://pypi.python.org/pypi/PrettyTable
from prettytable import PrettyTable
t = PrettyTable(['Name', 'Age'])
t.add_row(['Alice', 24])
t.add_row(['Bob', 19])
print(t)
+-------+-----+
| Name | Age |
+-------+-----+
| Alice | 24 |
| Bob | 19 |
+-------+-----+
PrettyTable has options to read data from csv, html, sql database. Also you are able to select subset of data, sort table and change table styles.
3. texttable: https://pypi.python.org/pypi/texttable
from texttable import Texttable
t = Texttable()
t.add_rows([['Name', 'Age'], ['Alice', 24], ['Bob', 19]])
print(t.draw())
+-------+-----+
| Name | Age |
+=======+=====+
| Alice | 24 |
+-------+-----+
| Bob | 19 |
+-------+-----+
with texttable you can control horizontal/vertical align, border style and data types.
4. termtables: https://github.com/nschloe/termtables
import termtables as tt
string = tt.to_string(
[["Alice", 24], ["Bob", 19]],
header=["Name", "Age"],
style=tt.styles.ascii_thin_double,
# alignment="ll",
# padding=(0, 1),
)
print(string)
+-------+-----+
| Name | Age |
+=======+=====+
| Alice | 24 |
+-------+-----+
| Bob | 19 |
+-------+-----+
with texttable you can control horizontal/vertical align, border style and data types.
Other options:
terminaltables Easily draw tables in terminal/console applications from a list of lists of strings. Supports multi-line rows.
asciitable Asciitable can read and write a wide range of ASCII table formats via built-in Extension Reader Classes.
Some ad-hoc code:
row_format ="{:>15}" * (len(teams_list) + 1)
print(row_format.format("", *teams_list))
for team, row in zip(teams_list, data):
print(row_format.format(team, *row))
This relies on str.format() and the Format Specification Mini-Language.
>>> import pandas
>>> pandas.DataFrame(data, teams_list, teams_list)
Man Utd Man City T Hotspur
Man Utd 1 2 1
Man City 0 1 0
T Hotspur 2 4 2
Python actually makes this quite easy.
Something like
for i in range(10):
print '%-12i%-12i' % (10 ** i, 20 ** i)
will have the output
1 1
10 20
100 400
1000 8000
10000 160000
100000 3200000
1000000 64000000
10000000 1280000000
100000000 25600000000
1000000000 512000000000
The % within the string is essentially an escape character and the characters following it tell python what kind of format the data should have. The % outside and after the string is telling python that you intend to use the previous string as the format string and that the following data should be put into the format specified.
In this case I used "%-12i" twice. To break down each part:
'-' (left align)
'12' (how much space to be given to this part of the output)
'i' (we are printing an integer)
From the docs: https://docs.python.org/2/library/stdtypes.html#string-formatting
Updating Sven Marnach's answer to work in Python 3.4:
row_format ="{:>15}" * (len(teams_list) + 1)
print(row_format.format("", *teams_list))
for team, row in zip(teams_list, data):
print(row_format.format(team, *row))
I know that I am late to the party, but I just made a library for this that I think could really help. It is extremely simple, that's why I think you should use it. It is called TableIT.
Basic Use
To use it, first follow the download instructions on the GitHub Page.
Then import it:
import TableIt
Then make a list of lists where each inner list is a row:
table = [
[4, 3, "Hi"],
[2, 1, 808890312093],
[5, "Hi", "Bye"]
]
Then all you have to do is print it:
TableIt.printTable(table)
This is the output you get:
+--------------------------------------------+
| 4 | 3 | Hi |
| 2 | 1 | 808890312093 |
| 5 | Hi | Bye |
+--------------------------------------------+
Field Names
You can use field names if you want to (if you aren't using field names you don't have to say useFieldNames=False because it is set to that by default):
TableIt.printTable(table, useFieldNames=True)
From that you will get:
+--------------------------------------------+
| 4 | 3 | Hi |
+--------------+--------------+--------------+
| 2 | 1 | 808890312093 |
| 5 | Hi | Bye |
+--------------------------------------------+
There are other uses to, for example you could do this:
import TableIt
myList = [
["Name", "Email"],
["Richard", "richard#fakeemail.com"],
["Tasha", "tash#fakeemail.com"]
]
TableIt.print(myList, useFieldNames=True)
From that:
+-----------------------------------------------+
| Name | Email |
+-----------------------+-----------------------+
| Richard | richard#fakeemail.com |
| Tasha | tash#fakeemail.com |
+-----------------------------------------------+
Or you could do:
import TableIt
myList = [
["", "a", "b"],
["x", "a + x", "a + b"],
["z", "a + z", "z + b"]
]
TableIt.printTable(myList, useFieldNames=True)
And from that you get:
+-----------------------+
| | a | b |
+-------+-------+-------+
| x | a + x | a + b |
| z | a + z | z + b |
+-----------------------+
Colors
You can also use colors.
You use colors by using the color option (by default it is set to None) and specifying RGB values.
Using the example from above:
import TableIt
myList = [
["", "a", "b"],
["x", "a + x", "a + b"],
["z", "a + z", "z + b"]
]
TableIt.printTable(myList, useFieldNames=True, color=(26, 156, 171))
Then you will get:
Please note that printing colors might not work for you but it does works the exact same as the other libraries that print colored text. I have tested and every single color works. The blue is not messed up either as it would if using the default 34m ANSI escape sequence (if you don't know what that is it doesn't matter). Anyway, it all comes from the fact that every color is RGB value rather than a system default.
More Info
For more info check the GitHub Page
Just use it
from beautifultable import BeautifulTable
table = BeautifulTable()
table.column_headers = ["", "Man Utd","Man City","T Hotspur"]
table.append_row(['Man Utd', 1, 2, 3])
table.append_row(['Man City', 7, 4, 1])
table.append_row(['T Hotspur', 3, 2, 2])
print(table)
As a result, you will get such a neat table and that's it.
A simple way to do this is to loop over all columns, measure their width, create a row_template for that max width, and then print the rows. It's not exactly what you are looking for, because in this case, you first have to put your headings inside the table, but I'm thinking it might be useful to someone else.
table = [
["", "Man Utd", "Man City", "T Hotspur"],
["Man Utd", 1, 0, 0],
["Man City", 1, 1, 0],
["T Hotspur", 0, 1, 2],
]
def print_table(table):
longest_cols = [
(max([len(str(row[i])) for row in table]) + 3)
for i in range(len(table[0]))
]
row_format = "".join(["{:>" + str(longest_col) + "}" for longest_col in longest_cols])
for row in table:
print(row_format.format(*row))
You use it like this:
>>> print_table(table)
Man Utd Man City T Hotspur
Man Utd 1 0 0
Man City 1 1 0
T Hotspur 0 1 2
When I do this, I like to have some control over the details of how the table is formatted. In particular, I want header cells to have a different format than body cells, and the table column widths to only be as wide as each one needs to be. Here's my solution:
def format_matrix(header, matrix,
top_format, left_format, cell_format, row_delim, col_delim):
table = [[''] + header] + [[name] + row for name, row in zip(header, matrix)]
table_format = [['{:^{}}'] + len(header) * [top_format]] \
+ len(matrix) * [[left_format] + len(header) * [cell_format]]
col_widths = [max(
len(format.format(cell, 0))
for format, cell in zip(col_format, col))
for col_format, col in zip(zip(*table_format), zip(*table))]
return row_delim.join(
col_delim.join(
format.format(cell, width)
for format, cell, width in zip(row_format, row, col_widths))
for row_format, row in zip(table_format, table))
print format_matrix(['Man Utd', 'Man City', 'T Hotspur', 'Really Long Column'],
[[1, 2, 1, -1], [0, 1, 0, 5], [2, 4, 2, 2], [0, 1, 0, 6]],
'{:^{}}', '{:<{}}', '{:>{}.3f}', '\n', ' | ')
Here's the output:
| Man Utd | Man City | T Hotspur | Really Long Column
Man Utd | 1.000 | 2.000 | 1.000 | -1.000
Man City | 0.000 | 1.000 | 0.000 | 5.000
T Hotspur | 2.000 | 4.000 | 2.000 | 2.000
Really Long Column | 0.000 | 1.000 | 0.000 | 6.000
I think this is what you are looking for.
It's a simple module that just computes the maximum required width for the table entries and then just uses rjust and ljust to do a pretty print of the data.
If you want your left heading right aligned just change this call:
print >> out, row[0].ljust(col_paddings[0] + 1),
From line 53 with:
print >> out, row[0].rjust(col_paddings[0] + 1),
Pure Python 3
def print_table(data, cols, wide):
'''Prints formatted data on columns of given width.'''
n, r = divmod(len(data), cols)
pat = '{{:{}}}'.format(wide)
line = '\n'.join(pat * cols for _ in range(n))
last_line = pat * r
print(line.format(*data))
print(last_line.format(*data[n*cols:]))
data = [str(i) for i in range(27)]
print_table(data, 6, 12)
Will print
0 1 2 3 4 5
6 7 8 9 10 11
12 13 14 15 16 17
18 19 20 21 22 23
24 25 26
table_data= [[1,2,3],[4,5,6],[7,8,9]]
for row in table_data:
print("{: >20} {: >20} {: >20}".format(*row))
OUTPUT:
1 2 3
4 5 6
7 8 9
wherein f string formatting
">" is used for right alignment
"<" is used for left alignment
20 is the space width that can be changed according to the requirement.
try rich: https://github.com/Textualize/rich
from rich.console import Console
from rich.table import Table
console = Console()
table = Table(show_header=True, header_style="bold magenta")
table.add_column("Date", style="dim", width=12)
table.add_column("Title")
table.add_column("Production Budget", justify="right")
table.add_column("Box Office", justify="right")
table.add_row(
"Dec 20, 2019", "Star Wars: The Rise of Skywalker", "$275,000,000", "$375,126,118"
)
table.add_row(
"May 25, 2018",
"[red]Solo[/red]: A Star Wars Story",
"$275,000,000",
"$393,151,347",
)
table.add_row(
"Dec 15, 2017",
"Star Wars Ep. VIII: The Last Jedi",
"$262,000,000",
"[bold]$1,332,539,889[/bold]",
)
console.print(table)
https://github.com/willmcgugan/rich/raw/master/imgs/table.png
The following function will create the requested table (with or without numpy) with Python 3 (maybe also Python 2). I have chosen to set the width of each column to match that of the longest team name. You could modify it if you wanted to use the length of the team name for each column, but will be more complicated.
Note: For a direct equivalent in Python 2 you could replace the zip with izip from itertools.
def print_results_table(data, teams_list):
str_l = max(len(t) for t in teams_list)
print(" ".join(['{:>{length}s}'.format(t, length = str_l) for t in [" "] + teams_list]))
for t, row in zip(teams_list, data):
print(" ".join(['{:>{length}s}'.format(str(x), length = str_l) for x in [t] + row]))
teams_list = ["Man Utd", "Man City", "T Hotspur"]
data = [[1, 2, 1],
[0, 1, 0],
[2, 4, 2]]
print_results_table(data, teams_list)
This will produce the following table:
Man Utd Man City T Hotspur
Man Utd 1 2 1
Man City 0 1 0
T Hotspur 2 4 2
If you want to have vertical line separators, you can replace " ".join with " | ".join.
References:
lots about formatting https://pyformat.info/ (old and new formatting
styles)
the official Python tutorial (quite good) -
https://docs.python.org/3/tutorial/inputoutput.html#the-string-format-method
official Python information (can be difficult to read) -
https://docs.python.org/3/library/string.html#string-formatting
Another resource -
https://www.python-course.eu/python3_formatted_output.php
For simple cases you can just use modern string formatting (simplified Sven's answer):
f'{column1_value:15} {column2_value}':
table = {
'Amplitude': [round(amplitude, 3), 'm³/h'],
'MAE': [round(mae, 2), 'm³/h'],
'MAPE': [round(mape, 2), '%'],
}
for metric, value in table.items():
print(f'{metric:14} : {value[0]:>6.3f} {value[1]}')
Output:
Amplitude : 1.438 m³/h
MAE : 0.171 m³/h
MAPE : 27.740 %
Source: https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals
I found this just looking for a way to output simple columns. If you just need no-fuss columns, then you can use this:
print("Titlex\tTitley\tTitlez")
for x, y, z in data:
print(x, "\t", y, "\t", z)
EDIT: I was trying to be as simple as possible, and thereby did some things manually instead of using the teams list. To generalize to the OP's actual question:
#Column headers
print("", end="\t")
for team in teams_list:
print(" ", team, end="")
print()
# rows
for team, row in enumerate(data):
teamlabel = teams_list[team]
while len(teamlabel) < 9:
teamlabel = " " + teamlabel
print(teamlabel, end="\t")
for entry in row:
print(entry, end="\t")
print()
Ouputs:
Man Utd Man City T Hotspur
Man Utd 1 2 1
Man City 0 1 0
T Hotspur 2 4 2
But this no longer seems any more simple than the other answers, with perhaps the benefit that it doesn't require any more imports. But #campkeith's answer already met that and is more robust as it can handle a wider variety of label lengths.
I would try to loop through the list and use a CSV formatter to represent the data you want.
You can specify tabs, commas, or any other char as the delimiter.
Otherwise, just loop through the list and print "\t" after each element
http://docs.python.org/library/csv.html
I got a better one that can save a lot of spaces.
table = [
['number1', 'x', 'name'],
["4x", "3", "Hi"],
["2", "1", "808890312093"],
["5", "Hi", "Bye"]
]
column_max_width = [max(len(row[column_index]) for row in table) for column_index in range(len(table[0]))]
row_format = ["{:>"+str(width)+"}" for width in column_max_width]
for row in table:
print("|".join([print_format.format(value) for print_format, value in zip(row_format, row)]))
output:
number1| x| name
4x| 3| Hi
2| 1|808890312093
5|Hi| Bye
To create a simple table using terminaltables open the terminal or your command prompt and run pip install terminaltables.
You can print a Python list as the following:
from terminaltables import AsciiTable
l = [
['Head', 'Head'],
['R1 C1', 'R1 C2'],
['R2 C1', 'R2 C2'],
['R3 C1', 'R3 C2']
]
table = AsciiTable(l)
print(table.table)
list1 = [1, 2, 3]
list2 = [10, 20, 30]
l = []
for i in range(0, len(list1)):
l.append(list1[i]), l.append(list2[i])
# print(l)
for i in range(0, len(l), 2):
print(l[i], "", l[i + 1])

Pandas Convert the prints to dataframe

i have a code and the prints look pretty weird. i want to fix it
*The Prints
Matching Score
0 john carry 73.684211
Matching Score
0 alex midlane 80.0
Matching Score
0 alex midlane 50.0
Matching Score
0 robert patt 53.333333
Matching Score
0 robert patt 70.588235
Matching Score
0 david baker 100.0
*I need this format
| Matching | Score |
| ------------ | -----------|
| john carry | 73.684211 |
| alex midlane | 80.0 |
| alex midlane | 50.0 |
| robert patt | 53.333333 |
| robert patt | 70.588235 |
| david baker | 100.0 |
*My Code
import numpy as np
import pandas as pd
from rapidfuzz import process, fuzz
df = pd.DataFrame({
"NameTest": ["john carry", "alex midlane", "robert patt", "david baker", np.nan, np.nan, np.nan],
"Name": ["john carrt", "john crat", "alex mid", "alex", "patt", "robert", "david baker"]
})
NameTests = [name for name in df["NameTest"] if isinstance(name, str)]
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
data = {'Matching': [match[0]],
'Score': [match[1]]}
df1 = pd.DataFrame(data)
print(df1)
I have tried many ways. but got the same prints
thank you for suggestion.
You need an array or a list in order to keep all the data (I use an array) because you creating a dataframe in each loop
data = []
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
print(match[0])
data.append({'Matching': match[0],
'Score': match[1]})
df1 = pd.DataFrame(data)
print(df1)
Here is the output
enter image description here
You create a new dataframe in each loop. You can store the result in a global dict and create dataframe from that dict after the loop.
data = {'Matching': [], 'Score': []}
for Name in df["Name"]:
if isinstance(Name, str):
match = process.extractOne(
Name, NameTests,
scorer=fuzz.ratio,
processor=None,
score_cutoff=10)
data['Matching'].append(match[0])
data['Score'].append(match[1])
df1 = pd.DataFrame(data)

Categories

Resources