Turning a text file into a tabular format [duplicate] - python

This question already has answers here:
How do I print parameters of multiple objects in table form? [duplicate]
(2 answers)
Line up columns of numbers (print output in table format)
(7 answers)
Closed 4 years ago.
I'm having issues trying to properly format a text file to fit the needed criteria for a school project. I've been stuck on it for a while and I'm still very new to coding and wanted to know if anyone has an answer that I can understand and implement, hopefully I can learn from those much more experienced.
I want to convert a text file that can be entered by a user that looks like this within the file:
Lennon 12 3.33
McCartney 57 7
Harrison 11 9.1
Starr 3 4.13
and create it to fit a tabular format like this:
Name Hours Total Pay
Lambert 34 357.00
Osborne 22 137.50
Giacometti 5 503.50
I can create the headers, though it may not be pretty code, but when I print the contents of the test file it usually turns out like this:
Name Hour Total pay
Lennon 12 3.33
McCartney 57 7
Harrison 11 9.1
Starr 3 4.13
And I don't understand how to properly format it to look like a proper table that's right justified and properly in line with the actual headers, I'm not sure how to really tackle it or where to even start as I haven't made any real ground on this.
I've gutted my code and broke it down into just the skeleton after trying with things like file_open.read().rstrip("\n) and .format making a mess of the indexes and sometimes somehow ending up with only single letters to appear:
file_name = input("Enter the file name: ")
print("Name" + " " * 12 + "Hour" + " " * 6 + "Total pay")
with open(file_name, 'r') as f:
for line in f:
print(line, end='')
I know it looks simple, and because it is. Our instructor wanted us to work with the "open" command and try and stay away from things that could make it less readable but still as compact as possible. This includes the importing of third party tools which shot down chances to use things like beautifultable like a few other friends have offered as an easier way out.
I had a classmate say to read the lines that turns it into a list and adjust it from there with some formatting, and another classmate said I could probably format it without listing it; although I found that the newline character "\n" appears at the end of each list index if turning it into a list
ex: ['Lennon 12 3.33\n', 'McCartney 57 7\n', 'Harrison 11 9.1\n', 'Starr 3 4.13']
Though what I don't understand is how to format the things that are within the list so that the name can be separated from each of the number variables and in line with the header as I don't have much experience with for loops that many say can be an easy fix within my class if I have that down pat.
I'm not exactly looking for straight coded answers, but rather a point in the right direction or where to read up on how to manipulate listed content

Here's something to get you headed in the right direction:
data_filename = 'employees.txt'
headers = 'Name', 'Hours', 'Rate' # Column names.
# Read the data from file into a list-of-lists table.
with open(data_filename) as file:
datatable = [line.split() for line in file.read().splitlines()]
# Find the longest data value or header to be printed in each column.
widths = [max(len(value) for value in col)
for col in zip(*(datatable + [headers]))]
# Print heading followed by the data in datatable.
# (Uses '>' to right-justify the data in some columns.)
format_spec = '{:{widths[0]}} {:>{widths[1]}} {:>{widths[2]}}'
print(format_spec.format(*headers, widths=widths))
for fields in datatable:
print(format_spec.format(*fields, widths=widths))
Output:
Name Hours Rate
Lennon 12 3.33
McCartney 57 7
Harrison 11 9.1
Starr 3 4.13

You can use pandas for this, a dataframe will do the required job
import pandas as pd
df = pd.read_csv('file.txt', sep='\s{1,}')
df.columns = ['Name','Hours','Total Pay']
print(df)
Hope this helps.

Related

Google Kickstart 2014 Round D Sort a scrambled itinerary - Do I need to bring the input in a ready-to-use array format?

Problem:
Once upon a day, Mary bought a one-way ticket from somewhere to somewhere with some flight transfers.
For example: SFO->DFW DFW->JFK JFK->MIA MIA->ORD.
Obviously, transfer flights at a city twice or more doesn't make any sense. So Mary will not do that.
Unfortunately, after she received the tickets, she messed up the tickets and she forgot the order of the ticket.
Help Mary rearrange the tickets to make the tickets in correct order.
Input:
The first line contains the number of test cases T, after which T cases follow.
For each case, it starts with an integer N. There are N flight tickets follow.
Each of the next 2 lines contains the source and destination of a flight ticket.
Output:
For each test case, output one line containing "Case #x: itinerary", where x is the test case number (starting from 1) and the itinerary is a sorted list of flight tickets that represent the actual itinerary.
Each flight segment in the itinerary should be outputted as pair of source-destination airport codes.
Sample Input: Sample Output:
2 Case #1: SFO-DFW
1 Case #2: SFO-DFW DFW-JFK JFK-MIA MIA-ORD
SFO
DFW
4
MIA
ORD
DFW
JFK
SFO
DFW
JFK
MIA
My question:
I am a beginner in the field of competitive programming. My question is how to interpret the given input in this case. How did Googlers program this input? When I write a function with a Python array as its argument, will this argument be in a ready-to-use array format or will I need to deal with the above mentioned T and N numbers in the input and then arrange airport strings in an array format to make it ready to be passed in the function's argument?
I have looked up at the following Google Kickstart's official Python solution to this problem and was confused how they simply pass the ticket_list argument in the function. Don't they need to clear the input from the numbers T and N and then arrange the airport strings into an array, as I have explained above?
Also, I could not understand how could the methods first and second simply appear if no Class has been initialized? But I think this should be another question...
def print_itinerary(ticket_list):
arrival_map = {}
destination_map = {}
for ticket in ticket_list:
arrival_map[ticket.second] += 1
destination_map[ticket.first] += 1
current = FindStart(arrival_map)
while current in destination_map:
next = destination_map[current]
print current + "-" + next
current = next
You need to implement it yourself to read data from standard input and write results to standard output.
Sample code for reading from standard input and writing to standard output can be found in the coding section of the FAQ on the KickStart Web site.
If you write the solution to this problem in python, you can get T and N as follows.
T = int(input())
for t in range(1, T + 1):
N = int(input())
...
Then if you want to get the source and destination of the flight ticket as a list, you can use the same input method to get them in the list.
ticket_list = [[input(), input()] for _ in range(N)]
# [['MIA', 'ORD'], ['DFW', 'JFK'], ['SFO', 'DFW'], ['JFK', 'MIA']]
If you want to use first and second, try a namedtuple.
Pair = namedtuple('Pair', ['first', 'second'])
ticket_list = [Pair(input(), input()) for _ in range(N)]

How can I use Python and Pandas to parse through text and return the strings I want in separate data cells?

So I have compiled a list of NFL game projections from the 2020 season for fantasy relevant players. Each row contains the team names, score, relevant players and their stats like in the text below. The problem is that each of the player names and stats are either different lengths or written out in slightly different ways.
`Bears 24-17 Jaguars
M.Trubisky- 234/2TDs
D.Montgomery- 113 scrim yards/1 rush TD/4 rec
A.Robinson- 9/114/1
C.Kmet- 3/35/0
G.Minshew- 183/1TD/2int
J.Robinson- 77 scrim yards/1 rush TD/4 rec
DJ.Chark- 3/36`
I'm trying to create a data frame that will split the player name, receptions, yards, and touchdowns into separate columns. Then I will able to compare these numbers to their actual game numbers and see how close the predictions were. Does anyone have an idea for a solution in Python? Even if you could point me in the right direction I'd greatly appreciate it!
You can get split the full string using the '-' (dash/minus sign) as the separator. Then use indexing to get different parts.
Using str.split(sep='-')[0] gives you the name. Here, the str would be the row, for example M.Trubisky- 234/2TDs.
Similarly, str.split(sep='-')[1]gives you everything but the name.
As for splitting anything after the name, there is no way of doing it unless they are in a certain order. If you are able to somehow achieve this, there is a way of splitting into columns.
I am going to assume that the trend here is yards / touchdowns / receptions, in which case, we can again use the str.split() method. I am also assuming that the 'rows' only belong to one team. You might have to run this script once for each team to create a dataframe, and then join all dataframes with a new feature called 'team_name'.
You can define lists and append values to them, and then use the lists to create a dataframe. This snippet should help you.
import re
names, scrim_yards, touchdowns, receptions = [], [], [], []
for row in rows:
# name = row.split(sep='-')[0] --> sample name: M.Trubisky
names.append(row.split(sep='-')[0])
stats = row.split(sep='-')[1].split(sep='/') # sample stats: [234, 2TDs ]
# Since we only want the 'numbers' from each stat, we can filter out what we want using regular expressions.
# This snippet was obtained from [here][1].
numerical_stats = re.findall(r'\b\d+\b', stats) # sample stats: [234, 2]
# now we use indexing again to get desired values
# If the
scrim_yards.append(numerical_stats[0])
touchdowns.append(numerical_stats[1])
receptions.append(numerical_stats[2])
# You can then create a pandas dataframe
nfl_player_stats = pd.DataFrame({'names': names, 'scrim_yards': scrim_yards, 'touchdowns': touchdowns, 'receptions': receptions})
As you are pointing out, often times the hardest part of processing a data file like this is handling all the variability and inconsistency in the file itself. There are a lot of things that can vary inside the file, and then sometimes the file also contains silly errors (typos, missing whitespace, and the like). Depending on the size of the data file, you might be better off simply hand-editing it to make it easier to read into Python!
If you tackle this directly with Python code, then it's a very good idea to be very careful to verify the actual data matches your expectations of it. Here are some general concepts on how to handle this:
First off, make sure to strip every line of whitespace and ignore blank lines:
for curr_line in file_lines:
curr_line = curr_line.strip()
if len(curr_line) > 0:
# Process the line...
Once you have your stripped, non-blank line, make sure to handle the "game" (matchup between two teams) line differently than the lines denoting players"
TEAM_NAMES = [ "Cardinals", "Falcons", "Panthers", "Bears", "Cowboys", "Lions",
"Packers", "Rams", "Vikings" ] # and 23 more; you get the idea
#...down in the code where we are processing the lines...
if any([tn in curr_line for tn in TEAM_NAMES]):
# ...handle as a "matchup"
else:
# ...handle as a "player"
When handling a player and their stats, we can use "- " as a separator. (You must include the space, otherwise players such as Clyde Edwards-Helaire will split the line in a way you did not want.) Here we unpack into exactly two variables, which gives us a nice error check since the code will raise an exception if the line doesn't split into exactly two parts.
p_name, p_stats = curr_line.split("- ")
Handling the stats will be the hardest part. It will all depend on what assumptions you can safely make about your input data. I would recommend being very paranoid about validating that the input data agrees with the assumptions in your code. Here is one notional idea -- an over-engineered solution, but that should help to manage the hassle of finding all the little issues that are probably lurking in that data file:
if "scrim yards" in p_stats:
# This is a running back, so "scrim yards" then "rush TD" then "rec:
rb_stats = p_stats.split("/")
# To get the number, just split by whitespace and grab the first one
scrim_yds = int(rb_stats[0].split()[0])
if len(rb_stats) >= 2:
rush_tds = int(rb_stats[1].split()[0])
if len(rb_stats) >= 3:
rec = int(rb_stats[2].split()[0])
# Always check for unexpected data...
if len(rb_stats) > 3:
raise Exception("Excess data found in rb_stats: {}".format(rb_stats))
elif "TD" in p_stats:
# This is a quarterback, so "yards"/"TD"/"int"
qb_stats = p_stats.split("/")
qb_yards = int(qb_stats[0]) # Or store directly into the DF; you get the idea
# Handle "TD" or "TDs". Personal preference is to avoid regexp's
if len(qb_stats) >= 2:
if qb_stats[1].endswidth("TD"):
qb_td = int(qb_stats[1][:-2])
elif qb_stats[1].endswith("TDs"):
qb_td = int(qb_stats[1][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Handle "int" if it's there
if len(qb_stats) >= 3:
if qb_stats[2].endswidth("int"):
qb_int = int(qb_stats[2][:-3])
else:
raise Exception("Unknown qb_stats: {}".format(qb_stats))
# Always check for unexpected data...
if len(qb_stats) > 3:
raise Exception("Excess data found in qb_stats: {}".format(qb_stats))
else:
# Must be a running back: receptions/yards/TD
rb_rec, rb_yds, rb_td = p_stats.split("/")

is there a way to modify a string to remove a decimal?

I have a file with a lot of images. Each image is named something like:
100304.jpg
100305.jpg
100306.jpg
etc...
I also have a spreadsheet, Each image is a row, the first value in the row is the name, the values after the name are various decimals and 0's to describe features of each image.
The issue is that when I pull the name from the sheet, something is adding a decimal which then results in the file not being able to be transferred via the shutil.move()
import xlrd
import shutil
dataLocation = "C:/Users/User/Documents/Python/Project/sort_solutions_rev1.xlsx"
imageLocBase = "C:/Users/User/Documents/Python/Project/unsorted"
print("Specify which folder to put images in. Type the number only.")
print("1")
print("2")
print("3")
int(typeOfSet) = input("")
#Sorting for folder 1
if int(typeOfSet) == 1:
#Identifying what to move
name = str(sheet.cell(int(nameRow), 0).value)
sortDataStorage = (sheet.cell(int(nameRow), 8).value) #float
sortDataStorageNoFloat = str(sortDataStorage) #non-float
print("Proccessing: " + name)
print(name + " has a correlation of " + (sortDataStorageNoFloat))
#sorting for this folder utilizes the information in column 8)
if sortDataStorage >= sortAc:
print("test success")
folderPath = "C:/Users/User/Documents/Python/Project/Image Folder/Folder1"
shutil.move(imageLocBase + "/" + name, folderPath)
print(name + " has been sorted.")
else:
print(name + " does not meet correlation requirement. Moving to next image.")
The issue I'm having occurs with the shutil.move(imageLocBase + "/" +name, folderPath)
For some reason my code takes the name from the spreadsheet (ex: 100304) and then adds a ".0" So when trying to move a file, it is trying to move 100304.0 (which doesn't exist) instead of 100304.
Using pandas to read your Excel file.
As suggested in a comment on the original question, here is a quick example of how to use pandas to read your Excel file, along with an example of the data structure.
Any questions, feel free to shout, or have a look into the docs.
import pandas as pd
# My path looks a little different as I'm on Linux.
path = '~/Desktop/so/MyImages.xlsx'
df = pd.read_excel(path)
Data Structure
This is completely contrived as I don't have an example of your actual file.
IMAGE_NAME FEATURE_1 FEATURE_2 FEATURE_3
0 100304.jpg 0.0111 0.111 1.111
1 100305.jpg 0.0222 0.222 2.222
2 100306.jpg 0.0333 0.333 3.333
Hope this helps get you started.
Suggestion:
Excel likes to think it's clever and does 'unexpected' things, as you're experiencing with the decimal (data type) issue. Perhaps consider storing your image data in a database (SQLite) or as plain old CSV file. Pandas can read from either of these as well! :-)
splitOn = '.'
nameOfFile = text.split(splitOn, 1)[0]
Should work
if we take your file name eg 12345.0 and create a var
name = "12345.0"
Now we need to split this var. In this case we wish to split on .
So we save this condition as a second var
splitOn = '.'
Using the .split for python.
Here we offer the text (variable name) and the python split command.
so to make it literal
12345.0
split at .
only make one split and save as two vars in a list
(so we have 12345 at position 0 (1st value)
and 0 at position 1 (2nd value) in a list)
save 1st var
(as all lists are 0 based we ask for [0]
(if you ever get confused with list, arrays etc just start counting
from 0 instead of one on your hands and then you know
ie position 0 1 2 3 4 = 1st value, 2nd value, 3rd value, 4th value, 5th value)
nameOfFile = name.split(splitOn, 1)[0]
12345.0 split ( split on . , only one split ) save position 0 ie first value
So.....
name = 12345.0
splitOn = '.'
nameOfFile = name.split(splitOn, 1)[0]
yield(nameOfFile)
output will be
12345
I hope that helps
https://www.geeksforgeeks.org/python-string-split/
OR
as highlighted below, convert to float to in
https://www.geeksforgeeks.org/type-conversion-python/
if saved as float
name 12345.0
newName = round(int(name))
this will round the float (as its 0 will round down)
OR
if float is saved as a string
print(int(float(name)))
Apparently the value you retrieve from the spreadsheet comes parsed as a float, so when you cast it to string it retains the decimal part.
You can trim the “.0” from the string value, or cast it to integer before casting to string.
You could also check the spreadsheet’s cell format and ensure it is set to normal (idk the setting, but something that is not a number). With that fixed, your data probably wont come with the .0 anymore.
If always add ".0" to the end of the variable, You need to read the var_string "name" in this way:
shutil.move(imageLocBase + "/" + name[:-2], folderPath)
A string is like a list that we can choose the elements to read.
Slicing is colled this method
Sorry for my English. Bye
All these people have taken time to reply, please out of politeness rate the replies.

Python write to a text file after certain column

I am using the following code:
f.write(str(foo) + ' ' + str(bar) + '\n')
The problem is that the number of letters in foo is different for each value and I get the following output:
Account Category DORMANT
Last Made Update 21/12/2013
Mortgages Partly Satisfied 0
The problem is that because I am using same amount of space (' ') for all the values and Mortgages Partly Satisfied is longer string, so the value 0 goes to the right. What I would like the output to be is:
Account Category DORMANT
Last Made Update 21/12/2013
Mortgages Partly Satisfied 0
My question is: Is there a way to insert the second value bar after certain amount of columns so the values will always be aligned?
I hope I was clear enough.
It's probably best to use string formatting with the str.format method, like so:
items = [
('Account Category', 'DORMANT'),
('Last Made Update', '21/12/2013'),
('Mortgages Partly Satisfied', '0'),
]
for label, value in items:
f.write('{:28} {}\n'.format(label, value))
The :28 is the width specifier. See format string docs for more info.
Python lets you add padding for strings by specifying the number of characters a given field should use. This can be used when writing to your file as follows:
data = [["Account Category", "DORMANT"], ["Last Made Update", "21/12/2013"], ["Mortgages Partly Satisfied", "0"]]
with open('output.txt', 'w') as f:
for v1, v2 in data:
f.write("{:28} {}\n".format(v1, v2))
Giving you:
Account Category DORMANT
Last Made Update 21/12/2013
Mortgages Partly Satisfied 0
You can use the ljust function, that returns the string left justified.
Just, try it:
f.write(str(foo).ljust(40) + str(bar) + '\n')
You can also check other methods in the docs
This is going to give you the next output:
Last Made Update 21/12/2013
Account Category DORMANT
Mortgages Partly Satisfied 0

adding '+' to all the numbers as a prefix (numbers are stored in a csv file) using a python script

goal
All the numbers in the csv file that I exported from hotmail are stored as 91123456789 whereas to complete a call i need to dial +91123456789. These contacts will be converted to a batch of vcf files and exported to my phone. I want to add the + to all my contacts at the beginning.
approach
write a python script that can do this for an indefinite number of contacts.
pre-conditions
none of the numbers in the csv file will have a + in them.
problem
(a) there is a posibility that the number itself may have a 91 in it like: +919658912365. This makes the adding a plus very difficult.
explanation:I am adding this as a problem, as if the 91 is there only at the beginning of a number then we can add it simple by checking two consecutive digits and if they match 91 then we can add + else we don't need to add + and we can move on to the next pair of digits.
(b) the fields are seprated by comma's. I want to add the + as a prefix only in front of the field which has the header mobile and not in any other field where a set of digits 91 may appear(like in landline numbers or fax numbers)
research
I tried this with excel, but the process it would take an unreasonable amount of time(like 2 hours!)
specs
I have 400 contacts.
Windows XP SP 3
please help me solve this problem.
Something like below??
import csv
for row in csv.reader(['num1, 123456789', 'num2, 987654321', 'num3, +23456789']):
phoneNumber = row[1].strip()
if not phoneNumber.startswith('+'):
phoneNumber = '+' + phoneNumber
print phoneNumber
Could use iterators to test each phone number as below:
phone_numbers = ['12234', '91232324', '913746', '3453' '9145653', '95843']
for i, number in enumerate(phone_numbers):
phone_numbers[i] = ''.join(['+', phone_numbers[i]]) if number.startswith('91') else phone_numbers[i]
Hope that helps

Categories

Resources