Open a Latex file with Pandas? - python

I am trying to replicate using Python the content of the "Tidy Data" paper available here.
However, the datasets are available on github as .tex files, and I can't seem to be able to open them with pandas.
To the extent of my searches so far, it seems that pandas can export to latex, but not import from it...
1) Am I correct ?
2) If so, how would you advise me to open those files ?
Thank you for your time !

Using this as example :
import pandas as pd
from pandas.compat import StringIO
with open('test.tex') as input_file:
text = ""
for line in input_file:
if '&' in line:
text += line.replace('\\', '') + '\n'
data = StringIO(text)
df = pd.read_csv(data, sep="&")
data.close()
Returns :
year artist track time date.entered wk1 wk2 wk3
0 2000 2 Pac Baby Don't Cry 4:22 2000-02-26 87 82 72
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87 92
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 68
3 2000 98verb|^|0 Give Me Just One Nig... 3:24 2000-08-19 51 39 34
4 2000 A*Teens Dancing Queen 3:44 2000-07-08 97 97 96
5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84 62 51
6 2000 Aaliyah Try Again 4:03 2000-03-18 59 53 38
7 2000 Adams, Yolanda Open My Heart 5:30 2000-08-26 76 76 74
You can also write one script which transform the file :
with open('test.tex') as input_file:
with open('test.csv', 'w') as output_file:
for line in input_file:
if '&' in line:
output_file.write(line.replace('\\', '') + '\n')
Then another script wich uses pandas
import pandas as pd
pd.read_csv('test.csv', sep="&")

1) To my knowledge you can open any standard type of file with python
2) You could try:
with open('test.tex', 'w') as text_file:
//Do something to text_file here

Related

better way to write a csv into a StringIO from another StringIO object

I have the following stringIO object:
s = io.StringIO("""idx Exam_Results Hours_Studied
0 93 8.232795
1 94 7.879095
2 92 6.972698
3 88 6.854017
4 91 6.043066
5 87 5.510013
6 89 5.509297""")
I want to transform it into a csv format and dump it into a new stringIO object. I'm using currently this strategy to do that, but to me it seems I bit clumsy.
output = ""
for line in s.getvalue().split('\n'):
output += re.sub(r'\s+',',',line) + '\n'
output = io.StringIO(output)
print(output.getvalue())
Result:
idx,Exam_Results,Hours_Studied
0,93,8.232795
1,94,7.879095
2,92,6.972698
3,88,6.854017
4,91,6.043066
5,87,5.510013
6,89,5.509297
Is there a clever way to achieve this ?
You can use the csv module:
import csv
from io import StringIO
s = StringIO(
"""idx Exam_Results Hours_Studied
0 93 8.232795
1 94 7.879095
2 92 6.972698
3 88 6.854017
4 91 6.043066
5 87 5.510013
6 89 5.509297"""
)
def convert(origin: str) -> StringIO:
si = StringIO(newline="")
spamwriter = csv.writer(
si, delimiter=",", quotechar="|", quoting=csv.QUOTE_MINIMAL
)
for line in origin.splitlines():
spamwriter.writerow(line.split())
return si
def main():
sio = convert(s.getvalue())
print(sio.getvalue())
if __name__ == "__main__":
main()
from io import StringIO
import csv
text = StringIO("""idx Exam_Results Hours_Studied
0 93 8.232795
1 94 7.879095
2 92 6.972698
3 88 6.854017
4 91 6.043066
5 87 5.510013
6 89 5.509297""")
output = StringIO('')
writer = csv.writer(output, delimiter=',')
writer.writerows(csv.reader(text, delimiter=' ', skipinitialspace=True))
print(output.getvalue())
Output:
idx,Exam_Results,Hours_Studied
0,93,8.232795
1,94,7.879095
2,92,6.972698
3,88,6.854017
4,91,6.043066
5,87,5.510013
6,89,5.509297
You can try pandas package
import io
import pandas as pd
s = io.StringIO("""idx Exam_Results Hours_Studied
0 93 8.232795
1 94 7.879095
2 92 6.972698
3 88 6.854017
4 91 6.043066
5 87 5.510013
6 89 5.509297""")
out = io.StringIO()
df = (pd.read_csv(s, delim_whitespace=True)
.to_csv(out, index=False, sep=';'))
print(out.getvalue())
idx;Exam_Results;Hours_Studied
0;93;8.232795
1;94;7.879095
2;92;6.972698
3;88;6.854017
4;91;6.043066
5;87;5.510013
6;89;5.509297

How insert to file one column from table in python

I want to copy and write from one file to another a new one column.
I have a file:
1 12 13 14
2 22 23 24
3 32 33 34
4 42 43 44
5 52 53 54
6 62 63 64
I need to copy 4 column to new file.
In the code, you can see that I take the file and delete the first two lines in it. After that, my attempts to create a file with one column.
f=open("1234.txt").readlines()
for i in [0,0,-1]:
f.pop(i)
with open("1234.txt",'w') as F:
F.writelines(f)
ff = open("1234.txt", 'r')
df1 = ff.iloc[:,3:3]
print(df1)
with open('12345.txt', 'w') as F:
df.writelines('12345.txt')
I’m not sure whether to import something for iloc, may be it pandas? Should I close files in code and when?

How to get a specific field for parsing log files using pandas regular expressions [duplicate]

I have pandas DataFrame like this
X Y Z Value
0 18 55 1 70
1 18 55 2 67
2 18 57 2 75
3 18 58 1 35
4 19 54 2 70
I want to write this data to a text file that looks like this:
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
I have tried something like
f = open(writePath, 'a')
f.writelines(['\n', str(data['X']), ' ', str(data['Y']), ' ', str(data['Z']), ' ', str(data['Value'])])
f.close()
It's not correct. How to do this?
You can just use np.savetxt and access the np attribute .values:
np.savetxt(r'c:\data\np.txt', df.values, fmt='%d')
yields:
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
or to_csv:
df.to_csv(r'c:\data\pandas.txt', header=None, index=None, sep=' ', mode='a')
Note for np.savetxt you'd have to pass a filehandle that has been created with append mode.
The native way to do this is to use df.to_string() :
with open(writePath, 'a') as f:
dfAsString = df.to_string(header=False, index=False)
f.write(dfAsString)
Will output the following
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
This method also lets you easily choose which columns to print with the columns attribute, lets you keep the column, index labels if you wish, and has other attributes for spacing ect.
You can use pandas.DataFrame.to_csv(), and setting both index and header to False:
In [97]: print df.to_csv(sep=' ', index=False, header=False)
18 55 1 70
18 55 2 67
18 57 2 75
18 58 1 35
19 54 2 70
pandas.DataFrame.to_csv can write to a file directly, for more info you can refer to the docs linked above.
Late to the party: Try this>
base_filename = 'Values.txt'
with open(os.path.join(WorkingFolder, base_filename),'w') as outfile:
df.to_string(outfile)
#Neatly allocate all columns and rows to a .txt file
#AHegde - To get the tab delimited output use separator sep='\t'.
For df.to_csv:
df.to_csv(r'c:\data\pandas.txt', header=None, index=None, sep='\t', mode='a')
For np.savetxt:
np.savetxt(r'c:\data\np.txt', df.values, fmt='%d', delimiter='\t')
Way to get Excel data to text file in tab delimited form.
Need to use Pandas as well as xlrd.
import pandas as pd
import xlrd
import os
Path="C:\downloads"
wb = pd.ExcelFile(Path+"\\input.xlsx", engine=None)
sheet2 = pd.read_excel(wb, sheet_name="Sheet1")
Excel_Filter=sheet2[sheet2['Name']=='Test']
Excel_Filter.to_excel("C:\downloads\\output.xlsx", index=None)
wb2=xlrd.open_workbook(Path+"\\output.xlsx")
df=wb2.sheet_by_name("Sheet1")
x=df.nrows
y=df.ncols
for i in range(0,x):
for j in range(0,y):
A=str(df.cell_value(i,j))
f=open(Path+"\\emails.txt", "a")
f.write(A+"\t")
f.close()
f=open(Path+"\\emails.txt", "a")
f.write("\n")
f.close()
os.remove(Path+"\\output.xlsx")
print(Excel_Filter)
We need to first generate the xlsx file with filtered data and then convert the information into a text file.
Depending on requirements, we can use \n \t for loops and type of data we want in the text file.
I used a slightly modified version:
with open(file_name, 'w', encoding = 'utf-8') as f:
for rec_index, rec in df.iterrows():
f.write(rec['<field>'] + '\n')
I had to write the contents of a dataframe field (that was delimited) as a text file.
If you have a Dataframe that is an output of pandas compare method, such a dataframe looks like below when it is printed:
grossRevenue netRevenue defaultCost
self other self other self other
2098 150.0 160.0 NaN NaN NaN NaN
2110 1400.0 400.0 NaN NaN NaN NaN
2127 NaN NaN NaN NaN 0.0 909.0
2137 NaN NaN 0.000000 8.900000e+01 NaN NaN
2150 NaN NaN 0.000000 8.888889e+07 NaN NaN
2162 NaN NaN 1815.000039 1.815000e+03 NaN NaN
I was looking to persist the whole dataframe into a text file as its visible above. Using pandas's to_csv or numpy's savetxt does not achieve this goal. I used plain old print to log the same into a text file:
with open('file1.txt', mode='w') as file_object:
print(data_frame, file=file_object)

Issues reading csv with footer and arbitray number of blank lines at end

I have issues using the pandas package for reading a .csv file with a single footer and an arbitrary number (>= 0) of blank lines at the end of the file (blank lines come after the footer). For example this is my .csv file:
col_1, col_2
1, 30
2, 40
3, 50
(last row)
(I can not show what should be an arbitrary number of blank lines at the end, because the SO editor however doesn't parse them) (to avoid any confusion (last row) is the footer)
When I run:
>>> import pandas as pd
>>> pd.read_csv('test.csv', header=0, engine='python', skipfooter=1, skip_blank_lines=True)
col_1 col_2
0 1 30.0
1 2 40.0
2 3 50.0
3 (last row) NaN
I get the undesired row with index 3:
(last row) NaN
An undesired spin-off are values in my first column all being string instead of int's and the values in the 2nd column are float's instead of int's.
I can fix it by truncating the last row and converting the columns to the right type, however it should be possible by giving the right parameters to the skipfooter argument or the skip_blank_lines argument. However whatever parameters I use, it fails. What is going wrong?
I use pandas version 0.20.3 and Python 2.7.12 on a Linux system.
You can create you own parser pretty easily like:
CSV Parser:
def read_my_csv(file_handle):
# build csv reader
reader = csv.reader(file_handle)
# for each row, check for footer
for row in reader:
if row[0].strip() == '(last row)':
break
yield row
To Use:
import csv
import pandas as pd
with open ("test.csv", 'rU') as f:
generator = read_my_csv(f)
columns = next(generator)
df = pd.DataFrame(generator, columns=columns)
print(df)
Results:
col_1 col_2
0 1 30
1 2 40
2 3 50
Finally I could reproduce you behaviour by putting special symbol ^A in the last line.
If I print file to the console, there is nothing special:
$cat test.csv
col_1, col_2
1, 30
2, 40
3, 50
$
But looking at hexdump, you could see unusual 01 at 19th position:
$hexdump -C test.csv
00000000 63 6f 6c 5f 31 2c 20 63 6f 6c 5f 32 0a 31 2c 20 |col_1, col_2.1, |
00000010 33 30 0a 32 2c 20 34 30 0a 33 2c 20 35 30 0a 01 |30.2, 40.3, 50..|
00000020 0a 0a |..|
00000022
$
When reading such file with pandas, I got exactly the same results you described.
The easier way to check you file is to view it with less command-line tool:
$less test.csv
col_1, col_2
1, 30
2, 40
3, 50
^A
$
The way to fix this situation depends on how this special char come to the file.

How to read specific columns and Rows in Python?

Timestamp SP DP
20-03-2017 10:00:01 50 60.5
20-03-2017 10:10:00 60 70
20-03-2017 10:40:01 75 80
20-03-2017 11:05:00 44 65
20-03-2017 11:25:01 98 42
20-03-2017 11:50:01 12 99
20-03-2017 12:00:05 13 54
20-03-2017 12:05:01 78 78
20-03-2017 12:59:01 15 89
20-03-2017 13:00:00 46 99
20-03-2017 13:23:01 44 45
20-03-2017 13:45:08 80 39
import csv
output = []
f = open( 'test.csv', 'r' ) #open the file in read universal mode
for line in f:
cells = line.split( "," )
output.append( ( cells[ 0 ], cells[ 1 ] ) ) #since we want the first, second column
print (output)
how to read specific columns and specific rows?
Desired Output:
i want only first column and 2 rows;
Timestamp SP
20-03-2017 10:00:01 50
20-03-2017 10:10:00 60
How to do that?
Use your csv module, and either count your rows (using the enumerate() function or use itertools.islice() to limit how much is read:
import csv
output = []
with open( 'test.csv', 'r', newline='') as f:
reader = csv.reader(f)
for counter, row in enumerate(reader):
if counter > 2:
# read only the header and first two rows
break
output.append(row[:2])
or using islice():
import csv
from itertools import islice
with open( 'test.csv', 'r', newline='') as f:
reader = csv.reader(f)
output = list(islice((row[:2] for row in reader), 3))
You can use index slicing. Just read csv from the source.
from pandas import *
df = read_csv("Name of csv file.")
df2 = df.ix[:1, 0:2]
print df2
Try it.
You to use pandas to read it.
import pandas
df = pandas.read_csv("filepath", index_col = 0)
Then you can call first column and 2 rows by
df.SP.head(2)
or
df.ix[:1, 0:2] # first row and column first two column

Categories

Resources