Better way to convert a file to .bib - python

I have CSV Files, Text Files and Word Files that have tons of bibliography that I want to export to .bib file, preferably with a custom order defined by me.
What I want to know is, what's the best way to convert this information using Python, to a .bib file?
Is it converting Text to XML and maybe then to .bib? Or loading a CSV file and iterate through columns and write column by column?
I've worked a script for loading a CSV and take each column as a list, but to convert to .bib I can't solve how.
Here's the code I've pulled together:
import csv
import pandas
import bibtexparser
from bibtexparser.bparser import BibTexParser
from bibtexparser.customization import *
colnames = ['AUTHORS', 'TITLE', 'EDITOR']
data = pandas.read_csv('file.csv', names=colnames, delimiter =r";", encoding='latin1')
list1 = list(data.AUTHORS)
def customs(record):
record = type(record)
record = author(record)
record = editor(record)
return record
with open('123.bib','w') as bibtex_file:
parser = BibTexParser()
parser.customization = customs

Related

Grab values from seperate csv file and replace the values of columns in a pipe delimited file

Trying to whip this out in python. Long story short I got a csv file that contains column data i need to inject into another file that is pipe delimited. My understanding is that python can't replace values, so i have to re-write the whole file with the new values.
data file(csv):
value1,value2,iwantthisvalue3
source file(txt, | delimited)
value1|value2|iwanttoreplacethisvalue3|value4|value5|etc
fixed file(txt, | delimited)
samevalue1|samevalue2| replacedvalue3|value4|value5|etc
I can't figure out how to accomplish this. This is my latest attempt(broken code):
import re
import csv
result = []
row = []
with open("C:\data\generatedfixed.csv","r") as data_file:
for line in data_file:
fields = line.split(',')
result.append(fields[2])
with open("C:\data\data.txt","r") as source_file, with open("C:\data\data_fixed.txt", "w") as fixed_file:
for line in source_file:
fields = line.split('|')
n=0
for value in result:
fields[2] = result[n]
n=n+1
row.append(line)
for value in row
fixed_file.write(row)
I would highly suggest you use the pandas package here, it makes handling tabular data very easy and it would help you a lot in this case. Once you have installed pandas import it with:
import pandas as pd
To read the files simply use:
data_file = pd.read_csv("C:\data\generatedfixed.csv")
source_file = pd.read_csv('C:\data\data.txt', delimiter = "|")
and after that manipulating these two files is easy, I'm not exactly sure how many values or which ones you want to replace, but if the length of both "iwantthisvalue3" and "iwanttoreplacethisvalue3" is the same then this should do the trick:
source_file['iwanttoreplacethisvalue3'] = data_file['iwantthisvalue3]
now all you need to do is save the dataframe (the table that we just updated) into a file, since you want to save it to a .txt file with "|" as the delimiter this is the line to do that (however you can customize how to save it in a lot of ways):
source_file.to_csv("C:\data\data_fixed.txt", sep='|', index=False)
Let me know if everything works and this helped you. I would also encourage to read up (or watch some videos) on pandas if you're planning to work with tabular data, it is an awesome library with great documentation and functionality.

Multiple txt files as separate rows in a csv file without breaking into lines (in pandas dataframe)

I have many txt files (which have been converted from pdf) in a folder. I want to create a csv/excel dataset where each text file will become a row. Right now I am opening the files in pandas dataframe and then trying to save it to a csv file. When I print the dataframe, I get one row per txt file. However, when saving to csv file, the texts get broken and create multiple rows/lines for each txt file rather than just one row. Do you know how I can solve this problem? Any help would be highly appreciated. Thank you.
Following is the code I am using now.
import glob
import os
import pandas as pd
file_list = glob.glob(os.path.join(os.getcwd(), "K:\\text_all", "*.txt"))
corpus = []
for file_path in file_list:
with open(file_path, encoding="latin-1") as f_input:
corpus.append(f_input.read())
df = pd.DataFrame({'col':corpus})
print (df)
df.to_csv('K:\\out.csv')
Update
If this solution is not possible it would be also helpful to transform the data a bit in pandas dataframe. I want to create a column with the name of txt files, that is, the name of each txt file in the folder will become the identifier of the respective text file. I will then save it to tsv format so that the lines do not get separated because of comma, as suggested by someone here.
I need something like following.
identifier col
txt1 example text in this file
txt2 second example text in this file
...
txtn final example text in this file
Use
import csv
df.to_csv('K:\\out.csv', quoting=csv.QUOTE_ALL)

How to create a hierarchical csv file?

I have following N number of invoice data in Excel and I want to create CSV of that file so that it can be imported whenever needed...so how can I archive this?
Here is a screenshot:
Assuming you have a Folder "excel" full of Excel Files within your Project-Directory and you also have another folder "csv" where you intend to put your generated CSV Files, you could pretty much easily batch-convert all the Excel Files in the "excel" Directory into "csv" using Pandas.
It will be assumed that you already have Pandas installed on your System. Otherwise, you could do that via: pip install pandas. The fairly commented Snippet below illustrates the Process:
# IMPORT DATAFRAME FROM PANDAS AS WELL AS PANDAS ITSELF
from pandas import DataFrame
import pandas as pd
import os
# OUR GOAL IS:::
# LOOP THROUGH THE FOLDER: excelDir.....
# AT EACH ITERATION IN THE LOOP, CHECK IF THE CURRENT FILE IS AN EXCEL FILE,
# IF IT IS, SIMPLY CONVERT IT TO CSV AND SAVE IT:
for fileName in os.listdir(excelDir):
#DO WE HAVE AN EXCEL FILE?
if fileName.endswith(".xls") or fileName.endswith(".xlsx"):
#IF WE DO; THEN WE DO THE CONVERSION USING PANDAS...
targetXLFile = os.path.join(excelDir, fileName)
targetCSVFile = os.path.join(csvDir, fileName) + ".csv"
# NOW, WE READ "IN" THE EXCEL FILE
dFrame = pd.read_excel(targetXLFile)
# ONCE WE DONE READING, WE CAN SIMPLY SAVE THE DATA TO CSV
pd.DataFrame.to_csv(dFrame, path_or_buf=targetCSVFile)
Hope this does the Trick for you.....
Cheers and Good-Luck.
Instead of putting total output into one csv, you could go with following steps.
Convert your excel content to csv files or csv-objects.
Each object will be tagged with invoice id and save into dictionary.
your dictionary data structure could be like {'invoice-id':
csv-object, 'invoice-id2': csv-object2, ...}
write custom function which can reads your csv-object, and gives you
name,product-id, qty, etc...
Hope this helps.

How would I transfer CSV "words" into Python as strings

So I am quite the beginner in Python, but what I'm trying to do is to download each CSV file for the NYSE. In an excel file I have every symbol. The Yahoo API allows you to download the CSV file by adding the symbol to the base url.
My first instinct was to use pandas, but pandas doesn't store strings.
So what I have
import urllib
strs = ["" for x in range(3297)]
#strs makes the blank string array for each symbol
#somehow I need to be able to iterate each symbol into the blank array spots
while y < 3297:
strs[y] = "symbol of each company from csv"
y = y+1
#loop for downloading each file from the link with strs[y].
while i < 3297:
N = urllib.URLopener()
N.retrieve('http://ichart.yahoo.com/table.csv?s='strs[y],'File')
i = i+1
Perhaps the solution is simpler than what I am doing.
From what I can see in this question you can't see how to connect your list of stock symbols to how you read the data in Pandas. e.g. 'http://ichart.yahoo.com/table.csv?s='strs[y] is not valid syntax.
Valid syntax for this is
pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(strs[y]))
It would be helpful if you could add a few sample lines from your csv file to the question. Guessing at your structure you would do something like:
import pandas as pd
symbol_df = pd.read_csv('path_to_csv_file')
for stock_symbol in symbol_df.symbol_column_name:
df = pd.read_csv('http://ichart.yahoo.com/table.csv?s={}'.format(stock_symbol))
# process your dataframe here
Assuming you take that Excel file w/ the symbols and output as a CSV, you can use Python's built-in CSV reader to parse it:
import csv
base_url = 'http://ichart.yahoo.com/table.csv?s={}'
reader = csv.reader(open('symbols.csv'))
for row in reader:
symbol = row[0]
data_csv = urllib.urlopen(base_url.format(symbol)).read()
# save out to file, or parse with CSV library...

Converting a folder of Excel files into CSV files/Merge Excel Workbooks

I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...
Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)
I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial
Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.
Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.
You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)

Categories

Resources