I'm using psycopg2 to copy csvs into postgres. all my csvs will always have the same suffix, such as competitors or customers, along with a unique number each time i generate them. therefore I want a python script that can run to use COPY and I can just have 'dir/customers*.csv'
is there an easy way to do this, does asterisk exist as some other character in python?
Thanks in advance!
Ok so what I currently have is a geoprocessing model in ArcGIS that will generate a bunch of CSVs into C:\Data\Sheltered BLPUs\CSVs such as competitors and customers, but attaches a unique ID for each location onto the end of the filename e.g. customers17.csv. I then want a python script to copy all of these into postgres. All the tables exist as 'customers' or 'competitors' - this is what I wanted to write:
cur.execute("COPY competitors FROM 'C:/Data/Sheltered BLPUs/CSVs/competitors*.csv' DELIMITER ',' CSV;")
import glob
print glob.glob('dir/customers*.csv')
Related
I am new in the python world and I try to build a solution I struggle to develop. The goal is to check that some mandatory information (it will be keywords) are present in a pdf. I have an Excel file where each row correspond to a transaction, and I need to check that all the transaction (and the mandatory information related to them) are in the a corresponding PDF sent during the day.
So, on one side, I have several Excel row in a sheet with the mandatory information (corresponding to info on each transaction), and on the other side, I have a folder with several PDF.
I try to extract data of each pdf to allow the workflow to check if the information for each row in my Excel file are in a single pdf. I check some question raised here and tried to apply some solution to my problem, but I haven't managed to obtain a full working solution.
I have been able to build the partial code that will extract the pdf data and look for the keywords:
Import os
from glob import glob
import re
from PyPDF2 import PdfFileReader
def search_page(pattern, page):
yield from pattern.findall(page.extractText())
def search_document(pattern, path):
document = PdfFileReader(path)
for page in document.pages:
yield from search_page(pattern, page)
searchWords = ['my list of keywords in each row of my Excel file']
pattern = re.compiler(r'\b(?:%s)\b' % '|'.join(searchWords))
for path in glob('path of my folder with all the pdf files'):
matches = search_document(pattern, path)
#inspired by a solution on stackoverflow used to count the occurences of keywords
Also, I think that using panda to build the list of keyword should work, but I can't use it in me previous code, the search tool want a string, not a list.
import pandas as pd
df=pd.read_excel('path of my Excel file', sheet_name=0, usecols='G,L,R,S,Z')
print(df) #I wanted to check that the code was selecting the right colomn only, as some other colomn have unnecessary information
I don't know how to do a searchwords list for each row of my Excel file and put it in the first part of the code. Also, I don't know how to ask to search for ALL the keywords of the list (row in excel), as it is mandatory to have all the information of a transaction in the same pdf. And when it finds all the info, return "ok row 1" or something like that and do the check for the second row, etc. (and put error if it doesn't find all the information).
P.S.: Originally, I wanted only to extract the data with a python code and add it in an Alteryx Workflow, but the python tool of alteryx doesn't accept some Package in my company.
I would be very thankfull for any help!
I'm a doctor trying to learn some code for work, and was hoping you could help me solve a problem I have with regards to importing multiple images into python.
I am working in Jupyter Notebook, where I have created a dataframe (named df_1) using pandas. In this dataframe each row represents a patient, and the first column shows the case number for each patient (e.g. 85).
Now, what I want to do is import multiple images (.bmp) from a given folder(same location as the .ipynb file). There are many images in this folder, and I do not want all of them - only the ones who have filenames corresponding to the "case_number" column in my dataframe (e.g. 85.bmp).
I already read this post, but I must admit it was way to complicated for me to understand.
Is there some simple loop (or something else) I could create to import all images with filenames corresponding to the values of the "case number" column in the dataframe?
I was imagining something like the below would be possible, I just do not know how to write it.
for i=[(df_1['case_number'()]
cv2.imread('[i].bmp')
The images don't really need to be implemented in the dataframe, but I would like to be able to view them in my notebook by using e.g. plt.imshow(85) afterwards.
Here is an image of the head of my dataframe
Thank you for helping!
You can access all of your files using this:
imageList = []
for i in range(0, len(df_1)):
cv2.imread('./' + str(df_1['case_number'][i]) + '.bmp')
imageList.append('./' + str(df_1['case_number'][i]) + '.bmp')
plt.imshow(imagelist[x])
This is looping through every item in the case_number column, the ./ shows that your file is within the current directory, using the directory path leading up to your current file. And by making everything a string and joining it you make it so that the file path is readable. The path created by joining the strings should look something like ./85.bmp, which should open your desired file. Also, you are appending the filenames to the list so that they can be accessed by the plt.imshow()
If you would like to access the files based on their name, you can use another variable (which could be set as an input) and implement the code below
fileName = input('Enter Your Value: ')
inputFile = imageList.index('./' + fileName + '.bmp')
and from here, you could use the same plt.imshow(imagelist[x]), but replace the x with the inputFile variable.
I have two files on HDFS and I just want to join these two files on a column say employee id.
I am trying to simply print the files to make sure we are reading that correctly from HDFS.
lines = sc.textFile("hdfs://ip:8020/emp.txt")
print lines.count()
I have tried foreach and println functions as well and I am not able to display file data.
I am working in python and totally new to both python and spark as well.
This is really easy just do a collect
You must be sure that all the data fits the memory on your master
my_rdd = sc.parallelize(xrange(10000000))
print my_rdd.collect()
If that is not the case You must just take a sample by using take method.
# I use an exagerated number to remind you it is very large and won't fit the memory in your master so collect wouldn't work
my_rdd = sc.parallelize(xrange(100000000000000000))
print my_rdd.take(100)
Another example using .ipynb:
I'm trying to populate a SQLite database using Django with data from a file that consists of 6 million records. However the code that I've written is giving me a lot of time issues even with 50000 records.
This is the code with which I'm trying to populate the database:
import os
def populate():
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns=col[1]
name=col[8]
job=col[12]
dun_add = add_c_duns(duns)
add_contact(c_duns = dun_add, fn=name, job=job)
def add_contact(c_duns, fn, job):
c = Contact.objects.get_or_create(duns=c_duns, fullName=fn, title=job)
return c
def add_c_duns(duns):
cd = Contact_DUNS.objects.get_or_create(duns=duns)[0]
return cd
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate()
print "Done!!"
The code works fine since I have tested this with dummy records, and it gives the desired results. I would like to know if there is a way using which I can lower the execution time of this code. Thanks.
I don't have enough reputation to comment, but here's a speculative answer.
Basically the only way to do this through django's ORM is to use bulk_create . So the first thing to consider is the use of get_or_create. If your database has existing records that might have duplicates in the input file, then your only choice is writing the SQL yourself. If you use it to avoid duplicates inside the input file, then preprocess it to remove duplicate rows.
So if you can live without the get part of get_or_create, then you can follow this strategy:
Go through each row of the input file and instantiate a Contact_DUNS instance for each entry (don't actually create the rows, just write Contact_DUNS(duns=duns) ) and save all instances to an array. Pass the array to bulk_create to actually create the rows.
Generate a list of DUNS-id pairs with value_list and convert them to a dict with the DUNS number as the key and the row id as the value.
Repeat step 1 but with Contact instances. Before creating each instance use the DUNS number to get the Contact_DUNS id from the dictionary of step 2. The instantiate each Contact in the following way: Contact(duns_id=c_duns_id, fullName=fn, title=job). Again, after collecting the Contact instances just pass them to bulk_create to create the rows.
This should radically improve performance as you'll be no longer executing a query for each input line. But as I said above, this can only work if you can be certain that there are no duplicates in the database or the input file.
EDIT Here's the code:
import os
def populate_duns():
# Will only work if there are no DUNS duplicates
# (both in the DB and within the file)
duns_instances = []
with open("filename") as f:
for line in f:
duns = line.strip().split("|")[1]
duns_instances.append(Contact_DUNS(duns=duns))
# Run a single INSERT query for all DUNS instances
# (actually it will be run in batches run but it's still quite fast)
Contact_DUNS.objects.bulk_create(duns_instances)
def get_duns_dict():
# This is basically a SELECT query for these two fields
duns_id_pairs = Contact_DUNS.objects.values_list('duns', 'id')
return dict(duns_id_pairs)
def populate_contacts():
# Repeat the same process for Contacts
contact_instances = []
duns_dict = get_duns_dict()
with open("filename") as f:
for line in f:
col = line.strip().split("|")
duns = col[1]
name = col[8]
job = col[12]
ci = Contact(duns_id=duns_dict[duns],
fullName=name,
title=job)
contact_instances.append(ci)
# Again, run only a single INSERT query
Contact.objects.bulk_create(contact_instances)
if __name__ == '__main__':
print "Populating Contact db...."
os.environ.setdefault("DJANGO_SETTINGS_MODULE", "settings")
from web.models import Contact, Contact_DUNS
populate_duns()
populate_contacts()
print "Done!!"
CSV Import
First of all 6 million records is a quite a lot for sqllite and worse still sqlite isn't very good and importing CSV data directly.
There is no standard as to what a CSV file should look like, and the
SQLite shell does not even attempt to handle all the intricacies of
interpreting a CSV file. If you need to import a complex CSV file and
the SQLite shell doesn't handle it, you may want to try a different
front end, such as SQLite Database Browser.
On the other hand Mysql and Postgresql are more capable of handling CSV data and mysql's LOAD DATA IN FILE and Postgresql COPY are both painless ways to import very large amounts of data in a very short period of time.
Suitability of Sqlite.
You are using django => you are building a web app => more than one user will access the database. This is from the manual about concurrency.
SQLite supports an unlimited number of simultaneous readers, but it
will only allow one writer at any instant in time. For many
situations, this is not a problem. Writer queue up. Each application
does its database work quickly and moves on, and no lock lasts for
more than a few dozen milliseconds. But there are some applications
that require more concurrency, and those applications may need to seek
a different solution.
Even your read operations are likely to be rather slow because an sqlite database is just one single file. So with this amount of data there will be a lot of seek operations involved. The data cannot be spread across multiple files or even disks as is possible with proper client server databases.
The good news for you is that with Django you can usually switch from Sqlite to Mysql to Postgresql just by changing your settings.py. No other changes are needed. (The reverse isn't always true)
So I urge you to consider switching to mysql or postgresl before you get in too deep. It will help you solve your present problem and also help to avoid problems that you will run into sooner or later.
6,000,000 is quite a lot to import via Python. If Python is not a hard requirement, you could write a SQLite script that directly import the CSV data and create your tables using SQL statements. Even faster would be to preprocess your file using awk and output two CSV files corresponding to your two tables.
I used to import 20,000,000 records using sqlite3 CSV importer and it took only a few minutes minutes.
I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.
What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.
Cycle# Time Data
1 100.59 34
1 101.34 54
1 98.78 45
2
2
2 ...........
Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.
import sys, subprocess, os
DATABASE = sys.argv[1]
subprocess.call(["mdb-schema", DATABASE, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()
# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
if table != '':
filename = table.replace(" ","_") + ".csv"
file = open(filename, 'w')
print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", DATABASE, table],
stdout=subprocess.PIPE).communicate()[0]
file.write(contents)
file.close()
Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.
A few suggestions:
Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.
When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.
At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.
Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:
SELECT
v1.SourceFile,
v1.Cycle,
v1.Time,
v1.Data
FROM
Voltage v1
INNER JOIN
(
SELECT
SourceFile,
Cycle,
MAX([Time]) AS MaxTime
FROM Voltage
GROUP BY SourceFile, Cycle
) v2
ON v1.SourceFile=v2.SourceFile
AND v1.Cycle=v2.Cycle
AND v1.Time=v2.MaxTime
To bring it directly to Pandas in python3 I wrote this little snippet
import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
subprocess.call(["mdb-schema", database_path, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
sys.stdout.flush()
# Dump each table as a stringio using "mdb-export",
out_tables = {}
for rtable in tables:
table = rtable.decode()
if VERBOSE: print('running table:',table)
if table != '':
if VERBOSE: print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", database_path, table],
stdout=subprocess.PIPE).communicate()[0]
temp_io = StringIO(contents.decode())
print(table, temp_io)
out_tables[table] = pd.read_csv(temp_io)
return out_tables
There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.
It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.