I have 3000 CSV files stored on my hard drive, each containing thousands of rows and 10 columns. Rows correspond to dates, and the number of rows as well as the exact dates is different across spreadsheets. The columns for all the spreadsheets are the same in number (10) and label. For each date from the earliest date across all spreadsheets to the latest date across all spreadsheets, I need to (i) access the columns in each spreadsheet for which data for that date exists, (ii) run some calculations, and (iii) store the results (a set of 3 or 4 scalar values) for that date. To clarify, results should be a variable in my workspace that stores the results for each date for all CSVs.
Is there a way to load this data using Python that is both time and memory efficient? I tried creating a Pandas data frame for each CSV, but loading all the data into RAM takes almost ten minutes and almost completely fills up my RAM. Is it possible to check if the date exists in a given CSV, and if so, load the columns corresponding to that CSV into a single data frame? This way, I could load just the rows that I need from each CSV to do my calculations.
Simple Solution.
Go and Download DB Browser for SQlite.
Open it, and create New Database.
After That, go to File and Import Table from CSV. ( Do this for All of your CSV Tables ) Alternatively, you can use Python script and sqlite3 library to be fast and automated for creating table and inserting values from your CSV sheets.
When you are done with importing all the tables, play around with this function based on your details.
import sqlite3
import pandas as pd
data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def create_database(): # Create Database with table name
con = sqlite3.connect('database.db')
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS my_CSV_data (id INTEGER PRIMARY KEY, name text, address text,mobile text , phone text,balance float,max_balance INTEGER)")
con.commit()
con.close()
def insert_into_company(): # Inserting data into column
con = sqlite3.connect(connection_str)
cur = con.cursor()
for i in data:
cur.execute("INSERT INTO my_CSV_data VALUES(Null,?,?,?,?,?,?)",(i[0],i[1],i[2],i[3],i[4],i[5]))
con.commit()
con.close()
def select_company(): # Viewing Data from Column
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data")
data = cur.fetchall()
con.close()
return data
create_database()
insert_into_company()
for j in select_company():
print(j)
Do this Once, you can you use it again and again. It will enable you to access data in less than 1 second. Ask me, if you need any other help. I'll be happy to guide you through.
Related
I want to delete multiple table rows that are present in the CSV file
I'm able to make a connection to the db.
conn = psycopg2.connect(host=dbendpoint, port=portinfo, database=databasename, user=dbusername, password=password)
cur = conn.cursor()
I'm able to read the CSV file and store the contents in the dataframe
dfOutput = spark.read.format("text").option("delimiter",",").option("header", "true").option("mode","FAILFAST").option("compression","None").csv(inputFile)
I think I have to convert my dataframe into tuple to use it into psycopg2.
Now I'm unable to proceed forward.
I'm new to psycopg2, so any help would be appreciated.
Is there any way to turn my excel workbook into MySQL database. Say for example my Excel workbook name is copybook.xls then MySQL database name would be copybook. I am unable to do it. Your help would be really appreciated.
Here I give an outline and explanation of the process including links to the relevant documentation. As some more thorough details were missing in the original question, the approach needs to be tailored to particular needs.
The solution
There's two steps in the process:
1) Import the Excel workbook as Pandas data frame
Here we use the standard method of using pandas.read_excel to get the data out from Excel file. If there is a specific sheet we want, it can be selected using sheet_name. If the file contains column labels, we can include them using parameter index_col.
import pandas as pd
# Let's select Characters sheet and include column labels
df = pd.read_excel("copybook.xls", sheet_name = "Characters", index_col = 0)
df contains now the following imaginary data frame, which represents the data in the original Excel file
first last
0 John Snow
1 Sansa Stark
2 Bran Stark
2) Write records stored in a DataFrame to a SQL database
Pandas has a neat method pandas.DataFrame.to_sql for interacting with SQL databases through SQLAlchemy library. The original question mentioned MySQL so here we assume we already have a running instance of MySQL. To connect the database, we use create_engine. Lastly, we write records stored in a data frame to the SQL table called characters.
from sqlalchemy import create_engine
engine = create_engine('mysql://USERNAME:PASSWORD#localhost/copybook')
# Write records stored in a DataFrame to a SQL database
df.to_sql("characters", con = engine)
We can check if the data has been stored
engine.execute("SELECT * FROM characters").fetchall()
Out:
[(0, 'John', 'Snow'), (1, 'Sansa', 'Stark'), (2, 'Bran', 'Stark')]
or better, use pandas.read_sql_table to read back the data directly as data frame
pd.read_sql_table("characters", engine)
Out:
index first last
0 0 John Snow
1 1 Sansa Stark
2 2 Bran Stark
Learn more
No MySQL instance available?
You can test the approach by using an in-memory version of SQLite database. Just copy-paste the following code to play around:
import pandas as pd
from sqlalchemy import create_engine
# Create a new SQLite instance in memory
engine = create_engine("sqlite://")
# Create a dummy data frame for testing or read it from Excel file using pandas.read_excel
df = pd.DataFrame({'first' : ['John', 'Sansa', 'Bran'], 'last' : ['Snow', 'Stark', 'Stark']})
# Write records stored in a DataFrame to a SQL database
df.to_sql("characters", con = engine)
# Read SQL database table into a DataFrame
pd.read_sql_table('characters', engine)
I have a .CSV file that has data of the form:
2006111024_006919J.20J.919J-25.HLPP.FMGN519.XSVhV7u5SHK3H4gsep.log,2006111024,K0069192,MGN519,DN2BS460SEB0
This his how it appears in a text file. In Excel the commas are columns.
The .csv file can have 100s of these rows. To make things easier both coding and reading the code, I am using pandas mixed with SQL Alchemy. I am new to Python and all these modules
my initial method gets all the info but does one insert at a time for each row of a csv file. My mentor says this is not the best way and that I should use a "bulk" insert/read all rows of the csv then insert them all at once. My method so far uses pandas df.to_sql. I hear this method has a "multi" mode for insert. The problem is, I have no idea how to use it with my limited knowledge and how it would work with the method I have so far:
def odfsfromcsv_to_db(csvfilename_list, db_instance):
odfsdict = db_instance['odfs_tester_history']
for csv in csvfilename_list: # is there a faster way to compare the list of files in archive and history?
if csv not in archivefiles_set:
odfscsv_df = pd.read_csv(csv, header=None, names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WF_SCRIBE'])
#print(odfscsv_df['ODFS_LOG_FILENAME'])
for index, row in odfscsv_df.iterrows():
table_row = {
"ODFS_LOG_FILENAME": row['ODFS_LOG_FILENAME'],
"ODFS_FILE_CREATE_DATETIME": row['ODFS_FILE_CREATE_DATETIME'],
"LOT": row['LOT'],
"TESTER": row['TESTER'],
"WF_SCRIBE": row['WF_SCRIBE'],
"CSV_FILENAME": csv.name
}
print(table_row)
df1 = pd.DataFrame.from_dict([table_row])
result = df1.to_sql('odfs_tester_history', con=odfsdict['engine'], if_exists='append', index=False)
else:
print(csv.name + " is in archive folder already")
How do I modify this and be able to insert multiple records at once. I felt limited to creating a new dictionary for each row of the table and then inserting that dictionary into the table for each row. Is there a way to collate the rows into one big structure and push them all at once into my db using pandas?
You already have the pd.read you would just need to use the following code:
odfscsv_df.to_sql('DATA', conn, if_exists='replace', index = False)
Where DATA is your table, conn is your connection, etc. the two links below should help you with any specifics to your code, and I have attached a snippet of some old code that might help to make it clearer, however, the two links below are the better resource.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html?highlight=sql#io-sql
import sqlite3
import pandas as pd
from pandas import DataFrame
conn = None;
try:
conn = sqlite3.connect(':memory:') # This allows the database to run in RAM, with no requirement to create a file.
#conn = sqlite3.connect('data.db') # You can create a new database by changing the name within the quotes.
except Error as e:
print(e)
c = conn.cursor() # The database will be saved in the location where your 'py' file is saved IF you did not choose the :memory: option
# Create table - DATA from input.csv - this must match the values and headers of the incoming CSV file.
c.execute('''CREATE TABLE IF NOT EXISTS DATA
([generated_id] INTEGER PRIMARY KEY,
[What] text,
[Ever] text,
[Headers] text,
[And] text,
[Data] text,
[You] text,
[Have] text)''')
conn.commit()
# When reading the csv:
# - Place 'r' before the path string to read any special characters, such as '\'
# - Don't forget to put the file name at the end of the path + '.csv'
# - Before running the code, make sure that the column names in the CSV files match with the column names in the tables created and in the query below
# - If needed make sure that all the columns are in a TEXT format
read_data = pd.read_csv (r'input.csv', engine='python')
read_data.to_sql('DATA', conn, if_exists='replace', index = False) # Insert the values from the csv file into the table 'DATA'
# DO STUFF with your data
c.execute('''
DROP TABLE IF EXISTS DATA
''')
conn.close()
I found my own answer by just feeding the dataframe straight to sql. It is a slight modification of #user13802268's answer:
odfscsv_df = pd.read_csv(csv, header=None, names=['ODFS_LOG_FILENAME', 'ODFS_FILE_CREATE_DATETIME', 'LOT', 'TESTER', 'WAFER_SCRIBE'])
result = odfscsv_df.to_sql('odfs_tester_history', con=odfsdict['engine'], if_exists='append', index=False)
Hope everyone is well and staying safe.
I'm uploading an Excel file to SQL using Python. I have three fields: CarNumber nvarchar(50), Status nvarchar(50), and DisassemblyStart date.
I'm having an issue importing the DisassemblyStart field dates. The connection and transaction using Python are succesful.
However I get zeroes all over, even though the Excel file is populated with dates. I've tried switching to nvarchar(50), date, and datetime to see if I can least get a string and nothing. I saved the Excel file as CSV and TXT and tried uploading it and still got zeroes. I added 0.001 to every date in Excel (as to add an artificial time) in case that would make it clic but nothing happened. Still zeroes. I sure there's a major oversight from being too much in the weeds. I need help.
The Excel file has the three field columns.
This is the Python code:
import pandas as pd
import pyodbc
# Import CSV
data = pd.read_csv (r'PATH_TO_CSV\\\XXX.csv')
df = pd.DataFrame(data,columns= ['CarNumber','Status','DisassemblyStart'])
df = df.fillna(value=0)
# Connect to SQL Server
conn = pyodbc.connect("Driver={SQL Server};Server=SERVERNAME,PORT ;Database=DATABASENAME;Uid=USER;Pwd=PW;")
cursor = conn.cursor()
# Create Table
cursor.execute ('DROP TABLE OPS.dbo.TABLE')
cursor.execute ('CREATE TABLE OPS.dbo.TABLE (CarNumber nvarchar(50),Status nvarchar(50), DisassemblyStart date)')
# Insert READ_CSV INTO TABLE
for row in df.itertuples():
cursor.execute('INSERT INTO OPS.dbo.TABLE (CarNumber,Status,DisassemblyStart) VALUES (?,?,Convert(datetime,?,23))',row.CarNumber,row.Status,row.DisassemblyStart)
conn.commit()
conn.close()
Help will be much appreciated.
Thank you and be safe,
David
Context:
I have a table in mysql database which has the format like this. Every row is one day stock price and volume data
Ticker,Date/Time,Open,High,Low,Close,Volume
AAA,7/15/2010,19.581,20.347,18.429,18.698,174100
AAA,7/16/2010,19.002,19.002,17.855,17.855,109200
BBB,7/19/2010,19.002,19.002,17.777,17.777,104900
BBB,7/19/2010,19.002,19.002,17.777,17.777,104900
CCC,7/19/2010,19.002,19.002,17.777,17.777,104900
....100000 rows
This table is created by importing the data from multiple *.txt file with the same column and format. The *.txt file name is the same with the ticker name in ticker column: ie: import AAA.txt get me the 2 rows of AAA data.
All these *.txt file is generated automatically by a system that retrieve stock price in my country. Every day, after the stock market close, the .txt file will have one new row according to the data of the new day.
Question: everyday, how could I update the new row in each txt file into the database, I do not want to load all the data in the .txt file in mysql table everyday because it take a lot of time, I only want to load new rows.
How should I write the code to do this updating mission.
(1) Create/use an empty stage table, no prmary ... :
create table db.temporary_stage (
... same columns as your orginial table , but no constraints or keys or an index ....
)
(2) # this should be really fast
LOAD DATA INFILE 'data.txt' INTO TABLE db.temporary_stage;
(3) join on id then use a hash function to eliminate all rows that haven't changed. the following can be made better, but all in all using bulk loads against databases is a lot faster when you have lots of rows, and thats mostly down to how the database moves stuff about internally. it can do upkeep much more efficiently all at once than a little at a time.
UPDATE mytable SET
mytable... = temporary_stage...
precomputed_hash = hash(concat( .... ) )
FROM
(
SELECT temporary_stage.* from mytable join
temporary_stage on mytable.id = temporary_state.id
where mytable.pre_computed_hash != hash(concat( .... ) ) )
AS new_data on mytable.id = new_data.id
# clean up
DELETE FROM temporary_stage;