Rewriting CSV file with particular rows omitted - Python 3

Rewriting CSV file with particular rows omitted - Python 3 - python

G'day,
I posted this question, and had some excellent responses from #abarnert. I'm trying to remove particular rows from a CSV file. I've learned that CSV files won't allow particular rows to be deleted, so I'm going to rewrite the CSV whilst omitting the particular rows, then rename the new file as the old.
As per the above question in the link, I have tools being taken and returned from a toolbox. The CSV file I'm trying to rewrite is an ongoing 'log' of the tools currently checked out from the toolbox. Therefore, when a tool is returned, I need that tool to be removed from the log CSV file.
Here's what I have so far:
absent_past = frozenset(tuple(row) for row in csv.reader(open('Absent_Past.csv', 'r')))
absent_current = frozenset(tuple(row) for row in csv.reader(open('Absent_Current.csv', 'r')))
tools_returned = [",".join(row) for row in absent_past - absent_current]
with open('Log.csv') as f:
check = csv.reader(f)
for row in check:
if row[1] not in tools_returned:
csv.writer(open('Log_Current.csv', 'a+')).writerow(row)
os.remove('Log.csv')
os.rename('Log_Current.csv', 'Log.csv')
As you can (hopefully) see from above, it will open the Log.csv file, and if a tool has been returned (ie. the tool is listed in a row in tools_returned), it will not rewrite that entry into the new file. When all the non-returned tools have been written to the new file, the old file is deleted, with the new file being renamed as Log.csv from Log_Current.csv.
It's worth mentioning that the tools which have been taken are appended to Log_Current.csv before it is renamed. This part of the code works nicely :)
I've been instructed to avoid using CSV for this system, which I agree with. I would like to explore CSV operation under Python as much as I can at this point however, as I know it will come in handy in the future. I will be looking to use the contextlib and shelve functions in the future.
Thanks!
EDIT: In the code above, I have if row[1]...which I'm hoping means that it will only check the value of the first column in the row? Basically, the row will consist of something like Hammer, James, Taken, 09:15:25, but I only want to search the Log.csv file for Hammer, as the tools_returned consists of only rows of tool names, ie. Hammer, Drill, Saw etc. Is the row[1] approach correct for this?
At the moment, the Log_Current.csv file is writing the Log.csv files regardless of whether the tool has been replaced or not. As such, I'm thinking that the if row[1] etc part of the code isn't working.

I figured I'd answer my own question, as I've now figured it out. The code posted above is correct, except for one minor error. When referring to the column number in a row, the first column is column 0, not column 1. As I was searching column '1' for the tool name, it was never going to work, as column '1 is actually the second column, which is the name of the user.
Changing that line to if row[0] etc rewrites a new file with the current list of tools that are checked out, and omits any tools that have been replaced, as desired!

Related

How to get rid of rows with pandas in a CSV where the value of cells in a specific column is under 100 Billion?

I'm trying to filter through a CSV and make a new CSV which is the exact same except for it gets rid of any rows that have a value of greater than 100 billion in the 'marketcap' column.
The code I've written so just spits out the same CSV as the original out over again and doesn't cut out any lines from the old CSV to the new CSV.
Code:
db = pd.read_csv('SF1_original.csv')
db = db[db['marketcap']<= 100000000000]
db.to_csv('new_SF1_original.csv')
Example of old CSV (It's long don't look through whole thing, just to give you an idea):
ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000,7107000000,,4982000000,2125000000,,10.219,-30000000,1368000000,1368000000,1160000000,131000000,2.41,0.584,665000000,111000000,554000000,665000000,281000000,96000000,0,0.0,0.0,202000000,298000000,0.133,298000000,202000000,202000000,0.3,0.3,0.3,4486000000,,4486000000,50960600000,,,354000000,0.806,1.0,1086000000,0.484,0,0,4337000000,,1567000000,42000000,42000000,0,2621000000,2067000000,554000000,51663600000,1368000000,-160000000,2068000000,111000000,0,1192000000,-208000000,-42000000,384000000,0,131000000,131000000,131000000,0,0,0.058,915000000,171000000,635000000,0.0,11.517,,,1408000000,0,114.3,,,1445000000,131000000,2246000000,2246000000,290000000,,,,,0,625000000,1.0,452000000,439000000,440000000,5.116,7107000000,0,71000000,113000000,16.189,2915000000
Example New CSV (Exact same when this line should have been cut):
,ticker,dimension,calendardate,datekey,reportperiod,lastupdated,accoci,assets,assetsavg,assetsc,assetsnc,assetturnover,bvps,capex,cashneq,cashnequsd,cor,consolinc,currentratio,de,debt,debtc,debtnc,debtusd,deferredrev,depamor,deposits,divyield,dps,ebit,ebitda,ebitdamargin,ebitdausd,ebitusd,ebt,eps,epsdil,epsusd,equity,equityavg,equityusd,ev,evebit,evebitda,fcf,fcfps,fxusd,gp,grossmargin,intangibles,intexp,invcap,invcapavg,inventory,investments,investmentsc,investmentsnc,liabilities,liabilitiesc,liabilitiesnc,marketcap,ncf,ncfbus,ncfcommon,ncfdebt,ncfdiv,ncff,ncfi,ncfinv,ncfo,ncfx,netinc,netinccmn,netinccmnusd,netincdis,netincnci,netmargin,opex,opinc,payables,payoutratio,pb,pe,pe1,ppnenet,prefdivis,price,ps,ps1,receivables,retearn,revenue,revenueusd,rnd,roa,roe,roic,ros,sbcomp,sgna,sharefactor,sharesbas,shareswa,shareswadil,sps,tangibles,taxassets,taxexp,taxliabilities,tbvps,workingcapital
0,A,ARQ,1999-12-31,2000-03-15,2000-01-31,2020-09-01,53000000.0,7107000000.0,,4982000000.0,2125000000.0,,10.219,-30000000.0,1368000000.0,1368000000.0,1160000000.0,131000000.0,2.41,0.584,665000000.0,111000000.0,554000000.0,665000000.0,281000000.0,96000000.0,0.0,0.0,0.0,202000000.0,298000000.0,0.133,298000000.0,202000000.0,202000000.0,0.3,0.3,0.3,4486000000.0,,4486000000.0,50960600000.0,,,354000000.0,0.8059999999999999,1.0,1086000000.0,0.484,0.0,0.0,4337000000.0,,1567000000.0,42000000.0,42000000.0,0.0,2621000000.0,2067000000.0,554000000.0,51663600000.0,1368000000.0,-160000000.0,2068000000.0,111000000.0,0.0,1192000000.0,-208000000.0,-42000000.0,384000000.0,0.0,131000000.0,131000000.0,131000000.0,0.0,0.0,0.057999999999999996,915000000.0,171000000.0,635000000.0,0.0,11.517000000000001,,,1408000000.0,0.0,114.3,,,1445000000.0,131000000.0,2246000000.0,2246000000.0,290000000.0,,,,,0.0,625000000.0,1.0,452000000.0,439000000.0,440000000.0,5.1160000000000005,7107000000.0,0.0,71000000.0,113000000.0,16.189,2915000000.0
I've seen two questions somewhat related to this on StackOverflow, but they haven't helped me much. This one uses CSV library instead of pandas (which is an option for me). This one is more helpful since it uses pandas but still hasn't been interacted with and isn't exactly the same as my use case.

You can get the indexes of the rows with "marketcap" over 100 billion rows like so:
df.loc[df["marketcap"] > 100000000000]["marketcap"].index
All that's left to do is drop them from the DataFrame:
df.drop(df.loc[df["marketcap"] > 100000000000]["marketcap"].index, inplace=True)
Reading from CSV and writing to the CSV is already correctly taken care of in your code.

Merging and cleaning up csv files in Python

I have been using pandas but am open to all suggestions, I'm not an expert at scripting but am a complete loss. My goal is the following:
Merge multiple CSV files. Was able to do this in Pandas and have a dataframe with the merged dataset.
Screenshot of how merged dataset looks like
Delete the duplicated "GEO" columns after the first set. This last part doesn't let me usedf = df.loc[:,~df.columns.duplicated()] because they are not technically duplicated.The repeated column names end with a .1,.2,etc. as I am guessing the concate adds this. Other problem is that some columns have a duplicated column name but are different datasets. I have been using the first row as the index since it's always the same coded values but this row is unnecessary and will be deleted afterwards in the script. This is my biggest problem right now.
Delete certain columns such as the ones with the "Margins". I use ~df2.columns.str.startswith for this and have no trouble with this.
Replace spaces, ":" and ";" with underscores in the first row. I have no clue how to do this.
Insert a new column, write '=TEXT(B1,0)' formula, do this for the whole column (formula would change to B2,B3, etc.), copy the column and paste as values. I was able to do this in openpyxl although was having trouble and was not able to try the final output thanks to excel trouble.
source = excel.Workbooks.Open(filename)
excel.Range("C1:C1337").Select()
excel.Selection.Copy()
excel.Selection.PasteSpecial(Paste=constants.xlPasteValues)
Not sure if it works and was wondering if it was possible in pandas, win32com or I should stay with openpyxl. Thanks all!

Csv editing of files by each field

I am new to python and my teacher has told me to create a code that can edit a csv file by each field depending on the value in it.
Here is a nested list to show the csv file which is split into lists by lines then by elements:
[["A","B","C","D"],["Yes",1,"05/11/2016","0"],["No","12","05/06/2017","1"],["Yes","6","08/09/2017","2"]]
What I am supposed to do is to make a loop which can be used to detect the postions of the elements within the inner list and then change the first element of each list to a "No" if it is a yes ,the 3rd element to today's date if the date stated is at least 6 months back and the last element to a 1 if it is more than 1 ,how am I supposed to do this?
Below is my code:
filename="Assignment_Data1.csv"
file=open(filepath+filename,"r")
reader=csv.reader(file,delimiter=",")
from datetime import datetime
six_months = str(datetime.date.today() - datetime.timedelta(6*365/12-1))
fm_six_months=str(datetime.datetime.strptime(six_months, '%Y-%m-%d').strftime('%d/%m/%Y'))
td=datetime.now()
deDate = str(td)[8:10] + "/"+ str(td)[5:7] + "/"+ str(td)[0:4]
import csv
for row in reader:
for field in row:
if row[2]<=fm_six_months or row[4]>50 or row[2]<10:
row[3]=deDate
row[4]=0
row[2]=100
Basically what I am trying to do is to replace the fields that have the above stated conditions with want I want through a loop ,is it possible?

You're on the right track, but your code has a couple issues:
1) Import statements.
Your import statements should be all at the top of your program. Currently, you use csv.reader in line 3, but haven't imported csv yet.
The way you're importing of the datetime module is inconsistent with most of the code. This is somewhat confusing, since the datetime module also has a datetime class. Given what you want to do, it would be easiest to change the import statement to import datetime and change line 8 to td=datetime.datetime.now() (now is a function of the datetime class).
2) Iterating over field and row is redundant. The construct you have, for row in reader: for field in row, will run your if statements additional times that are unnecessary.
3) Python is zero-indexed. This means that the first element is a list is accessed using row[0], not row[1]. In your case, the fourth column of your CSV would be accessed with row[3].
4) You're combining conditions. From the phrasing of the assignment, it sounds like each of the conditions (like "change the first element to a No if it is a yes") is supposed to be independent of the other. However, if row[2]<=fm_six_months or row[4]>50 or row[2]<10 means that you'll change the data if any condition is true. It sounds like you need three separate if blocks.
5) Your code has no writer. This is really the big one. Simply saying row[2] = 100 doesn't do anything lasting, as row is just an object in memory; changing row doesn't actually change the CSV file on your computer. To actually modify the csv, you'll need to write it back out to file, using a csv.writer.

Python - read excel data when format changes every time

I get an excel from someone and I need to read the data every month. The format is not stable each time, and by saying "not stable" I mean:
Where the data starts changes: e.g. Section A may start on row 4, column D this time, but next time it may start at row 2, column E.
Under each section there are tags. The number of tags may change as well. But every time I only need the data in tag_2 and tag_3 (these two will always show up)
The only data that I need is from tag_2, tag_3, for each month (month1 - month8). And I want to find a way using Python first locate the section name, then find tag_2, tag_3 under that section, then get the data for month1 to month8 (number of months may change as well).
Please note that I do NOT want to locate the data that I need by specifying locations in excel since the locations change every time. How to I do this?
The end product should be a pandas dataframe that has monthly data for tag_2, tag_3, with a column that says which section the data come from.
Thanks.

I think you can directly read it as a comma separated text file. Based on what you need you can look at the tag2 ant tag3 for each line.
with open(filename, "r") as fs:
for line in fs:
cell_list = line.split(",")
# This point you will have all elements on the line as a list
# you can check for the size and implement your logic

Assuming that the (presumably manually pasted) block of information is unlikely to end up in the very bottom-right corner of the excel sheet, you could simply iterate over rows and columns (set maximum values for each to prevent long searching times) until you find a familiar value (such as "Section A") and go from there.
Unless I misunderstood you, the rest of the format should consistent between the months so you can simply assume that "month_1" is always one cell up and two to the right of that initial spot.
I have not personally worked with excel sheets in python, so I cannot state whether the following is possible in python, but it definitely works in ExcelVBA:
You could just as well use the Range.find() method to find the value "Section A" and continue with the same process as above, perhaps writing any results to a txt file and calling your python script from there if neccessary.
I hope this helps a little.

create database by load a csv files using the header as columnnames (and add a column that has the filename as a name)

I have CSV files that I want to make database tables from in mysql. I've searched all over and can't find anything on how to use the header as the column names for the table. I suppose this must be possible. In other words, when creating a new table in MySQL do you really have to define all the columns, their names, their types etc in advance. It would be great if MySQL could do something like Office Access where it converts to the corresponding type depending on how the value looks.
I know this is maybe a too broadly defined question, but any pointers in this matter would be helpful. I am learning Python too, so if it can be done through a python script that would be great too.
Thank you very much.

Using Python, you could use the csv DictReader module to makes it pretty easy to use the headers from the csv files as labels for the input data. It basically reads all lines in as a dictionary object with the keys as the headers, so you can use the keys as the source for your column names when accessing mySQL.
A quick example that reads a csv into a list of dictionaries:
example.csv:
name,address,city,state,phone
jack,111 washington st, somewhere, NE, 888-867-5309
jill,112 washington st, somewhere else, NE, 888-867-5310
john,113 washington st, another place, NE, 888-867-5311
example.py:
import csv
data = []
with open("example.csv") as csvfile:
reader = csv.DictReader(csvfile)
for line in reader:
data.append(line)
print(data[0].keys())
print(data[0]['address'])
print(data[1]['name'])
print(data[2]['phone'])
output:
$:python example.py
dict_keys(['name', 'address', 'city', 'state', 'phone'])
111 washington st
jill
888-867-5311
More in-depth examples at: http://java.dzone.com/articles/python-101-reading-and-writing
Some info on connection to MySQL in Python: How do I connect to a MySQL Database in Python?

The csv module can easily give you the column names from the first line, and then the values from the other ones. The hard part will be do guess the correct column types. When you load a csv file into an Excel worksheet, you only have few types : numeric, string, date.
In a database like MySQL, you can define the size of string columns, and you can give the table a primary key and eventually other indexes. You will not be able to guess that part automatically from a csv file.
At the simplest way, you can treat all columns as varchar(255). It is really uncommon to have fields in a csv file that do not fit in 255 characters. If you want something more clever, you will have to scan the file twice : first time to control the maximum size for each colum, and at the end, you could take the minimum power of 2 greater than that. Next step would be to control if any column contains only integers or floating point values. It begins to be harder to do that automatically, because the representation of floating point values may be different depending on the locale. For example 12.51 in an english locale would be 12,51 in a french locale. But Python can give you the locale.
The hardest thing would be eventual date or datetime fields, because there are many possible formats only numeric (dd/mm/yyyy or mm/dd/yy) or using plain text (Monday, 29th of september).
My advice would be to define a default mode, for example all string, or just integer and strings, and use configuration parameters or even a configuration file to finely tune conversion per column.
For the reading part, the csv module will give you all what you need.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.