read write csv/dataframe files - python

i am trying to remove the 5th and sixth item of each line of my csv file each line is a list but when i am trying to run it i am getting a (DataFrame constructor not properly called!) error please help
i have tried everything i can but i cant find a simple way to remove the last 2 items of every list and then after this i want to add 2 items onto every list with a random int between diffent numbers .

Just edit how you're reading the file
You should use
df = pd.read_csv('database.csv')
that's why you are getting that error.

You reading the file incorrectly
df = pd.DataFrame() is for creating a new dataframe.
You should use
df = pd.read_csv("filename.csv")

Related

How to get first column of all rows in a list in python

could someone please help me with printing the first column of all rows in a csv file. I imported the csv file in python and when i make a list by using the code below, it prints all columns and all rows of the csv file. When printed in Python, the columns are seperated by a ';' instead of the usual ',' in a list. I would like to set all rows of the first column equal to a variable. I would really appreciate if someone could help me.
click here to see the code
data = pd.read_csv('your.csv') #reads csv file and stores in data variable
data.iloc[:,0] #prints first column

How to fix a blank column being added at the far left when reading my excel file?

I'm working on a rather large excel file and as part of it I'd like to insert two columns into a new excel file at the far right which works, however whenever I do so an unnamed column appears at the far left with numbers in it.
This is for an excel file, and I've tried to use the .drop feature as well as use a new file and read about the CSV files but I cannot seem to apply it here, so nothing seems to solve it.
wdf = pd.read_excel(tLoc)
sheet_wdf_map = pd.read_excel(tLoc, sheet_name=None)
wdf['Adequate'] = np.nan
wdf['Explanation'] = np.nan
wdf = wdf.drop(" ", axis=1)
I expect the output to be my original columns with only the two new columns being on the far right without the unnamed column.
Add index_col=[0] as an argument to read_excel.

Python: How to create a new dataframe with first row when a specific value

I am reading csv files into python using:
df = pd.read_csv(r"C:\csvfile.csv")
But the file has some summary data, and the raw data start if a value "valx" is found. If "valx" is not found then the file is useless. I would like to create news dataframes that start when "valx" is found. I have been trying for a while with no success. Any help on how to achieve this is greatly appreciated.
Unfortunately, pandas only accepts skiprows for rows to skip in the beginning. You might want to parse the file before creating the dataframe.
As an example:
import csv
with open(r"C:\csvfile.csv","r") as f:
lines = csv.reader(f, newline = '')
if any('valx' in i for i in lines):
data = lines
Using the Standard Libary csv module, you can read file and check if valx is in the file, if it is found, the content will be returned in the data variable.
From there you can use the data variable to create your dataframe.

Reading a .xlsx and accessing cell values but not by their position

this is my first question so sorry in advance if I make some explanation mistakes.
I'm coding in python 2.7.
I wrote a .xlsx (Excel) file (it could have been a .xls, I don't really need the macro + VBA at this point). The Excel file looks like this:
The values are linked with the name of the column and the name of the line. For example, I have a column named "Curve 1" and a line named "Number of extremum". So in that cell I wrote "1" if the curve1 has 1 extremum.
I want to take this value in order to manipule it in a python script.
I know I can use xlrd module with open workbook and put the values of the line 1 ("Number of extremum") in a list and then only take the first one (corresponding to the column "Curve 1" and so to the value "1" I want), but this isn't what I would like to have.
Instead, I would like to access the "1" cell value by only giving to the python script the strings "Curve 1" and "Number of extremum" and python would access to the cell at the meeting of the two and take its value : "1". Is it possible ?
I would like to do this because the Excel file would change in time and cells could be moved. So if I try to access cell value by it's "position number" (like line 1, column 1), I would have a problem if a column or a line is added at this position. I would like not to have to edit again the python script if there's some editing in the xlsx file.
Thank you very much.
Pandas is a popular 3rd party library for reading/writing datasets. You can use pd.DataFrame.at for efficient scalar access via row and column labels:
import pandas as pd
# read file
df = pd.read_excel('file.xlsx')
# extract value
val = df.at['N of extremum', 'Curve 1']
This is very easy using Pandas. To obtain the cell you want you can just use loc which allows you to specify the row and column just like you want.
import pandas
df = pandas.read_excel('test.xlsx')
df.loc['N of extremum', 'Curve 1']

Python CSV formatting issue when writing specific columns to output file then opening in Excel

The Problem
I have a CSV file that contains a large number of items.
The first column can contain either an IP address or random garbage. The only other column I care about is the fourth one.
I have written the below snippet of code in an attempt to check if the first column is an IP address and, if so, write that and the contents of the fourth column to another CSV file side by side.
with open('results.csv','r') as csvresults:
filecontent = csv.reader(csvresults)
output = open('formatted_results.csv','w')
processedcontent = csv.writer(output)
for row in filecontent:
first = str(row[0])
fourth = str(row[3])
if re.match('\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', first) != None:
processedcontent.writerow(["{},{}".format(first,fourth)])
else:
continue
output.close()
This works to an extent. However, when viewing in Excel, both items are placed in a single cell rather than two adjacent ones. If I open it in notepad I can see that each line is wrapped in quotation marks. If these are removed Excel will display the columns properly.
Example Input
1.2.3.4,rubbish1,rubbish2,reallyimportantdata
Desired Output
1.2.3.4 reallyimportantdata - two separate columns
Actual Output
"1.2.3.4,reallyimportantdata" - single column
The Question
Is there any way to fudge the format part to not write out with quotations? Alternatively, what would be the best way to achieve what I'm trying to do?
I've tried writing out to another file and stripping the lines but, despite not throwing any errors, the result was the same...
writerow() takes a list of elements and writes each of those into a column. Since you are feeding a list with only one element, it is being placed into one column.
Instead, feed writerow() a list:
processedcontent.writerow([first,fourth])
Have you considered using Pandas?
import pandas as pd
df = pd.read_csv("myFile.csv", header=0, low_memory=False, index_col=None)
fid = open("outputp.csv","w")
for index, row in df.iterrows():
aa=re.match(r"^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$",row['IP'])
if aa:
tline = '{0},{1}'.format(row['IP'], row['fourth column'])
fid.write(tline)
output.close()
There may be an error or two and I got the regex from here.
This assumes the first row of the csv has titles which can be referenced. If it does not then you can use header = None and reference the columns with iloc
Come to think of it you could probably run the regex on the dataFrame, copy the first and fourth column to a new dataFrame and use the to_csv method in pandas.

Categories

Resources