CSV copy with pandas - python

I know this topic has been extensively treated, but I'm not able to get what I want, sorry about the probably newbie question. So the thing is I have a CSV like this:
Date,"Tmax","Tmin","Tmedia","Rachas","Vmax","LT","L1","L2","L3","L4"
23 nov 2018,"14.0 (15:30)","7.3 (23:59)","10.7","12 (14:50)","5 (14:50)","2.0","1.6","0.4","0.0","0.0"
I am getting a new CSV like that one each day, with multiple rows, but I'm interested only in the first row after the header. What I want to do is copying that first row each day to a new CSV iteratively, so at the end of the week, that CSV should have seven rows. Additionally, I'd like to check if that date is already in that daily file. The thing is that I'm not getting the new CSV right, here's my try:
import pandas as pd
df = pd.read_csv('file.csv', skiprows=4, header=None)
writer=df[df.index.isin([0])].to_csv('output.csv',header=None)
The problem with this code is that it overwrites the file output.csv each time. Then I considered changing it to:
writer=df[df.index.isin([0])]
pd.read_csv('output.csv').append(writer).to_csv('output.csv',header=None)
The problem now is that it does need the file to previously exist; and even so, the information is not correctly copied to the new file. I think it must be simpler than this, but I'm stuck. Thanks for your help.

If you only want the first row after the header, read the header and just use nrows=1. Then use open in append mode to write your one-row dataframe to the end of the csv file. The header=False argument deals nicely with excluding the header when writing.
df = pd.read_csv('file.csv', nrows=1)
with open('output.csv', 'a') as fout:
df.to_csv(fout, header=False)
I've omitted skiprows=4 because it's not clear how this relates to your input data.

Related

Can't get python to replace row in csv (creates new row instead)

def deletedata(uniquecode):
with open('Stallingsbestand.csv', 'r+') as CSV:
writer = csv.writer(CSV, delimiter=';')
for row in CSV:
if uniquecode in row:
writer.writerow((uniquecode, ''))
in Stallingsbestand.csv consists of rows that look like this:
uniquecode;Date_of_last_opening_a_function
I want to be able to delete the date of last opening and just have the unique code there.
(appending False at the end of the row can work too but I don't know which is easier)
I thought that just overwriting the row would be the easiest but I can't get it to work. Is there anyone who knows how to make this work?
You want to rename the file to Stallingsbestand.old, and write out a new version of Stallingsbestand.csv. One way to do this is to copy (sometimes) modified rows from a csv.reader to a csv.writer within a loop, similar to your current code.
You might find it more convenient to create an in-memory dataframe with pandas.read_csv(), mutate one of its rows, and then persist it with pandas.to_csv().
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Problems with creating a CSV file using Excel

I have some data in an Excel file. I would like to analyze them using Python. I started by creating a CSV file using this guide.
Thus I have created a CSV (Comma delimited) file filled with the following data:
I wrote a few lines of code in Python using Spyder:
import pandas
colnames = ['GDP', 'Unemployment', 'CPI', 'HousePricing']
data = pandas.read_csv('Dane_2.csv', names = colnames)
GDP = data.GDP.tolist()
print(GDP)
The output is nothing I've expected:
It can be easily seen that the output differs a lot from the figures in GDP column. I will appreciate any tips or hints which will help to deal with my problem.
Seems like in the GDP column there are decimal values from the first column in the .csv file and first digits of the second column. There's either something wrong with the .csv you created, but more probably you need to specify separator in the pandas.read_csv line. Also, add header=None, to make sure you don't lose the first line of the file (i.e. it will get replaced by colnames).
Try this:
import pandas
colnames = ['GDP', 'Unemployment', 'CPI', 'HousePricing']
data = pandas.read_csv('Dane_2.csv', names = colnames, header=None, sep=';')
GDP = data.GDP.tolist()
print(GDP)

Saving DataFrame to csv but output cells type becomes number instead of text

import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".

Unable to get correct output from tsv file using pandas

I have a tsv file which I am trying to read by the help of pandas. The first two rows of the files are of no use and needs to be ignored. Although, when I get the output, I get it in the form of two columns. The name of the first column is Index and the name of second column is a random row from the csv file.
import pandas as pd
data = pd.read_csv('zahlen.csv', sep='\t', skiprows=2)
Please refer to the screenshot below.
The second column name is in bold black, which is one of the row from the file. Moreover, using '\t' as delimiter does not separate the values in different column. I am using Spyder IDE for this. Am I doing something wrong here?
Try this:
data = pd.read_table('zahlen.csv', header=None, skiprows=2)
read_table() is more suited for tsv files and read_csv() is a more specialized version of it. Then header=None will make first row data, instead of header.

Extracting columns containing a certain name

I'm trying to use it to manipulate data in large txt-files.
I have a txt-file with more than 2000 columns, and about a third of these have a title which contains the word 'Net'. I want to extract only these columns and write them to a new txt file. Any suggestion on how I can do that?
I have searched around a bit but haven't been able to find something that helps me. Apologies if similar questions have been asked and solved before.
EDIT 1: Thank you all! At the moment of writing 3 users have suggested solutions and they all work really well. I honestly didn't think people would answer so I didn't check for a day or two, and was happily surprised by this. I'm very impressed.
EDIT 2: I've added a picture that shows what a part of the original txt-file can look like, in case it will help anyone in the future:
One way of doing this, without the installation of third-party modules like numpy/pandas, is as follows. Given an input file, called "input.csv" like this:
a,b,c_net,d,e_net
0,0,1,0,1
0,0,1,0,1
(remove the blank lines in between, they are just for formatting the
content in this post)
The following code does what you want.
import csv
input_filename = 'input.csv'
output_filename = 'output.csv'
# Instantiate a CSV reader, check if you have the appropriate delimiter
reader = csv.reader(open(input_filename), delimiter=',')
# Get the first row (assuming this row contains the header)
input_header = reader.next()
# Filter out the columns that you want to keep by storing the column
# index
columns_to_keep = []
for i, name in enumerate(input_header):
if 'net' in name:
columns_to_keep.append(i)
# Create a CSV writer to store the columns you want to keep
writer = csv.writer(open(output_filename, 'w'), delimiter=',')
# Construct the header of the output file
output_header = []
for column_index in columns_to_keep:
output_header.append(input_header[column_index])
# Write the header to the output file
writer.writerow(output_header)
# Iterate of the remainder of the input file, construct a row
# with columns you want to keep and write this row to the output file
for row in reader:
new_row = []
for column_index in columns_to_keep:
new_row.append(row[column_index])
writer.writerow(new_row)
Note that there is no error handling. There are at least two that should be handled. The first one is the check for the existence of the input file (hint: check the functionality provide by the os and os.path modules). The second one is to handle blank lines or lines with an inconsistent amount of columns.
This could be done for instance with Pandas,
import pandas as pd
df = pd.read_csv('path_to_file.txt', sep='\s+')
print(df.columns) # check that the columns are parsed correctly
selected_columns = [col for col in df.columns if "net" in col]
df_filtered = df[selected_columns]
df_filtered.to_csv('new_file.txt')
Of course, since we don't have the structure of your text file, you would have to adapt the arguments of read_csv to make this work in your case (see the the corresponding documentation).
This will load all the file in memory and then filter out the unnecessary columns. If your file is so large that it cannot be loaded in RAM at once, there is a way to load only specific columns with the usecols argument.
You can use pandas filter function to select few columns based on regex
data_filtered = data.filter(regex='net')

Categories

Resources