how to choose which column to write in (.csv) in python - python

import csv
f = csv.reader(open('lmt.csv','r')) # open input file for reading
Date, Open, Hihh, mLow, Close, Volume = zip(*f) #s plit it into separate columns
ofile = open("MYFILEnew1.csv", "wb") # output csv file
c = csv.writer(ofile)
item = Date
item2 = Volume
rows = zip(item, item)
i = 0
for row in item2:
print row
writer = csv.writer(ofile, delimiter='\t')
writer.writerow([row])
ofile.close()
Above is what I have produced so far.
As you can see in the 3rd line, I have extracted 6 columns from a spreadsheet.
I want to create a .csv file under the name of MYFILEnew1.csv which only has two columns, Date and Volume.
What I have above creates a .csv that only writes Volume column into the first column of the new .csv file.
How would you go about placing Date into the second column?
For example
Date Open High Low Close Volume
17-Feb-16 210 212.97 209.1 212.74 1237731
is what i have. and Id like to produce a new csv file such that it has
Date Volume
17-Feb-16 1237731

If I understand you question correctly, you can achieve that very easily using panda's read_csv and to_csv (#downvoter: Could you explain your downvote, please!?); the final solution to your problem can be found below EDIT2:
import pandas as pd
# this assumes that your file is comma separated
# if it is e.g. tab separated you should use pd.read_csv('data.csv', sep = '\t')
df = pd.read_csv('data.csv')
# select desired columns
df = df[['Date', 'Volume']]
#write to the file (tab separated)
df.to_csv('MYFILEnew1.csv', sep='\t', index=False)
So, if your data.csv file looks like this:
Date,Open,Hihh,mLow,Close,Volume
1,5,9,13,17,21
2,6,10,14,18,22
3,7,11,15,19,23
4,8,12,16,20,24
The the MYFILEnew1.csv would look like this after running the script above:
Date Volume
1 21
2 22
3 23
4 24
EDIT
Using your data (tab separated, stored in the file data3.csv):
Date Open Hihh mLow Close Volume
17-Feb-16 210 212.97 209.1 212.74 1237731
Then
import pandas as pd
df = pd.read_csv('data3.csv', sep='\t')
# select desired columns
df = df[['Date', 'Volume']]
# write to the file (tab separated)
df.to_csv('MYFILEnew1.csv', sep='\t', index=False)
gives the desired output
Date Volume
17-Feb-16 1237731
EDIT2
Since your header in your input csv file seems to be messed up (as discussed in the comments), you have to rename the first column. The following now works fine for me using your entire dataset:
import pandas as pd
df = pd.read_csv('lmt.csv', sep=',')
# get rid of the wrongly formatted column name
df.rename(columns={df.columns[0]: 'Date' }, inplace=True)
# select desired columns
df = df[['Date', 'Volume']]
# write to the file (tab separated)
df.to_csv('MYFILEnew1.csv', sep='\t', index=False)

Here I would suggest using the csv module's csv.DictReader object to read and write from the files. To read the file, you would do something like
import csv
fieldnames=('Date', 'Open', 'High', 'mLow', 'Close', 'Volume')
with open('myfilename.csv') as f:
reader = csv.DictReader(f, fieldnames=fieldnames)
Beyond this, you will just need to filter out the keys you don't want from each row and similarly use the csv.DictWriter class to write to your export file.

You were so close:
import csv
f = csv.reader(open('lmt.csv','rb')) # csv is binary
Date, Open, Hihh, mLow, Close, Volume = zip(*f)
rows = zip(Date, Volume)
ofile = open("MYFILEnew1.csv", "wb")
writer = csv.writer(ofile)
for row in rows:
writer.writerow(row) # row is already a tuple so no need to make it a list
ofile.close()

Related

how to open several nc files, filter a desired period and write the result in a single txt file?

I have several nc files, I need to open them all. Filter the desired period and write the result of all in a single txt file.
The nc files correspond to a month (Jan, Feb, Mar...) and have four variables (temperature, dew point, u and v).
I need to assemble a table, with all the variables side by side for a specific period. For example, from January to October. The first column being temperature, the second dew point, third u and lastly v.
from netCDF4 import MFDataset
import pandas as pd
import xarray as xr
import csv
import tempfile
ds=xr.open_mfdataset('/home/milena/Documentos/dados_obs_haroldo/media_horaria/MEDIA_HORARIA_*.nc')
lat = ds.variables['lat'][:]
lon = ds.variables['lon'][:]
t2mj = ds.variables['t2mj'][:]
td2mj = ds.variables['td2mj'][:]
u10mj = ds.variables['u10mj'][:]
v10mj = ds.variables['v10mj'][:]
#Brasilia
t2mj_txt=ds.t2mj.isel(lat=153, lon=117).to_dataframe().to_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/t2mj.csv')
td2mj_txt=ds.td2mj.isel(lat=153, lon=117).to_dataframe().to_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/td2mj.csv')
u10mj_txt=ds.u10mj.isel(lat=153, lon=117).to_dataframe().to_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/u10mj.csv')
v10mj_txt=ds.v10mj.isel(lat=153, lon=117).to_dataframe().to_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/v10mj.csv')
#print(t2mj_txt)
#opem csv
t2mj_csv = pd.read_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/t2mj.csv', skipinitialspace=True)
td2mj_csv = pd.read_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/td2mj.csv', skipinitialspace=True)
u10mj_csv = pd.read_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/u10mj.csv', skipinitialspace=True)
v10mj_csv = pd.read_csv('/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/v10mj.csv', skipinitialspace=True)
#print(t2mj_csv)
#filter desired period
t2mj_date=t2mj_csv[(t2mj_csv['time'])<"2022-12-01"]
td2mj_date=td2mj_csv[(td2mj_csv['time'])<"2022-12-01"]
u10mj_date=u10mj_csv[(u10mj_csv['time'])<"2022-12-01"]
v10mj_date=v10mj_csv[(v10mj_csv['time'])<"2022-12-01"]
arquivo = open("/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/t2mj_filter.txt", "w")
arquivo.write(t2mj_date['t2mj'].to_string())
arquivo.close()
arquivo2 = open("/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/td2mj_filter.txt", "w")
arquivo2.write(td2mj_date['td2mj'].to_string())
arquivo2.close()
arquivo3 = open("/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/u10mj_filter.txt", "w")
arquivo3.write(u10mj_date['u10mj'].to_string())
arquivo3.close()
arquivo4 = open("/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/v10mj_filter.txt", "w")
arquivo4.write(v10mj_date['v10mj'].to_string())
arquivo4.close()
file_list=['/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/t2mj_filter.txt', '/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/td2mj_filter.txt', '/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/u10mj_filter.txt', '/home/milena/Documentos/dados_obs_haroldo/media_horaria/csv/v10mj_filter.txt']
dfe = pd.DataFrame()
for file in file_list:
temp_dfe = pd.read_csv(file, header=None, names=[file[:-4]])
dfe = pd.concat([dfe, temp_dfe], axis=1)
arquivo5 = open("/home/milena/Documentos/dados_obs_haroldo/media_horaria/teste.txt", "w")
arquivo5.write(dfe.to_string())
arquivo5.close()
my result looks like this:
enter image description here
I would like it to look like this:
enter image description here
The beauty of Xarray is that you don't need to do all this extra IO to make this problem work. Just use xarray's selection logic then convert the dataset to a dataframe before writing a multi-column text file.
ds = xr.open_mfdataset(...)
# select point, select time slice, drop lat/lon variables
ds_point = ds.isel(lat=153, lon=117).sel(time=slice(None, '2022-12-01')).drop(['lat', 'lon'])
# convert to dataframe
df_point = ds_point.to_dataframe()
# order columns then write to text file with tab separator
variables = ['t2mj', 'td2mj', 'u10mj', 'v10mj']
df_point[variables].to_csv(..., sep='\t')

How to convert .dat to .csv using python? the data is being expressed in one column

Hi i'm trying to convert .dat file to .csv file.
But I have a problem with it.
I have a file .dat which looks like(column name)
region GPS name ID stop1 stop2 stopname1 stopname2 time1 time2 stopgps1 stopgps2
it delimiter is a tab.
so I want to convert dat file to csv file.
but the data keeps coming out in one column.
i try to that, using next code
import pandas as pd
with open('file.dat', 'r') as f:
df = pd.DataFrame([l.rstrip() for l in f.read().split()])
and
with open('file.dat', 'r') as input_file:
lines = input_file.readlines()
newLines = []
for line in lines:
newLine = line.strip('\t').split()
newLines.append(newLine)
with open('file.csv', 'w') as output_file:
file_writer = csv.writer(output_file)
file_writer.writerows(newLines)
But all the data is being expressed in one column.
(i want to express 15 column, 80,000 row, but it look 1 column, 1,200,000 row)
I want to convert this into a csv file with the original data structure.
Where is a mistake?
Please help me... It's my first time dealing with data in Python.
If you're already using pandas, you can just use pd.read_csv() with another delimiter:
df = pd.read_csv("file.dat", sep="\t")
df.to_csv("file.csv")
See also the documentation for read_csv and to_csv

I need to split one column in csv file into two columns using python

Hello everyone I am learning python I am new I have a column in a csv file with this example of value:
I want to divide the column programme based on that semi column into two columns for example
program 1: H2020-EU.3.1.
program 2: H2020-EU.3.1.7.
This is what I wrote initially
import csv
import os
with open('IMI.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file)
with open('new_IMI.csv', 'w') as new_file:
csv_writer = csv.writer(new_file, delimiter='\t')
#for line in csv_reader:
# csv_writer.writerow(line)
please note that after i do the split of columns I need to write the file again as a csv and save it to my computer
Please guide me
Using .loc to iterate through each row of a dataframe is somewhat inefficient. Better to split an entire column, with the expand=True to assign to the new columns. Also as stated, easy to use pandas here:
Code:
import pandas as pd
df = pd.read_csv('IMI.csv')
df[['programme1','programme2']] = df['programme'].str.split(';', expand=True)
df.drop(['programme'], axis=1, inplace=True)
df.to_csv('IMI.csv', index=False)
Example of output:
Before:
print(df)
id acronym status programme topics
0 945358 BIGPICTURE SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2019-18-01
1 821362 EBiSC2 SIGNED H2020-EU.3.1.;H2020-EU3.1.7 IMI2-2017-13-06
2 116026 HARMONY SIGNED H202-EU.3.1. IMI2-2015-06-04
After:
print(df)
id acronym status topics programme1 programme2
0 945358 BIGPICTURE SIGNED IMI2-2019-18-01 H2020-EU.3.1. H2020-EU3.1.7
1 821362 EBiSC2 SIGNED IMI2-2017-13-06 H2020-EU.3.1. H2020-EU3.1.7
2 116026 HARMONY SIGNED IMI2-2015-06-04 H2020-EU.3.1. None
You can use pandas library instead of csv.
import pandas as pd
df = pd.read_csv('IMI.csv')
p1 = {}
p2 = {}
for i in range(len(df)):
if ';' in df['programme'].loc[i]:
p1[df['id'].loc[i]] = df['programme'].loc[i].split(';')[0]
p2[df['id'].loc[i]] = df['programme'].loc[i].split(';')[1]
df['programme1'] = df['id'].map(p1)
df['programme2'] = df['id'].map(p2)
and if you want to delete programme column:
df.drop('programme', axis=1)
To save new csv file:
df.to_csv('new_file.csv', inplace=True)

Removing a row from CSV with python if data wasn't recorded in a column

I'm trying to import a batch of CSV's into PostgreSQL and constantly run into an issue with missing data:
psycopg2.DataError: missing data for column "column_name" CONTEXT:
COPY table_name, line where ever in the CSV that data wasn't
recorded, and here are data values up to the missing column.
There is no way to get the complete set of data written to the row at times, and I have to deal with the files as is. I am trying to figure a way to remove the row if data wasn't recorded into any column. Here's what I have:
file_list = glob.glob(path)
for f in file_list:
filename = os.path.basename(f) #get the file name
arc_csv = arc_path + filename #path for revised copy of CSV
with open(f, 'r') as inp, open(arc_csv, 'wb') as out:
writer = csv.writer(out)
for line in csv.reader(inp):
if "" not in line: #if the row doesn't have any empty fields
writer.writerow(line)
cursor.execute("COPY table_name FROM %s WITH CSV HEADER DELIMITER ','",(arc_csv,))
You could use pandas to remove rows with missing values:
import glob, os, pandas
file_list = glob.glob(path)
for f in file_list:
filename = os.path.basename(f)
arc_csv = arc_path + filename
data = pandas.read_csv(f, index_col=0)
ind = data.apply(lambda x: not pandas.isnull(x.values).any(), axis=1)
# ^ provides an index of all rows with no missing data
data[ind].to_csv(arc_csv) # writes the revised data to csv
However, this could get slow if you're working with large datasets.
EDIT - added index_col=0 as an argument to pandas.read_csv() to prevent the added index column issue. This uses the first column in the csv as an existing index. Replace 0 with another column's number if you have reason not to use the first column as index.
Unfortunately, you cannot parameterize table or column names. Use string formatting, but make sure to validate/escape the value properly:
cursor.execute("COPY table_name FROM {column_name} WITH CSV HEADER DELIMITER ','".format(column_name=arc_csv))

Python: extracting data values from one file with IDs from a second file

I’m new to coding, and trying to extract a subset of data from a large file.
File_1 contains the data in two columns: ID and Values.
File_2 contains a large list of IDs, some of which may be present in File_1 while others will not be present.
If an ID from File_2 is present in File_1, I would like to extract those values and write the ID and value to a new file, but I’m not sure how to do this. Here is an example of the files:
File_1: data.csv
ID Values
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
File_2: ID.xlsx
IDs
HOT224_1_0025m_c100047_1
HOT224_1_0025m_c100061_1
HOT225_1_0025m_c100547_1
HOT225_1_0025m_c100561_1
I tried the following:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col = 0)
ID_file = pd.read_excel('ID.xlsx')
values_from_ID = data_file.loc[['ID_file']]
The following error occurs:
KeyError: "None of [['ID_file']] are in the [index]"
Not sure if I am reading in the excel file correctly.
I also do not know how to write the extracted data to a new file once I get the code to do it.
Thanks for your help.
With pandas:
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')
Content of result.csv:
IDs,Values
HOT224_1_0025m_c100047_1,16.0
HOT224_1_0025m_c100061_1,1.0
In steps:
You need to read your csv with whitespace delimited:
data_file = pd.read_csv('data.csv', index_col=0, delim_whitespace=True)
it looks like this:
>>> data_file
Values
ID
HOT224_1_0025m_c100047_1 16
HOT224_1_0025m_c10004_1 3
HOT224_1_0025m_c100061_1 1
HOT224_1_0025m_c10010_2 1
HOT224_1_0025m_c10020_1 1
Now, read your Excel file, using the ids as index:
ID_file = pd.read_excel('ID.xlsx', index_col=0)
and you use its index with locto get the matching entries from your first dataframe. Drop the missing values with dropna():
res = data_file.loc[ID_file.index].dropna()
Finally, write to the result csv:
res.to_csv('result.csv')
You can do it using a simple dictionary in Python. You can make a dictionary from file 1 and read the IDs from File 2. The IDS from file 2 can be checked in the dictionary and only the matching ones can be written to your output file. Something like this could work :
with open('data.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
table = {l.split()[0]:l.split()[1] for l in lines if len(l.strip()) != 0}
with open('id.csv','r') as f:
lines = f.readlines()
#Skip the CSV Header
lines = lines[1:]
matchedIDs = [(l.strip(),table[l.strip()]) for l in line if l.strip() in table]
Now you will have your matched IDs and their values in a list of tuples called matchedIDs. You can write them in any format you like in a file.
I'm also new to python programming. So the code that I used below might not be the most efficient. The situation I assumed is that find ids in data.csv also in id.csv, there might be some ids in data.csv not in id.csv and vise versa.
import pandas as pd
data = pd.read_csv('data.csv')
id2 = pd.read_csv('id.csv')
data.ID = data['ID']
id2.ID = idd['IDs']
d=[]
for row in data.ID:
d.append(row)
f=[]
for row in id2.ID:
f.append(row)
g=[]
for i in d:
if i in f:
g.append(i)
data = pd.read_csv('data.csv',index_col='ID')
new_data = data.loc[g,:]
new_data.to_csv('new_data.csv')
This is the code I ended up using. It worked perfectly. Thanks to everyone for their responses.
import pandas as pd
data_file = pd.read_csv('data.csv', index_col=0)
ID_file = pd.read_excel('ID.xlsx', index_col=0)
res = data_file.loc[ID_file.index].dropna()
res.to_csv('result.csv')

Categories

Resources