Add column from one .csv to another .csv file using Python - python

I am currently writing a script where I am creating a csv file ('tableau_input.csv') composed both of other csv files columns and columns created by myself. I tried the following code:
def make_tableau_file(mp, current_season = 2016):
# Produces a csv file containing predicted and actual game results for the current season
# Tableau uses the contents of the file to produce visualization
game_data_filename = 'game_data' + str(current_season) + '.csv'
datetime_filename = 'datetime' + str(current_season) + '.csv'
with open('tableau_input.csv', 'wb') as writefile:
tableau_write = csv.writer(writefile)
tableau_write.writerow(['Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS', 'True_Result', 'Predicted_Results', 'Confidence', 'Date'])
with open(game_data_filename, 'rb') as readfile:
scores = csv.reader(readfile)
scores.next()
for score in scores:
tableau_content = score[1::]
# Append True_Result
if int(tableau_content[3]) > int(tableau_content[1]):
tableau_content.append(1)
else:
tableau_content.append(0)
# Append 'Predicted_Result' and 'Confidence'
prediction_results = mp.make_predictions(tableau_content[0], tableau_content[2])
tableau_content += list(prediction_results)
tableau_write.writerow(tableau_content)
with open(datetime_filename, 'rb') as readfile2:
days = csv.reader(readfile2)
days.next()
for day in days:
tableau_write.writerow(day)
'tableau_input.csv' is the file I am creating. The columns 'Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS' come from 'game_data_filename'(e.g tableau_content = score[1::]). The columns 'True_Result', 'Predicted_Results', 'Confidence' are columns created in the first for loop.
So far, everything works but finally I tried to add to the 'Date' column data from the 'datetime_filename' using the same structure as above but when I open my 'tableau_input' file, there is no data in my 'Date' column. Can someone solve this problem?
For info, below are screenshots of csv files respectively for 'game_data_filename' and 'datetime_filename' (nb: datetime values are in datetime format)

It's hard to test this as I don't really know what the input should look like, but try something like this:
def make_tableau_file(mp, current_season=2016):
# Produces a csv file containing predicted and actual game results for the current season
# Tableau uses the contents of the file to produce visualization
game_data_filename = 'game_data' + str(current_season) + '.csv'
datetime_filename = 'datetime' + str(current_season) + '.csv'
with open('tableau_input.csv', 'wb') as writefile:
tableau_write = csv.writer(writefile)
tableau_write.writerow(
['Visitor_Team', 'V_Team_PTS', 'Home_Team', 'H_Team_PTS', 'True_Result', 'Predicted_Results', 'Confidence', 'Date'])
with open(game_data_filename, 'rb') as readfile, open(datetime_filename, 'rb') as readfile2:
scoreReader = csv.reader(readfile)
scores = [row for row in scoreReader]
scores = scores[1::]
daysReader = csv.reader(readfile2)
days = [day for day in daysReader]
if(len(scores) != len(days)):
print("File lengths do not match")
else:
for i in range(len(days)):
tableau_content = scores[i][1::]
tableau_date = days[i]
# Append True_Result
if int(tableau_content[3]) > int(tableau_content[1]):
tableau_content.append(1)
else:
tableau_content.append(0)
# Append 'Predicted_Result' and 'Confidence'
prediction_results = mp.make_predictions(tableau_content[0], tableau_content[2])
tableau_content += list(prediction_results)
tableau_content += tableau_date
tableau_write.writerow(tableau_content)
This combines both of the file reading parts into one.
As per your questions below:
scoreReader = csv.reader(readfile)
scores = [row for row in scoreReader]
scores = scores[1::]
This uses list comprehension to create a list called scores, with every element being one of the rows from scoreReader. As scorereader is a generator, every time we ask it for a row, it spits one out for us, until there are no more.
The second line scores = scores[1::] just chops off the first element of the list, as you don't want the header.
For more info try these:
Generators on Wiki
List Comprehensions
Good luck!

Related

Python entire XML file to list and then into dataframe, missing most of the file

My final goal is to take each xml file and enter the raw format of the XML into Snowflake, and this is the result I have so far. For some reason though when i convert the list to a Dataframe, the dataframe is only take a couple items from the list for each file...and not the entire 5000 rows in the xml.
My list Data is grabbing all contents from multiple files, in the list you can see the following:
Each list item is genertating a numpy array and its splitting up the elements from the looks of it.
dated = datetime.today().strftime('%Y-%m-%d')
source_dir = r'C:\Users\jSmith\.spyder-py3\SampleXML'
table_name = 'LV_XML'
file_list = glob.glob(source_dir + '/*.XML')
data = []
for file_path in file_list:
data.append(
np.genfromtxt(file_path,dtype='str',delimiter='|',encoding='utf-8')) #delimiter used to make sure it is not splitting based on spaces, might be the issue?
df = pd.DataFrame(list(zip(data)),
columns =['SRC_XML'])
df['SRC_XML']=df['SRC_XML'].astype(str)
df = df.replace(',','', regex=True)
df["TPR_AS_OF_DT"] = dated
The data frame has the following in each column:
Solution via Dave, with a small tweak:
for file_path in file_list:
with open(file_path,'r') as afile:
content = ''
for aline in afile:
content += aline.replace('\n',' ') # changed to replace for my needs
data.append(content)
This puts the data into a single string, and allows it to be ready to be inserted into the Snowflake table as 1 string...for future queries
Perhaps replace the file reading with this:
for file_path in file_list:
with open(file_path,'r') as afile:
content = ''
for aline in afile:
content += aline.strip('\n')
data.append(content)

Read csv file with empty lines

Analysis software I'm using outputs many groups of results in 1 csv file and separates the groups with 2 empty lines.
I would like to break the results in groups so that I can then analyse them separately.
I'm sure there is a built-in function in python (or one of it's libraries) that does this, I tried this piece of code that I found somewhere but it doesn't seem to work.
import csv
results = open('03_12_velocity_y.csv').read().split("\n\n")
# Feed first csv.reader
first_csv = csv.reader(results[0], delimiter=',')
# Feed second csv.reader
second_csv = csv.reader(results[1], delimiter=',')
Update:
The original code actually works, but my python skills are pretty limited and I did not implement it properly.
.split(\n\n\n) method does work but the csv.reader is an object and to get the data in a list (or something similar), it needs to iterate through all the rows and write them to the list.
I then used Pandas to remove the header and convert the scientific notated values to float. Code is bellow. Thanks everyone for help.
import csv
import pandas as pd
# Open the csv file, read it and split it when it encounters 2 empty lines (\n\n\n)
results = open('03_12_velocity_y.csv').read().split('\n\n\n')
# Create csv.reader objects that are used to iterate over rows in a csv file
# Define the output - create an empty multi-dimensional list
output1 = [[],[]]
# Iterate through the rows in the csv file and append the data to the empty list
# Feed first csv.reader
csv_reader1 = csv.reader(results[0].splitlines(), delimiter=',')
for row in csv_reader1:
output1.append(row)
df = pd.DataFrame(output1)
# remove first 7 rows of data (the start position of the slice is always included)
df = df.iloc[7:]
# Convert all data from string to float
df = df.astype(float)
If your row counts are inconsistent across groups, you'll need a little state machine to check when you're between groups and do something with the last group.
#!/usr/bin/env python3
import csv
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as out_f:
csv.writer(out_f).writerows(group)
with open("input.csv", newline="") as f:
reader = csv.reader(f)
group_i = 1
group = []
last_row = []
for row in reader:
if row == [] and last_row == [] and group != []:
write_group(group, group_i)
group = []
group_i += 1
continue
if row == []:
last_row = row
continue
group.append(row)
last_row = row
# flush remaining group
if group != []:
write_group(group, group_i)
I mocked up this sample CSV:
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
And when I run the program above I get three CSV files:
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
group_2.csv
g2r1c1,g2r1c2,g2r1c3
g2r2c1,g2r2c2,g2r2c3
group_3.csv
g3r1c1,g3r1c2,g3r1c3
g3r2c1,g3r2c2,g3r2c3
g3r3c1,g3r3c2,g3r3c3
g3r4c1,g3r4c2,g3r4c3
g3r5c1,g3r5c2,g3r5c3
If your row counts are consistent, you can do this with fairly vanilla Python or using the Pandas library.
Vanilla Python
Define your group size and the size of the break (in "rows") between groups.
Loop over all the rows adding each row to a group accumulator.
When the group accumulator reaches the pre-defined group size, do something with it, reset the accumulator, and then skip break-size rows.
Here, I'm writing each group to its own numbered file:
import csv
group_sz = 5
break_sz = 2
def write_group(group, i):
with open(f"group_{i}.csv", "w", newline="") as f_out:
csv.writer(f_out).writerows(group)
with open("input.csv", newline="") as f_in:
reader = csv.reader(f_in)
group_i = 1
group = []
for row in reader:
group.append(row)
if len(group) == group_sz:
write_group(group, group_i)
group_i += 1
group = []
for _ in range(break_sz):
try:
next(reader)
except StopIteration: # gracefully ignore an expected StopIteration (at the end of the file)
break
group_1.csv
g1r1c1,g1r1c2,g1r1c3
g1r2c1,g1r2c2,g1r2c3
g1r3c1,g1r3c2,g1r3c3
g1r4c1,g1r4c2,g1r4c3
g1r5c1,g1r5c2,g1r5c3
With Pandas
I'm new to Pandas, and learning this as I go, but it looks like Pandas will automatically trim blank rows/records from a chunk of data^1.
With that in mind, all you need to do is specify the size of your group, and tell Pandas to read your CSV file in "iterator mode", where you can ask for a chunk (your group size) of records at a time:
import pandas as pd
group_sz = 5
with pd.read_csv("input.csv", header=None, iterator=True) as reader:
i = 1
while True:
try:
df = reader.get_chunk(group_sz)
except StopIteration:
break
df.to_csv(f"group_{i}.csv")
i += 1
Pandas add an "ID" column and default header when it writes out the CSV:
group_1.csv
,0,1,2
0,g1r1c1,g1r1c2,g1r1c3
1,g1r2c1,g1r2c2,g1r2c3
2,g1r3c1,g1r3c2,g1r3c3
3,g1r4c1,g1r4c2,g1r4c3
4,g1r5c1,g1r5c2,g1r5c3
TRY this out with your output:
import pandas as pd
# csv file name to be read in
in_csv = 'input.csv'
# get the number of lines of the csv file to be read
number_lines = sum(1 for row in (open(in_csv)))
# size of rows of data to write to the csv,
# you can change the row size according to your need
rowsize = 500
# start looping through data writing it to a new file for each set
for i in range(1,number_lines,rowsize):
df = pd.read_csv(in_csv,
header=None,
nrows = rowsize,#number of rows to read at each loop
skiprows = i)#skip rows that have been read
#csv to write data to a new file with indexed name. input_1.csv etc.
out_csv = 'input' + str(i) + '.csv'
df.to_csv(out_csv,
index=False,
header=False,
mode='a', #append data to csv file
)
I updated the question with the last details that answer my question.

How do I sort through a CSV file in python so that it only returns certain values?

I am trying to sort through a CSV file in python so that only a certain value from each entry is printed. Each line of my csv files has the date, location, weather, temperature, etc. I am trying to return the temperature column, but instead it is printing the entire csv file. This is what I currently have:
with open('2000-2009.csv', newline = "") as csv_file:
csv_reader = csv.reader(csv_file, delimiter = ',')
temp = 0
tempList = []
index = 0
for Tavg in csv_reader:
temp = int(Tavg)
tempList.append(temp)
print(tempList)
that's because you are importing the entire CSV file. you want to extract the column. add a counter based on the number of columns and read it when the counter hits the column.

How to print nested list as new line in CSV?

Nested List:
AIStable = [['Unknown,Unknown,-,-,9127057/-,-,-,-,0.0°/0.0kn,0.0S/0.0W,-'], ['Tanker,MarshallIslands,USHOU>DOSPM,Jan3,19:00,9683984/538005270,V7CJ5,183/32m,11.4m,112.0°/12.8kn,18.26069N/75.77137W,Jan2,202006:47UTC'], ['Productisnotfound!'], ['Productisnotfound!'], ['Productisnotfound!'], ['Productisnotfound!'], ['Cargoship,Russia,VLADIVOSTOK,RUSSIA,Dec31,00:00,9015785/273213710,UBCT2,119/18m,5.6m,7.9°/0.0kn,43.09932N/131.88229E,Jan1,202009:06UTC'], ['Tanker,Singapore,YEOSU,SK,Jan1,03:00,9370991/566376000,9V3383,100/16m,4.3m,188.0°/0.1kn,34.72847N/127.81362E,Jan2,202007:41UTC'], ['Cargoship,Italy,QINGDAO,Jan1,02:00,9511454/247283400,ICAH,292/45m,18.2m,324.0°/5.4kn,27.80572N/125.07295E,Jan2,202007:20UTC']]
Nested list within the list signifies a new line to be added in the CSV
The desired output in the CSV file is:
AIS_Type Flag Destination ETA IMO/MMSI Callsign Length/Beam Current_Draught Course/Speed Coordinates Last_Report
Unknown Unknown - - 9127057/- - - - - 0.0°/0.0kn 0.0S/0.0W -
Tanker MarshallIslands USHOU>DOSPM Jan3 19:00 9683984/538005270 V7CJ5 183/32m 11.4m 112.0°/12.8kn 18.26069N/75.77137W Jan2 202006:47UTC
Product is not found!
I tried the following:
AISUpdated = [[''.join(','.join(i).split())] for i in AIStable]
print(AISUpdated)
filename = "vesselsUpdated.csv"
with open(filename, 'w' , newline = '') as f:
writer = csv.writer(f,delimiter = ",")
headers = "AIS_Type,Flag,Destination,ETA,IMO/MMSI,Callsign,Length/Beam,Current_Draught,Course/Speed,Coordinates,Last_Report"
f.write(headers)
writer.writerows("\n")
for i in AISUpdated:
writer.writerows([i])
The output i obtain does not reflect a new column for the records while it squeeze all the new records of the new list to one single column under AIS_Type. Hence, i want it to separate by , and referencing each relevant data to the right column.
The problem is that all your lists only have one value.
You should either make them an actual lists, ['Unknown','Unknown','-','-','9127057']
or do it like this:
AISUpdated = [i[0].split(',') for i in AIStable]
with open('file.csv','w') as f:
writer = csv.writer(f)
writer.writerow(headers.split(','))
writer.writerows(AISUpdated)

Implement parallel processing of for loop

Looking to make the following code parallel- it reads in data in one large 9gb proprietary format and produces 30 individual csv files based on the 30 columns of data. It currently takes 9 minutes per csv written on a 30 minute data set. The solution space of parallel libraries in Python is a bit overwhelming. Can you direct me to any good tutorials/sample code? I couldn't find anything very informative.
for i in range(0, NumColumns):
aa = datetime.datetime.now()
allData = [TimeStamp]
ColumnData = allColumns[i].data # Get the data within this one Column
Samples = ColumnData.size # Find the number of elements in Column data
print('Formatting Column {0}'.format(i+1))
truncColumnData = [] # Initialize truncColumnData array each time for loop runs
if ColumnScale[i+1] == 'Scale: '+ tempScaleName: # If it's temperature, format every value to 5 characters
for j in range(Samples):
truncValue = '{:.1f}'.format((ColumnData[j]))
truncColumnData.append(truncValue) # Appends formatted value to truncColumnData array
allData.append(truncColumnData) #append the formatted Column data to the all data array
zipObject = zip(*allData)
zipList = list(zipObject)
csvFileColumn = 'Column_' + str('{0:02d}'.format(i+1)) + '.csv'
# Write the information to .csv file
with open(csvFileColumn, 'wb') as csvFile:
print('Writing to .csv file')
writer = csv.writer(csvFile)
counter = 0
for z in zipList:
counter = counter + 1
timeString = '{:.26},'.format(z[0])
zList = list(z)
columnVals = zList[1:]
columnValStrs = list(map(str, columnVals))
formattedStr = ','.join(columnValStrs)
csvFile.write(timeString + formattedStr + '\n') # Writes the time stamps and channel data by columns
one possible solution may be to use Dask http://dask.pydata.org/en/latest/
A coworker recently recommended it to me which is why I thought of it.

Categories

Resources