Exporting max values of different csv files in to one - python

I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)

Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))

You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))

Related

Reading from a .dat file Dataframe in Python

I have a .dat file which looks something like the below....
#| step | Channel| Mode | Duration|Freq.| Amplitude | Phase|
0 1 AWG Pi/2 100 2 1
1 1 SIN^2 100 1 1
2 1 SIN^2 200 0.5 1
3 1 REC 50 100 1 1
100 0 REC Pi/2 150 1 1
I had created a data frame and I wanted to read extract data from the data frame but I have an error
TypeError: expected str, bytes or os.PathLike object, not DataFrame
My code is below here,
import pandas as pd
import numpy as np
path = "updated.dat"
datContent = [i.strip().split() for i in open(path).readlines()]
#print(datContent)
column_names = datContent.pop(0)
print(column_names)
df = pd.DataFrame(datContent)
print(df)
extract_column = df.iloc[:,2]
with open (df, 'r') as openfile :
for line in openfile:
for column_search in line:
column_search = df.iloc[:,2]
if "REC" in column_search:
print ("Rec found")
Any suggestion would be appreciated
Since your post does not have any clear question, I have to guess based on your code. I am assuming that what you want to get is to find all rows in DataFrame where column Mode contains value REC.
Based on that, I prepared a small, self contained example that works on your data.
In your situation, the only line that you should use is the last one. Assuming that your DataFrame is created and filled correctly, your code below print(df) can be exchanged by this single line.
I would really recommend you reading the official documentation about indexing and selecting data from DataFrames. https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
import pandas as pd
from io import StringIO
data = StringIO("""
no;step;Channel;Mode;Duration;Freq.;Amplitude;Phase
;0;1;AWG;Pi/2;100;2;1
;1;1;SIN^2;;100;1;1
;2;1;SIN^2;;200;0.5;1
;3;1;REC;50;100;1;1
;100;0;REC;Pi/2;150;1;1
""")
df = pd.read_csv(data, sep=";")
df.loc[df.loc[:, 'Mode'] == "REC", :]

How to import data from a .txt file into arrays in python

I am trying to import data from a .txt file that contains four columns that are separated by tab and is several thousands lines long. This is how the start of the document look like:
Data info
File name: D:\(path to file)
Start time: 6/26/2019 15:39:54.222
Number of channels: 3
Sample rate: 1E6
Store type: fast on trigger
Post time: 20
Global header information: from DEWESoft
Comments:
Events
Event Type Event Time Comment
1 storing started at 7.237599
2 storing stopped at 7.257599
Data1
Time Incidente Transmitida DI 6
s um/m um/m -
0 2.1690152 140.98599 1
1E-6 2.1690152 140.98599 1
2E-6 4.3380303 145.32402 1
3E-6 4.3380303 145.32402 1
4E-6 -2.1690152 145.32402 1
I have several of these files that I want to loop trough and store in a cell/list that each cell/list item contains the four columns. After that I just use that cell/list to plot the data with a loop.
I saw that pandas library was suitable, but I don't understand how to use it.
fileNames = (["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"])
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Loop trough each source document
for i in range(0,len(fileNames)):
print('File location: '+folderName+fileNames[i])
# Get data from source as arrays, cut out the first 20 lines
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=[19], error_bad_lines=False)
# Store data in list/cell
# data[i] = temp # sort it
This is something I tried that didn't work, don't really know how to proceed. I know there are some documentation on this problem but I am new to this and need some help.
An error I get when trying the above:
ParserError: Error tokenizing data. C error: Expected 1 fields in line 12, saw 4
So it was an easy fix, just had to remove the braces from skiprows=[19].
The cods now looks like this and works.
fileNames = ["Test1_0001.txt", "Test2_0000.txt", "Test3_0000.txt",
"Test4_0000.txt", "Test5_0000.txt", "Test6_0001.txt", "Test7_0000.txt",
"Test8_0000.txt", "Test9_0000.txt", "Test10_0000.txt", "RawblueMat_0000.txt"]
folderName = 'AuxeticsSHPB\\' #Source folder for all files above
# Preallocation
data = []
for i in range(0,len(fileNames)):
temp=pd.read_csv(folderName+fileNames[i], sep='\t', lineterminator='\r',
skiprows=19)
data.append(temp)

Slicing my data frame is returning unexpected results

I have 13 CSV files that contain billing information in an unusual format. Multiple readings are recorded every 30 minutes of the day. Five days are recorded beside each other (columns). Then the next five days are recorded under it. To make things more complicated, the day of the week, date, and billing day is shown over the first recording of KVAR each day.
The image blow shows a small example. However, imagine that KW, KVAR, and KVA repeat 3 more times before continuing some 50 rows later.
My goal as to create a simple python script that would make the data into a data frame with the columns: DATE, TIME, KW, KVAR, KVA, and DAY.
The problem is my script returns NaN data for the KW, KVAR, and KVA data after the first five days (which is correlated with a new instance of a for loop). What is weird to me is that when I try to print out the same ranges I get the data that I expect.
My code is below. I have included comments to help further explain things. I also have an example of sample output of my function.
def make_df(df):
#starting values
output = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
time = df1.loc[3:50,0]
val_start = 3
val_end = 51
date_val = [0,2]
day_type = [1,2]
# There are 7 row movements that need to take place.
for row_move in range(1,8):
day = [1,2,3]
date_val[1] = 2
day_type[1] = 2
# There are 5 column movements that take place.
# The basic idea is that I would cycle through the five days, grab their data in a temporary dataframe,
# and then append that dataframe onto the output dataframe
for col_move in range(1,6):
temp_df = pd.DataFrame(columns=["DATE", "TIME", "KW", "KVAR", "KVA", "DAY"])
temp_df['TIME'] = time
#These are the 3 values that stop working after the first column change
# I get the values that I expect for the first 5 days
temp_df['KW'] = df.iloc[val_start:val_end, day[0]]
temp_df['KVAR'] = df.iloc[val_start:val_end, day[1]]
temp_df['KVA'] = df.iloc[val_start:val_end, day[2]]
# These 2 values work perfectly for the entire data set
temp_df['DAY'] = df.iloc[day_type[0], day_type[1]]
temp_df["DATE"] = df.iloc[date_val[0], date_val[1]]
# trouble shooting
print(df.iloc[val_start:val_end, day[0]])
print(temp_df)
output = output.append(temp_df)
# increase values for each iteration of row loop.
# seems to work perfectly when I print the data
day = [x + 3 for x in day]
date_val[1] = date_val[1] + 3
day_type[1] = day_type[1] + 3
# increase values for each iteration of column loop
# seems to work perfectly when I print the data
date_val[0] = date_val[0] + 55
day_type [0]= day_type[0] + 55
val_start = val_start + 55
val_end = val_end + 55
return output
test = make_df(df1)
Below is some sample output. It shows where the data starts to break down after the fifth day (or first instance of the column shift in the for loop). What am I doing wrong?
Could be pd.append requiring matched row indices for numerical values.
import pandas as pd
import numpy as np
output = pd.DataFrame(np.random.rand(5,2), columns=['a','b']) # fake data
output['c'] = list('abcdefghij') # add a column of non-numerical entries
tmp = pd.DataFrame(columns=['a','b','c'])
tmp['a'] = output.iloc[0:2, 2]
tmp['b'] = output.iloc[3:5, 2] # generates NaN
tmp['c'] = output.iloc[0:2, 2]
data.append(tmp)
(initial response)
How does df1 look like? Is df.iloc[val_start:val_end, day[0]] have any issue past the fifth day? The codes didn't show how you read from the csv files, or df1 itself.
My guess: if val_start:val_end gives invalid indices on the sixth day, or df1 happens to be malformed past the fifth day, df.iloc[val_start:val_end, day[0]] will return an empty Series object and possibly make its way into temp_df. iloc do not report invalid row indices, though similar column indices would trigger IndexError.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(5,3), columns=['a','b','c'], index=np.arange(5)) # fake data
df.iloc[0:2, 1] # returns the subset
df.iloc[100:102, 1] # returns: Series([], Name: b, dtype: float64)
A little off topic but I would recommend preprocessing the csv files rather than deal with indexing in Pandas DataFrame, as the original format was kinda complex. Slice the data by date and later use pd.melt or pd.groupby to shape them into the format you like. Or alternatively try multi-index if stick with Pandas I/O.

add computed column to a csv file

I expect that this don't be a classic beginner question. However I read and spent days trying to save my csv data without success.
I have a function that uses an input parameter that I give manually. The function generates 3 columns that I saved in a CSV file. When I want to use the function with other inputs and save the new data allocated at right from the previous computed columns, the result is that pandas sort my CSV file in 3 single columns one below each other with the headings.
I'm using the next code to save my data:
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a')
and the result is:
dot lake mock
1 42 11.914558
2 41 42.446977
3 40 89.188668
dot lake mock
1 42 226.266513
2 41 317.768887
dot lake mock
3 42 560.171830
4. 41. 555.005333
What I want is:
dot lake mock mock mock
0 42 11.914558. 226.266513. 560.171830
1 41 42.446977. 317.768887. 555.005533
2 40 89.188668
UPDATE:
My DataFrame was generated using a function like this:
First I opened a csv file:
df1=pd.read_csv('current_state.csv')
def my_function(df1, photos, coords=['X', 'Y']):
Hzs = t.copy()
shifts = np.floor(Hzs / t_step).astype(np.int)
ms = np.zeros(shifts.size)
delta_inv = np.arange(N+1)
dot = delta_inv[N:0:-1]
lake = np.arange(1,N+1)
for i, shift in enumerate(shifts):
diffs = df1[coords] - df1[coords].shift(-shift)
sqdist = np.square(diffs).sum(axis=1)
ms[i] = sqdist.sum()
mock = np.divide(ms, dot)
msds = pd.DataFrame({'dot':dot, 'lake':lake, 'mock':mock})
return msds
data = my_function(df1, photos, coords=['X', 'Y'])
print(data)
data.to_csv('/Users/Computer/Desktop/Examples anaconda/data_new.csv', sep=',',mode='a'
I looked for several day the way to write in a csv file containing several computed columns just right to the next one. Even the unpleasant comments of some guys! I finally found how to do this. If someone need something similar:
First I save my data using to_csv:
data.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',',mode='a', index=False)
after the file has been already generated with the headers, I remove the index that I don't need and I only call the function using at the end:
b = data
a = pd.read_csv('data_new.csv')
c = pd.concat ([a,b],axis=1, ignore_index=True)
c.to_csv('/Users/Computer/Desktop/Examples/data_new.csv', sep=',', index=False)
As a result I got the CSV file desired and is possible to call the function the times that you want!

Python - average of unique values

I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.
You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6
If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)

Categories

Resources