Python - average of unique values - python

I have a CSV file that looks like this:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
...
0101,41.0
0102,39.9
0103,44.6
0104,42.0
0105,43.0
0106,42.4
It's a list of temperatures for specific dates. It contains data for several years so the same dates occur multiple times. I would like to average the temperature so that I get a new table where each date is only occurring once and has the average temperature for that date in the second column.
I know that Stack Overflow requires you to include what you've attempted, but I really don't know how to do this and couldn't find any other answers on this.
I hope someone can help. Any help is much appreciated.

You can use pandas, and run the groupby command, when df is your data frame:
df.groupby('DATE').mean()
Here is some toy example to depict the behaviour
import pandas as pd
df=pd.DataFrame({"a":[1,2,3,1,2,3],"b":[1,2,3,4,5,6]})
df.groupby('a').mean()
Will result in
a b
1 2.5
2 3.5
3 4.5
When the original dataframe was
a b
0 1 1
1 2 2
2 3 3
3 1 4
4 2 5
5 3 6

If you can use the defaultdict pacakge from collections, makes this type of thing pretty easy.
Assuming your list is in the same directory as the python script and it looks like this:
list.csv:
DATE,TEMP
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
0101,39.0
0102,40.9
0103,44.4
0104,41.0
0105,40.0
0106,42.2
Here is the code I used to print out the averages.
#test.py
#usage: python test.py list.csv
import sys
from collections import defaultdict
#Open a file who is listed in the command line in the second position
with open(sys.argv[1]) as File:
#Skip the first line of the file, if its just "data,value"
File.next()
#Create a dictionary of lists
ourDict = defaultdict(list)
#parse the file, line by line
for each in File:
# Split the file, by a comma,
#or whatever separates them (Comma Seperated Values = CSV)
each = each.split(',')
# now each[0] is a year, and each[1] is a value.
# We use each[0] as the key, and append vallues to the list
ourDict[each[0]].append(float(each[1]))
print "Date\tValue"
for key,value in ourDict.items():
# Average is the sum of the value of all members of the list
# divided by the list's length
print key,'\t',sum(value)/len(value)

Related

Exporting max values of different csv files in to one

I got 3 datasets which contain the flow in m3/s per location. Dataset 1 is a 5 year ARI flood, Dataset 2 is a 20 year ARI flood and Dataset 3 is a 50 year ARI flood.
Per location I found the maximum discharge (5,20 & 50)
Code:
for key in Data_5_ARI_RunID_Flow_New.keys():
m = key
y5F_RunID = Data_5_ARI_RunID_Flow_New.loc[:,m]
y20F_RunID = Data_20_ARI_RunID_Flow_New.loc[:,m]
y50F_RunID = Data_50_ARI_RunID_Flow_New.loc[:,m]
max_y5F = max(y5F_RunID)
max_y20F = max(y20F_RunID)
max_y50F = max(y50F_RunID)
Max_DataID = m, max_y5F, max_y20F, max_y50F
print (Max_DataID)
The output is like this:
('G60_18', 44.0514, 47.625, 56.1275)
('Area5_11', 1028.4065, 1191.5946, 1475.9685)
('Area5_12', 1017.8286, 1139.2628, 1424.4304)
('Area5_13', 994.5626, 1220.0084, 1501.1483)
('Area5_14', 995.9636, 1191.8066, 1517.4541)
Now I want to export this result to a csv file, but I don't know how. I used this line of code, but it didn't work:
Max_DataID.to_csv(r'C:\Users\Max_DataID.csv', sep=',', index = False)
Use this file name myexample.csv with specific path where you want to create the file.
Please check that Max_DataID is a iterable value. And as your reference the values are in form of tuple so I use list() to convert tuples into list and that will be supported values for writerow in csv.
import csv
file = open('myexample.csv', 'wb')
filewriter = csv.writer(file,delimiter =',')
for data in Max_DataID:
filewriter.writerow(list(data))
You can do the following.
df.to_csv(file_name, sep='\t')
Also, if you want to split it into chunks, like 10,000 rows, or whatever, you can do this.
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path_here/main.csv', chunksize=10000)):
chunk.to_csv('chunk{}.csv'.format(i))

Iterating through csv records based on the version of the record via Python

I have a csv file that has a primary_id field and a version field and it looks like this:
ful_id version xs at_grade date
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
edit this is what the actual data looks like plus add 106 more columns of data and 20,000 records
The larger version number is the latest version of that record.I am having a difficult time thinking of the logic to get the latest record based on version and dumping that into a dictionary.I am pulling the info from the csv into a blank list but If anyone could give me some guidance on some of the logic moving forward, I would appreciate it
import csv
from collections import defaultdict
reader = csv.DictReader(open('rpm_inv.csv', 'rb'))
allData = list(reader)
dict_list = []
for line in allData:
dict_list.append(line)
pprint.pprint(dict_list)
I'm not exactly sure how you want your output to look like, but this might point you at least in the right direction, as long as you're not opposed to pandas.
import pandas as pd
df = pd.read_csv('rpm_inv.csv', header=True)
by_version = df.groupby('Version')
latest = by_version.max()
# To put it into a dictionary of {version:ID}
{v:row['ID'] for v, row in latest.iterrows()}
There's no need for anything fancy.
defaultdict is included in Python's standard library. It's an improved dictionary. I've used it here because it obviates the need to initialise entries in a dictionary. This means that I can write, for instance, result[id] = max(result[id], version). If no entry exists for id then defaultdict creates one and puts version in it (because it's obvious that this will be the maximum).
I read through the lines in the input file, one at a time, discarding end-lines and blanks, splitting on the commas, and then use map to apply the int function to each string produced.
I ignore the first line in the file simply be reading it and assigning its contents to a variable that I have arbitrarily called ignore.
Finally, just to make the results more intelligible, I sort the keys in the dictionary, and present the contents of it in order.
>>> from collections import defaultdict
>>> result = defaultdict(int)
>>> with open('to_dict.txt') as input:
... ignore = input.readline()
... for line in input:
... id, version = map(int, line.strip().replace(' ', '').split(','))
... result[id] = max(result[id], version)
...
>>> ids = list(result.keys())
>>> ids.sort()
>>> for id in ids:
... id, result[id]
...
(3, 1)
(11, 3)
(20, 2)
(400, 2)
EDIT: With that much data it becomes a different question, in my estimation, better processed with pandas.
I've put the df.groupby(['ful_id']).version.idxmax() bit in to demonstrate what I've done. I group on ful_id, then ask for the maximum value of version and the index of the maximum value, all in one step using idxmax. Although pandas displays this as a two-column table the result is actually a list of integers that I can use to select rows from the dataframe.
That's what I do with df.iloc[df.groupby(['ful_id']).version.idxmax(),:]. Here the df.groupby(['ful_id']).version.idxmax() part identifies the rows, and the : part identifies the columns, namely all of them.
Thanks for an interesting question!
>>> import pandas as pd
>>> df = pd.read_csv('different.csv', sep='\s+')
>>> df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
1 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 1 12 no 20170206
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
4 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 2 123 no 20170206
>>> df.groupby(['ful_id']).version.idxmax()
ful_id
000c1a6c-1f1c-45a6-a70d-f3555f7dd980 0
00dc5fec-ddb8-45fa-9c86-77e09ff590a9 3
034c1a6c-4f1c-aa36-a70d-f2245f7rr342 2
Name: version, dtype: int64
>>> new_df = df.iloc[df.groupby(['ful_id']).version.idxmax(),:]
>>> new_df
ful_id version xs at_grade date
0 000c1a6c-1f1c-45a6-a70d-f3555f7dd980 3 123 yes 20171003
3 00dc5fec-ddb8-45fa-9c86-77e09ff590a9 1 556 yes 20170201
2 034c1a6c-4f1c-aa36-a70d-f2245f7rr342 1 334 yes 20150302

how can I really change the values of specific text file with pandas

all
I have a txt file with many columns with no headers.
I use
df=pd.read_csv('a.txt',sep=' ',header=None,usecols=[2,3],names=['waiting time','running time')
suppose the columns would be like this:
waiting time running time
0 8617344 8638976
1 8681728 8703360
2 8703488 8725120
3 8725120 8725760
4 4185856 4207488
for the third column, I want to subtract values of the second columns, then I can get
waiting time running time
0 8617344 21632
1 8681728 21632
2 8703488 21632
3 8725120 640
4 4185856 21632
My question is that how let change really happen in txt file? It means the txt file has been really changed correspondingly.
If your question is how to update the text file with your new data, you just use the write version of your first line:
# Save to file a.txt
# Use a space as the separator
# Don't save the index
df.to_csv('a.txt', sep=' ', index=False)

reading variable number of columns with Python

I need to read a variable number of columns from my input file ( the number of columns is defined by the user, there's no limitation ). For every column I have multiple variables to read, three in my case, set by the user as well.
So the file to read is like:
2 3 5
6 7 9
3 6 8
In Fortran this is really easy to do:
DO 180 I=1,NMOD
READ(10,*) QARR(I),RARR(I),WARR(I)
NMOD is defined by the user, as well as all the values in the example. All of them are input parameters to be stored in memory. By doing these I can save all the variables I need and I can use it whenever I want, recalling them by changing the I index. How can I obtain the same result with Python?
Example file 'text'
2 3 5
6 7 9
3 6 8
Python code
data = []
with open('text') as file:
columns_to_read = 1 # here you tell how many columns you want to read per line
for line in file:
data.append(list(map(int, line.split()[:columns_to_read])))
print(data) # print: [[2], [6], [3]]
data will hold an array of arrays that represent your lines.
from itertools import islice
with open('file.txt', 'rt') as f:
# default slice from row 0 until end with step 1
# example islice(10, 20, 2) take only row 10,12,14,16,18
dat = islice(f, 0, None, 1)
column = None # change column here, default to all
# this keep the list value as string
# mylist = [i.split() for i in dat]
# this keep the list value as int
mylist = [[int(j) for j for i.split()[:column] for i in dat]
Code above construct 2d list
access with mylist[row][column]
Example - mylist[2][3] access row 2 column 3
Edit : improve code efficiency with #Guillaume #Javier suggestion

How to parse CSV file and search by item in first column

I have a CSV file with over 4,000 lines formatted like...
name, price, cost, quantity
How do I trim my CSV file so only the 20 names I want remain? I am able to parse/trim the CSV file, I am coming up blank on how to search column 1.
Use pandas!
import pandas as pd
df = pd.DataFrame({'name': ['abc', 'ght', 'kjh'], 'price': [7,5,6], 'cost': [9, 0 ,2], 'quantity': [1,3,4]})
df = pd.read_csv('input_csv.csv') # Your case you would import like this
>>> df
cost name price quantity
0 9 abc 7 1
1 0 ght 5 3
2 2 kjh 6 4
>>> names_wanted = ['abc','kjh']
>>> df_trim = df[df['name'].isin(names_wanted)]
>>> df_trim
cost name price quantity
0 9 abc 7 1
2 2 kjh 6 4
Then export the file to csv:
>>> df_trim.to_csv('trimmed_csv.csv', index=False)
Done!
You can loop though the csv.reader(). It will return you rows. Rows consist of lists. Compare the first element of the list ie row[0]. If it is what you want, add the row to an output list.
You could create an ASCII test file with each of the 20 names on a separate line (perhaps called target_names). Then, with your CSV file (perhaps called file.csv), on the command line (bash):
for name in $(cat target_names); do grep $name file.csv >> my_new_small_file.csv; done
If you have issues with case sensitivity, use grep -i.
Not sure I understood you right but, can the snippet below do what you want ?
def FilterCsv(_sFilename, _aAllowedNameList):
l_aNewFileLines = []
l_inputFile = open(_sFilename, 'r')
for l_sLine in l_inputFile:
l_aItems = l_sLine.split(',')
if l_aItems[0] in _aAllowedNameList:
l_aNewFileLines.append(l_sLine)
l_inputFile.close()
l_outputFile = open('output_' + _sFilename, 'w')
for l_sLine in l_aNewFileLines:
l_outputFile.write(l_sLine)
l_outputFile.close()
Hope this can be of any help!

Categories

Resources