Python: how to fix matplotlib plotting error? - python

I want to make a loop to create a plot for each corresponding column in 2 different csv files such that column 1 in csv A and column 1 in csv B are plotted together with the same timestamp (pulled from another csv). I do not think I will have trouble when I modify my code to create the loop, but I have to get matplotlib to work for the first column before trying to construct a loop.
I have already tried checking to make sure that the correct data is being passed into the function and that is in the correct order. For example, I printed the zipped array as a list (t_array, b_array) and checked my csv files to verify that the data was in the correct order. I have also tried modifying the axes, ticks, and zoom to no avail. I have tried checking the helper functions which I lifted from my other projects and they all work as expected.
def double_plot():
before = read_file(r_before)
after = read_file(r_after)
time = read_file(timestamp)
if len(before) == len(after):
b_array = np.asarray(before[1])
a_array = np.asarray(after[1])
t_array = np.asarray(time[1])
plt.plot(t_array, b_array)
plt.plot(t_array, a_array)
plt.show()
else:
print(len(before))
print(len(after))
print("dimension failure")
read_file() is a helper function that reads csv files and saves the columns to dictionaries with the first column key indexed by key
"1" and so on down the columns. I know I should probably change it to index with 0 first, but this is a problem for later...
Images showing what I want the code to do and what it is doing
What I would like
What my code is actually doing
Thank you for your time. This is my first time posting so I apologize if something I did was incorrect. I did attempt to find the answer before posting.
Edits: data sample; read_file()
screenshot of excel
def read_file(read_file):
data = {}
with open(read_file, 'r') as f:
reader = csv.reader(f)
for row in reader:
col_num = 0
for col in row:
col_num += 1
if col_num in data:
data[col_num].append(col)
else:
ls = col
ls = [ls]
data[col_num] = ls
return data
edit again: ^ its much better to use pandas but I am leaving this here because its funny after seeing it done with dataframes

The arrays I was using with the plot function contained strings rather than floats.
These links explain the problem along with multiple ways to fix it:
Matplotlib y axis values are not ordered
In Python, how do I convert all of the items in a list to floats?

Related

How can I periodically skip rows reading txt with pandas?

I need to process data measured every 20 seconds during the whole 2018 year, the raw file has following structure:
date time a lot of trash
in several rows
amount of samples trash again
data
date time a lot of trash
etc.
I want to make one pandas dataframe of it or at least one dataframe per every block (its size is coded as amount of samples) of data saving the time of measurement.
How can I ignore all other data trash? I know that it is written periodically (period = amount of samples), but:
- I don't know how many strings are in file
- I don't want to use explicit method file.getline() in cycle, because it would work just endlessly (especially in python) and I have no enough computing power to use it
Is there any method to skip rows periodically in pandas or another lib? Or how else can I resolve it?
There is an example of my data:
https://drive.google.com/file/d/1OefLwpTaytL7L3WFqtnxg0mDXAljc56p/view?usp=sharing
I want to get dataframe similar to datatable on the pic + additional column with date-time without technical rows
Use itertools.islice, where N below means read every N lines
from itertools import islice
N = 3
sep = ','
with open(file_path, 'r') as f:
lines_gen = islice(f, None, None, N)
df = pd.DataFrame([x.strip().split(sep) for x in lines_gen])
I repeated your data three times. It sounds like you need every 4th row (not starting at 0) because that is where your data lies. In the documentation for skipsrows it says.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].
So what if we pass a not in to the lambda function? that is what I am doing below.
I am creating a list of the values i want to keep. and passing the not in to the skiprows argument. In English, skip all the rows that are not every 4th line.
import pandas as pd
# creating a list of all the 4th row indexes. If you need more than 1 million, just up the range number
list_of_rows_to_keep = list(range(0,1000000))[3::4]
# passing this list to the lambda function using not in.
df = pd.read_csv(r'PATH_To_CSV.csv', skiprows=lambda x: x not in list_of_rows_to_keep)
df.head()
#output
0 data
1 data
2 data
Just count how many lines are in file and put the list of them (may it calls useless_rows) which are supposed to be skiped in pandas.read_csv(..., skiprows=useless_rows).
My problem was a chip rows counting.
There are few ways to do it:
On Linux command "wc -l" (here is an instruction how to put it into your code: Running "wc -l <filename>" within Python Code)
Generators. I have a key in my relevant rows: it is in last column. Not really informative, but rescue for me. So I can count string with it, appears it's abour 500000 lines and it took 0.00011 to count
with open(filename) as f:
for row in f:
if '2147483647' in row:
continue
yield row

Trying to get the largest number in a column of a .csv file

This is what I have currently, I get the error int is 'int' object is not iterable. If I understand correctly my issue is that BIKE_AVAILABLE is assigned a number at the top of my project with a number so instead of looking at the column it is looking at that number and hitting an error. How should I go about going through the column? I apologize in advance for the newby question
for i in range(len(stations[BIKES_AVAILABLE]) -1):
most_bikes = max(stations[BIKES_AVAILABLE])
sort(stations[BIKES_AVAILABLE]).remove(max(stations[BIKES_AVAILABLE]))
if most_bikes == max(stations[BIKES_AVAILABLE]):
second_most = max(stations[BIKES_AVAILABLE])
index_1 = index(most_bikes)
index_2 = index(second_most)
most_bikes = max(data[0][index_1], data[0][index_2])
return most_bikes
Another method that might be better for you to use with data manipulation is to try the pandas module.
Then you could do this:
import pandas as pd
data = pd.read_csv('bicycle_data.csv')
# Alternative:
# most_sales = data['sold'].max()
most_sales = max(data['sold'])
Now you don't have to worry about indexing columns with numbers:
You can also do something like this:
sorted_data = data.sort_values(by='sold', ascending=False)
# Displays top 5 sold bicycles.
print(sorted_data.head(5))
More importantly if you enjoy using indexes, there is a function to get you the index of the max value called idxmax built into pandas.
Using a generator inside max()
If you have a CSV file named test.csv, with contents:
line1,3,abc
line2,1,ahc
line3,9,sbc
line4,4,agc
You can use a generator expression inside the max() function for a memory efficient solution (i.e. no list is created).
If you wanted to do this for the second column, then:
max(int(l.split(',')[1]) for l in open("test.csv").readlines())
which would give 9 for this example.
Update
To get the row (index), you need to store the index of the max number in the column so that you can access this:
max(((i,int(l.split(',')[1])) for i,l in enumerate(open("test.csv").readlines())),key=lambda t:t[1])[0]
which gives 2 here as the line in test.csv (above) with the max number in column 2 (which is 9) is 2 (i.e. the third line).
This works fine, but you may prefer to just break it up slightly:
lines = open("test.csv").readlines()
max(((i,int(l.split(',')[1])) for i,l in enumerate(lines)),key=lambda t:t[1])[0]
Assuming a csv structure like so:
data = ['1,blue,15,True',
'2,red,25,False',
'3,orange,35,False',
'4,yellow,24,True',
'5,green,12,True']
If I want to get the max value from the 3rd column I would do this:
largest_number = max([n.split(',')[2] for n in data])

Getting all column values from google sheet using Gspread and Python

So i have a problem with the Gspread for python 3
when i do something like:
x = worksheet.cell(1,1).value
print(x)
Then i get the value of cell 1,1 which in my case is:
Nice
But when i do:
x = worksheet.col_values(1)
print(x)
Then i get all the results as in
'Nice', 'Cool','','','','','','','','','','','','','',''
And all the empty cells as well which i don't understand since i am asking just for values why i do i get all the '', empty brackets and why the other results are also in brackets ? I would expect something like:
Nice
Cool
When i call for the values of a column and those are the only values. Anyone know how to get such results ?
According to this https://github.com/burnash/gspread documentation it should work but it dose not.
You are getting all of the column data, contained in a list. It starts at row one and gives you all rows in that column to the bottom of the spreadsheet (1000 rows by default), including empty cells. The documentation tells you this:
col_values(col) Returns a list of all values in column col.
Empty cells in this list will be rendered as None.
This seems to have been changed to return empty strings instead, but the principle is the same.
To get just values, use a list comprehension:
x = [item for item in worksheet.col_values(1) if item]
Noting that the above will remove blank rows between items, which might cause misalignment if you try to work with multiple columns where row number is important. Since it's a list, individual items are accessed with:
for item in x:
print(item)
Looking again at the gspread-documentation, I was able to create a dataframe and then thereafter obtain the column-values:
gc = gspread.authorize(GoogleCredentials.get_application_default())
sht2 = gc.open_by_url('https://docs.google.com/spreadsheets/d/<id>')
worksheet = sht2.worksheet("Sheet-name")
dataframe = pd.DataFrame(worksheet.get_all_records())
dataframe.head(3)
Note: Don't forget to enable your gsheet's sharing-settings to "Anyone with a link", to be able to access the sheet from e.g. google colab.
You can also create a while loop and make something like this.
Let's say you want column E to G, you can start the loop from x=5 and end it on x=7. Just make sure that you transpose the dataframe at the end before printing it.
columns = []
x = 5
while x < 8:
data = sheet.col_values(x)[1:]
x += 1
columns.append(data)
df = pd.DataFrame(columns).T
print(df)

Mapping CSV data into Python

I am new to Python, and I am trying to sort of 'migrate' a excel solver model that I have created to Python, in hopes of more efficient processing time.
I receive a .csv sheet that I use as my input for the model, it is always in the same format.
This model essentially uses 4 different metrics associated with product A, B and C, and I essentially determine how to price A, B, and C accordingly.
I am at the very nascent stage of effectively inputting this data to Python. This is what I have, and I would not be surprised if there is a better approach, so open to trying anything you veterans have to recommend!
import csv
f = open("141881.csv")
for row in csv.reader(f):
price = row[0]
a_metric1 = row[1]
a_metric2 = row[2]
a_metric3 = row[3]
a_metric4 = row[4]
b_metric1 = row[7]
b_metric2 = row[8]
b_metric3 = row[9]
b_metric4 = row[10]
c_metric1 = row[13]
c_metric2 = row[14]
c_metric3 = row[15]
c_metric4 = row[16]
The .csv file comes in the format of price,a_metric1,a_metric2,a_metric3,a_metric4,,price,b_metric1,b_metric2,b_metric3,b_metric4,price,,c_metric1,c_metric2,c_metric3,c_metric4
I skip the second and third price column as they are identical to the first one.
However when I run the python script, I get the following error:
c_metric1 = row[13]
IndexError: list index out of range
And I have no idea why this occurs, when I can see the data is there myself (in excel, this .csv file would go all the way to column Q, or what I understand as row[16].
Your help is appreciated, and any advice on my approach is more than welcomed.
Thanks in advance!
Using print() can be your friend here:
import csv
with open('141881.csv') as file_handle:
file_reader = csv.reader(file_handle)
for row in file_reader:
print(row)
The code above will print out EACH row.
To print out ONLY the first row replace the for loop with: print(file_reader.__next__()) (assuming Python3)
Printing out row(s) will allow you to see what exactly a "row" is.
P.S.
Using with is advisable because it handles the opening and closing of the file for you
Look into pandas.
Read file as:
data = pd.read_csv('141881.csv'))
to read a columns:
col = data.columns['column_name']
to read a row:
row = data.ix[row_number]
CSV Module in Python transforms a spreadsheet into a matrice : a list of list
The python module to read csv transform each line of your input into a list.
For each row, it will split the row into a list of cell.In other words, one array is composed of as many columns you have into your excel spreadsheet.
Try in terminal:
>>> f = open("141881.csv")
>>> print csv.reader(f)
>>>[["id", "name", "company", "email"],[1563, "defoe", "SuperFastCompany",],["def#superfastcie.net"],[1564, "doe", "Awsomestartup", "doe#awesomestartup"], ...]`
So that's why you iterate throught the rows of your spreadsheet assigning the value into a new variable.
I recommend you to read on basics of list manipulation.
But...
What is an IndexError? catching exception:
If one cell is empty or one row has less columns than other: it will thraw an Error. Such as you described. IndexError means Python wasn't able to find a value for this specific cell. In other words if some line of your excel spreadsheet are smaller than the other it will say there is no such value to asign and throw an Index Error. That why knowing how to catch exception could be very useful to see the problem. Try to verify that the list of each has the same lenght if not assign an empty value for example
try:
#if row has always 17 cells with values
#I can just assign it directly using a little trick
price,a_metric1,a_metric2,a_metric3,a_metric4,,price,b_metric1,b_metric2,b_metric3,b_metric4,price,c_metric1,c_metric2,c_metric3,c_metric4 = row'
except IndexError:
# if there is no 17 cells
# tell me how many cells is actually in the list
# you will see there that there less than 17 elements
print len(row)
Now you can just skip the error by assigning None value to those who don't appears in the csv file
You can read more about Catching Exception
Thanks everyone for your input - printing the results made me realize that I was getting the IndexError because of the very first row, which only had headers. Skipping that row got rid of the error.
I will look into pandas, it seems like that will be useful for the type of work I am doing.
Thanks again for all of your help, much appreciated.

CSV find max in column and append new data

I asked a question about two hours ago regarding the reading and writing of data from a website. I've spent the last two hours since then trying to find a way to read the maximum date value from column 'A' of the output, comparing that value to the refreshed website data, and appending any new data to the csv file without overriding the old ones or creating duplicates.
The code that is currently 100% working is this:
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
f.write(data.text)
I've tried various ways of finding the maximum value of column 'A'. I've tried a bunch of different ways of using "Dict" and other methods of sorting/finding max, and even using pandas and numpy libs. None of which seem to work. Could someone point me in the direction of a decent way to find the maximum of a column from the .csv file? Thanks!
if you have it in a pandas DataFrame, you can get the max of any column like this:
>>> max(data['time'])
'2012-01-18 15:52:26'
where data is the variable name for the DataFrame and time is the name of the column
I'll give you two answers, one that just returns the max value, and one that returns the row from the CSV that includes the max value.
import csv
import operator as op
import requests
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
csv_file = "trades_{}.csv".format(symbol)
data = requests.get(url)
with open(csv_file, "w") as f:
f.write(data.text)
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_value = max(row[0] for row in csv.reader(f))
with open(csv_file) as f:
next(f) # discard first row from file -- see notes
max_row = max(csv.reader(f), key=op.itemgetter(0))
Notes:
max() can directly consume an iterator, and csv.reader() gives us an iterator, so we can just pass that in. I'm assuming you might need to throw away a header line so I showed how to do that. If you had multiple header lines to discard, you might want to use islice() from the itertools module.
In the first one, we use a "generator expression" to select a single value from each row, and find the max. This is very similar to a "list comprehension" but it doesn't build a whole list, it just lets us iterate over the resulting values. Then max() consumes the iterable and we get the max value.
max() can use a key= argument where you specify a "key function". It will use the key function to get a value and use that value to figure the max... but the value returned by max() will be the unmodified original value (in this case, a row value from the CSV). In this case, the key function is manufactured for you by operator.itemgetter()... you pass in which column you want, and operator.itemgetter() builds a function for you that gets that column.
The resulting function is the equivalent of:
def get_col_0(row):
return row[0]
max_row = max(csv.reader(f), key=get_col_0)
Or, people will use lambda for this:
max_row = max(csv.reader(f), key=lambda row: row[0])
But I think operator.itemgetter() is convenient and nice to read. And it's fast.
I showed saving the data in a file, then pulling from the file again. If you want to go through the data without saving it anywhere, you just need to iterate over it by lines.
Perhaps something like:
text = data.text
rows = [line.split(',') for line in text.split("\n") if line]
rows.pop(0) # get rid of first row from data
max_value = max(row[0] for row in rows)
max_row = max(rows, key=op.itemgetter(0))
I don't know which column you want... column "A" might be column 0 so I used 0 in the above. Replace the column number as you like.
It seems like something like this should work:
import requests
import csv
symbol = "mtgoxUSD"
url = 'http://api.bitcoincharts.com/v1/trades.csv?symbol={}'.format(symbol)
data = requests.get(url)
with open("trades_{}.csv".format(symbol), "r+") as f:
all_values = list(csv.reader(f))
max_value = max([int(row[2]) for row in all_values[1:]])
(write-out-the-value?)
EDITS: I used "row[2]" because that was the sample column I was taking max of in my csv. Also, I had to strip off the column headers, which were all text, which was why I looked at "all_values[1:]" from the second row to the end of the file.

Categories

Resources