how to remove line connecting discontinued point in matplotlib.pyplot.plot - python

I have a dataframe which has one column shows price, and its index is datetime.
2018-09-18T02:29:56.5 524.6
2018-09-18T02:29:57.0 524.6
2018-09-18T02:29:57.5 524.8
2018-09-18T02:29:59.0 525.1
2018-09-18T02:29:59.5 525.1
2018-09-18T02:30:00.0 524.8
2018-09-19T21:00:00.5 527.1
2018-09-19T21:00:01.0 527.1
2018-09-19T21:00:01.5 527.3
2018-09-19T21:00:02.0 527.7
2018-09-19T21:00:02.5 527.5
2018-09-19T21:00:03.0 527.6
2018-09-19T21:00:03.5 527.4
im trying to plot the timeplot by matplotlib.pyplot.plot(df).
It gives a plot but with a long straight line connecting the discontinued datapoint (last data point on 2018-09-18T02:30:00.0 and the first data point on 2018-09-19T21:00:00.5). Is there a way to remove the connecting line between the data point gap?

sry... i think how i can do it... just use
df.plot(x=df.index.astype(str))
basically, convert my index from datetime to string, and use the strings as my x-axis

Related

Struggling to clean the column in pandas

I need help with cleaning a column of my data. So, basically in the column, in each separate cell there are dates, time, letter, floating points so many other type of data. The datatype of this column is 'Object'.
What I want to do is, remove all the dates and replace it with empty cells and keep only the time in the entire column. And then I want to insert the average time into the empty cells.
I'm using pycharm and using PANDAS to clean the column.
[enter image description here][1]
I would imagine you can achieve this with something along the lines of below. For time format, it seems like for your data column just checking if string contains 2 semi colons is enough. You can also specify something more robust:
def string_splitter (x):
x=x.split()
y=[]
for stuff in x:
if stuff.index(":")>1: #<you can also replace with a more robust pattern for time>
y.append(stuff)
else:
y.append("")#<add your string for indicating empty space>
return " ".join(y)
df['column_name'].apply(string_splitter)

bar plot with HH:MM:SS in x-axis

I have the following string in Python (got many of those):
date = "00:01:43"
which represents the hour::minute::seconds. This comes from reading a csv file which contains many of those date.
Now I need to construct those I am reading from csv to some sort of array and then use it for a bar plot (matloblib.bar) as the x-axis
The question is how do I prepare the dates I am reading to be used in a bar plot:
with open('file.csv','r')
for line in file:
time = line.split(',')[0] ## this is read like "HH:MM:SS"
temp = line.split(',')[1] ## this is read like "Float as a string"
tempArray.append(float(temp))
QUESTION
How do I assembly the time into an array to then be used in the following:
plt.bar(timeArray, tempArray)
where the x-axis would still show "HH:MM:SS" format.
Since it looks like you have create a list for temp data, then you can just create another list for time data.
Then use ax.set_xticks({temp_data_list}, {time_data_list}) to group them together.
ax.set_xticks({temp_data_list}, {time_data_list})

dataframe line plot is not plotting a line with column values

I think there is something wrong with the data in my dataframe, but I am having a hard time coming to a conclusion. I think there might be some missing datetime values, which is the index of the dataframe. Given that there are over 1000 rows, it isn't possible for me to check each row manually. Here is a picture of my data and the corresponding line plt. Clearly this isn't a line plot!
Is there any way to supplement the possible missing values in my dataframe somehow?
I also did a line plot in seaborne as well to get another perspective, but I don't think it was immediately helpful.
You have effectively done same as I have simulated. Really you have a multi-index date and age_group. plotting both together means line jumps between the two. Separate them out and plot as separate lines and it is as you expect.
d = pd.date_range("1-jan-2020", "16-mar-2021")
df = pd.concat([pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0.5,1, len(d)))}, index=d).assign(age_group="0-9 Years"),
pd.DataFrame({"daily_percent":np.sort(np.random.uniform(0,0.5, len(d)))}, index=d).assign(age_group="20-29 Years")])
df.plot(kind="line", y="daily_percent", color="red")
df.set_index("age_group", append=True).unstack(1).droplevel(0, axis=1).plot(kind="line", color=["red","blue"])

Plot latitude longitude with drop wrong data in rows

Hello I need help or clue with my data frame.
I have 319k rows with two columns named 'Latitude' adn 'Longtitude', by reason of checking I grouped and coutned the rows: https://i.stack.imgur.com/IqCka.png , https://i.stack.imgur.com/gg9v0.png
I need make scatter plot, but unfortunetely I'm very very new in python, and I don't know how I can find correct long and lat data in rows without empty records or wrong data like -1.0000 (screenshot). Lat and long for Boston (MA) are 42... And -72...
I think my code to plotting is good, but i cant correctly filtered my data to make it:
for seaborn sns.stripplot(x='Latitude', y='Longtitude', data=MojaBaza)-> for now, I've got:
https://i.stack.imgur.com/RL1t1.png
for matplotlib plt.scatter(x=MojaBaza['Longtitude'], y=MojaBaza['Latitude']) -> and for this instruction, I've got "'value' must be an instance of str or bytes, not a float"
Sorry if my question is stupid, but I really don't know, how handle it.
Greetings
Problem was the type of data.
The solution is:
MojaBaza['Latitude']=MojaBaza['Latitude'].astype('float')
MojaBaza['Longitude']=MojaBaza['Longitude'].astype('float')
In the next step:
Filtr1 = MojaBaza['Latitude'] > 40
Filtr2 = MojaBaza['Longitude'] < -70
Lokacja = MojaBaza[Filtr1 & Filtr2]
and we got it:

ValueError: microsecond must be in 0..999999 When trying to plot a series using scatter plot

I get ValueError: microsecond must be in 0..999999 when I try to plot two series using scatter plot.
I have two datasets(contains the posts made on a platform with the time they where created and number of comments each post receiced) the goal here is to understand what time if a post was created it will likely create a big number of comments.
hn_ask_sorted_data = hn_ask_data.sort_values(by = ['num_comments'],ascending=False)
hn_show_sorted_data = hn_show_data.sort_values(by = ['num_comments'],ascending=False)
hn_ask_sorted_data['created_at'] = pd.to_datetime(hn_ask_sorted_data['created_at'])
hn_show_sorted_data['created_at'] = pd.to_datetime(hn_show_sorted_data['created_at'])
I convert the column that contains time into a datetime object, but I am more interested with the time component of the object hence I only take the time component using the .dt.time
hn_ask_sorted_data['created_at'] = hn_ask_sorted_data['created_at'].dt.time
hn_show_sorted_data['created_at'] = hn_show_sorted_data['created_at'].dt.time
Then I make a scatterplot using two columns one containing number of comments on the post and the time during which the post was posted (ie the above created column) instead of getting the results I get the described error.
plt.scatter(hn_ask_sorted_data['created_at'],hn_ask_sorted_data['num_comments'])
plt.show()
plt.scatter(hn_show_sorted_data['created_at'],hn_show_sorted_data['num_comments'])
plt.show()

Categories

Resources