I have a time series dataframe (dummy) as below, for which I am trying to create a line chart using plotly to plot the values of all the columns on y axis while index is the x axis. All the columns of the dataframe have the same number of rows and upon checking the type of the index, it is 'dtype='datetime64[ns, pytz.FixedOffset(60)]'
However, while creating the line chart as per code below, I get the following error: "ValueError: All arguments should have the same length. The length of argument y is 5, whereas the length of previously-processed arguments ['time_before_fulfilment'] is 109". I went through other stack overflow answers and tried a couple of things but couldn't solve it.
Could someone kindly help?
# Code to create dummy dataframe
data = {
'2001-07-21 10:00:00+05:00': [45, 51, 31, 3],
'2001-07-21 10:15:00+05:00': [46, 50, 32, 3],
'2001-07-21 10:30:00+05:00': [47, 51, 34, 7],
'2001-07-21 10:45:00+05:00': [50, 50, 33, 9]
}
# Create the DataFrame
df = pd.DataFrame(data, index=['2001-07-21 10:45:00+05:00', 'Col2', 'Col3', 'Col4'])
df.index.name = 'date'
df = df.rename_axis(index=None, columns='date').T
df.index = pd.to_datetime(df.index, utc=True)
df.index = df.index.tz_convert(pytz.FixedOffset(60))
# Show the DataFrame
df
Dataframe
2001-07-21 10:45:00+05:00 Col 2 Col 3 Col 4
date
2001-07-21 10:00:00+05:00 45 51 31 3
2001-07-21 10:15:00+05:00 46 50 32 3
2001-07-21 10:30:00+05:00 47 51 34 7
2001-07-21 10:45:00+05:00 50 50 33 9
Code
def plot_graph():
fig = px.line(df, x = df.index, y = [df.columns[0],'Col2','Col3','Col4'] , markers='.')
fig.update_xaxes(
rangeslider_visible=True,
rangeselector=dict(
buttons=list([
dict(count = 1, label = "1H", step = "hour", stepmode ="backward"),
dict(step="all")
])
)
)
fig.show()
plot_graph()
I will answer this question, so that someone with a similar issue can get some insight. The problem was that the column name '2001-07-21 10:45:00+05:00' is a timestamp which needed to be converted to type str. Doing that fixed the issue and the plotly code generated the desired line graph
Code:
##Converting the column name from type timestamp to str
timestamp = pd.Timestamp(df_main.columns[0])
date_string = timestamp.strftime('%Y-%m-%d %H:%M:%S%z')
df= df.rename(columns={df.columns[0]: date_string})
df
Related
I have a dataset containing rows of measurements (weight) in the following format:
user_id, day_n, weight
user_id is the identifier of the user. There are multiple rows
with the same user_id.
day_n is the day number on which the measurement is done.
weight is the weight in kg.
For removing outliers or incorrect data, I use a min and max value for both the weight and the day_n column.
For now, I plot all data into a scatter plot.
My question:
How can I only include users, which have their first measurement (weight) between two values (min_weight and max_weight)?
Considering the following example data and min_weight = 70 and max_weight = 75, user_id 1 should be included, but user_id 2 should not.
user_id, day_n, weight
1, 0, 72
1, 28, 70
1, 68, 69
2, 5, 76
2, 28, 80
2, 78, 78
I tried:
I tried to group by user_id and looked for the min() of day_n. I couldn't figure out how to check the weight in this row.
My code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv('../datasets/measurements_2019-12-31_2021-07-01.csv')
# Min and max weights
df = df[df['weight'] > 25]
df = df[df['weight'] < 300]
# Min and max days
df = df[df['day_n'] > -7]
df = df[df['day_n'] < 187]
# Only include rows with > 3 measurements
df = df.groupby('user_id').filter(lambda x: len(x) > 3)
# Scatter plot
plt.scatter(df['day_n'], df['weight'])
# Regression line
m, b = np.polyfit(df['day_n'], df['weight'], 1)
plt.plot(df['day_n'], m*df['day_n']+b,color='red')
plt.show()
sort then find the user id to keep:
df = pd.DataFrame({'user_id':[1,1,1,2,2,2],
'day_n': [0, 28, 68, 5, 28, 78],
'weight': [72, 70, 69, 76, 80, 78]})
df = df.sort_values(['user_id', 'day_n'])
keep_user = df.groupby('user_id')['weight'].first().between(70, 75)
df.loc[df['user_id'].isin(keep_user[keep_user].index)]
user_id day_n weight
0 1 0 72
1 1 28 70
2 1 68 69
I have a csv with two columns, Dates and Profits/Losses that I have read into the data frame.
import os
import csv
import pandas as pd
cpath = os.path.join('..', 'Resources', 'budget_data.csv')
df = pd.read_csv(cpath)
df["Profit/Losses"]= df["Profit/Losses"].astype(int)
data = pd.DataFrame(
[
["2019-01-01", 40],
["2019-02-01", -5],
["2019-03-01", 15],
],
columns = ["Dates", "Profit/Losses"]
)
I want to know the differences of profits and losses per month (with each row being one month) and so thought to use df.diff to calculate the values
df.diff()
This results however in errors as I think it is trying to calculate the dates column as well and I'm not sure how to make it only calculate the profits and losses.
Is this what you are looking for?
import pandas as pd
data = pd.DataFrame(
[
["2019-01-01", 40],
["2019-02-01", -5],
["2019-03-01", 15],
],
columns = ["Dates", "Profit/Losses"]
)
data.assign(Delta=lambda d: d["Profit/Losses"].diff().fillna(0))
Yields
Dates Profit/Losses Delta
0 2019-01-01 40 0
1 2019-02-01 -5 -45.0
2 2019-03-01 15 20.0
Maybe you can do this:
import pandas as pd
x = [[1,2], [1,2], [1,4]]
d = pd.DataFrame(x, columns=['loss', 'profit'])
d.insert(0, "diff", [d['profit'][i] - d['loss'][i] for i in d.index])
d.head()
Gives:
I have a sample of a dataframe as shown below.
data = {'Date':['2021-07-18','2021-07-19','2021-07-20','2021-07-21','2021-07-22','2021-07-23'],
'Invalid':["NaN", 1, 1, "NaN", "NaN", "NaN"],
'Negative':[23, 24, 17, 24, 20, 23],
'Positive':["NaN", 1, 1, 1, "NaN", 1]}
df_sample = pd.DataFrame(data)
df_sample
The code for displaying a stacked bar graph is given below and also the graph produced by it.
temp = Graph1_df.set_index(['Dates', 'Results']).sort_index(0).unstack()
temp.columns = temp.columns.get_level_values(1)
f, ax = plt.subplots(figsize=(20, 5))
temp.plot.bar(ax=ax, stacked=True, width = 0.3, color=['blue','green','red'])
ax.title.set_text('Total Test Count vs Dates')
plt.show()
Using the code above or with any new approach, I want just the values for 'positive' to be displayed on the chart.
Note: 3rd column in the dataframe snippet is the 'Positive' column.
Any help is greatly appreciated.
Thanks.
Plotting with pandas.DataFrame.plot with kind='bar'
Use .bar_label to add annotations
See this answer for other links and options related to .bar_label
Stacked bar plots are plotted in order from left to right and bottom to top, based on the order of the columns and rows, respectively.
Since 'Positive' is column index 2, we only want labels for i == 2
Tested in pandas 1.3.0 and requires matplotlib >=3.4.2 and python >=3.8
The list comprehension for labels uses an assignment expression, :=, which is only available from python 3.8
labels = [f'{v.get_height():.0f}' if ((v.get_height()) > 0) and (i == 2) else '' for v in c] is the option without :=
.bar_label is only available from matplotlib 3.4.2
This answer shows how to add annotations for matplotlib <3.4.2
import pandas as pd
import numpy as np # used for nan
# test dataframe
data = {'Date':['2021-07-18','2021-07-19','2021-07-20','2021-07-21','2021-07-22','2021-07-23'],
'Invalid':[np.nan, 1, 1, np.nan, np.nan, np.nan],
'Negative':[23, 24, 17, 24, 20, 23],
'Positive':[np.nan, 1, 1, 1, np.nan, 1]}
df = pd.DataFrame(data)
# convert the Date column to a datetime format and use the dt accessor to get only the date component
df.Date = pd.to_datetime(df.Date).dt.date
# set Date as index
df.set_index('Date', inplace=True)
# create multi-index column to match OP image
top = ['Size']
current = df.columns
df.columns = pd.MultiIndex.from_product([top, current], names=['', 'Result'])
# display(df)
Size
Result Invalid Negative Positive
Date
2021-07-18 NaN 23 NaN
2021-07-19 1.0 24 1.0
2021-07-20 1.0 17 1.0
2021-07-21 NaN 24 1.0
2021-07-22 NaN 20 NaN
2021-07-23 NaN 23 1.0
# reset the top index to a column
df = df.stack(level=0).rename_axis(['Date', 'Size']).reset_index(level=1)
# if there are many top levels that are reset as a column, then select the data to be plotted
sel = df[df.Size.eq('Size')]
# plot
ax = sel.iloc[:, 1:].plot(kind='bar', stacked=True, figsize=(20, 5), title='Total Test Count vs Dates', color=['blue','green','red'])
# add annotations
for i, c in enumerate(ax.containers):
# format the labels
labels = [f'{w:.0f}' if ((w := v.get_height()) > 0) and (i == 2) else '' for v in c]
# annotate with custom labels
ax.bar_label(c, labels=labels, label_type='center', fontsize=10)
# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)
I have a data frame like the below and would like to convert the Latitude and Longitude columns in Degree, Minute, Second format into decimal degrees and want the updated table with other column
any help would be appreciated
Here is the relevant code that uses apply, lambda to process each row of the dataframe and creates a new column lat_decimal to contain the result.
# Create dataframe
d6 = {'id':['a1','a2','a3'],
'lat_deg': [10, 11, 12],
'lat_min': [15, 30, 45],
'lat_sec': [10, 20, 30]
}
df6 = pd.DataFrame( data=d6 )
df6["lat_decimal"] = df6[["lat_deg","lat_min","lat_sec"]].apply(lambda row: row.values[0] + row.values[1]/60 + row.values[2]/3600, axis=1)
The resulting dataframe:-
id lat_deg lat_min lat_sec lat_decimal
0 a1 10 15 10 10.252778
1 a2 11 30 20 11.505556
2 a3 12 45 30 12.758333
Say I have a dataframe like so that I have read in from a file (note: *.ene is a txt file)
df = pd.read_fwf('filename.ene')
TS DENSITY STATS
1
2
3
1
2
3
I would like to only change the TS column. I wish to replace all the column values of 'TS' with the values from range(0,751,125). The desired output should look like so:
TS DENSITY STATS
0
125
250
500
625
750
I'm a bit lost and would like some insight regarding the code to do such a thing in a general format.
I used a for loop to store the values into a list:
K=(6*125)+1
m = []
for i in range(0,K,125):
m.append(i)
I thought to use .replace like so:
df['TS']=df['TS'].replace(old_value, m, inplace=True)
but was not sure what to put in place of old_value to select all the values of the 'TS' column or if this would even work as a method.
it's pretty straight forward, if you're replacing all the data you just need to do
df['TS'] =m
example :
import pandas as pd
data = [[10, 20, 30], [40, 50, 60], [70, 80, 90]]
df = pd.DataFrame(data, index=[0, 1, 2], columns=['a', 'b', 'c'])
print(df)
# a b c
# 0 10 20 30
# 1 40 50 60
# 2 70 80 90
df['a'] = [1,2,3]
print(df)
# a b c
# 0 1 20 30
# 1 2 50 60
# 2 3 80 90