Create values in column after groupby - python

I have a data frame that is obtained after grouping an initial data frame by the 'hour' and 'site' column. So the current data frame has details of 'value' grouped per 'hour' and 'site'. What I want is to fill the hour which has no 'value' with zero. 'Hour' range is from 0-23. how can I do this?
Left is input, right is expected output

You can try this:
import numpy as np
import pandas as pd
raw_df = pd.DataFrame(
{
"Hour": [1, 2, 4, 12, 0, 2, 7, 13],
"Site": ["x", "x", "x", "x", "y", "y", "y", "y"],
"Value": [1, 1, 1, 1, 1, 1, 1, 1],
}
)
full_hour = pd.DataFrame(
{
"Hour": np.concatenate(
[range(24) for site_name in raw_df["Site"].unique()]
),
"Site": np.concatenate(
[[site_name] * 24 for site_name in raw_df["Site"].unique()]
),
}
)
result = full_hour.merge(raw_df, on=["Hour", "Site"], how="left").fillna(0)
Then you can get what you want. But I suggest you copy your test data in your question instead an image. You know, we have no responsibility to create your data. You should think more about how can make others answer your question comfortably.

So if you want to change the value in hours column to zero, where the value is not in range of 0-23, here is what to do.I actually didn't get your question clearly so i assume this must be what you want.I have taken a dummy example as you have not provided you own data.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date':['10/2/2011', '11/2/2011', '12/2/2011',
'13/2/2011','14/2/2011'],
'Product':['Umbrella', 'Matress', 'Badminton', 'Shuttle','ewf'],
'Last_Price':[1200, 1500, 1600, 352,'ee'],
'Updated_Price':[12, 24, 0, 1,np.nan],
'Discount':[10, 10, 10, 10, 11]})
df['Updated_Price'] = df['Updated_Price'].fillna(0)
df.loc[df['Updated_Price']>23,'Updated_Price']=0
This replaces all nan values with 0 and and for values greater than 23, also replaces with 0

Related

Time between status reports with pyspark

I have a dataset the contains the status changes of one of our companies' systems.
I am only able to use PySpark to process this data.
each row in the dataset is a status change. There is a status column and an update timestamp.
Status
timestamp
red
2023-01-02T01:05:32.113Z
yellow
2023-01-02T01:15:47.329Z
red
2023-01-02T01:25:11.257Z
green  
2023-01-02T01:33:12.187Z
red
2023-01-05T15:10:12.854Z
green  
2023-01-05T15:26:24.131Z
For the sake of what I need to do, we are going to say a degradation is the first time it reports anything not green to the time it reports green again. What I am trying to do is to create a table of degradations with the duration of each one. ex:
degradation
duration
start
end
degradation 1
27.65
2023-01-02T01:05:32.113Z
2023-01-02T01:33:12.187Z
degradation 2
16.2
2023-01-05T15:10:12.854Z
2023-01-05T15:26:24.131Z
I can get PySpark to return durations between two timestamps without an issue, what I am struggling with is getting PySpark to use the timestamp from the first red to the following green and then log it as a row in a new df.
Any help is appreciated. Thank you.
I have one solution. Dont know if this is the easiest and fastest way to calculate what you want. For me the problem is that those data are valid only when they are not partitioned and in correct order which forces me to move all of them to one partition, at least at first stage
What i am doing here
I am using one big window with lag function and sum to calculate new partitions. In this case partition are created base on occurence of record with status = 'green'
Then i am using group by to find first/last event within each partition and calculate the diff
import pyspark.sql.functions as F
from pyspark.sql import Window
df = [
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 2, 1, 5, 32, 113)},
{"Status": "yellow", "timestamp": datetime.datetime(2023, 1, 2, 1, 15, 47, 329)},
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 2, 1, 25, 11, 257)},
{"Status": "green", "timestamp": datetime.datetime(2023, 1, 2, 1, 33, 12, 187)},
{"Status": "red", "timestamp": datetime.datetime(2023, 1, 5, 15, 10, 12, 854)},
{"Status": "green", "timestamp": datetime.datetime(2023, 1, 5, 15, 26, 24, 131)},
]
df = spark.createDataFrame(df)
windowSpec = Window.partitionBy().orderBy("timestamp")
df.withColumn(
"partition_number",
F.sum(
(F.coalesce(F.lag("Status").over(windowSpec), F.lit(0)) == F.lit("green")).cast(
"int"
)
).over(windowSpec),
).groupBy("partition_number").agg(
F.first("timestamp", ignorenulls=True).alias("start"),
F.last("timestamp", ignorenulls=True).alias("end"),
).withColumn(
"duration",
(F.round((F.col("end").cast("long") - F.col("start").cast("long")) / 60, 2)),
).withColumn(
"degradation", F.concat(F.lit("Degradation"), F.col("partition_number"))
).select(
"degradation", "duration", "start", "end"
).show(
truncate=False
)
output is
+------------+--------+--------------------------+--------------------------+
|degradation |duration|start |end |
+------------+--------+--------------------------+--------------------------+
|Degradation0|27.67 |2023-01-02 01:05:32.000113|2023-01-02 01:33:12.000187|
|Degradation1|16.2 |2023-01-05 15:10:12.000854|2023-01-05 15:26:24.000131|
+------------+--------+--------------------------+--------------------------+
If there is such a need you may change precision in duration or adjust this code to start counting degradation from 1 not 0 if this is a problem
You can mark the status green first, use lag to move a row after. Then you can seperate for each degradation by summing the value over window.
w = Window.partitionBy().orderBy("timestamp")
df.withColumn('status', f.lag((f.col('status') == f.lit('green')).cast('int'), 1, 0).over(w)) \
.withColumn('status', f.sum('status').over(w) + 1) \
.groupBy('status') \
.agg(
f.first('timestamp').alias('start'),
f.last('timestamp').alias('end')
) \
.select(
f.concat(f.lit('Degradation'), f.col('status')).alias('degradation'),
f.round((f.col('end') - f.col('start')).cast('long') / 60, 2).alias('duration'),
'start',
'end'
) \
.show(truncate=False)
+------------+--------+-----------------------+-----------------------+
|degradation |duration|start |end |
+------------+--------+-----------------------+-----------------------+
|Degradation1|27.67 |2023-01-02 01:05:32.113|2023-01-02 01:33:12.187|
|Degradation2|16.18 |2023-01-05 15:10:12.854|2023-01-05 15:26:24.131|
+------------+--------+-----------------------+-----------------------+

How to convert a dataframe from a GridDB container to a list of lists?

I am using GridDB Python Client and I have a container that stores my database and I need to get the dataframe object converted to a list of lists. The read_sql_query function offered by pandas returns a dataframe but I need to get the dataframe returned as a list of lists instead of a dataframe. The first element in the list of lists is for the header of the dataframe (the column names) and the other elements are for the rows in the dataframe. Is there a way I could do that? Please help
Here is the code for the container and the part where the program reads SQL queries:
#...
import griddb_python as griddb
import pandas as pd
from pprint import pprint
factory = griddb.StoreFactory.get_instance()
# Initialize container
try:
gridstore = factory.get_store(host="127.0.0.1", port="8080",
cluster_name="Cls36", username="root",
password="")
conInfo = griddb.ContainerInfo("Fresher_Students",
[["id", griddb.Type.INTEGER],
["First Name",griddb.Type.STRING],
["Last Name", griddb.Type.STRING],
["Gender", griddb.Type.STRING],
["Course", griddb.Type.STRING]
],
griddb.ContainerType.COLLECTION, True)
cont = gridstore.put_container(conInfo)
cont.create_index("id", griddb.IndexType.DEFAULT)
data = pd.read_csv("fresher_students.csv")
#Add data
for i in range(len(data)):
ret = cont.put(data.iloc[i, :])
print("Data added successfully")
except griddb.GSException as e:
print(e)
sql_statement = ('SELECT * FROM Fresher_Students')
sql_query = pd.read_sql_query(sql_statement, cont)
def convert_to_lol(query):
# Code goes here
# ...
LOL = convert_to_lol(sql_query.head()) # Not Laughing Out Load, it's List of Lists
pprint(LOL)
#...
I want to get something that looks like this:
[["id", "First Name", "Last Name", "Gender", "Course"],
[0, "Catherine", "Ghua", "F", "EEE"],
[1, "Jake", "Jonathan", "M", "BMS"],
[2, "Paul", "Smith", "M", "BFA"],
[3, "Laura", "Williams", "F", "MBBS"],
[4, "Felix", "Johnson", "M", "BSW"],
[5, "Vivian", "Davis", "F", "BFD"]]
[UPDATED]
The easiest way I know about(for any DF):
df = pd.DataFrame({'id':[2, 3 ,4], 'age':[24, 42, 13]})
[df.columns.tolist()] + df.reset_index().values.tolist()
output:
[['id', 'age'], [0, 2, 24], [1, 3, 42], [2, 4, 13]]

How to convert pandas dataframe into multi level JSON with headers?

I have a pandas dataframe which I would like to convert to JSON format for my source system to utilize, which requires a very specific JSON format.
I cant seem to get to the exact format like shown in the expected output section, using simple dictionary loops.
Is there anyway I can convert csv/pd.Dataframe to nested JSON?
Any python package specifically built for this?
Input Dataframe:
#Create Input Dataframe
data = {
'col6':['A','A','A','B','B','B'],
'col7':[1, 1, 2, 1, 2, 2],
'col8':['A','A','A','B','B','B'],
'col10':['A','A','A','B','B','B'],
'col14':[1,1,1,1,1,2],
'col15':[1,2,1,1,1,1],
'col16':[9,10,26,9,12,4],
'col18':[1,1,2,1,2,3],
'col1':['xxxx','xxxx','xxxx','xxxx','xxxx','xxxx'],
'col2':[2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13,2.02011E+13],
'col3':['xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012','xxxx20201107023012'],
'col4':['yyyy','yyyy','yyyy','yyyy','yyyy','yyyy'],
'col5':[0,0,0,0,0,0],
'col9':['A','A','A','B','B','B'],
'col11':[0,0,0,0,0,0],
'col12':[0,0,0,0,0,0],
'col13':[0,0,0,0,0,0],
'col17':[51,63,47,59,53,56]
}
pd.DataFrame(data)
Expected Output:
{
"header1": {
"col1": "xxxx"
"col2": "20201107023012"
"col3": "xxxx20201107023012"
"col4": "yyyy",
"col5": "0"
},
"header2":
{
"header3":
[
{
col6: A,
col7: 1,
header4:
[
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
},
{
col8: "A",
col9: 1,
col10: "A",
col11: 0,
col12: 0,
col13: 0,
"header5":
[
{
col14: "1",
col15: 1,
col16: 1,
col17: 51,
col18: 1
},
{
col14: "1",
col15: 1,
col16: 2,
col17: 63,
col18: 2
}
]
}
]
}
]
}
}
Maybe this will get you started. I'm not aware of a current python module that will do what you want but this is the basis of how I'd start it. Making assumptions based on what you've provided.
As each successive nest is based on some criteria, you'll need to loop through filtered dataframes. Depending on the size of your dataframes using groupby may be a better option than what I have here but the theory is the same. Also, you'll have to create you key value pairs correctly, this just creates the data support what you are builing.
# assume header 1 is constant so take first row and use .T to transpose to create dictionaries
header1 = dict(df.iloc[0].T[['col1','col2','col3','col4','col5']])
print('header1', header1)
# for header three, looks like you need the unique combinations so create dataframe
# and then iterate through to get all the header3 dictionaries
header3_dicts = []
dfh3 = df[['col6', 'col7']].drop_duplicates().reset_index(drop=True)
for i in range(dfh3.shape[0]):
header3_dicts.append(dict(dfh3.iloc[i].T[['col6','col7']]))
print('header3', header3_dicts)
# iterate over header3 to get header 4
for i in range(dfh3.shape[0]):
#print(dfh3.iat[i,0], dfh3.iat[i,1])
dfh4 = df.loc[(df['col6']==dfh3.iat[i,0]) & (df['col7']==dfh3.iat[i,1])]
header4_dicts = []
for j in range(dfh4.shape[0]):
header4_dicts.append(dict(df.iloc[j].T[['col8','col9','col10','col11','col12','col13']]))
print('header4', header4_dicts)
# next level repeat similar to above

Change bar colors in pandas matplotlib bar chart by passing a list/tuple

There are several threads on this topic, but none of them seem to directly address my question. I would like to plot a bar chart from a pandas dataframe with a custom color scheme that does not rely on a map, e.g. use an arbitrary list of colors. It looks like I can pass a concatenated string with color shorthand names (first example below). When I use the suggestion here, the first color is repeated (see second example below). There is a comment in that post which eludes to the same behavior I am observing. Of course, I could do this by setting the subplot, but I'm lazy and want to do it in one line. So, I'd like to use the final example where I pass in a list of hex codes and it works as expected. I'm using pandas versions >=0.24 and matplotlib versions >1.5. My questions are:
Why does this happen?
What am I doing wrong?
Can I pass a list of colors?
pd.DataFrame( [ 1, 2, 3, 4, 5 ] ).plot( kind="bar", color="brgmk" )
pd.DataFrame( [ 1, 2, 3, 4, 5 ] ).plot( kind="bar", color=[ "b", "r", "g", "m", "k" ] )
pd.DataFrame( [ 1, 2, 3, 4, 5 ] ).plot( kind="bar", color=[ "#0000FF", "#FF0000", "#008000", "#FF00FF", "#000000" ] )
When plotting a dataframe, the first color information is used for the first column, the second for the second column etc. Color information may be just one value that is then used for all rows of this column, or multiple values that are used one-by-one for each row of the column (repeated from the beginning if more rows than colors). See the following example:
pd.DataFrame( [[ 1, 4], [2, 5], [3, 6]] ).plot(kind="bar", color=[[ "b", "r", "g" ], "m"] )
So in your case you just need to put the list of color values in a list (specifically not a tuple):
pd.DataFrame( [ 1, 2, 3, 4, 5 ] ).plot( kind="bar", color=[[ "b", "r", "g", "m", "k" ]] )
or
pd.DataFrame( [ 1, 2, 3, 4, 5 ] ).plot( kind="bar", color=[[ "#0000FF", "#FF0000", "#008000", "#FF00FF", "#000000" ]] )
The first case in the OP (color="brgmk") works as expected as pandas internally puts the color string in a list (strings are not considered list-like).

Pandas ignore missing dates to find percentiles

I have a dataframe. I am trying to find percentiles of datetimes. I am using the function:
Dataframe:
student, attempts, time
student 1,14, 9/3/2019 12:32:32 AM
student 2,2, 9/3/2019 9:37:14 PM
student 3, 5
student 4, 16, 9/5/2019 8:58:14 PM
studentInfo2 = [14, 4, Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data['time'].notnull(), student2Info[2], 'rank')
where student2Info[2] holds the datetime for a particular student. When I try and do this I get the error:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Any ideas on how I can get the percentile to calculate correctly even when there are missing times in the columns?
You need to transform the Timestamps into units that percentileofscore can understand. Also, pd.DataFrame.notnull() returns a boolean list that you may use to filter your DataFrame, it does not return the filtered list, so I've updated that for you. Here is a working example:
import pandas as pd
import scipy.stats as stats
data = pd.DataFrame.from_dict({
"student": [1, 2, 3, 4],
"attempts": [14, 2, 5, 16],
"time_0001": [
"9/3/2019 12:32:32 AM",
"9/3/2019 9:37:14 PM",
"",
"9/5/2019 8:58:14 PM"
]
})
student2Info = [14, 4, pd.Timestamp('2019-09-04 00:26:36')]
data['time'] = pd.to_datetime(data['time_0001'], errors='coerce')
perc1_first = stats.percentileofscore(data[data['time'].notnull()].time.transform(pd.Timestamp.toordinal), student2Info[2].toordinal(), 'rank')
print(perc1_first) #-> 66.66666666666667

Categories

Resources