Python get specific value from HDF table - python

I have two tables, the first one contains 300 rows, each row presents a case with 3 columns in which we find 2 constant values presenting the case, the second table is my data table collected from sensors contains the same indicators as the first except the case column, the idea is to detect to which case belongs each line of the second table knowing that the data are not the same as the first but in the range.
example:
First table:
[[1500, 22, 0], [1100, 40, 1], [2400, 19, 2]]
columns=['analog', 'temperature', 'case'])**
second table:
[[1420, 20], [1000, 39], [2300, 29]]
columns=['analog', 'temperature']
I want to detect my first row (1420 20) belongs to which case?

You can simply use a classifier; K-NN for instance...
import pandas as pd
df = pd.DataFrame([[1500, 22, 0], [1100, 40, 1], [2400, 19, 2]],columns=['analog', 'temperature', 'case'])
df1 = pd.DataFrame([[1420, 10], [1000, 39], [2300, 29]],columns=['analog', 'temperature'])
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 1, metric = 'minkowski', p = 2)
classifier.fit(df[['analog', 'temperature']], df["case"])
df1["case"] = classifier.predict(df1)
Output of df1;
analog temperature case
0 1420 10 0
1 1000 39 1
2 2300 29 2
so, first row (1420 20) in df1 (2nd table) belongs to case 0...

What do you mean by belong? [1420, 20] belongs to [?,?,?]?

Related

Pandas - Add a new column extracting value from arrays based on other column value

I am currently stuck trying to extract a value from a list/array depending on values of a dataframe.
Imagine i have this array. This array i can manually create so i can put the numbers in any way i want i just thought this python list was the best one but i can do anything here
value = [[30, 120, 600, 3000], [15, 60, 300, 1500], [30, 120, 600, 3000], [10, 40, 200, 1000],[10, 40, 200, 1000], [10, 40, 200, 1000], [10, 40, 200, 1000], [5, 20, 100, 500]]
I have also a data frame that comes from much bigger/dynamic processing where I have two columns, which are int types. Here a code to recreate those 2 columns as an example.
The array possible values of id1 go from 0 to 6 and of id2 go from 0 to 3
data = {'id1': [4, 2, 6, 6], 'id2': [1, 2, 3, 1]}
df = pd.DataFrame(data)
What i want to do is add an additional column in the dataframe df which is based on the value of the array depending on the two columns.
So for example the first row of data frame will take the value of value[4][1]=40 to end up with a dataframe like this
result = {'id1': [4, 2, 6, 6], 'id2': [1, 2, 3, 1], 'matched value': [40, 600, 1000, 40]}
dfresult = pd.DataFrame(result)
I am a bit lost on what is the best way to achieve this.
What comes to my mind is a very brutal solution where what i can do is take the values of the multidimensional array and just create a single list where I have all the possible 7*4 combinations, in the data frame create a new column which is the concatenation of the two-ids and then do a straight join based on the simple condition. This would likely work in this case because the possible combinations are few but I am certain there is a learning opportunity here to use lists in a dynamic way that escapes me!
You can use list comprehension to iterate over the id pairs and retrieve the corresponding value for each pair
df['matched_val'] = [value[i][j] for i, j in zip(df['id1'], df['id2'])]
Or a better solution with numpy indexing but applicable only if the sub-lists inside value are of equal length:
df['matched_val'] = np.array(value)[df['id1'], df['id2']]
Result
id1 id2 matched_val
0 4 1 40
1 2 2 600
2 6 3 1000
3 6 1 40

use a pandas cell as a bucket and fill it up with other column values

The Problem
Figure out how to generate a new row when the sum of two values from another DF is larger than the specified max capacity of that row, and storing the new values in the new row.
Background
A truck plans to go to two customers to deliver some packages, forming two routes to drive to deliver these packages (see df_routes). The truck has a max capacity of how many packages it can take per route (column truck_max_capacity). On the date of delivery, we get an update on how many packages the customer actually wanted (see df_actuals), which is different from our initial plan.
In our scenario here, the new values from df_actuals > df_planned. Therefore a new row for the new route needs to be created so we can send a truck to pick up these packages. So the wanted output should in theory look like the below df_new_routes.
My idea for a solution:
A function that could take:
The sum of the plan for customer A and B in DF_input_1: 11+12 = 23
The new actual values from DF_input_2: 14 + 13 = 27
Look at the max_capacity column says the max should be = 24:
We take 14 from customer A and 10 from customer B
Generate a new row for Customer B for 27-24 = 3 (the rest value)
My challenges with this:
I'm not sure how to complete the above steps so that a new row is generated with the rest value of 3
Am I overcomplicating this? Is there a way to more easily check if something is over a new sum value --> then create a new row with the "rest" value of 3 here?
Code to reproduce dataframes:
df_routes = pd.DataFrame({
'date':['2021-12-24','2021-12-25'],
'customer_1_id':['A', 'A'],
'customer_2_id':['B', 'B'],
'customer_1_planned_packages':[11, 10],
'customer_2_planned_packages':[12, 10],
'total_pallets':[23, 20],
'truck_max_capacity':[24, 24],
'cost_route':[120, 120]
})
df_actuals = pd.DataFrame({
'date':['2021-12-24','2021-12-24','2021-12-25','2021-12-25'],
'shipper_id':['A', 'B','A', 'B'],
'actual_packages':[14, 13, 14, 13]
})
df_new_routes = pd.DataFrame({
'date':['2021-12-24','2021-12-24', '2021-12-25', '2021-12-25'],
'customer_1_id':['A', 'B', 'A', 'B'],
'customer_2_id':['B', np.nan, 'B', np.nan],
'customer_1_actual_packages':[14, 3, 14, 3],
'customer_2_actual_packages':[10, np.nan, 10, np.nan],
'total_pallets':[24, 3, 24, 3], #sum(S1_actual_pallets, S2 ... etc)
'truck_max_capacity':[24, 24, 24, 24],
'cost_route':[120, 130, 120, 130]
})

Remove first two valid data points from time series data in wide format

I have data where each row as customers and what is the quantity they bought. There are 12 columns in the data starting from Jan 2018 to Dec 2018 (each column is a month).
Let us say for customer X1, my data starts in June 2018 so first 5 columns of this row are empty.
For customer X2, my data starts in Aug 2018 so first 7 columns of this row are empty.
For customer X3, my data starts in Jan 2018 so all of the columns have data points.
For each of the row (i.e.) every customer, I want to delete the first 2 data points and make them null. Red color indicates null values.
df = pd.DataFrame({'Jan-18': [np.nan, np.nan, 15],
'Feb-18': [np.nan, np.nan, 20],
'Mar-18': [np.nan, np.nan, 15],
'Apr-18': [np.nan, np.nan, 20],
'May-18': [np.nan, np.nan, 15],
'Jun-18': [2, np.nan, 20],
'Jul-18': [5, np.nan, 15],
'Aug-18': [10, 10, 20],
'Sep-18': [15, np.nan, 15],
'Oct-18': [20, 15, 20],
'Nov-18': [25, 20, 15],
'Dec-18': [30, 20, 20]})
output_df = pd.DataFrame({'Jan-18': [np.nan, np.nan, 15],
'Feb-18': [np.nan, np.nan, 20],
'Mar-18': [np.nan, np.nan, 15],
'Apr-18': [np.nan, np.nan, 20],
'May-18': [np.nan, np.nan, 15],
'Jun-18': [np.nan, np.nan, 20],
'Jul-18': [np.nan, np.nan, 15],
'Aug-18': [10, np.nan, 20],
'Sep-18': [15, np.nan, 15],
'Oct-18': [20, np.nan, 20],
'Nov-18': [25, 20, 15],
'Dec-18': [30, 20, 20]})
So for X1, I delete June and July (both were valid data points i.e. not null) and data will start from August.
For X2, I delete August, there was no data for Sept, but there is data for Oct. So, I have to delete both August and Oct.
For X3, since I dont know when exactly in past it became my customer, I dont want to delete anything. [I can calculate count for every row and filter rows with count 12 so no deletion happens there]
I have thought about using count and shape to find number of null values in every row. df.shape[1] - df.count(axis=1)
But not sure how to delete the first 2 data points in every row. Any help is appreciated.
# script (after using the provided code to generate `df`):
x, y = np.nonzero(df.notnull().values)
loc = pd.DataFrame({'x': x, 'y': y}).groupby('x').head(2)
xnull, ynull = zip(*loc.groupby('x').filter(lambda p: list(p.y) != [0, 1]).values)
df.iloc[list(xnull), list(ynull)] = np.nan
first the x & y index values are obtained for the dataframe having non-null values.
for each x coord, the first two y values are taken to form a dataframe loc with 6 rows.
loc is filtered to remove rows if the y coords are 0 & 1, i.e the first non null values are found in the first two rows of the original dataframe
the filtered locations are split into the x & y coordinates that will be set to null.
the values in the dataframe are overwritten with null. note that this mutates the original dataframe, so if the original dataframe is required, ensure that a copy is made & modified instead.
notes:
the filter would null-out the case where a customer made a purchase in the first or second month & never purchased again.
xnull & ynull are initially tuples, which had to be converted to list to work with iloc. However list(p.y) is aggregating values in column y into a list

issue in creating Data frame by passing list

after doing few data manipulation i got 2 list avglist and sumlist
and now i passed this 2 list to my result_df
result_df = pd.DataFrame({"File Name": filelist ,"Average":avglist,"Sum":sumlist})
print(result_df)
so below is my Output result, but problem here is
1) even my header Continental AG, datatype info also include..
i just my my values "874" and 584 in sum needed..
-i tried avglist.value(), but .value is not a list function
also tried few variation in .index but did not get expected result..
am i missing any steps here?
There is something wrong with how you're importing your files. If you take a .sum() of your dataframe, it will give you back the sum of the columns. I suspect you may be doing that since you are summing a dataframe. Then when you try to put the list in another dataframe its looking funky.
lets take the following two dataframes:
df = pd.DataFrame({'a':[1, 20, 30, 4, 0],
'b':[1, 0, 3, 4, 0],
'c':[1, 3, 7, 7, 5],
'd':[1, 8, 3, 8, 5],
'e':[1, 11, 3, 4, 0]})
df2 = pd.DataFrame({'a':[1, 20, 100, 4, 0],
'b':[1, 0, 39, 49, 10],
'c':[1, 3, 97, 7, 95],
'd':[441, 38, 23, 8, 115],
'e':[1, 11, 13, 114, 0]})
looking at the sum of one of these dataframes:
df.sum()
a 55
b 8
c 23
d 25
e 19
dtype: int64
now if we were to take the sums of dataframes and put them in a list:
sums = [x.sum() for x in [df, df2]]
when we inspect this we get:
[a 55
b 8
c 23
d 25
e 19
dtype: int64, a 125
b 99
c 203
d 625
e 139
dtype: int64]
if you want the sum of the whole dataframe and not just by column, you can use .sum().sum() which will sum first by columns and then sum those columns
df.sum().sum()
130
so across dataframes it would be:
sums = [x.sum().sum() for x in [df, df2]]
doing the mean would depend on how your csvs are. if you were to do .mean().mean() that might be very different than what you're looking for. If its just 1 column every time it would be fine. but if it were more, it would be taking the mean of 5 columns, and then taking the mean of that (those 5 averages summed divided by 5).
lastly it looks like "Continental AG (Worldwide)" is the name of your column.
So in your for loop you should be doing:
sums = [df['Continental AG (Worldwide)'.sum() for df in list_dfs]
i performed few operation sometime like below...
while i < len(filepath):
.....
df['Date']=df['Time'].apply(lambda i:i.split('T')[0])
.......
.......
sum1=sum_df.sum(axis=0)
avg1=Avg_df.sum(axis=0)
.......
.......
avglist.append(avg1)
sumlist.append(sum1)
.....
i+=1
so i have changed my all operation to below..
df['Date']=df.iloc[:,0].apply(lambda i:i.split('T')[0])
.........
.........
sum1=sum_df.iloc[:,0].sum()
avg1=Avg_df.iloc[:,0].mean()
.....
.....
avglist.append(avg1)
sumlist.append(sum1)
instead of using column name, axis in my operation.
i updated to dataframe.iloc in all my operation and it started giving me correct result..
still not sure about precise reason , but this changes worked for me..

Pivot Table in Python

I am pretty new to Python and hence I need your help on the following:
I have two tables (dataframes):
Table 1 has all the data and it looks like that:
GenDate column has the generation day.
Date column has dates.
Column D and onwards has different values
I also have the following table:
Column I has "keywords" that can be found in the header of Table 1
Column K has dates that should be in column C of table 1
My goal is to produce a table like the following:
I have omitted a few columns for Illustration purposes.
Every column on table 1 should be split base on the Type that is written on the Header.
Ex. A_Weeks: The Weeks corresponds to 3 Splits, Week1, Week2 and Week3
Each one of these slits has a specific Date.
in the new table, 3 columns should be created, using A_ and then the split name:
A_Week1, A_Week2 and A_Week3.
for each one of these columns, the value that corresponds to the Date of each split should be used.
I hope the explanation is good.
Thanks
You can get the desired table with the following code (follow comments and check panda api reference to learn about functions used):
import numpy as np
import pandas as pd
# initial data
t_1 = pd.DataFrame(
{'GenDate': [1, 1, 1, 2, 2, 2],
'Date': [10, 20, 30, 10, 20, 30],
'A_Days': [11, 12, 13, 14, 15, 16],
'B_Days': [21, 22, 23, 24, 25, 26],
'A_Weeks': [110, 120, 130, 140, np.NaN, 160],
'B_Weeks': [210, 220, 230, 240, np.NaN, 260]})
# initial data
t_2 = pd.DataFrame(
{'Type': ['Days', 'Days', 'Days', 'Weeks', 'Weeks'],
'Split': ['Day1', 'Day2', 'Day3', 'Week1', 'Week2'],
'Date': [10, 20, 30, 10, 30]})
# create multiindex
t_1 = t_1.set_index(['GenDate', 'Date'])
# pivot 'Date' level of MultiIndex - unstack it from index to columns
# and drop columns with all NaN values
tt_1 = t_1.unstack().dropna(axis=1)
# tt_1 is what you need with multi-level column labels
# map to rename columns
t_2 = t_2.set_index(['Type'])
mapping = {
type_: dict(zip(
t_2.loc[type_, :].loc[:, 'Date'],
t_2.loc[type_, :].loc[:, 'Split']))
for type_ in t_2.index.unique()}
# new column names
new_columns = list()
for letter_type, date in tt_1.columns.values:
letter, type_ = letter_type.split('_')
new_columns.append('{}_{}'.format(letter, mapping[type_][date]))
tt_1.columns = new_columns

Categories

Resources