Total the sum of a corresponding Column with python - python

|-------|------------|--------------|--------------|-------------|------------|------------|--------------|
| Store | Date | Weekly_Sales | Holiday_Flag | Temperature | Fuel_Price | CPI | Unemployment |
|-------|------------|--------------|--------------|-------------|------------|------------|--------------|
| 1 | 05-02-2010 | 1643690.90 | 0 | 42.31 | 2.572 | 211.096358 | 8.106 |
| 1 | 12-02-2010 | 1641957.44 | 1 | 38.51 | 2.548 | 211.242170 | 8.106 |
| 1 | 19-02-2010 | 1611968.17 | 0 | 39.93 | 2.514 | 211.289143 | 8.106 |
| 1 | 26-02-2010 | 1409727.59 | 0 | 46.63 | 2.561 | 211.319643 | 8.106 |
| 1 | 05-03-2010 | 1554806.68 | 0 | 46.50 | 2.625 | 211.350143 | 8.106 |
|-------|------------|--------------|--------------|-------------|------------|------------|--------------|
The Store columns range from 1 -40. How do I get the store with the maximum Weekly_Sales?

There's so many ways to do this and you haven't given an example of how you load the data into python or what format its in so this makes answering the question difficult. I suggest you look into pandas or numpy for data analysis libraries. If this is stored in a .csv format or even a python dictionary, you could try the following:
import pandas as pd
df = pd.read_csv('file.csv', header=0)
#df = pf.from_dict(dct)
value = df.Weekly_Sales.max()
#or
index = df.Weekly_Sales.idxmax()

Related

Dividing a dataframe into several dataframes according to date column

I have a dataframe which contains a specific column for the date which is called 'testdate'. And I have a period between two specific date, such as 20110501~20120731.
From a dataframe, I want to divide that dataframe into multiple dataframes according to the year-month of 'testdate'.
For example, if 'testdate' is between 20110501-20110531 then df1, if 'testdate' is between next month, then f2, ... and so on.
For example, a whole dataframe looks like this...
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 1 | 20110528 | 50 |
| 2 | 20110601 | 75 |
| 3 | 20110504 | 100 |
| 4 | 20110719 | 82 |
| 5 | 20111120 | 42 |
| 6 | 20111103 | 95 |
| 7 | 20120520 | 42 |
| 8 | 20120503 | 95 |
But, I want to divide it like this...
[DF1]: name should be 201105
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 1 | 20110528 | 50 |
| 3 | 20110504 | 100 |
[DF2]: name should be 201106
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 2 | 20110601 | 75 |
[DF3]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 4 | 20110719 | 82 |
[DF4]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 5 | 20111120 | 42 |
| 6 | 20111103 | 95 |
[DF5]
| StudentID | Testdate | Record |
| -------- | -------- | ------ |
| 7 | 20120520 | 42 |
| 8 | 20120503 | 95 |
I found some codes for dividing a dataframe according to the quarter, but I could find any codes for my task.
How can I deal with it ? Many thanks to your help.
Create a grouper by slicing yyyymm from testdate then group the dataframe and store each group inside a dict comprehension
s = df['Testdate'].astype(str).str[:6]
dfs = {f'df_{k}': g for k, g in df.groupby(s)}
# dfs['df_201105']
StudentID Testdate Record
0 1 20110528 50
2 3 20110504 100
# dfs['df_201106']
StudentID Testdate Record
1 2 20110601 75

Creating dummy variables for dataset with continuous and categorical variables in Python using Pandas

I am working on a logistic regression for employee turnover.
The variables I have are
q3attendance object
q4attendance object
average_attendance object
training_days int64
esat float64
lastcompany object
client _category object
qual_category object
location object
rating object
role object
band object
resourcegroup object
skill object
status int64
I have marked the categorical variables using
cat=[' bandlevel ',' resourcegroup ',' skill ',..}
I define x and y using x=df.iloc[:,:-1] and y=df.iloc[:,-1].
Next I need to create dummy variables. So, I use the command
xd = pd.get_dummies(x,drop_first='True')
After this, I expect the continuous variables to remain as they are and the dummies to be created for all categorical variables. However, on executing the command, I find that the code is treating the continuous variables also as categorical and ends up creating dummies for all of them as well. So, it the tenure is 3 years 2 months, 4 years 3 months etc, 3.2 and 4.3 both are taken as categorical. I end up with more than 1500 dummies and it is a challenge to run the regression after that.
What am I missing? Should I specifically mark out the categorical variables when using get_dummies?
pd.get_dummies has a optional param columns which accepts list of columns for which you need to create encoding.
For example:
df.head()
+----+------+--------------+-------------+---------------------------+----------+-----------------+
| | id | first_name | last_name | email | gender | ip_address |
|----+------+--------------+-------------+---------------------------+----------+-----------------|
| 0 | 1 | Lucine | Krout | lkrout0#sourceforge.net | Female | 199.158.46.27 |
| 1 | 2 | Sherm | Jullian | sjullian1#mapy.cz | Male | 8.97.22.209 |
| 2 | 3 | Derk | Mulloch | dmulloch2#china.com.cn | Male | 132.108.184.131 |
| 3 | 4 | Elly | Sulley | esulley3#com.com | Female | 63.177.149.251 |
| 4 | 5 | Brocky | Jell | bjell4#huffingtonpost.com | Male | 152.32.40.4 |
| 5 | 6 | Harv | Allot | hallot5#blogtalkradio.com | Male | 71.135.240.164 |
| 6 | 7 | Wolfie | Stable | wstable6#utexas.edu | Male | 211.31.189.141 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7#whitehouse.gov | Male | 224.214.43.40 |
| 8 | 9 | Devina | Salerg | dsalerg8#furl.net | Female | 49.169.34.38 |
| 9 | 10 | Missie | Korpal | mkorpal9#wunderground.com | Female | 119.115.90.232 |
+----+------+--------------+-------------+---------------------------+----------+-----------------+
then
columns = ["gender"]
pd.get_dummies(df, columns=columns)
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
| | id | first_name | last_name | email | ip_address | gender_Female | gender_Male |
|----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------|
| 0 | 1 | Lucine | Krout | lkrout0#sourceforge.net | 199.158.46.27 | 1 | 0 |
| 1 | 2 | Sherm | Jullian | sjullian1#mapy.cz | 8.97.22.209 | 0 | 1 |
| 2 | 3 | Derk | Mulloch | dmulloch2#china.com.cn | 132.108.184.131 | 0 | 1 |
| 3 | 4 | Elly | Sulley | esulley3#com.com | 63.177.149.251 | 1 | 0 |
| 4 | 5 | Brocky | Jell | bjell4#huffingtonpost.com | 152.32.40.4 | 0 | 1 |
| 5 | 6 | Harv | Allot | hallot5#blogtalkradio.com | 71.135.240.164 | 0 | 1 |
| 6 | 7 | Wolfie | Stable | wstable6#utexas.edu | 211.31.189.141 | 0 | 1 |
| 7 | 8 | Harcourt | Dunguy | hdunguy7#whitehouse.gov | 224.214.43.40 | 0 | 1 |
| 8 | 9 | Devina | Salerg | dsalerg8#furl.net | 49.169.34.38 | 1 | 0 |
| 9 | 10 | Missie | Korpal | mkorpal9#wunderground.com | 119.115.90.232 | 1 | 0 |
+----+------+--------------+-------------+---------------------------+-----------------+-----------------+---------------+
print(tabulate())
will encode only column gender
All data is auto-generated, doesn't represent real world

Multi-Index Lookup Mapping

I'm trying to create a new column which has a value based on 2 indices of that row. I have 2 dataframes with equivalent multi-index on the levels I'm querying (but not of equal size). For each row in the 1st dataframe, I want the value of the 2nd df that matches the row's indices.
I originally thought perhaps I could use a .loc[] and filter off the index values, but I cannot seem to get this to change the output row-by-row. If I wasn't using a dataframe object, I'd loop over the whole thing to do it.
I have tried to use the .apply() method, but I can't figure out what function to pass to it.
Creating some toy data with the same structure:
#import pandas as pd
#import numpy as np
np.random.seed = 1
df = pd.DataFrame({'Aircraft':np.ones(15),
'DC':np.append(np.repeat(['A','B'], 7), 'C'),
'Test':np.array([10,10,10,10,10,10,20,10,10,10,10,10,10,20,10]),
'Record':np.array([1,2,3,4,5,6,1,1,2,3,4,5,6,1,1]),
# There are multiple "value" columns in my data, but I have simplified here
'Value':np.random.random(15)
}
)
df.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
df.sort_index(inplace=True)
v = pd.DataFrame({'Aircraft':np.ones(7),
'DC':np.repeat('v',7),
'Test':np.array([10,10,10,10,10,10,20]),
'Record':np.array([1,2,3,4,5,6,1]),
'Value':np.random.random(7)
}
)
v.set_index(['Aircraft', 'DC', 'Test', 'Record'], inplace=True)
v.sort_index(inplace=True)
df['v'] = df.apply(lambda x: v.loc[df.iloc[x]])
Returns error for indexing on multi-index.
To set all values to a single "v" value:
df['v'] = float(v.loc[(slice(None), 'v', 10, 1), 'Value'])
So inputs look like this:
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | A | 10 | 1 | 0.847576 |
| | | | 2 | 0.860720 |
| | | | 3 | 0.017704 |
| | | | 4 | 0.082040 |
| | | | 5 | 0.583630 |
| | | | 6 | 0.506363 |
| | | 20 | 1 | 0.844716 |
| | B | 10 | 1 | 0.698131 |
| | | | 2 | 0.112444 |
| | | | 3 | 0.718316 |
| | | | 4 | 0.797613 |
| | | | 5 | 0.129207 |
| | | | 6 | 0.861329 |
| | | 20 | 1 | 0.535628 |
| | C | 10 | 1 | 0.121704 |
--------------------------------------------
--------------------------------------------
| Aircraft | DC | Test | Record | Value |
|----------|----|------|--------|----------|
| 1.0 | v | 10 | 1 | 0.961791 |
| | | | 2 | 0.046681 |
| | | | 3 | 0.913453 |
| | | | 4 | 0.495924 |
| | | | 5 | 0.149950 |
| | | | 6 | 0.708635 |
| | | 20 | 1 | 0.874841 |
--------------------------------------------
And after the operation, I want this:
| Aircraft | DC | Test | Record | Value | v |
|----------|----|------|--------|----------|----------|
| 1.0 | A | 10 | 1 | 0.847576 | 0.961791 |
| | | | 2 | 0.860720 | 0.046681 |
| | | | 3 | 0.017704 | 0.913453 |
| | | | 4 | 0.082040 | 0.495924 |
| | | | 5 | 0.583630 | 0.149950 |
| | | | 6 | 0.506363 | 0.708635 |
| | | 20 | 1 | 0.844716 | 0.874841 |
| | B | 10 | 1 | 0.698131 | 0.961791 |
| | | | 2 | 0.112444 | 0.046681 |
| | | | 3 | 0.718316 | 0.913453 |
| | | | 4 | 0.797613 | 0.495924 |
| | | | 5 | 0.129207 | 0.149950 |
| | | | 6 | 0.861329 | 0.708635 |
| | | 20 | 1 | 0.535628 | 0.874841 |
| | C | 10 | 1 | 0.121704 | 0.961791 |
Edit:
as you are on pandas 0.23.4, you just change droplevel to reset_index with option drop=True
df_result = (df.reset_index('DC').assign(v=v.reset_index('DC', drop=True))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Original:
One way is putting index DC of df to columns and using assign to create new column on it and reset_index and reorder_index
df_result = (df.reset_index('DC').assign(v=v.droplevel('DC'))
.set_index('DC', append=True)
.reorder_levels(v.index.names))
Out[1588]:
Value v
Aircraft DC Test Record
1.0 A 10 1 0.847576 0.961791
2 0.860720 0.046681
3 0.017704 0.913453
4 0.082040 0.495924
5 0.583630 0.149950
6 0.506363 0.708635
20 1 0.844716 0.874841
B 10 1 0.698131 0.961791
2 0.112444 0.046681
3 0.718316 0.913453
4 0.797613 0.495924
5 0.129207 0.149950
6 0.861329 0.708635
20 1 0.535628 0.874841
C 10 1 0.121704 0.961791

Dataframe add column: with count of rows where condition

I have got a dataframe (df) in python with 2 columns: ID and Date.
| ID | Date |
| ------------- |:-------------:|
| 1 | 06-14-2019 |
| 1 | 06-10-2019 |
| 2 | 06-16-2019 |
| 3 | 06-12-2019 |
| 3 | 06-12-2019 |
I'm trying to add a column to the dataframe which contains the count of rows where ID matches ID of the current row and Date <= Date of the current row.
Like the following:
| ID | Date | Count |
| ------------- |:-------------:|:-------------:|
| 1 | 06-14-2019 | 2 |
| 1 | 06-10-2019 | 1 |
| 2 | 06-16-2019 | 1 |
| 3 | 06-12-2019 | 2 |
| 3 | 06-12-2019 | 2 |
I have tried something like:
grouped = df.groupby(['ID'])
df['count'] = df.apply(lambda row: grouped.get_group[row['ID']][grouped.get_group(row['ID'])['Date'] < row['Date']]['ID'].size, axis=1)
This results in the following error:
TypeError: ("'method' object is not subscriptable", 'occurred at index 0')
Suggestions are welcome
I forgot to mention:
My real dataframe contains almost 4 million rows, so i'm looking for a smart and fast solution that won't take to long to run
Using df.iterrows():
df['Count'] = None
for idx, value in df.iterrows():
df.iloc[idx, -1 ] = len(df[(df.ID == value[0]) & (df.Date <= value[1])].index)
Output:
+---+----+------------+-------+
| | ID | Date | Count |
+---+----+------------+-------+
| 0 | 1 | 06-14-2019 | 2 |
| 1 | 1 | 06-10-2019 | 1 |
| 2 | 2 | 06-16-2019 | 1 |
| 3 | 3 | 06-12-2019 | 2 |
| 4 | 3 | 06-12-2019 | 2 |
+---+----+------------+-------+

calculate difference of column values for sets of row indices which are not successive in pandas

Say I have the following table:
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.72694 | 1.4742 | 0.32396 | 0.98535 | 1 | 0.83592 | 0.0046566 | 0.0039465 | 0.04779 | 0.12795 | 0.016108 | 0.0052323 | 0.00027477 | 1.1756 | 1 |
| 2 | 0.74173 | 1.5257 | 0.36116 | 0.98152 | 0.99825 | 0.79867 | 0.0052423 | 0.0050016 | 0.02416 | 0.090476 | 0.0081195 | 0.002708 | 7.48E-05 | 0.69659 | 1 |
| 3 | 0.76722 | 1.5725 | 0.38998 | 0.97755 | 1 | 0.80812 | 0.0074573 | 0.010121 | 0.011897 | 0.057445 | 0.0032891 | 0.00092068 | 3.79E-05 | 0.44348 | 1 |
| 4 | 0.73797 | 1.4597 | 0.35376 | 0.97566 | 1 | 0.81697 | 0.0068768 | 0.0086068 | 0.01595 | 0.065491 | 0.0042707 | 0.0011544 | 6.63E-05 | 0.58785 | 1 |
| 5 | 0.82301 | 1.7707 | 0.44462 | 0.97698 | 1 | 0.75493 | 0.007428 | 0.010042 | 0.0079379 | 0.045339 | 0.0020514 | 0.00055986 | 2.35E-05 | 0.34214 | 1 |
| 7 | 0.82063 | 1.7529 | 0.44458 | 0.97964 | 0.99649 | 0.7677 | 0.0059279 | 0.0063954 | 0.018375 | 0.080587 | 0.0064523 | 0.0022713 | 4.15E-05 | 0.53904 | 1 |
| 8 | 0.77982 | 1.6215 | 0.39222 | 0.98512 | 0.99825 | 0.80816 | 0.0050987 | 0.0047314 | 0.024875 | 0.089686 | 0.0079794 | 0.0024664 | 0.00014676 | 0.66975 | 1 |
| 9 | 0.83089 | 1.8199 | 0.45693 | 0.9824 | 1 | 0.77106 | 0.0060055 | 0.006564 | 0.0072447 | 0.040616 | 0.0016469 | 0.00038812 | 3.29E-05 | 0.33696 | 1 |
| 11 | 0.7459 | 1.4927 | 0.34116 | 0.98296 | 1 | 0.83088 | 0.0055665 | 0.0056395 | 0.0057679 | 0.036511 | 0.0013313 | 0.00030872 | 3.18E-05 | 0.25026 | 1 |
| 12 | 0.79606 | 1.6934 | 0.43387 | 0.98181 | 1 | 0.76985 | 0.0077992 | 0.011071 | 0.013677 | 0.057832 | 0.0033334 | 0.00081648 | 0.00013855 | 0.49751 | 1 |
+----+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
I have two sets of row indices :
set1 = [1,3,5,8,9]
set2 = [2,4,7,10,10]
Note : Here, I have indicated the first row with index value 1. Length of both sets shall always be same.
What I am looking for is a fast and pythonic way to get the difference of column values for corresponding row indices, that is : difference of 1-2,3-4,5-7,8-10,9-10.
For this example, my resultant dataframe is the following:
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
| 1 | 0.01479 | 0.0515 | 0.0372 | 0.00383 | 0.00175 | 0.03725 | 0.0005857 | 0.0010551 | 0.02363 | 0.037474 | 0.0079885 | 0.0025243 | 0.00019997 | 0.47901 | 0 |
| 1 | 0.02925 | 0.1128 | 0.03622 | 0.00189 | 0 | 0.00885 | 0.0005805 | 0.0015142 | 0.004053 | 0.008046 | 0.0009816 | 0.00023372 | 0.0000284 | 0.14437 | 0 |
| 3 | 0.04319 | 0.1492 | 0.0524 | 0.00814 | 0.00175 | 0.05323 | 0.0023293 | 0.0053106 | 0.0169371 | 0.044347 | 0.005928 | 0.00190654 | 0.00012326 | 0.32761 | 0 |
| 3 | 0.03483 | 0.1265 | 0.02306 | 0.00059 | 0 | 0.00121 | 0.0017937 | 0.004507 | 0.0064323 | 0.017216 | 0.0016865 | 0.00042836 | 0.00010565 | 0.16055 | 0 |
| 1 | 0.05016 | 0.2007 | 0.09271 | 0.00115 | 0 | 0.06103 | 0.0022327 | 0.0054315 | 0.0079091 | 0.021321 | 0.0020021 | 0.00050776 | 0.00010675 | 0.24725 | 0 |
+---+---------+--------+---------+---------+---------+---------+-----------+-----------+-----------+----------+-----------+------------+------------+---------+---+
My resultant difference values are absolute here.
I cant apply diff(), since the row indices may not be consecutive.
I am currently achieving my aim via looping through sets.
Is there a pandas trick to do this?
Use loc based indexing -
df.loc[set1].values - df.loc[set2].values
Ensure that len(set1) is equal to len(set2). Also, keep in mind setX is a counter-intuitive name for list objects.
You need to select by data reindexing and then subtract:
df = df.reindex(set1) - df.reindex(set2).values
loc or iloc will raise a future warning, since passing list-likes to .loc or [] with any missing label will raise KeyError in the future.
In short, try the following:
df.iloc[::2].values - df.iloc[1::2].values
PS:
Or alternatively, if (like in your question the indices follow no simple rule):
df.iloc[set1].values - df.iloc[set2].values

Categories

Resources