Hello I am trying to transpose a table in dataframe as follow, where A and B are both companies name.
This is the dataframe I have so far
|---------------------|------------------|------------------|
| Date | A | B |
|---------------------|------------------|------------------|
| date_1 | 34 | 8 |
|---------------------|------------------|------------------|
| date_2 | | 12 |
|---------------------|------------------|------------------|
| date_3 | 6 | 321 |
|---------------------|------------------|------------------|
and this is what I am looking to achieve:
|---------------------|------------------|------------------|
| Date | Company | Value |
|---------------------|------------------|------------------|
| date_1 | A | 34 |
|---------------------|------------------|------------------|
| date_1 | B | 8 |
|---------------------|------------------|------------------|
| date_2 | B | 12 |
|---------------------|------------------|------------------|
| date_3 | A | 6 |
|---------------------|------------------|------------------|
| date_3 | B | 321 |
|---------------------|------------------|------------------|
You are probably looking for melt, which should give you what you want.
pd.melt(df, id_vars=['Date'],value_vars=['A','B'], var_name='Company',value_name='Value').dropna()
Related
Is it possible to fill_forward only some columns when upsampling using Polars?
For example, whould like to fill in the missing dates in sample dataframe (see code below). 'upsample' and 'forward_fill' works beautifully and is blazing fast with a much larger dataset. The output is as expected in the table below. No issues, all as expected.
However, and the question: is it possible to exclude a column for the forward_fill, so for example it returns blanks in the 'utc_time' column instead of filling the time. I have tried listing the columns in select statement by replacing 'pl.all() with pl.col([...])', but that just removes the column that is not listed.
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
'utc_created':pl.date_range(low=datetime(2021, 12, 16), high=datetime(2021, 12, 22, 0), interval="2d"),
'utc_time':['21:12:06','21:20:06','17:51:10','03:54:49'],
'sku':[9100000801,9100000801,9100000801,9100000801],
'old':[18,17,16,15],
'new':[17,16,15,14],
'alert_type':['Inventory','Inventory','Inventory','Inventory'],
'alert_level':['Info','Info','Info','Info']
}
)
df = df.upsample(
time_column = 'utc_created',every="1d", by =('sku'))
.select(pl.all()
.forward_fill()
)
Returns:
| utc_created | utc_time | sku | old | new | alert_type | alert_level |
| 2021-12-16 00:00:00 | 21:12:06 | 9100000801 | 18 | 17 | Inventory | Info |
| 2021-12-17 00:00:00 | 21:12:06 | 9100000801 | 18 | 17 | Inventory | Info |
| 2021-12-18 00:00:00 | 21:20:06 | 9100000801 | 17 | 16 | Inventory | Info |
| 2021-12-19 00:00:00 | 21:20:06 | 9100000801 | 17 | 16 | Inventory | Info |
| 2021-12-20 00:00:00 | 17:51:10 | 9100000801 | 16 | 15 | Inventory | Info |
| 2021-12-21 00:00:00 | 17:51:10 | 9100000801 | 16 | 15 | Inventory | Info |
| 2021-12-22 00:00:00 | 03:54:49 | 9100000801 | 15 | 14 | Inventory | Info |
You can use pl.exclude(name).forward_fill() e.g.
.with_column(pl.exclude("utc_time").forward_fill())
I have a dataset that looks like this:
| ColumnA | ColumnB |ColumnZ |
| --------| -------------- |--------|
| 1 | locationA |324 |
| 1 | n.a. |34 |
| 2 | n.a. |21 |
| 2 | locationA |n.a. |
| 2 | locationA |34 |
| 2 | n.a. |12 |
| 3 | n.a. |1 |
| 3 | locationB |134 |
| 3 | n.a. |n.a. |
| 4 | n.a. |134 |
| 4 | locationC |n.a. |
| 4 | locationD |132 |
| 4 | locationD |n.a. |
I now want to add a new ColumnC, in which is stated "different locations", when more than 1 location is in ColumnB that belong to the same group (i.e. same number) in ColumnA. So my desired output is:
| ColumnA | ColumnB | ColumnZ | ColumnC |
| --------| -------------- | --------| ----------------- |
| 1 | locationA | 324 | |
| 1 | n.a. | 34 | |
| 2 | n.a. | 21 | |
| 2 | locationA | n.a. | |
| 2 | locationA | 34 | |
| 2 | n.a. | 12 | |
| 3 | n.a. | 1 | |
| 3 | locationB | 134 | |
| 3 | n.a. | n.a. | |
| 4 | n.a. | 134 | different locations |
| 4 | locationC | n.a. | different locations |
| 4 | locationD | 132 | different locations |
| 4 | locationD | n.a. | different locations |
Therefore I've started with turning all n.a. values in ColumnB to NaN values:
df['ColumnB'] = df['ColumnB'].replace('n.a.', np.NaN)
and then I've tried it with this function:
def no_of_locations(group):
if df['ColumnB'].nunique() > 1:
df['ColumnC'] = 'different locations'
pass
df.groupby('ColumnA').apply(no_of_locations)
Yet, the result is that it still counts all unique values in the whole ColumnB, not only in the group based on ColumnA. How can I restrict it on the respective group?
If only condition is having repeating ColumnA after dropping na values you can just count the number of ColumnA values and use it to mask and filter your original dataframe
mask = df['ColumnA'].isin((df.replace({'ColumnB' : {'n.a.': np.nan}})
.dropna(subset=['ColumnB'])
.groupby('ColumnA')['ColumnB'].nunique()
.loc[lambda x: x>1].index.values
)
df.loc[mask, 'ColumnC'] = 'different_locations'
I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))
Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.
I have two data frames that are both multi-indexed on 'Date' and 'name', and want to do a SQL style JOIN to combine them. I've tried
pd.merge(df1.reset_index(), df2.reset_index(), on=['name', 'Date'], how='inner')
which then results in an empty DataFrame.
If I inspect the data frames I can see that the index of one is represented as '2015-01-01' and the other is represented as '2015-01-01 00:00:00' which explains my issues with joining.
Is there a way to 'recast' the index to a specific format within pandas?
I've included the tables to see what data I'm working with.
df1=
+-------------+------+------+------+
| Date | name | col1 | col2 |
+-------------+------+------+------+
| 2015-01-01 | mary | 12 | 123 |
| 2015-01-02 | mary | 23 | 33 |
| 2015-01-03 | mary | 34 | 45 |
| 2015-01-01 | john | 65 | 76 |
| 2015-01-02 | john | 67 | 78 |
| 2015-01-03 | john | 25 | 86 |
+-------------+------+------+------+
df2=
+------------+------+-------+-------+
| Date | name | col3 | col4 |
+------------+------+-------+-------+
| 2015-01-01 | mary | 80809 | 09885 |
| 2015-01-02 | mary | 53879 | 58972 |
| 2015-01-03 | mary | 23887 | 3908 |
| 2015-01-01 | john | 9238 | 2348 |
| 2015-01-02 | john | 234 | 234 |
| 2015-01-03 | john | 5325 | 6436 |
+------------+------+-------+-------+
DESIRED Result:
+-------------+------+------+-------+-------+-------+
| Date | name | col1 | col2 | col3 | col4 |
+-------------+------+------+-------+-------+-------+
| 2015-01-01 | mary | 12 | 123 | 80809 | 09885 |
| 2015-01-02 | mary | 23 | 33 | 53879 | 58972 |
| 2015-01-03 | mary | 34 | 45 | 23887 | 3908 |
| 2015-01-01 | john | 65 | 76 | 9238 | 2348 |
| 2015-01-02 | john | 67 | 78 | 234 | 234 |
| 2015-01-03 | john | 25 | 86 | 5325 | 6436 |
+-------------+------+------+-------+-------+-------+
The reason you cannot join is because you have different dtypes on the indicies. Pandas silently fails if the indicies have different dtypes.
You can easily change your indicies from string representations of time to proper pandas datetimes like this:
df = pd.DataFrame({"data":range(1,30)}, index=['2015-04-{}'.format(d) for d in range(1,30)])
df.index.dtype
dtype('O')
df.index = df.index.to_series().apply(pd.to_datetime)
df.index.dtype
dtype('<M8[ns]')
Now you can merge the dataframes on their index:
pd.merge(left=df, left_index=True,
right=df2, right_index=True)
Assuming you have a df2, which my example is omitting...