I have two data frames that are both multi-indexed on 'Date' and 'name', and want to do a SQL style JOIN to combine them. I've tried
pd.merge(df1.reset_index(), df2.reset_index(), on=['name', 'Date'], how='inner')
which then results in an empty DataFrame.
If I inspect the data frames I can see that the index of one is represented as '2015-01-01' and the other is represented as '2015-01-01 00:00:00' which explains my issues with joining.
Is there a way to 'recast' the index to a specific format within pandas?
I've included the tables to see what data I'm working with.
df1=
+-------------+------+------+------+
| Date | name | col1 | col2 |
+-------------+------+------+------+
| 2015-01-01 | mary | 12 | 123 |
| 2015-01-02 | mary | 23 | 33 |
| 2015-01-03 | mary | 34 | 45 |
| 2015-01-01 | john | 65 | 76 |
| 2015-01-02 | john | 67 | 78 |
| 2015-01-03 | john | 25 | 86 |
+-------------+------+------+------+
df2=
+------------+------+-------+-------+
| Date | name | col3 | col4 |
+------------+------+-------+-------+
| 2015-01-01 | mary | 80809 | 09885 |
| 2015-01-02 | mary | 53879 | 58972 |
| 2015-01-03 | mary | 23887 | 3908 |
| 2015-01-01 | john | 9238 | 2348 |
| 2015-01-02 | john | 234 | 234 |
| 2015-01-03 | john | 5325 | 6436 |
+------------+------+-------+-------+
DESIRED Result:
+-------------+------+------+-------+-------+-------+
| Date | name | col1 | col2 | col3 | col4 |
+-------------+------+------+-------+-------+-------+
| 2015-01-01 | mary | 12 | 123 | 80809 | 09885 |
| 2015-01-02 | mary | 23 | 33 | 53879 | 58972 |
| 2015-01-03 | mary | 34 | 45 | 23887 | 3908 |
| 2015-01-01 | john | 65 | 76 | 9238 | 2348 |
| 2015-01-02 | john | 67 | 78 | 234 | 234 |
| 2015-01-03 | john | 25 | 86 | 5325 | 6436 |
+-------------+------+------+-------+-------+-------+
The reason you cannot join is because you have different dtypes on the indicies. Pandas silently fails if the indicies have different dtypes.
You can easily change your indicies from string representations of time to proper pandas datetimes like this:
df = pd.DataFrame({"data":range(1,30)}, index=['2015-04-{}'.format(d) for d in range(1,30)])
df.index.dtype
dtype('O')
df.index = df.index.to_series().apply(pd.to_datetime)
df.index.dtype
dtype('<M8[ns]')
Now you can merge the dataframes on their index:
pd.merge(left=df, left_index=True,
right=df2, right_index=True)
Assuming you have a df2, which my example is omitting...
Related
I've got a list of over 500k people. The data looks like the first table. I'd like to use the admission date from the first table and if the admission date of the same person in the second table is within 30 days of their admission date in the first table then I'd like to store that overlapping record in the third table. The example of what I'd like is below. Is there a faster way to do this than using iterrows using the person_ids and dates from the first table and checking every row in the second table?
Table 1
| person_id | admission_date | value |
| 1234 | 2017-01-31 | 6 |
| 5678 | 2018-03-20 | 12 |
| 9101 | 2017-02-22 | 11 |
| 1234 | 2020-10-31 | 19 |
| 5678 | 2019-06-16 | 21 |
| 9101 | 2021-12-14 | 8 |
Table 2
| person_id | admission_date | value |
| 1234 | 2015-01-31 | 10 |
| 1234 | 2017-02-12 | 152 |
| 5678 | 2017-01-31 | 10 |
| 5678 | 2018-04-10 | 10 |
| 9101 | 2017-02-25 | 99 |
| 9101 | 2017-03-01 | 10 |
| 1234 | 2012-12-31 | 10 |
| 5678 | 2019-07-10 | 11 |
| 9101 | 2017-01-31 | 10 |
Table 3
| person_id | admission_date | value |
| 1234 | 2017-02-12 | 152 |
| 5678 | 2018-04-10 | 10 |
| 9101 | 2017-02-25 | 99 |
| 9101 | 2017-03-01 | 10 |
| 5678 | 2019-07-10 | 11 |
You need to use merge_asof:
df1['admission_date'] = pd.to_datetime(df1['admission_date'])
df2['admission_date'] = pd.to_datetime(df2['admission_date'])
out = (pd
.merge_asof(df1.sort_values(by='admission_date')
.rename(columns={'admission_date': 'date'})
.drop(columns='value'),
df2.sort_values(by='admission_date'),
by='person_id',
left_on='date',
right_on='admission_date',
direction='forward',
tolerance=pd.Timedelta('30D')
)
.drop(columns='date')
.dropna(subset='value')
)
output:
person_id admission_date value
0 1234 2017-02-12 152.0
1 9101 2017-02-25 99.0
2 5678 2018-04-10 10.0
3 5678 2019-07-10 11.0
let table 1 be df1, table 2 be df2 and table 3 be df3
Not sure of table 1 has duplicate person id's as table 2 has, so assuming it does here and taking the most recent admission date for both table 1 and table 2.
df1 = df1.sort_values(by=['person_id','admission_date'],ascending =False)
df1 = df1[df1['person_id'].duplicated()==False] % only has the latest admission for any person_id
df2 = df2.sort_values(by=['person_id','admission_date'],ascending =False)
df2 = df2[df2['person_id'].duplicated()==False] % only has the latest admission for any person_id
df3 = pd.concat([df1.set_index('person_id')['admission_date'].to_frame('adm_date_1'),df2.set_index('person_id')],axis=1,join='inner')
Now that we have the data aligned, we can check for the 30 day condition:
mask = (df3['adm_date_1']-df3['admission_date']).apply(lambda x: x.days).abs()
df3 = df3.loc[mask,['admission_date','value']]
For this to work the date columns need to be of datetime type, if not, first the conversion is necessary
I'm trying to find the best practice when it comes to dumping prometheus data to parquet.
My data consists of known metric data monitored but unknown number of clients monitored being: VMs, Hosts, Pods, Containers, etc.
What I want first it to monitor all pods that have a fixed number of labels being tracked.
So I'm planning on structuring my data like this.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | b | 24 | 37 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | b | 17 | 37 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | b | 40 | 24 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | b | 1 | 22 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | b | 34 | 4 | 2020 | 6 | 5 |
Creating an item column for each client I have.
Then I'd save the next client which you come labeled as c.
| ds | item | label_x | label_n | year | month | day |
|:--------------------|:-------|----------:|----------:|-------:|--------:|------:|
| 2020-06-01 00:00:00 | c | 24 | 22 | 2020 | 6 | 1 |
| 2020-06-02 00:00:00 | c | 16 | 12 | 2020 | 6 | 2 |
| 2020-06-03 00:00:00 | c | 1 | 18 | 2020 | 6 | 3 |
| 2020-06-04 00:00:00 | c | 1 | 28 | 2020 | 6 | 4 |
| 2020-06-05 00:00:00 | c | 45 | 4 | 2020 | 6 | 5 |
and just append them.
then to retrieve all items in a month for instance I could just:
filters = [('year', '==', 2020), ('month', '==', 6)]
ddf = dd.read_parquet(db, columns=['item'], filters=filters)
items = ddf[['item']].drop_duplicates().compute()
and grab an unique item by:
filters = [('item', '==', 'b'), ('year', '==', 2020), ('month', '==', 6)]
Is there anything I'm missing?
Regards,
Carlos.
Hello I am trying to transpose a table in dataframe as follow, where A and B are both companies name.
This is the dataframe I have so far
|---------------------|------------------|------------------|
| Date | A | B |
|---------------------|------------------|------------------|
| date_1 | 34 | 8 |
|---------------------|------------------|------------------|
| date_2 | | 12 |
|---------------------|------------------|------------------|
| date_3 | 6 | 321 |
|---------------------|------------------|------------------|
and this is what I am looking to achieve:
|---------------------|------------------|------------------|
| Date | Company | Value |
|---------------------|------------------|------------------|
| date_1 | A | 34 |
|---------------------|------------------|------------------|
| date_1 | B | 8 |
|---------------------|------------------|------------------|
| date_2 | B | 12 |
|---------------------|------------------|------------------|
| date_3 | A | 6 |
|---------------------|------------------|------------------|
| date_3 | B | 321 |
|---------------------|------------------|------------------|
You are probably looking for melt, which should give you what you want.
pd.melt(df, id_vars=['Date'],value_vars=['A','B'], var_name='Company',value_name='Value').dropna()
I have about 2 billion records and I want to group data with PySpark and save each grouped data to csv.
Here is my sample Dataframe:
+----+------+---------------------+
| id | name | date |
+----+------+---------------------+
| 1 | a | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 2 | b | 2019-12-01 00:00:00 |
+----+------+---------------------+
| 3 | c | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 4 | a | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 5 | b | 2020-01-01 00:00:00 |
+----+------+---------------------+
| 6 | a | 2020-01-05 00:00:00 |
+----+------+---------------------+
| 7 | b | 2020-01-05 00:00:00 |
+----+------+---------------------+
Then I use groupBy to group them with this code:
df.groupBy([
'name',
year('date').alias('year'),
month('date').alias('month')
]).count()
output:
+------+------+-------+-------+
| name | year | month | count |
+------+------+-------+-------+
| a | 2019 | 12 | 1 |
+------+------+-------+-------+
| b | 2019 | 12 | 1 |
+------+------+-------+-------+
| c | 2020 | 01 | 1 |
+------+------+-------+-------+
| a | 2020 | 01 | 2 |
+------+------+-------+-------+
| b | 2020 | 01 | 2 |
+------+------+-------+-------+
But I want each group elements in Dataframe like this:
+------+------+-------+-----------+
| name | year | month | element |
+------+------+-------+-----------+
| a | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| b | 2019 | 12 | Dataframe |
+------+------+-------+-----------+
| c | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| a | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
| b | 2020 | 01 | Dataframe |
+------+------+-------+-----------+
Where "element column" contains grouped Dataframe in each group then I want to map each group and save them to separate csv.
Note: I have tried to use distinct and collect for grouping then select data for each group, but performance is too slow for my huge data. I think groupBy is faster, so I want to use groupBy instead.
How to do it in PySpark ?
you can achive your goal using withcolumn and lit
df.groupBy(['name',year('date').alias('year'),month('date').alias('month')])
.withColumn('element',lit('Dataframe'))
Given a graphlab SFrame where there's a column with dates, e.g.:
+-------+------------+---------+-----------+
| Store | Date | Sales | Customers |
+-------+------------+---------+-----------+
| 1 | 2015-07-31 | 5263.0 | 555.0 |
| 2 | 2015-07-31 | 6064.0 | 625.0 |
| 3 | 2015-07-31 | 8314.0 | 821.0 |
| 4 | 2015-07-31 | 13995.0 | 1498.0 |
| 3 | 2015-07-20 | 4822.0 | 559.0 |
| 2 | 2015-07-10 | 5651.0 | 589.0 |
| 4 | 2015-07-11 | 15344.0 | 1414.0 |
| 5 | 2015-07-23 | 8492.0 | 833.0 |
| 2 | 2015-07-19 | 8565.0 | 687.0 |
| 10 | 2015-07-09 | 7185.0 | 681.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
Is there an easy way in graphlab / other python function to convert the Date column to Year|Month|Day?
+-------+------+----+----+---------+-----------+
| Store | YYYY | MM | DD | Sales | Customers |
+-------+------+----+----+---------+-----------+
| 1 | 2015 | 07 | 31 | 5263.0 | 555.0 |
| 2 | 2015 | 07 | 31 | 6064.0 | 625.0 |
| 3 | 2015 | 07 | 31 | 8314.0 | 821.0 |
+-------+------------+---------+-----------+
[986159 rows x 4 columns]
In pandas, I can do this: Which is the fastest way to extract day, month and year from a given date?
But to convert an SFrame into Panda to split date and convert back into SFrame is quite a chore.
You could also do it with the split-datetime method. It gives you a bit more flexibility.
sf.add_columns(sf['Date'].split_datetime(column_name_prefix = ''))
The split_datetime method itself is on the SArray (a single column of the SFrame) and it returns an SFrame which you can then add back to the original data (at basically 0 cost)
A quick and dirty way to do this is
sf['date2'] = sf['Date'].apply(lambda x: x.split('-'))
sf = sf.unpack('date2')
Another option would be to convert the Date column to a datetime type, then use the graphlab.SArray.split_datetime function.