Create a date from Year and Month on a Select query - python

I'm working on Vertica
I have a problem, looking really easy, but I can't find a way to figure it out.
From a query, I can get 2 fields "Month" and "Year". What I want is to automatically select another field, Date, that I'd build being '01/Month/Year' (as the sql Date format). The goal is :
What I have
SELECT MONTH, YEAR FROM MyTable
Output :
01 2020
11 2019
09 2021
What I want
SELECT MONTH, YEAR, *answer* FROM MyTable
Output :
01 2020 01-01-2020
11 2019 01-11-2019
09 2021 01-09-2021
Sorry, it looks like really dumb and easy, but I didn't find any good way to do it. Thanks in advance.

Don't use string operations to build dates, you can mess up things considerably:
Today could be: 16.07.2021 or 07/16/2021, or also 2021-07-16, and, in France, for example: 16/07/2021 . Then, you could also left-trim the zeroes - or have 'July' instead of 07 ....
Try:
WITH
my_table (mth,yr) AS (
SELECT 1, 2020
UNION ALL SELECT 11, 2019
UNION ALL SELECT 9, 2021
)
SELECT
yr
, mth
, ADD_MONTHS(DATE '0001-01-01',(yr-1)*12+(mth-1)) AS firstofmonth
FROM my_table
ORDER BY 1,2;
yr | mth | firstofmonth
------+-----+--------------
2019 | 11 | 2019-11-01
2020 | 1 | 2020-01-01
2021 | 9 | 2021-09-01

I finally found a way doing that :
SELECT MONTH, YEAR, CONCAT(CONCAT(YEAR, '-'), CONCAT(MONTH, '-01')) FROM MyTable

Try this:
SELECT [MONTH], [YEAR], CONCAT(CONCAT(CONCAT('01-',[MONTH]),'-'),[YEAR]) AS [DATE]
FROM MyTable
Output will be:
| MONTH | YEAR | DATE |
|-------|------|------------|
| 01 | 2020 | 01-01-2020 |
| 11 | 2019 | 01-11-2019 |
| 09 | 2021 | 01-09-2021 |

Related

Imputing Values Based on FirstYear and LastYear in Long Table Format

I have a long table on firm-level that has the first and last active year and their zip code.
pd.DataFrame({'Firm':['A','B','C'],
'FirstYear':[2020, 2019, 2018],
'LastYear':[2021, 2022, 2019],
'Zipcode':['00000','00001','00003']})
Firm FirstYear LastYear Zipcode
A 2020 2021 00000
B 2019 2022 00001
C 2018 2019 00003
I want to get the panel data that has the zipcode for every active year. So ideally I might want a wide table that impute the value of Zipcode based on first year and last year, and every year between the first and last year.
It should look like this:
2020 2021 2019 2022 2018
A 00000 00000
B 00001 00001 00001 00001
C 00003 00003
I have some code to create a long table per row but I have many millions of rows and it takes a long time. What's the best way in terms of performance and memory use to transform the long table I have to impute every year's zipcode value in pandas?
Thanks in advance.
Responding to the answer's update:
Imagine there is a firm whose first and last year didn't overlap with other firms.
df=pd.DataFrame({'Firm':['A','B','C'],
'FirstYear':[2020, 2019, 1997],
'LastYear':[2021, 2022, 2008],
'Zipcode':['00000','00001','00003']})
The output from the code is like:
Firm 2020 2021 2019 2022 1997 2008
A 00000 00000
B 00001 00001 00001 00001
C 00003 00003
Here is a solution with pd.melt()
d = (pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1))
d = (d.ffill(axis=1)
.where(d.ffill(axis=1).notna() &
d.bfill(axis=1).notna())
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))
Original Answer:
(pd.melt(df,id_vars=['Firm','Zipcode'])
.set_index(['Firm','value'])['Zipcode']
.unstack(level=1)
.reindex(df[['FirstYear','LastYear']].stack().unique(),axis=1))
Output:
value 2020 2021 2019 2022 2018
Firm
A 00000 00000 NaN NaN NaN
B 00001 00001 00001 00001 NaN
C NaN NaN 00003 NaN 00003

Extract date and sort rows by date

I have a dataset that includes some strings in the following forms:
Text
Jun 28, 2021 — Brendan Moore is p...
Professor of Psychology at University
Aug 24, 2019 — Chemistry (Nobel prize...
by A Craig · 2019 · Cited by 1 — Authors. ...
... 2020 | Volume 8 | Article 330Edited by:
I would like to create a new column where there are, if there exist, dates sorted by ascending order.
To do so, I need to extract the part of string which includes date information from each row, whether exits.
Something like this:
Text Numbering
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
All the rows not starting with a date (that follows the format: Jun 28, 2021 — are assigned to -1.
The first step would be identify the pattern: xxx xx, xxxx;
then, transforming date object into datetime (yyyy-mm-dd).
Once got this date information, it needs to be converted into numerical, then sorted.
I am having difficulties in answering the last point, specifically on how to filter only dates and sort them in an appropriate way.
The expected output would be
Text Numbering (sort by date asc)
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
Mission accomplished:
# Find rows that start with a date
matches = df['Text'].str.match(r'^\w+ \d+, \d{4}')
# Parse dates out of date rows
df['date'] = pd.to_datetime(df[matches]['Text'], format='%b %d, %Y', exact=False, errors='coerce')
# Assign numbering for dates
df['Numbering'] = df['date'].sort_values().groupby(np.ones(df.shape[0])).cumcount() + 1
# -1 for the non-dates
df.loc[~matches, 'Numbering'] = -1
# Cleanup
df.drop('date', axis=1, inplace=True)
Output:
>>> df
Text Numbering
0 Jun 28, 2021 - Brendan Moore is p... 2
1 Professor of Psychology at University -1
2 Aug 24, 2019 - Chemistry (Nobel prize... 1
3 by A Craig - 2019 - Cited by 1 - Authors. ... -1
4 ... 2020 | Volume 8 | Article 330Edited by: -1

Doing a pandas left merge with duplicate column names (want to delete left and keep right) [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
So let's say I have df_1
Day Month Amt
--------------- --------- ---------
Monday Jan 10
Tuesday Feb 20
Wednesday Feb 30
Thursday April 40
Friday April 50
and df_2
Month Amt
--------------- ---------
Jan 999
Feb 1000000
April 123456
I want to get the following result when I do a left merge:
Day Month Amt
--------------- --------- ---------
Monday Jan 999
Tuesday Feb 1000000
Wednesday Feb 1000000
Thursday April 123456
Friday April 123456
So basically the 'Amt' values from the right table replace the 'Amt' values from the left table where applicable.
When I try
df_1.merge(df_2,how = 'left',on = 'Month')
I get:
Day Month Amt_X Amt_Y
--------------- --------- --------- -------
Monday Jan 10 999
Tuesday Feb 20 1000000
Wednesday Feb 30 1000000
Thursday April 40 123456
Friday April 50 123456
Anyone know of a simple and efficient fix? Thanks!
This answer is purely supplemental to the duplicate target. That is a much more comprehensive answer than this.
Strategy #1
there are two components to this problem.
Use df_2 to create a mapping.
The intuitive way to do this is
mapping = df_2.set_index('Month')['Amt']
which creates a series object that can be passed to pd.Series.map
However, I'm partial to
mapping = dict(zip(df_2.Month, df_2.Amt))
Or even more obtuse
mapping = dict(zip(*map(df_2.get, df_2)))
Use pandas.Series.map
df_1.Month.map(mapping)
0 999
1 1000000
2 1000000
3 123456
4 123456
Name: Month, dtype: int64
Finally, you want to put that into the existing dataframe.
Create a copy
df_1.assign(Amt=df_1.Month.map(mapping))
Day Month Amt
0 Monday Jan 999
1 Tuesday Feb 1000000
2 Wednesday Feb 1000000
3 Thursday April 123456
4 Friday April 123456
Overwrite existing data
df_1['Amt'] = df_1.Month.map(mapping)
Strategy #2
To use merge most succinctly, drop the column that is to be replaced.
df_1.drop('Amt', axis=1).merge(df_2)
Day Month Amt
0 Monday Jan 999
1 Tuesday Feb 1000000
2 Wednesday Feb 1000000
3 Thursday April 123456
4 Friday April 123456

Pyspark split date string

I have a dataframe and want to split the start_date column (string and year) and keep just the year in a new column (column 4):
ID start_date End_date start_year
|01874938| August 2013| December 2014| 2013|
|00798252| March 2009| May 2015| 2009|
|02202785| July 2, 2014|January 15, 2016| 2, |
|01646125| November 2012| November 2015| 2012|
As you can see I can split the date and keep the years. However for dates like in row 3: "July 2, 2014" the result is "2, " instead of 2014.
This is my code :
split_col = fn.split(df7_ct_map['start_date'] , ' ')
df = df7_ct_map.withColumn('NAME1', split_col.getItem(0))
df = dff.withColumn('start_year', split_col.getItem(1))
You could use a regular expression instead of splitting on ,.
df.withColumn('start_year', regexp_extract(df['start_date'], '\\d{4}', 0))
This will match 4 consecutive numbers, i.e. a year.
You could also extract the last 4 characters of your column start_date.
from pyspark.sql import functions as F
df.withColumn('start_year' ,
F.expr('substring(rtrim(start_date), length(start_date) - 4,length(start_date) )' ) )
.show()
+-------------+----------+
| start_date|start_year|
+-------------+----------+
| August 2013| 2013|
| March 2009| 2009|
| July 2, 2014| 2014|
|November 2014| 2014|
+-------------+----------+

Python Grouping column values into one value

Hi all so using this past link:
I am trying to consolidate columns of values into rows using groupby:
hp = hp[hp.columns[:]].groupby('LC_REF').apply(lambda x: ','.join(x.dropna().astype(str)))
#what I have
22 23 24 LC_REF
TV | WATCH | HELLO | 2C16
SCREEN | SOCCER | WORLD | 2C16
TEST | HELP | RED | 2C17
SEND |PLEASE |PARFAIT | 2C17
#desired output
22 | TV,SCREEN
23 | WATCH, SOCCER
24 | HELLO, WORLD
25 | TEST, SEND
26 | HELP,PLEASE
27 | RED, PARFAIT
Or some sort of variation where column 22,23,24 is combined and grouped by LC_REF. My current code turns all of column 22 into one row, all of column 23 into one row, etc. I am so close I can feel it!! Any help is appreciated
It seems you need:
df = hp.groupby('LC_REF')
.agg(lambda x: ','.join(x.dropna().astype(str)))
.stack()
.rename_axis(('LC_REF','a'))
.reset_index(name='vals')
print (df)
LC_REF a vals
0 2C16 22 TV,SCREEN
1 2C16 23 WATCH,SOCCER
2 2C16 24 HELLO,WORLD
3 2C17 22 TEST,SEND
4 2C17 23 HELP,PLEASE
5 2C17 24 RED,PARFAIT

Categories

Resources