I am working on a project in GIS software where I need to have a column containing dates in the format YYYY-MM-DD. Currently, in Excel I have 3 columns: 1 with the year, 1 with the month and 1 with the day. Looks like this:
| A | B | C |
| 2012 | 1 | 1 |
| 2012 | 2 | 1 |
| 2012 | 3 | 1 |
...etc...
And I need it to look like this:
| A |
| 2012-01-01|
| 2012-02-01|
| 2012-03-01|
I have several workbooks that I need in the same format so I figured that perhaps python would be a useful tool so that I didn't have to manually concatenate everything in Excel.
So, my question is, is there a simple way to not only concatenate these three columns, but to also add a zero in front of the month and day numbers?
I have been experimenting a little bit with the python library openpyxl, but have not come up with anything useful so far. Any help would be appreciated, thanks.
If you're going to be staying in excel, you may as well just use the excel macro scripting. If your year, month, and day are in columns A, B and C, you can just type this in column D to concatenate them, then format it as a date and adjust the padding.
=$A1 & "-" & $B1 & "-" & $C1
Try this:
def DateFromABC(self,ws):
import datetime
# Start reading from row 1
for i,row in enumerate(ws.rows,1):
sRow = str(i)
# datetime goes into column 'D'
Dcell = ws['D'+sRow]
# Set datetime from 'ABC'
Dcell.value = datetime.date(year=ws['A'+sRow].value,
month=ws['B'+sRow].value,
day=ws['C'+sRow].value
)
print('i=%s,tpye=%s, value=%s' % (i,str(type(Dcell.value)),Dcell.value ) )
#end for
#end def DateFromABC
Tested with Python:3.4.2 - openpyxl:2.4.1 - LibreOffice: 4.3.3.2
Related
I want to combine two datasets in Python based on multiple conditions using pandas.
The two datasets are different numbers of rows.
The first one contains almost 300k entries, while the second one contains almost 1000 entries.
More specifically, The first dataset: "A" has the following information:
Path | Line | Severity | Vulnerability | Name | Text | Title
An instance of the content of "A" is this:
src.bla.bla.class.java| 24; medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress
While the second dataset: "B" contains the following information:
Class | Path | DTWC | DR | DW | IDFP
An instance of the content in "B" is this:
y.x.bla.MainActivity | com.lucao.limpazap_11| 0 | 0 | 0 | 0
I want to combine these two dataset as follow:
If A['Name'] is equal to B['Path'] AND B['Class'] is in A['Class']
Than
Merge the two lines into another data frame "C"
An output example is the following:
Suppose that A contains:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress|
and B contains:
com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
the output should be the following:
src.bla.bla.class.java| 24| medium| Logging found| hr.kravarscan.enchantedfortress_15| description| Enchanted Fortress| com.bla.class | hr.kravarscan.enchantedfortress_15| 0 | 0 | 0 | 0
I'm not sure if this the best and the most efficient way but i have test it and it worked. So my answer is pretty straight forward, we will loop over two dataframes and apply the desired conditions.
Suppose the dataset A is df_a and dataset B is df_b.
First we have to add a suffix on every columns on df_a and df_b so both rows can be appended later.
df_a.columns= [i+'_A' for i in df_a.columns]
df_b.columns= [i+'_B' for i in df_b.columns]
And then we can apply this for loop
df_c= pd.DataFrame()
# Iterate through df_a
for (idx_A, v_A) in df_a.iterrows():
# Iterate through df_b
for (idx_B, v_B) in df_b.iterrows():
# Apply the condition
if v_A['Name_A']==v_B['Path_B'] and v_B['Class_B'] in v_A['Path_A']:
# Cast both series to dictionary and then append them to a new dict
c_dict= {**v_A.to_dict(), **v_B.to_dict()}
# Append the df_c with c_dict
df_c= df_c.append(c_dict, ignore_index=True)
I am with time in Pandas and have seen that there are two ways of extracting a Day of Week integer from a timestamp. These are pd.Series.dt.weekday and pd.Series.dt.dayofweek.
The documentation says that both return: Series or Index containing integers indicating the day number where the day of the week with Monday=0, Sunday=6.
Am I missing something or are these two functions effectively the same?
In the "See Also" section, it does describe the other function as an Alias. Does this answer my question?
You solved your question:
| dayofweek
| The day of the week with Monday=0, Sunday=6.
|
| Return the day of the week. It is assumed the week starts on
| Monday, which is denoted by 0 and ends on Sunday which is denoted
| by 6. This method is available on both Series with datetime
| values (using the `dt` accessor) or DatetimeIndex.
|
| Returns
| -------
| Series or Index
| Containing integers indicating the day number.
|
| See Also
| --------
| Series.dt.dayofweek : Alias.
| Series.dt.weekday : Alias. # <-- YES IT'S AN ALIAS
| Series.dt.day_name : Returns the name of the day of the week.
Source code:
dayofweek = day_of_week
weekday = dayofweek
I know there are some questions on Stack Overflow on Sumproduct but the solution are not working for me. I am also new to Python Pandas.
For each row, I want to do a sumproduct of certain columns only if column['2020'] !=0.
I used the below code, but get error:
IndexError: ('index 2018 is out of bounds for axis 0 with size 27', 'occurred at index 0')
Pls help. Thank you
# df_copy is my dataframe
column_list=[2018,2019]
weights=[6,9]
def test(df_copy):
if df_copy[2020]!=0:
W_Avg=sum(df_copy[column_list]*weights)
else:
W_Avg=0
return W_Avg
df_copy['sumpr']=df_copy.apply(test, axis=1)
df_copy
**|2020 | 2018 | 2019 | sumpr|**
|0 | 100 | 20 | 0 |
|1 | 30 | 10 | 270 |
|3 | 10 | 10 | 150 |
I am sorry if the table doesn't look like a table. I can't create a table properly in Stackoverflow.
Basically for a particular row, if
2020 = 2 ,
2018 =30 ,
2019 =10 ,
sumpr= 30 * 9 + 10*9 = 270
Your column names are most likely strings, not integers.
To confirm it, run df_copy.columns and you should receive something like:
Index(['2020', '2018', '2019'], dtype='object')
(note apostrophes surrounding column names).
So change your column list to:
column_list = ['2018', '2019']
In your function change also the column name to a string:
df_copy['2020']
Then your code should run.
You can also run a more concise code:
df_copy['sumpr'] = np.where(df_copy['2020'] != 0, (df_copy[column_list]
* weights).sum(axis=1), 0)
I'm trying to convert some VBA scripts into a python script, and I have been having troubles trying to figure some few things out, as the results seem different from what the excel file gives.
So I have an example dataframe like this :
|Name | A_Date |
_______________________
|RAHEAL | 04/30/2020|
|GIFTY | 05/31/2020|
||ERIC | 03/16/2020|
|PETER | 05/01/2020|
|EMMANUEL| 12/15/2019|
|BABA | 05/23/2020|
and I would want to achieve this result(VBA script result) :
|Name | A_Date | Sold
__________________________________
|RAHEAL | 04/30/2020| No
|GIFTY | 05/31/2020| Yes
||ERIC | 03/16/2020| No
|PETER | 05/01/2020| Yes
|EMMANUEL| 12/15/2019| No
|BABA | 05/23/2020| Yes
By converting this VBA script :
Range("C2").Select
Selection.Value = _
"=IF(RC[-1]>=(INT(R2C2)-DAY(INT(R2C2))+1),""Yes"",""No"")"
Selection.AutoFill Destination:=Range("C2", "C" & Cells(Rows.Count, 1).End(xlUp).Row)
Range("C1").Value = "Sold"
ActiveSheet.Columns("C").Copy
ActiveSheet.Columns("C").PasteSpecial xlPasteValues
Simply :=IF(B2>=(INT($B$2)-DAY(INT($B$2))+1),"Yes","No")
To this Python script:
sales['Sold']=np.where(sales['A_Date']>=(sales['A_Date'] - pd.to_timedelta(sales.A_Date.dt.day, unit='d'))+ timedelta(days=1),'Yes','No')
But I keep getting a "Yes" throughout.... could anyone help me spot out where I might have made some kind of mistake
import pandas as pd
df = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['04/30/2020','05/31/2020','03/16/2020',
'05/01/2020','12/15/2019','05/23/2020']})
df['A_Date'] = pd.to_datetime(df['A_Date'])
print(df)
df['Sold'] = df['A_Date'] >= df['A_Date'].iloc[0].replace(day=1)
df['Sold'] = df['Sold'].map({True:'Yes', False:'No'})
print(df)
output:
Name A_Date
0 RAHEAL 2020-04-30
1 GIFTY 2020-05-31
2 ERIC 2020-03-16
3 PETER 2020-05-01
4 EMMANUEL 2019-12-15
5 BABA 2020-05-23
Name A_Date Sold
0 RAHEAL 2020-04-30 Yes
1 GIFTY 2020-05-31 Yes
2 ERIC 2020-03-16 No
3 PETER 2020-05-01 Yes
4 EMMANUEL 2019-12-15 No
5 BABA 2020-05-23 Yes
If I read the formula right - if A_Date value >= 04/01/2020 (i.e. first day of month for date in B2), so RAHEAL should be Yes too
I don't know if you noticed (and if this is intended), but if A_Date value has a fractional part (i.e. time), when you calculate the value for 1st of the month, there is room for error. If the time in B2 is let's say 10:00 AM, when you calculate cut value, it will be 04/1/2020 10:00. Then if you have another value, let's say 04/01/2020 09:00, it will be evaluated as False/No. This is how it works also in your Excel formula.
EDIT (12 Jan 2021): Note, values in column A_Date are of type datetime.datetime or datetime.date. Presumably they are converted when reading the Excel file or explicitly afterwards.
Very much embarassed I didn't see the simple elegant solution that buran gave +. I did more of a literal translation.
first_date.toordinal() - 693594 is the integer date value for your initial date, current_date.toordinal() - 693594 is the integer date value for the current iteration of the dates column. I apply your cell formula logic to each A_Date row value and output as the corresponding Sold column value.
import pandas as pd
from datetime import datetime
def is_sold(current_date:datetime, first_date:datetime, day_no:int)->str:
# use of toordinal idea from #rjha94 https://stackoverflow.com/a/47478659
if current_date.toordinal() - 693594 >= first_date.toordinal() - 693594 - day_no + 1:
return "Yes"
else:
return "No"
sales = pd.DataFrame({'Name':['RAHEAL','GIFTY','ERIC','PETER','EMMANUEL','BABA'],
'A_Date':['2020-04-30','2020-05-31','2020-03-16','2020-05-01','2019-12-15','2020-05-23']})
sales['A_Date'] = pd.to_datetime(sales['A_Date'], errors='coerce')
sales['Sold'] = sales['A_Date'].apply(lambda x:is_sold(x, sales['A_Date'][0], x.day))
print(sales)
I have done some research on this, but couldn't find a concise method when the index is of type 'string'.
Given the following Pandas dataframe:
Platform | Action | RPG | Fighting
----------------------------------------
PC | 4 | 6 | 9
Playstat | 6 | 7 | 5
Xbox | 9 | 4 | 6
Wii | 8 | 8 | 7
I was trying to get the index (Platform) of the smallest value in the 'RPG' column, which would return 'Xbox'. I managed to make it work but it's not efficient, and looking for a better/quicker/condensed approach. Here is what I got:
# Return the minimum value of a series of all columns values for RPG
series1 = min(ign_data.loc['RPG'])
# Find the lowest value in the series
minim = min(ign_data.loc['RPG'])
# Get the index of that value using boolean indexing
result = series1[series1 == minim].index
# Format that index to a list, and return the first (and only) element
str_result = result.format()[0]
Use Series.idxmin:
df.set_index('Platform')['RPG'].idxmin()
#'Xbox'
or what #Quang Hoang suggests on the comments
df.loc[df['RPG'].idxmin(), 'Platform']
if Platform already the index:
df['RPG'].idxmin()
EDIT
df.set_index('Platform').loc['Playstat'].idxmin()
#'Fighting'
df.set_index('Platform').idxmin(axis=1)['Playstat']
#'Fighting'
if already the index:
df.loc['Playstat'].idxmin()