I am trying to learn Python, coming from a SAS background.
I have imported a SAS dataset, and one thing I noticed was that I have multiple date columns that are coming through as SAS dates (I believe).
In looking around, I found a link which explained how to perform this (here):
The code is as follows:
alldata['DateFirstOnsite'] = pd.to_timedelta(alldata.DateFirstOnsite, unit='s') + pd.datetime(1960, 1, 1)
However, I'm wondering how to do this for multiple columns. If I have multiple date fields, rather than repeating this line of code multiple times, can I create a list of fields I have, and then run this code on that list of fields? How is that done?
Thanks in advance
Yes, it's possible to create a list and iterate through that list to convert the SAS date fields to pandas datetime. However, I'm not sure why you're using a to_timedelta method, unless the SAS date fields are represented by seconds after 1960/01/01. If you plan on using the to_timedelta method, then its simply a case of creating a function that takes your df and your field and passing those two into your function:
def convert_SAS_to_datetime(df, field):
df[field] = pd.to_timedelta(df[field], unit='s') + pd.datetime(1960, 1, 1)
return df
Now, let's suppose you have your list of fields that you know should be converted to a datetime field (along with your df):
my_list = ['field1','field2','field3','field4','field5']
my_df = pd.read_sas('mySASfile.sas7bdat') # your SAS data that's converted to a pandas DF
You can now iterate through your list with a for loop while passing those fields and your df to the function:
for field in my_list:
my_df = convert_SAS_to_datetime(my_df, field)
Now, the other method I would recommend is using the to_datetime method, but this assumes that you know what the SAS format of your date fields are.
e.g. 01Jan2016 # date9 format
This is when you might have to look through the documentation here to determine the directive to converting the date. In the case of a date9 format, then you can use:
df[field] = pd.to_datetime(df[date9field], format="%d%b%Y")
If i read your question correctly, you want to apply your code to multiple columns? to do that simple do this:
alldata[['col1','col2','col3']] = 'your_code_here'
Exmaple:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['Pharmacy of IDAHO','Access medicare arkansas','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
df[['E', 'D']] = 1 # <---- notice double brackets
print(df)
A B C D E
0 NaN 1.0 Pharmacy of IDAHO 1 1
1 NaN 0.0 Access medicare arkansas 1 1
2 3.0 3.0 NJ Pharmacy 1 1
3 4.0 5.0 Idaho Rx 1 1
4 5.0 0.0 CA Herbals 1 1
5 5.0 0.0 Florida Pharma 1 1
6 3.0 NaN AK RX 1 1
7 1.0 9.0 Ohio Drugs 1 1
8 5.0 0.0 PA Rx 1 1
9 NaN 0.0 USA Pharma 1 1
Notice the double brackets in the beginning. Hope this helps!
Related
I have two dataframes A and B that contain different sets of patient data, and need to append certain columns from B to A - however only for those rows that contain information from the same patient and visit, i.e. where A and B have a matching value in two particular columns. B is longer than A, not all rows in A are contained in B. I don't know how this would be possible without looping, but many people discourage from looping over pandas dataframes (apart from the fact that my loop solution does not work because "Can only compare identically-labeled Series objects"). I read the options here
How to iterate over rows in a DataFrame in Pandas
but don't see which one I could apply here and would appreciate any tips!
Toy example (the actual dataframe has about 300 rows):
dict_A = {
'ID': ['A_190792','X_210392','B_050490','F_311291','K_010989'],
'Visit_Date': ['2010-10-31','2011-09-24','2010-30-01','2012-01-01','2013-08-13'],
'Score': [30, 23, 42, 23, 31],
}
A = pd.DataFrame(dict_A)
dict_B = {
'ID': ['G_090891','A_190792','Z_060791','X_210392','B_050490','F_311291','K_010989','F_230989'],
'Visit_Date': ['2013-03-01','2010-10-31','2013-04-03','2011-09-24','2010-30-01','2012-01-01','2013-08-13','2014-09-09'],
'Diagnosis': ['F12', 'G42', 'F34', 'F90', 'G98','G87','F23','G42'],
}
B = pd.DataFrame(dict_B)
for idx, row in A.iterrows():
A.loc[row,'Diagnosis'] = B['Diagnosis'][(B['Visit_Date']==A['Visit_Date']) & (B['ID']==A['ID'])]
# Appends Diagnosis column from B to A for rows where ID and date match
I have seen this question Append Columns to Dataframe 1 Based on Matching Column Values in Dataframe 2 but the only answer is quite specific to it and also does not address the question whether a loop can/should be used or not
i think you can use merge:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='outer')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30.0 G42
1 X_210392 2011-09-24 23.0 F90
2 B_050490 2010-30-01 42.0 G98
3 F_311291 2012-01-01 23.0 G87
4 K_010989 2013-08-13 31.0 F23
5 G_090891 2013-03-01 NaN F12
6 Z_060791 2013-04-03 NaN F34
7 F_230989 2014-09-09 NaN G42
'''
if you want to only A:
A['Visit_Date']=pd.to_datetime(A['Visit_Date'])
B['Visit_Date']=pd.to_datetime(B['Visit_Date'])
final=A.merge(B,on=['Visit_Date','ID'],how='left')
print(final)
'''
ID Visit_Date Score Diagnosis
0 A_190792 2010-10-31 30 G42
1 X_210392 2011-09-24 23 F90
2 B_050490 2010-30-01 42 G98
3 F_311291 2012-01-01 23 G87
4 K_010989 2013-08-13 31 F23
'''
What I am looking for, but can't seem to get right is essentially this:
I'll read in a csv file to build a df. With the output being a table that I can manipulate or edit.
Ultimately, I'm looking for a method to rename the column header placeholders with the first "string" in the column.
Currently:
import pandas as pd
df = pd.read_csv("NameOfCSV.csv")
df
Example Output:
m1_name
m1_delta
m2_name
m2_delta
m3_name
m3_delta
...
m10_name
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
NaN
2
...
NaN
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
CH4
2
...
NaN
What I am trying to understand how to do is create a generic program that will grab the gas name within the column and rename the header "m*_name" and any other header corresponding to "m*_blah" with the gas name of that "m*_name" column.
Example Desired Output where the headers reflect the gas name:
**CO2_name
CO2_delta
NMHC_name
NMHC_delta
NO2_name
NO2_delta**
...
m10_name
CO2
1
NMHC
2
NaN
1
...
NaN
CO2
2
NMHC
1
NaN
2
...
NaN
CO2
1
NMHC
2
NO2
1
...
NaN
I've tried playing around with the several functions (primarily .rename()), trying to find some examples of similar problems, and digging for some documentation but have been unsuccessful with coming up with a solution. Beyond the couple column headers here in the example there are about a dozen per gas name, so I was also trying to build a loop structure to find the headers with the corresponding m_number,"m*_otherHeader", to populate those headers also. The datasets I'm working with are dynamic and the positions of these original columns from the csv change position as well. Some help would be appreciated, and/or pointing me in the direction of some examples or the proper documentation to read through would be great!
df = df.drop_duplicates().set_axis([(f"{df[f'{x}_name'].mode()[0]}_{y}" if df[f'{x}_name'].mode()[0] != 0 else f'{x}_{y}') for x, y in df.columns.str.split('_', n=1)], axis=1)
Output:
>>> df
CO2_name CO2_delta NMHC_name NMHC_delta CH4_name CH4_delta m10_name
0 CO2 1 NMHC 2 CH4 1 0
1 CO2 2 NMHC 1 CH4 2 0
I have a dataframe which is sparsed and something like this,
Conti_mV_XSCI_140|Conti_mV_XSCI_12|Conti_mV_XSCI_76|Conti_mV_XSCO_11|Conti_mV_XSCO_203|Conti_mV_XSCO_75
1 | nan | nan | 12 | nan | nan
nan | 22 | nan | nan | 13 | nan
nan | nan | 9 | nan | nan | 31
As you can see, XSCI is present in 3 header names, only thing is a random number(_140, _12, _76) is added which makes them different.
This is not correct. The column names should be like this - Conti_mV_XSCI, Conti_mV_XSCO.
and the final column name(without any random number), should be having values from all the three columns it was spread to(for example - xsci was xsci_140, xsci_12,xsci_76) like that.
The final dataframe should look something like this -
Conti_mV_XSCI| Conti_mV_XSCO
1 | 12
22 | 13
99 | 31
If you notice, the first value of XSCI comes from the first XSCI_140, second value comes from the second column with XSCI and so on. This is same for XSCO as well.
The issue is, I have to do this for all the columns starting with certain value, like - "Conti_mV,"IDD_PowerUp_mA" etc
My issue:
I am having a hard time cleaning out the header names because as soon as I remove the random number from the last, it throws an error of columns being duplicate, also it is not elegant
It would be a great help if anyone can help me. Please comment if anything is not clear here.
I need a new dataframe with one column(where there were 3) and combine the data from them.
Thanks.
First if necessary convert all columns to numeric:
df = df.apply(pd.to_numeric, errors='coerce')
If need grouping by column names splited with right side and selected first values:
df = df.groupby(lambda x: x.rsplit('_', 1)[0], axis=1).sum()
print (df)
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
1 22.0 13.0
2 9.0 31.0
If need filter columns manually:
df['Conti_mV_XSCI'] = df.filter(like='XSCI').sum(axis=1)
df['Conti_mV_XSCO'] = df.filter(like='XSCO').sum(axis=1)
EDIT: One idea for sum only columns specified in list of starts of columns names:
cols = ['IOZH_Pat_uA', 'IOZL_Pat_uA', 'Power_Short_uA', 'IDDQ_uA']
for c in cols:
# here ^ is for start of string
columns = df.filter(regex=f'^{c}')
df[c] = columns.sum(axis=1)
df = df.drop(columns, axis=1)
print (df)
try:
df['Conti_mV_XSCI']=df.filter(regex='XSCI').sum()
df['Conti_mV_XSCO']=df.filter(regex='XSCO').sum()
edit:
you can fillna with zeroes before the above operations.
df=df.fillna(0)
This will add a column Conti_mV_XSCI with the first non-nan entry for any column whose name begins with Conti_mV_XSCI
from math import isnan
df['Conti_mV_XSCI'] = df.filter(regex=("Conti_mV_XSCI.*")).apply(lambda row: [_ for _ in row if not isnan(_)][0], axis=1)
you can use the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github
# install the latest dev version of pyjanitor
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
(df.pivot_longer(names_to=".value",
names_pattern=r"(.+)_\d+")
.dropna())
Conti_mV_XSCI Conti_mV_XSCO
0 1.0 12.0
4 22.0 13.0
8 9.0 31.0
The code looks for values that match a pattern in the group, and returns those values with the header.
So I have a dataframe that has been read from a CSV. It has 36 columns and 3000+ rows. I want to split the dataframe on a column that contains items separated by a semicolon.
It is purchasing data and has mostly rows I would want to just copy down for the split; for example: Invoice Number, Sales Rep, etc. That is the first step and I have found many answers for this on SO, but none that solve for the second part.
There are other columns: Quantity, Extended Cost, Extended Price, and Extended Gross Profit that would need to be recalculated based on the split. The quantity, for the rows with values in the column in question, would need to be 1 for each item in the list; the subsequent columns would need to be recalculated based on that column.
See below for an example DF:
How would I go about this?
A lot of implementations use df.split(';') and some use of df.apply, but unfortunately I am not understanding the process from front to back.
Edit: This is the output I am looking for:
Proposed output
Using pandas 0.25.1+ you can use explode:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Quantity':[6,50,25,4]
,'Column in question':['1;2;3;4;5;6','','','7;8;9;10']
,'Price':['$1.00','$10.00','$0.10','$25.00']
,'Invoice Close Date':['9/3/2019','9/27/2019','9/18/2019','9/30/2019']})
df_out = df.assign(ciq=df['Column in question'].str.split(';')).explode('ciq')\
.drop('Column in question', axis=1)\
.rename(columns={'ciq':'Column in question'})
df_out['Quantity'] = (df_out['Quantity'] / df_out.groupby(level=0)['Quantity'].transform('size'))
df_out
Output:
Quantity Price Invoice Close Date Column in question
0 1.0 $1.00 9/3/2019 1
0 1.0 $1.00 9/3/2019 2
0 1.0 $1.00 9/3/2019 3
0 1.0 $1.00 9/3/2019 4
0 1.0 $1.00 9/3/2019 5
0 1.0 $1.00 9/3/2019 6
1 50.0 $10.00 9/27/2019
2 25.0 $0.10 9/18/2019
3 1.0 $25.00 9/30/2019 7
3 1.0 $25.00 9/30/2019 8
3 1.0 $25.00 9/30/2019 9
3 1.0 $25.00 9/30/2019 10
Details:
First, create a column containing a list using str.split and assign.
Next, use explode then rename new column to old name after drop.
So... I have a Dataframe that looks like this, but much larger:
DATE ITEM STORE STOCK
0 2018-06-06 A L001 4
1 2018-06-06 A L002 0
2 2018-06-06 A L003 4
3 2018-06-06 B L001 1
4 2018-06-06 B L002 2
You can reproduce the same DataFrame with the following code:
import pandas as pd
import numpy as np
import itertools as it
lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')
df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))
I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:
DATE ITEM STORE STOCK DELTA
0 2018-06-06 A L001 4 NaN
9 2018-06-07 A L001 0 -4.0
18 2018-06-08 A L001 4 4.0
27 2018-06-09 A L001 0 -4.0
36 2018-06-10 A L001 3 3.0
45 2018-06-11 A L001 2 -1.0
54 2018-06-12 A L001 2 0.0
I´ve manage to do so by the following code:
gg = df.groupby([df.ITEM, df.STORE])
lg = []
for (name, group) in gg:
aux = group.copy()
aux.reset_index(drop=True, inplace=True)
aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr
lg.append(aux)
df = pd.concat(lg)
But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?
I've tried to improve your groupby code, so this should be a lot faster.
v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)
Some pointers/ideas here:
Don't iterate over groups
Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff can be vectorized
The last line is tantamount to a fillna, but fillna is slower than np.where
Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further
This can also be re-written as
df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)