I've got a pandas DataFrame object which contains nans. I would like to find all blocks of subsequent valid frames for each column and from these blocks the first and the last index.
Example data:
[
[ 1,nan],
[ 2,nan],
[ 3,nan],
[ 4,3.0],
[ 5,1.0],
[ 6,4.0],
[ 7,1.0],
[ 8,5.0],
[ 9,9.0],
[10,2.0],
[11,nan],
[12,nan],
[13,6.0],
[14,5.0],
[15,3.0],
[16,5.0]
]
where first column is index, second column is value I'd like to filter on. Result of this should be something like
[(4,10), (13,16)]
I would like to avoid manually iterating through the data by means of a for-loop for performance reasons...
Update 1:
Two additional criteria:
The valid values in the value column don't have to be equal. They can take any valid float value between -inf and +inf
I only need the first and the last index of valid blocks, not the NaN blocks in between.
I think you can use:
#set column names and set index by first column
df.columns = ['idx', 'a']
df = df.set_index('idx')
#find groups
df['b'] = (df.a.isnull() != df.a.shift(1).isnull()).cumsum()
#remove NaN
df = df[df.a.notnull()].reset_index()
#aggregate first and last values of column idx
df = df['idx'].groupby(df.b).agg(['first', 'last'])
print zip(df['first'], df['last'])
[(4, 10), (13, 16)]
Then I try modify solution of cggarvey:
#set column names and set index by first column
df.columns = ['idx', 'a']
df = df.set_index('idx')
#find edges
pre = df['a'] - df['a'].diff(-1)
pst = df['a'] - df['a'].diff(1)
a = pre.notnull() & pst.isnull()
z = pre.isnull() & pst.notnull()
print zip(a[a].index, z[z].index)
[(4, 10), (13, 16)]
Here's an example using Numpy. Not sure how it compares to #jezrael's solution, but you mentioned performance as a requirement so you can compare the two.
Note: This assumes your columns are named "index" and "val"
import numpy as np
pre = np.array(df['val'] - df.diff(-1)['val'])
pst = np.array(df['val'] - df.diff(1)['val'])
a = np.where(~np.isnan(pre) & np.isnan(pst))
z = np.where(np.isnan(pre) & ~np.isnan(pst))
output = zip(df.ix[a[0]]['index'],df.ix[z[0]]['index'])
Output:
[(4, 10), (13, 16)]
Related
I have two .txt file where I want to separate the data frame into two parts using the first column value. If the value is less than "H1000", we want in a first dataframe and if it is greater or equal to "H1000" we want in a second dataframe.First column starts the value with H followed by a four numbers. I want to ignore H when comparing numbers less than 1000 or greater than 1000 in python.
What I have tried this thing,but it is not working.
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
Here is my code:
if (".txt" in str(path_txt).lower()) and path_txt.is_file():
txt_files = [Path(path_txt)]
else:
txt_files = list(Path(path_txt).glob("*.txt"))
for fn in txt_files:
all_dfs = pd.read_csv(fn,sep="\t", header=None) #Reading file
all_dfs = all_dfs.dropna(axis=1, how='all') #Drop the columns where all columns are NaN
all_dfs = all_dfs.dropna(axis=0, how='all') #Drop the rows where all columns are NaN
print(all_dfs)
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
df_h = all_dfs[0:ht_data] # Head Data
df_t = all_dfs[ht_data:] # Tene Data
Can anyone help me how to achieve this task in python?
Assuming this data
import pandas as pd
data = pd.DataFrame(
[
["H0002", "Version", "5"],
["H0003", "Date_generated", "8-Aug-11"],
["H0004", "Reporting_period_end_date", "19-Jun-11"],
["H0005", "State", "AW"],
["H1000", "Tene_no/Combined_rept_no", "E75/3794"],
["H1001", "Tenem_holder Magnetic Resources", "NL"],
],
columns = ["id", "col1", "col2"]
)
We can create a mask of over and under a pre set threshold, like 1000.
mask = data["id"].str.strip("H").astype(int) < 1000
df_h = data[mask]
df_t = data[~mask]
If you want to compare values of the format val = HXXXX where X is a digit represented as a character, try this:
val = 'H1003'
val_cmp = int(val[1:])
if val_cmp < 1000:
# First Dataframe
else:
# Second Dataframe
I have dataframe. A snippet can be seen bellow:
import pandas as pd
data = {'EVENT_ID': [112335580,112335580,112335580,112335580,112335580,112335580,112335580,112335580, 112335582,
112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,112335582,
112335582,112335582,112335582],
'SELECTION_ID': [6356576,2554439,2503211,6297034,4233251,2522967,5284417,7660920,8112876,7546023,8175276,8145908,
8175274,7300754,8065540,8175275,8106158,8086265,2291406,8065533,8125015],
'BSP': [5.080818565,6.651493872,6.374683435,24.69510797,7.776082305,11.73219964,270.0383021,4,8.294425408,335.3223613,
14.06040142,2.423340019,126.7205863,70.53780982,21.3328554,225.2711962,92.25113066,193.0151362,3.775394142,
95.3786641,17.86333041],
'WIN_LOSE':[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0]}
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df.sortlevel(level=0, ascending=True, sort_remaining=True)
df = pd.DataFrame(data, columns=['EVENT_ID', 'SELECTION_ID', 'BSP','WIN_LOSE'])
df = df.sort_values(["EVENT_ID","BSP"])
df.set_index(['EVENT_ID', 'SELECTION_ID'], inplace=True)
df['Win_Percentage'] = 1/df['BSP']
df['Lose_Percentage'] = 1 - df['Win_Percentage']
For each EVENT_ID, so index level zero, I would like to fit an equation of a line, exponential, power and log based on Lose_Percentage column.
So the fitted lines for EVENT_ID 112335580 would be based on the points (1, 0.750000), (2, 0.803181), (3, 0.843129), (4, 0.849658), (5, 0.871401), (6, 0.914764), (7, 0.959506), (8, 0.996297). This would then be done for all other EVENT_ID indexes.
To try and do this I want to convert Lose_Percentage column into an array for each EVENT_ID. To do this I have tried the following:I want to convert Lose_Percentage column into an array for each EVENT_ID. To do this I have tried the following:
df["Lose_Percentage"][112335580].tolist()
I don't want to just access one I want to access each value in the Lose_Percentage column for each EVENT_ID and pass this list to a function.
To fit a line to the data I can use polyfit. So I will need to pass the array to this.
Also, I have had a look to see how I can fit log, power and exponential line but cannot find a function which can do this
Any help would be appreciated, cheers.
Sandy
It's not necessary to extract the values. At first you define a function which fits and evaluates
def fit_eval(df):
y = df.values
x = np.arange(0, len(y)) + 1
z = np.polyfit(x, y, 1)
p = np.poly1d(z)
return p(x)
This function can be used in a groupy:
df['fit'] = df.groupby(level=0)['Lose_Percentage'].transform(fit_eval)
You can select the required list by using loc -
extract = pd.Series(df.loc[112335580]["Lose_Percentage"])
extract.reset_index()
I have the following example of my dataframe:
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
If the cus_num is equal in the column
The Title is equal for both rows in the dataframe
The second_date in a row <= end_date in an other row
If all these requirements are met the value True should be appended to a new column in the original row.
Because I'm working with a big dataset I'm looking for an efficient way to do this.
In this case only the first record should get a true value.
I have checked for the apply with lambda and groupby function in python but couldnt find a way to make these work.
Try this (spontaneously I cannot come up with a faster method):
import pandas as pd
import numpy as np
df["second_date"]=pd.to_datetime(df["second_date"], format='%d-%m-%Y')
df["end_date"]=pd.to_datetime(df["end_date"], format='%d-%m-%Y')
df["new col"] = False
for cust in set(df["cust_num"]):
indices = df.index[df["cust_num"] == cust].tolist()
if len(indices) > 1:
sub_df = df.loc[indices]
for title in set(df.loc[indices]["Title"]):
indices_title = sub_df.index[sub_df["Title"] == title]
if len(indices_title) > 1:
for i in indices_title:
if sub_df.loc[indices_title]["second_date"][i] <= sub_df.loc[indices_title]["end_date"][i]:
df["new col"] = True
break
df["new_col"] = new_col
First you need to make all date columns comparable with eachother by casting them into datetime. Then create the additional column you want.
Now create a set of all unique customer numbers and iterate through them. For each customer number get a list of all row indices with this customer number. If this list is longer than 1, then you have several same customer numbers. Then you create a sub df of your dataframe with all rows with the same customer number. Then iterate through the set of all titles. For each title check if there is the same title somewhere else in the sub df (len > 1). If this is the case, then iterate through all rows and write True in your additional column in the same row where the date condition is met for the first time.
This should work. Also while reading comments, I am assuming that all cust_num is unique.
import pandas as pd
df = pd.DataFrame({'first_date': ['01-07-2017', '01-07-2017', '01-08-2017'],
'end_date': ['01-08-2017', '01-08-2017', '15-08-2017'],
'second_date': ['01-09-2017', '01-08-2017', '15-07-2017'],
'cust_num': [1, 2, 1],
'Title': ['philips', 'samsung', 'philips']})
df["second_date"]=pd.to_datetime(df["second_date"])
df["end_date"]=pd.to_datetime(df["end_date"])
df['Value'] = False
for i in range(len(df)):
for j in range(len(df)):
if (i != j):
if (df.loc[j,'end_date'] >= df.loc[i,'second_date']) == True:
if (df.loc[i,'cust_num'] == df.loc[j,'cust_num']) == True:
if (df.loc[i,'Title'] == df.loc[j,'Title']) == True:
df.loc[i,'Value'] = True
Tell me if this code works! and any errors.
I am having trouble reformatting a dataframe.
My input is a day value rows by symbols columns (each symbol has different dates with it's values):
Input
code to generate input
data = [("01-01-2010", 15, 10), ("02-01-2010", 16, 11), ("03-01-2010", 16.5, 10.5)]
labels = ["date", "AAPL", "AMZN"]
df_input = pd.DataFrame.from_records(data, columns=labels)
The needed output is (month row with new row for each month):
Needed output
code to generate output
data = [("01-01-2010","29-01-2010", "AAPL", 15, 20), ("01-01-2010","29-01-2010", "AMZN", 10, 15),("02-02-2010","30-02-2010", "AAPL", 20, 32)]
labels = ['bd start month', 'bd end month','stock', 'start_month_value', "end_month_value"]
df = pd.DataFrame.from_records(data, columns=labels)
Meaning (Pseudo code)
1. for each row take only non nan values to create a new "row" (maybe dictionary with the date as the index and the [stock, value] as the value.
2. take only rows that are business start of month or business end of month.
3. write those rows to a new datatframe.
I have read several posts like this and this and several more.
All treat with dataframe of the same "type" and just resampling while I need to change to structure...
My code so far
# creating the new index with business days
df1 =pd.DataFrame(range(10000), index = pd.date_range(df.iloc[0].name, periods=10000, freq='D'))
from pandas.tseries.offsets import CustomBusinessMonthBegin
from pandas.tseries.holiday import USFederalHolidayCalendar
bmth_us = CustomBusinessMonthBegin(calendar=USFederalHolidayCalendar())
df2 = df1.resample(bmth_us).mean()
# creating the new index interseting my old one (daily) with the monthly index
new_index = df.index.intersection(df2.index)
# selecting only the rows I want
df = df.loc[new_index]
# creating a dict that will be my new dataset
new_dict = collections.OrderedDict()
# iterating over the rows and adding to dictionary
for index, row in df.iterrows():
# print index
date = df.loc[index].name
# values are the not none values
values = df.loc[index][~df.loc[index].isnull().values]
new_dict[date]=values
# from dict to list
data=[]
for key, values in new_dict.iteritems():
for i in range(0, len(values)):
date = key
stock_name = str(values.index[i])
stock_value = values.iloc[i]
row = (key, stock_name, stock_value)
data.append(row)
# from the list to df
labels = ['date','stock', 'value']
df = pd.DataFrame.from_records(data, columns=labels)
df.to_excel("migdal_format.xls")
Current output I get
One big problem:
I only get value of the stock on the start of month day.. I need start and end so I can calculate the stock gain on this month..
One smaller problem:
I am sure this is not the cleanest and fastest code :)
Thanks a lot!
So I have found a way.
looping through each column
groupby month
taking the first and last value I have in that month
calculate return
df_migdal = pd.DataFrame()
for col in df_input.columns[0:]:
stock_position = df_input.loc[:,col]
name = stock_position.name
name = re.sub('[^a-zA-Z]+', '', name)
name = name[0:-4]
stock_position=stock_position.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
stock_position["name"] = name
stock_position["return"] = ((stock_position["last"] / stock_position["first"]) - 1) * 100
stock_position.dropna(inplace=True)
df_migdal=df_migdal.append(stock_position)
df_migdal=df_migdal.round(decimals=2)
I tried I way cooler way, but did not know how to handle the ,multi index I got... I needed that for each column, to take the two sub columns and create a third one from some lambda function.
df_input.groupby([pd.TimeGrouper('M')]).agg(['first', 'last'])
I have a numpy array of size 31x36 and i want to transform into pandas dataframe in order to process it. I am trying to convert it using the following code:
pd.DataFrame(data=matrix,
index=np.array(range(1, 31)),
columns=np.array(range(1, 36)))
However, I am receiving the following error:
ValueError: Shape of passed values is (36, 31), indices imply (35, 30)
How can I solve the issue and transform it properly?
As to why what you tried failed, the ranges are off by 1
pd.DataFrame(data=matrix,
index=np.array(range(1, 32)),
columns=np.array(range(1, 37)))
As the last value isn't included in the range
Actually looking at what you're doing you could've just done:
pd.DataFrame(data=matrix,
index=np.arange(1, 32)),
columns=np.arange(1, 37)))
Or in pure pandas:
pd.DataFrame(data=matrix,
index=pd.RangeIndex(range(1, 32)),
columns=pd.RangeIndex(range(1, 37)))
Also if you don't specify the index and column params, an auto-generated index and columns is made, which will start from 0. Unclear why you need them to start from 1
You could also have not passed the index and column params and just modified them after construction:
In[9]:
df = pd.DataFrame(adaption)
df.columns = df.columns+1
df.index = df.index + 1
df
Out[9]:
1 2 3 4 5 6
1 -2.219072 -1.637188 0.497752 -1.486244 1.702908 0.331697
2 -0.586996 0.040052 1.021568 0.783492 -1.263685 -0.192921
3 -0.605922 0.856685 -0.592779 -0.584826 1.196066 0.724332
4 -0.226160 -0.734373 -0.849138 0.776883 -0.160852 0.403073
5 -0.081573 -1.805827 -0.755215 -0.324553 -0.150827 -0.102148
You meet an error because the end argument in range(start, end) is non-inclusive. You have a couple of options to account for this:
Don't pass index and columns
Just use df = pd.DataFrame(matrix). The pd.DataFrame constructor adds integer indices implicitly.
Pass in the shape of the array
matrix.shape gives a tuple of row and column count, so you need not specify them manually. For example:
df = pd.DataFrame(matrix, index=range(matrix.shape[0]),
columns=range(matrix.shape[1]))
If you need to start at 1, remember to add 1:
df = pd.DataFrame(matrix, index=range(1, matrix.shape[0] + 1),
columns=range(1, matrix.shape[1] + 1))
In addition to the above answer,range(1, X) describes the set of numbers from 1 up to X-1 inclusive. You need to use range(1, 32) and range(1, 37) to do what you describe.