Dataframe.resample() works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method?
I'd use iloc, which takes a row/column slice, both based on integer position and following normal python syntax. If you want every 5th row:
df.iloc[::5, :]
Though #chrisb's accepted answer does answer the question, I would like to add to it the following.
A simple method I use to get the nth data or drop the nth row is the following:
df1 = df[df.index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.index % 3 == 0] # Selects every 3rd raw starting from 0
This arithmetic based sampling has the ability to enable even more complex row-selections.
This assumes, of course, that you have an index column of ordered, consecutive, integers starting at 0.
There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__.
df = pd.DataFrame('x', index=range(5), columns=list('abc'))
df
a b c
0 x x x
1 x x x
2 x x x
3 x x x
4 x x x
For example, to get every 2 rows, you can do
df[::2]
a b c
0 x x x
2 x x x
4 x x x
There's also GroupBy.first/GroupBy.head, you group on the index:
df.index // 2
# Int64Index([0, 0, 1, 1, 2], dtype='int64')
df.groupby(df.index // 2).first()
# Alternatively,
# df.groupby(df.index // 2).head(1)
a b c
0 x x x
1 x x x
2 x x x
The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do
# df.groupby(np.arange(len(df)) // 2).first()
df.groupby(pd.RangeIndex(len(df)) // 2).first()
a b c
0 x x x
1 x x x
2 x x x
Adding reset_index() to metastableB's answer allows you to only need to assume that the rows are ordered and consecutive.
df1 = df[df.reset_index().index % 3 != 0] # Excludes every 3rd row starting from 0
df2 = df[df.reset_index().index % 3 == 0] # Selects every 3rd row starting from 0
df.reset_index().index will create an index that starts at 0 and increments by 1, allowing you to use the modulo easily.
I had a similar requirement, but I wanted the n'th item in a particular group. This is how I solved it.
groups = data.groupby(['group_key'])
selection = groups['index_col'].apply(lambda x: x % 3 == 0)
subset = data[selection]
A solution I came up with when using the index was not viable ( possibly the multi-Gig .csv was too large, or I missed some technique that would allow me to reindex without crashing ).
Walk through one row at a time and add the nth row to a new dataframe.
import pandas as pd
from csv import DictReader
def make_downsampled_df(filename, interval):
with open(filename, 'r') as read_obj:
csv_dict_reader = DictReader(read_obj)
column_names = csv_dict_reader.fieldnames
df = pd.DataFrame(columns=column_names)
for index, row in enumerate(csv_dict_reader):
if index % interval == 0:
print(str(row))
df = df.append(row, ignore_index=True)
return df
df.drop(labels=df[df.index % 3 != 0].index, axis=0) # every 3rd row (mod 3)
Related
Problem Statement:
I have a DataFrame that has to be filtered with multiple conditions.
Each condition is optional, which means if an invalid value is entered by the user for a certain condition, the condition can be skipped completely, defaulting to the original DataFrame (without that specific condition)in return.
While I can implement this quite easily with multiple if-conditions, modifying the DataFrame in a sequential way, I am looking for something that is more elegant and scalable (with increasing input parameters) and preferably using inbuilt pandas functionality
Reproducible Example
Dummy dataframe -
df = pd.DataFrame({'One':['a','a','a','b'],
'Two':['x','y','y','y'],
'Three':['l','m','m','l']})
print(df)
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Let's say that invalid values are the values that don't belong to the respective column. So, for column 'One' all other values are invalid except 'a' and 'b'. If the user input's 'a' then I should be able to filter the DataFrame df[df['One']=='a'], however, if the user inputs any invalid value, no such filter should be applied, and the original dataframe df is returned.
My attempt (with multiple parameters):
def valid_filtering(df, inp):
if inp[0] in df['One'].values:
df = df[df['One']==inp[0]]
if inp[1] in df['Two'].values:
df = df[df['Two']==inp[1]]
if inp[2] in df['Three'].values:
df = df[df['Three']==inp[2]]
return df
With all valid inputs -
inp = ['a','y','m'] #<- all filters valid so df is filtered before returning
print(valid_filtering(df, inp))
One Two Three
1 a y m
2 a y m
With few invalid inputs -
inp = ['a','NA','NA'] #<- only first filter is valid, so other 2 filters are ignored
print(valid_filtering(df, inp))
One Two Three
0 a x l
1 a y m
2 a y m
P.S. Additional question - is there a way to get DataFrame indexing to behave as -
df[df['One']=='valid'] -> returns filtered df
df[df['One']=='invalid'] -> returns original df
Because this would help me rewrite my filtering -
df[(df['One']=='valid') & (df['Two']=='invalid') & (df['Three']=='valid')] -> Filtered by col One and Three
EDIT: Solution -
An updated solution inspired by the code and logic provided by #corralien and #Ben.T
df.loc[(df.eq(inp)|~df.eq(inp).any(0)).all(1)]
Here is one way creating a Boolean dataframe depending on each value of inp in each column. Then use any along the rows to get columns with at least one True, and all along the columns once selected the columns that have at least one True.
def valid_filtering(df, inp):
# check where inp values are same than in df
m = (df==pd.DataFrame(data=[inp] , index=df.index, columns=df.columns))
# select the columns with at least one True
cols = m.columns[m.any()]
# select the rows that all True amongst wanted columns
rows = m[cols].all(axis=1)
# return df with selected rows
return df.loc[rows]
Note that if you don't have the same number of filter than columns in your original df, then you could do with a dictionary, it works too as in the example below the column Three will be ignored as all False.
d = {'One': 'a', 'Two': 'y'}
m = (df==pd.DataFrame(d, index=df.index).reindex(columns=df.columns))
The key is if a column return all False (~b.any, invalid filter) then return True to accept all values of this columns:
mask = df.eq(inp).apply(lambda b: np.where(~b.any(), True, b))
out = df.loc[mask.all(axis="columns")]
Case 1: inp = ['a','y','m'] (with all valid inputs)
>>> out
One Two Three
1 a y m
2 a y m
Case 2: inp = ['a','NA','NA'] (with few invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
Case 3: inp = ['NA','NA','NA'] (with no invalid inputs)
>>> out
One Two Three
0 a x l
1 a y m
2 a y m
3 b y l
Case 4: inp = ['b','x','m'] (with all valid inputs but not results)
>>> out
Empty DataFrame
Columns: [One, Two, Three]
Index: []
Of course, you can increase input parameters:
df["Four"] = ['i','j','k','k']
inp = ['a','NA','m','k']
>>> out
One Two Three Four
2 a y m k
Another way with list comprehension:
def valid_filtering(df, inp):
series = [df[column] == inp[i]
for i, column in enumerate(df.columns) if len(df[df[column] == inp[i]].values) > 0]
for s in series: df = df[s]
return df
Output of print(valid_filtering(df, ['a','NA','NA'])):
One Two Three
0 a x l
1 a y m
2 a y m
Related: applying lambda row on multiple columns pandas
I have a data frame df that look like:
A B C
Date
24/03/2014 -0.114726 -0.076779 -0.012105
25/03/2014 -0.118673 -0.078756 -0.008158
26/03/2014 -0.132919 -0.078067 0.006088
27/03/2014 -0.153581 -0.068223 0.02675
28/03/2014 -0.167744 -0.063045 0.040913
31/03/2014 -0.167399 -0.067346 -0.040568
01/04/2014 -0.166249 -0.068801 0.039418
02/04/2014 -0.160876 -0.077259 0.034045
03/04/2014 -0.156089 -0.090062 0.029258
04/04/2014 -0.161735 -0.079317 -0.034904
07/04/2014 -0.148305 -0.080767 0.021474
08/04/2014 -0.150812 -0.074792 0.023981
09/04/2014 -0.135339 -0.079736 0.008508
10/04/2014 -0.156345 -0.083574 0.029514
I am looking to create 2 variables that are the sum of column C where values are greater than 0 and where values are less than zero.
So in this example variable aboveZero would equal 0.259949 and the variable belowZero would equal -2.29261.
var1=df[df.C.lt(0)]['C'].sum()#less than 0
var2=df[df.C.gt(0)]['C'].sum()#greater than 0
If you prefer generator expression syntax:
aboveZero = sum(x for x in df["C"] if x > 0)
belowZero = sum(x for x in df["C"] if x < 0)
Im trying to create function which will create a new column in a pandas dataframe, where it figures out which substring is in a column of strings and takes the substring and uses that for the new column.
The problem being that the text to find does not appear at the same location in variable x
df = pd.DataFrame({'x': ["var_m500_0_somevartext","var_m500_0_vartextagain",
"varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6,8]})
finds = ["m500_0","0_500","m150_0"]
which of finds is in a given df["x"] row
I've made a function that works, but is terribly slow for large datasets
def pd_create_substring_var(df,new_var_name = "new_var",substring_list=["1"],var_ori="x"):
import re
df[new_var_name] = "na"
cols = list(df.columns)
for ix in range(len(df)):
for find in substring_list:
for m in re.finditer(find, df.iloc[ix][var_ori]):
df.iat[ix, cols.index(new_var_name)] = df.iloc[ix][var_ori][m.start():m.end()]
return df
df = pd_create_substring_var(df,"t",finds,var_ori="x")
df
x x1 t
0 var_m500_0_somevartext 4 m500_0
1 var_m500_0_vartextagain 5 m500_0
2 varwithsomeothertext_0_500 6 0_500
3 varwithsomext_m150_0_text 8 m150_0
Does this accomplish what you need ?
finds = ["m500_0", "0_500", "m150_0"]
df["t"] = df["x"].str.extract(f"({'|'.join(finds)})")
Use pandas.str.findall:
df['x'].str.findall("|".join(finds))
0 [m500_0]
1 [m500_0]
2 [0_500]
3 [m150_0]
Probably not the best way:
df['t'] = df['x'].apply(lambda x: ''.join([i for i in finds if i in x]))
And now:
print(df)
Is:
x x1 t
0 var_m500_0_somevartext 4 m500_0
1 var_m500_0_vartextagain 5 m500_0
2 varwithsomeothertext_0_500 6 0_500
3 varwithsomext_m150_0_text 8 m150_0
And now, just adding to #pythonjokeun's answer, you can do:
df["t"] = df["x"].str.extract("(%s)" % '|'.join(finds))
Or:
df["t"] = df["x"].str.extract("({})".format('|'.join(finds)))
Or:
df["t"] = df["x"].str.extract("(" + '|'.join(finds) + ")")
I don't know how large your dataset is, but you can use map function like below:
def subset_df_test():
df = pandas.DataFrame({'x': ["var_m500_0_somevartext", "var_m500_0_vartextagain",
"varwithsomeothertext_0_500", "varwithsomext_m150_0_text"], 'x1': [4, 5, 6, 8]})
finds = ["m500_0", "0_500", "m150_0"]
df['t'] = df['x'].map(lambda x: compare(x, finds))
print df
def compare(x, finds):
for f in finds:
if f in x:
return f
Try this
df["t"] = df["x"].apply(lambda x: [i for i in finds if i in x][0])
My data looks as follows:
ID my_val db_val
a X X
a X X
a Y X
b X Y
b Y Y
b Y Y
c Z X
c X X
c Z X
Expected result :
ID my_val db match
a X:2;Y:1 X full_match
b Y:2;X:1 Y full_match
c z:2;X:1 X partial_match
a full_match is when db_val matches the most abundant my_val
a partial_match is when db_val is in the other values but doesn't match the top one.
My current approach consists of grouping by ID then counting values into a seperate column then concatenating the value and its count, then aggregating all values into one row for each ID.
This is how I aggregate the columns:
def all_hits_aggregate_df(df, columns=['my_val']):
grouped = data.groupby('ID')
l=[]
for c in columns:
res = grouped[c].value_counts(ascending=False, normalize=False).to_frame('count_'+c).reset_index(level=1)
res[c] = res[c].astype(str) +':'+ res['count_'+c].astype(str)
l.append(res.groupby('ID').agg(lambda x: ';'.join(x)))
return reduce(lambda x, y: pd.merge(x, y, on = 'ID'), l)
And for the comparison phase, I loop through each row and parse the my_val column into lists then do the comparison.
I am sure that the way I do the comparison step is extremely inefficient but I am unsure how I would do it before aggregation to avoid having to parse the generated string later in the process.
We can groupby the DataFrame by ID, then count my_val values with value_counts and convert to json with to_json, which, with some small changes in formatting, gives us the format that was requested (we just need to remove curly brackets and quotes and replace commas with semicolons). On the grouped data we also take the first (and presumably the only one per ID) value of db_val and calculate the percentage of matches (more than 50% will give us full_match, 0-50% is partial_match and 0% is no_match):
df['match'] = df['my_val']==df['db_val']
z = (df
.groupby('ID')
.agg({'my_val': lambda x: x.value_counts().to_json(),
'db_val': 'first',
'match': 'mean'})
).reset_index()
z['my_val'] = z['my_val'].str.replace('[{"}]','').str.replace(',',';')
z['match'] = np.select(
[z['match'] > 0.5, z['match'] > 0],
['full_match', 'partial_match'], 'no_match')
print(z)
Output:
ID my_val db_val match
0 a X:2;Y:1 X full_match
1 b Y:2;X:1 Y full_match
2 c Z:2;X:1 X partial_match
This should give you the first part of what you want:
df['equal'] = df.my_val == df.db_val
df2 = pd.DataFrame()
df2['my_val'] = df.groupby('ID')['my_val'].sum()
df2['db'] = df.groupby('ID')['db_val'].unique()
df2['match_val'] = df.groupby('ID')['equal'].sum()
df2['match'] = ''
df2.loc[df2.match_val/len(df2.my_val) > 0.5, 'match'] = 'full_match'
df2.loc[df2.match_val/len(df2.my_val) <= 0.5, 'match'] = 'partial_match'
df2.loc[df2.match_val/len(df2.my_val) == 0, 'match'] = 'no_match'
df2 = df2.drop(columns = 'match_val')
print(df2)
my_val db match
ID
a XXY [X] full_match
b XYY [Y] full_match
c ZXZ [X] partial_match
If i have dataframe with column x.
I want to make a new column x_new but I want the first row of this new column to be set to a specific number (let say -2).
Then from 2nd row, use the previous row to iterate through the cx function
data = {'x':[1,2,3,4,5]}
df=pd.DataFrame(data)
def cx(x):
if df.loc[1,'x_new']==0:
df.loc[1,'x_new']= -2
else:
x_new = -10*x + 2
return x_new
df['x_new']=(cx(df['x']))
The final dataframe
I am not sure on how to do this.
Thank you for your help
This is what i have so far:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
df
# calculate equation
def depth_cal(d):
z = -3*d+1 #d must be previous row
return z
depth_cal=(depth_cal(df['depth'])) # how to set d as previous row
print (depth_cal)
depth_new =[]
for row in df['depth']:
if row == 1:
depth_new.append('-5.63')
else:
depth_new.append(depth_cal) #Does not put list in a column
df['Depth_correct']= depth_new
correct output:
There is still two problem with this:
1. it does not put the depth_cal list properly in column
2. in the depth_cal function, i want d to be the previous row
Thank you
I would do this by just using a loop to generate your new data - might not be ideal if particularly huge but it's a quick operation. Let me know how you get on with this:
data = {'depth':[1,2,3,4,5]}
df=pd.DataFrame(data)
res = data['depth']
res[0] = -5.63
for i in range(1, len(res)):
res[i] = -3 * res[i-1] + 1
df['new_depth'] = res
print(df)
To get
depth new_depth
0 1 -5.63
1 2 17.89
2 3 -52.67
3 4 159.01
4 5 -476.03