How do you get the last (or "nth") column in a dataFrame?
I tried several different articles such as 1 and 2.
df = pd.read_csv(csv_file)
col=df.iloc[:,0] #returns Index([], dtype='object')
col2=df.iloc[:,-1] #returns the whole dataframe
col3=df.columns[df.columns.str.startswith('c')] #returns Index([], dtype='object')
The commented out parts after the code is what I am getting after a print. Most of the time I am getting things like "returns Index([], dtype='object')"
Here is what df prints:
date open high low close
0 0 2019-07-09 09:20:10 296.235 296.245 296...
1 1 2019-07-09 09:20:15 296.245 296.245 296...
2 2 2019-07-09 09:20:20 296.235 296.245 296...
3 3 2019-07-09 09:20:25 296.235 296.275 296...
df.iloc is able to refer to both rows and columns. If you only input one integer, it will automatically refer to a row. You can mix the indexer types for the index and columns. Use : to select the entire axis.
df.iloc[:,-1:] will print out all of the final column.
Related
I have a data frame with measurements for several groups of participants, and I am doing some calculations for each group. I want to add a column in a big data frame (all participants), from secondary data frames (partial list of participants).
When I do merge a couple of times (merging a new data frame to the existing one), it creates a duplicate of the column instead of one column.
As the size of the dataframes is different I can not compare them directly.
I tried
#df1 - main bigger dataframe, df2 - smaller dataset contains group of df1
for i in range(len(df1)):
# checking indeces to place the data to correct participant:
if df1.index[i] not in df2['index']:
pass
else :
df1['rate'][i] = list(df2[rate][df2['index']==i])
It does not work properly though. Can you please help with the correct way of assembling the column?
update: where the index of the initial dataframe and the "index" column of the calculation is the same, copy the rate value from the calculation into main df
main dataframe 1df
index
rate
1
0
2
0
3
0
4
0
5
0
6
0
dataframe with calculated values
index
rate
1
value
4
value
6
value
output df
index
rate
1
value
2
0
3
0
4
value
5
0
6
value
Try this – using .join() to merge dataframes on their indices and combining two columns using .combine_first():
df = df1.join(df2, lsuffix="_df1", rsuffix="_df2")
df["rate"] = df["rate_df2"].combine_first(df["rate_df1"])
EDIT:
This assumes both dataframes use a matching index. If that is not the case for df2, run this first:
df2 = df2.set_index('index')
Date Sub Value
10/24/2020 A 1
9/18/2020 A 2
9/21/2020 A 3
9/13/2020 A 4
9/20/2020 A 5
I want to extract the data using latest date from the dataframe.
I was using the following formula, but the output is different
df = df.Date.max()
Output: 2020-10-24 00:00:00.
The output which i am looking for is
Date Sub Value
10/24/2020 A 1
To get multiple rows matching the same max value, you can do this:
In [2679]: df[df.Date == df.Date.max()]
Out[2679]:
Date Sub Value
0 2020-10-24 A 1
Use Series.idxmax with DataFrame.loc and [[]] for output one row DataFrame - get only one row by first maximal datetime:
df1 = df.loc[[df.Date.idxmax()]]
Or boolean indexing with compare max - then get multiple rows if more like 1 max values:
df1 = df[df.Date.eq(df.Date.max())]
I am a doctor looking at surgical activity in a DataFrame that has 450,000 records and 129 fields or columns. Each row describes a patient admission to hospital, and the columns describe the operation codes and diagnosis codes for the patient.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452883 entries, 0 to 452882
Columns: 129 entries, fyear to opertn_24Type
dtypes: float64(5), int64(14), object(110)
memory usage: 445.7+ MB
There are 24 operation columns for each row. I want to search in the operation columns (1-24) for the codes for pituitary surgery "B041" and "B012", to identify all patients who have had surgery for a pituitary tumor.
I am a total python beginner and have tried using iloc to describe the range of columns (1-24) which appear starting at position 72 in the list of columns but couldn't get it to work.
I am quite happy searching for individual values eg "B041" in a single column using
df["OPERTN_01"] == "B041"
but would ideally like to search multiple columns (all the surgical columns 1-24) more efficiently.
I tried searching the whole dataframe using
test = df[df.isin(["B041", "B012"])]
but that just returns the entire dataframe with null values.
So I have a few questions.
How do I identify integer positions (iloc numbers) for columns in a large dataframe of 129 columns? I just listed them and counted them to get the first surgical column ("OPERTN_01") at position 72 — there must be an easier way.
What's the best way to slice a dataframe to select records with multiple values from multiple columns?
Let's use .iloc and create a boolean for you to filter by:
import pandas as pd
import numpy as np
np.random.seed(12)
df = pd.DataFrame({"A" : ["John","Deep","Julia","Kate","Sandy"],
"result_1" : np.random.randint(5,15,5),
"result_2" : np.random.randint(5,15,5) })
print(df)
A result_1 result_2
0 John 11 5
1 Deep 6 11
2 Julia 7 6
3 Kate 8 9
4 Sandy 8 10
Next we need to find your intended values in the selected columns:
df.iloc[:,1:27].isin([11,10])
this returns:
result_1 result_2
0 True False
1 False True
2 False False
3 False False
4 False True
From the above, we need to slice our original dataframe by the rows where any value is true (if I've understood you correctly).
For this we can use np.where() with .loc:
df.loc[np.where(df.iloc[:,1:].isin([11,10])==True)[0]]
A result_1 result_2
0 John 11 5
1 Deep 6 11
4 Sandy 8 10
From here it's a simple task to extract your unique IDs.
Answer 1:
Let's say you are looking for eg_col in your dataframe's columns. Then, you can find its index within the columns using:
df.columns.tolist().index('eg_col')
Answer 2:
In your example, if you know the name of the last surgical column (let's say it's called OPERTN_24, you can slice those columns using:
df_op = df.loc[:, 'OPERTN_01':'OPERTN_24']
Continuing from that, we can look for values of 'B041', 'B012' in df_surg as you tried: df_op.isin['B041', 'B012'] which will return the boolean value for all dataframe entries.
To extract, for example, only those rows where at least one of our 'B041' values comes up, we select those rows with:
df.index[df_surg.isin(['B041', 'B012']).any(axis=1)]
I am running a for loop for each of 12 months. For each month I get bunch of dates in random order over various years in history. I also have corresponding temperature data on those dates. e.g. if I am in month January, of loop all dates and temperature I get from history are for January only.
I want to start with empty pandas dataframe with two columns namely 'Dates' and 'Temperature'. As the loop progresses I want to add the dates from another month and corresponding data to the 'Temperature' column.
After my dataframe is ready I want to finally use the 'Dates'column as index to order the 'Temperature' history available so that I have correct historical sorted dates with their temperatures.
I have thought about using numpy array and storing dates and data in two separate arrays; sort the dates and then sort the temperature using some kind of index. I believe using pandas pivot table feature it will be better implemented in pandas.
#Zanam Pls refer this syntax. I think your question is similar to this answer
df = DataFrame(columns=('lib', 'qty1', 'qty2'))
for i in range(5):
df.loc[i] = [randint(-1,1) for n in range(3)]
print(df)
lib qty1 qty2
0 0 0 -1
1 -1 -1 1
2 1 -1 1
3 0 0 0
4 1 -1 -1
[5 rows x 3 columns]
I'm trying to create a column of microsatellite motifs in a pandas dataframe. I have one column that gives the length of the motif and another that has the whole microsatellite.
Here's an example of the columns of interest.
motif_len sequence
0 3 ATTATTATTATT
1 4 ATCTATCTATCT
2 3 ATCATCATCATC
I would like to slice the values in sequence using the values in motif_len to give a single repeat(motif) of each microsatellite. I'd then like to add all these motifs as a third column in the data frame to give something like this.
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC
I've tried a few things with no luck.
>>df['motif'] = df.sequence.str[:df.motif_len]
>>df['motif'] = df.sequence.str[:df.motif_len.values]
Both make the motif column but all the values are NaN.
I think I understand why these don't work. I'm passing a series/array as the upper index in the slice rather than the a value from the mot_len column.
I also tried to create a series by iterating through each
Any ideas?
You can call apply on the df pass axis=1 to apply row-wise and use the column values to slice the str:
In [5]:
df['motif'] = df.apply(lambda x: x['sequence'][:x['motif_len']], axis=1)
df
Out[5]:
motif_len sequence motif
0 3 ATTATTATTATT ATT
1 4 ATCTATCTATCT ATCT
2 3 ATCATCATCATC ATC