My output dataframe will have sometimes 1 or 5 or 10 rows. How do I select exactly the middle row.
My code:
df =
val
0 10
1 20
2 30
3 40
mid_rw = round(len(df)/2)
print(df.iloc[mid_rw])
But above does not work if there is one row only? How to make it work for one row as well?
how about this:
import pandas as pd
df = pd.DataFrame({'val':[10,20,30,40]})
mid_rw = int(len(df)/2)
print(df.iloc[mid_rw])
int will round to floor
Related
I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)
In column J would like to get the value as per excel function ie IF(H3>I3,C2,0) and based on that occurance value ie from bottom to up 1st occurance as the latest one and next to that is 2nd occurance.
enter image description here
Here is the solution:
import pandas as pd
import numpy as np
# suppose we have this DataFrame:
df = pd.DataFrame({'A':[55,23,11,100,9] , 'B':[12,72,35,4,100]})
# suppose we want to reflect values of 'A' column if its values are equal or more than values in 'B' column, otherwise return 0
# so i'll make another column named 'Result' to put the results in it
df['Result'] = np.where(df['A'] >= df['B'] , df['A'] , 0)
then if you try to print DataFrame:
df
result:
A B Result
0 55 12 55
1 11 72 0
2 23 35 0
3 100 4 100
4 9 100 0
I have long rows of data where each of them contains a single character of a number and I want to merge every 5 rows of data into 1 using python.
for example:
A Result
1 12335
2 23352
3 33525
3 35251
5 ...
2 ...
5 ...
1 ...
the first result contains rows from 1 to 5 and the second result contains rows from 2 to 6. Can someone help me with that? Any answer would be appreciated!
something like this should work:
df = pd.DataFrame({'input' : ['1','2','3','4','5','6','7']})
df['result'] = df['input']
for i in range (1,6):
df['result'] = df['result'] + df['input'].shift(-i)
This also works but not very elegant
dz = df.iloc[0::5,:].reset_index(drop=True)+df.iloc[1::5,:].reset_index(drop=True)+df.iloc[2::5,:].reset_index(drop=True)+df.iloc[3::5,:].reset_index(drop=True)+df.iloc[4::5,:].reset_index(drop=True)
I have a dataset showing below.
What I would like to do is three things.
Step 1: AA to CC is an index, however, happy to keep in the dataset for the future purpose.
Step 2: Count 0 value to each row.
Step 3: If 0 is more than 20% in the row, which means more than 2 in this case because DD to MM is 10 columns, remove the row.
So I did a stupid way to achieve above three steps.
df = pd.read_csv("dataset.csv", header=None)
df_bool = (df == "0")
print(df_bool.sum(axis=1))
then I got an expected result showing below.
0 0
1 0
2 1
3 0
4 1
5 8
6 1
7 0
So removed the row #5 as I indicated below.
df2 = df.drop([5], axis=0)
print(df2)
This works well even this is not an elegant, kind of a stupid way to go though.
However, if I import my dataset as header=0, then this approach did not work at all.
df = pd.read_csv("dataset.csv", header=0)
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
How come this happens?
Also, if I would like to write a code with loop, count and drop functions, what does the code look like?
You can just continue using boolean_indexing:
First we calculate number of columns and number of zeroes per row:
n_columns = len(df.columns) # or df.shape[1]
zeroes = (df == "0").sum(axis=1)
We then select only rows that have less than 20 % zeroes.
proportion_zeroes = zeroes / n_columns
max_20 = proportion_zeroes < 0.20
df[max_20] # This will contain only rows that have less than 20 % zeroes
One liner:
df[((df == "0").sum(axis=1) / len(df.columns)) < 0.2]
It would have been great if you could have posted how the dataframe looks in pandas rather than a picture of an excel file. However, constructing a dummy df
df = pd.DataFrame({'index1':['a','b','c'],'index2':['b','g','f'],'index3':['w','q','z']
,'Col1':[0,1,0],'Col2':[1,1,0],'Col3':[1,1,1],'Col4':[2,2,0]})
Step1, assigning the index can be done using the .set_index() method as per below
df.set_index(['index1','index2','index3'],inplace=True)
instead of doing everything manually when it comes fo filtering out, you can use the return you got from df_bool.sum(axis=1) in the filtering of the dataframe as per below
df.loc[(df==0).sum(axis=1) / (df.shape[1])>0.6]
index1 index2 index3 Col1 Col2 Col3 Col4
c f z 0 0 1 0
and using that you can drop those rows, assuming 20% then you would use
df = df.loc[(df==0).sum(axis=1) / (df.shape[1])<0.2]
Ween it comes to the header issue it's a bit difficult to answer without seeing the what the file or dataframe looks like
I am using the pandas module for reading the data from a .csv file.
I can write out the following code to extract the data belonging to an individual column as follows:
import pandas as pd
df = pd.read_csv('somefile.tsv', sep='\t', header=0)
some_column = df.column_name
print some_column # Gives the values of all entries in the column
However, the file that I am trying to read now has more than 5000 columns and writing out the statement
some_column = df.column_name
is now not feasible. How can I get all the column values so that I can access them using indexing?
e.g to extract the value present at the 100th row and the 50th column, I should be able to write something like this:
df([100][50])
Use DataFrame.iloc or DataFrame.iat, but python counts from 0, so need 99 and 49 for select 100. row and 50. column:
df = df.iloc[99,49]
Sample - select 3. row and 4. column:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,10],
'E':[5,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 4 7 1 5 7
1 2 5 8 3 3 4
2 3 6 9 10 6 3
print (df.iloc[2,3])
10
print (df.iat[2,3])
10
Combination for selecting by column name and position of row is possible by Series.iloc or Series.iat:
print (df['D'].iloc[2])
10
print (df['D'].iat[2])
10
Pandas has indexing for dataframes, so you can use
df.iloc[[index]]["column header"]
the index is in a list as you can pass multiple indexes at one in this way.