Adding a new column to DataFrame with different values in different row - python

I have a DataFrame df which has 50 columns in it and it has 28800 rows. I want to add a new column col_new which will have value 0 in every rows from 2880 to 5760 ,12960 to 15840 and 23040 to 25920. And all other rows will have value 1.
How could I do that?

Believe what you are looking for is actually answered here: Add column in dataframe from list
myList = [1,2,3,4,5]
print(len(df)) # 50
df['new_col'] = mylist
print(len(df)) # 51
Alternatively, you could set the value of a slice of the list like so:
data['new_col'] = 1
data.loc[2880:5760, 'new_col'] = 0
data.loc[12960:15840, 'new_col'] = 0
data.loc[23040:25920, 'new_col'] = 0

df = pd.DataFrame([i for i in range(28800)])
df["new_col"] = 1
zeros_bool = [(
(i>=(2880-1) and i<5760) or (i>=(12960-1) and i<15840) or (i>=(23040-1) and i<25920)
) for i in range(28800)]
df.loc[zeros_bool,"new_col"] = 0

Related

I want to replace every value in the age column with its middle value

I have a column that looks like this:
Age
[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)
and want to remove the "[","-" and ")". Instead of showing the range such as 0-10, I would like to show the middle value instead for every row in the column
Yet another solution:
The dataframe:
df = pd.DataFrame({'Age':['[0-10)','[10-20)','[20-30)','[30-40)','[40-50)','[50-60)','[60-70)','[70-80)']})
df
Age
0 [0-10)
1 [10-20)
2 [20-30)
3 [30-40)
4 [40-50)
5 [50-60)
6 [60-70)
7 [70-80)
The code:
df['Age'] = df.Age.str.extract('(\d+)-(\d+)').astype('int').mean(axis=1).astype('int')
The result:
df
Age
0 5
1 15
2 25
3 35
4 45
5 55
6 65
7 75
If you want to explode a row into multiple rows where each row carries a value from the range, you can do this:
data = '''[0-10)
[10-20)
[20-30)
[30-40)
[40-50)
[50-60)
[60-70)
[70-80)'''
df = pd.DataFrame({'Age': data.splitlines()})
df['Age'] = df['Age'].str.extract(r'\[(\d+)-(\d+)\)').astype(int).apply(lambda r: list(range(r[0], r[1])), axis=1)
df.explode('Age')
Note that I assume your Age column is string typed, so I used extract to get the boundaries of the ranges, and convert them to a real list of integers. Finally explode your dataframe for the modified Age column will get you a new row for each integer in the list. Values in other columns will be copied accordingly.
I tried this:
import pandas as pd
import re
data = {
'age_range': [
'[0-10)',
'[10-20)',
'[20-30)',
'[30-40)',
'[40-50)',
'[50-60)',
'[60-70)',
'[70-80)',
]
}
df = pd.DataFrame(data=data)
def get_middle_age(age_range):
pattern = r'(\d+)'
ages = re.findall(pattern, age_range)
return int((int(ages[0])+int(ages[1]))/2)
df['age'] = df.apply(lambda row: get_middle_age(row['age_range']), axis=1)

How to create a dataframe with number of row having value above of zero for specifics columns?

I have a dataframe like this
name skill_1 skill_2
john 2 0
james 0 1
I would like to have a count of the rows above of zero for each columns starting with "skill".
Expected output for the new dataframe:
skills count
skill_1 1
skill_2 1
How can I do it with Pandas ?
You can try:
df.filter(like="skill").gt(0).sum(axis=0).to_frame("count")
filter for the columns that include "skill"
Mark those entries that are greater than 0 as True and others as False
Sum row-wise (axis=0) where True will be treated as 1 and False 0 to get the counts
Convert to dataframe
to get
count
skill_1 1
skill_2 1
Just filter by the condition on the "skill" columns:
import pandas as pd
df = pd.DataFrame(columns=['name','skill_1','skill_2'],
data=[['john',2,0],
['james',0,1]])
skill_cols = [x for x in df.columns if 'skill' in x]
subset_df = df[df[skill_cols] > 0][skill_cols]
column_count = subset_df.count()

Pandas-iterate through a dataframe column and concatenate corresponding row values that contain a list

I have a data-frame with column1 containing string values and column 2 containing lists of sting values.
I want to iterate through column1 and concatenate column1 values with their corresponding row values into a new data-frame.
Say, my input is
`dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}`
after the operation my data will look like this
dfd2 = {'TRAINSET':['101a1','101x1','101b2', '102a1','102b3','102b2','103d3', '103g5','103x2','104x1','104b2', '104a1']}
what i tried is:
dg = pd.concat([g['TRAINSET'].map(g['unique']).apply(pd.Series)], axis = 1)
but i get KeyError:'TRAINSET' as this is probably not the proper syntax
.Also, I would like to remove the Nan values in the list
Here is possible use list comprehension with flatten values of lists, join values by + and pass to DataFrame constructor is necessary:
#if necessary
#df = df.reset_index()
#flatten values with filter out missing values
L = [(str(a) + x) for a, b in df[['TRAINSET','unique']].values for x in b if pd.notna(x)]
df1 = pd.DataFrame({'TRAINSET': L})
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Or use DataFrame.explode (pandas 0.25+), crete default index, remove missing values by DataFrame.dropna and join columns to + with Series.to_frame for one column DataFrame :
df = df.explode('unique').dropna(subset=['unique']).reset_index(drop=True)
df1 = (df['TRAINSET'].astype(str) + df['unique']).to_frame('TRAINSET')
print (df1)
TRAINSET
0 101a1
1 101x1
2 101b2
3 102a1
4 102b3
5 102b2
6 103d3
7 103g5
8 103x2
9 104x1
10 104b2
11 104a1
Coming from your original data you can do the below using explode (new in pandas -0.25+) and agg:
Input:
dfd = {'TRAINSET':['101','102','103', '104'],
'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
Solution:
df = pd.DataFrame(dfd)
df.explode('unique').astype(str).agg(''.join,1).to_frame('TRAINSET').to_dict('list')
{'TRAINSET': ['101a1',
'101x1',
'101b2',
'102a1',
'102b3',
'102b2',
'103d3',
'103g5',
'103x2',
'104x1',
'104b2',
'104a1']}
Another solution, just to give you some choice...
import pandas as pd
_dfd = {'TRAINSET':['101','102','103', '104'], 'unique':[['a1','x1','b2'],['a1','b3','b2'] ,['d3','g5','x2'],['x1','b2','a1']]}
dfd = pd.DataFrame.from_dict(_dfd)
dfd.set_index("TRAINSET", inplace=True)
print(dfd)
dfd2 = dfd.reset_index()
def refactor(row):
key, l = str(row["TRAINSET"]), str(row["unique"])
res = [key+i for i in l]
return res
dfd2['TRAINSET'] = dfd2.apply(refactor, axis=1)
dfd2.set_index("TRAINSET", inplace=True)
dfd2.drop("unique", inplace=True, axis=1)
print(dfd2)

Pandas: Delete rows with different encoding of 0s in python

I have calculated statistical values and written them to a csv file. The nan values are replaced with zeros. There are rows with only zeros and there are rows with both 0 and 0.0 values only. How can I delete these rows? According to the attached image rows number 5 , 6 (only 0.0s), 9 and 11 (both 0s and 0.0s) needs to get deleted.
import pandas as pd
all_df = pd.read_csv('source.csv')
all_df.dropna(subset=df_all.columns.tolist()[1:], how='all', inplace=True)
all_df.fillna(0, inplace=True)
all_df.to_csv('outfile.csv', index=False)
Use all_df[(all_df.T != 0).any()] or all_df[(all_df != 0).any(axis=1)]:
all_df = pd.DataFrame({'a':[0,0,0,1], 'b':[0,0,0,1]})
print all_df
a b
0 0 0
1 0 0
2 0 0
3 1 1
all_df = all_df[(all_df.T != 0).any()]
all_df
a b
3 1 1
EDIT 1: After looking at your data, a solution is to convert all numerical columns to float and then do the operations. This problem arises from the way the initial data were saved into the .csv file.
all_df = pd.read_csv('/Users/me/Downloads/Test11.csv')
# do not select 'activity' column
df = all_df.loc[:, all_df.columns != 'activity']
# convert to float
df = df.astype(float)
# remove columns with all 0s
mask = (df != 0).any(axis=1)
df = df[mask]
#mask activity column
recover_lines_of_activity_column = all_df['activity'][mask]
# Final result
final_df = pd.concat([recover_lines_of_activity_column, df], axis = 1)
Output:

How to hide axis labels in python pandas dataframe?

I've used the following code to generate a dataframe which is supposed to be the input for a seaborn plot.
data_array = np.array([['index', 'value']])
for x in range(len(value_list)):
data_array = np.append(data_array, np.array([[int((x + 1)), int(value_list[x])]]), axis = 0)
data_frame = pd.DataFrame(data_array)
The output looks something like this:
0 1
0 index values
1 1 value_1
2 2 value_2
3 3 value_3
However, with this dataframe, seaborn returns an error. When comparing my data to the examples, I see that the first row is missing. The samples, being loaded in with load_dataset(), look something like this:
0 index values
1 1 value_1
2 2 value_2
3 3 value_3
How do I remove the first row of axis labels of my dataframe so that it looks like the samples provided? Removing the first row removes the strings "index" and "values", but not the axis label.
Numpy thinks that index and values row is also a row of the values of the dataframe and not the names of the columns.
I think this would be more pythonic way of doing this:
pd.DataFrame(list(enumerate(value_list, 1)), columns=['index', 'values'])
Don't know what value_list is. However I would recommend another way to create dataframe:
import pandas as pd
value_list = ['10', '20', '30']
data_frame = pd.DataFrame({
'index': range(len(value_list)),
'value': [int(x) for x in value_list]})
data_frame:
index value
0 0 10
1 1 20
2 2 30
Now you can easily change dataframe index and 'index' column:
data_frame.loc[:, 'index'] += 1
data_frame.index += 1
data_frame:
index value
1 1 10
2 2 20
3 3 30
Try:
new_header = df.iloc[0] #grab the first row for the header
df = df[1:] #take the data less the header row
df.columns = new_header #set the header row as the df header
just slice your dataframe
df =data_frame[2:]
df.columns = data_frame.iloc[1] --it will set the column name

Categories

Resources