I wanna append a longer list to dataframe .But get an error ValueError: Length of values (4) does not match length of index (3)
import pandas as pd
df = pd.DataFrame({'Data': ['1', '2', '3']})
df['Data2'] =['1', '2', '3', '4']
print(df)
How can I fix it .
Use DataFrame.reindex for add new rows by maximal length by new list or original DataFrame, if length of list should be changed, sometimes same length or sometimes length is shorter:
df = pd.DataFrame({'Data': ['1', '2', '3']})
L = ['1', '2', '3', '4']
df = df.reindex(range(max(len(df), len(L))))
df['Data2'] = L
print (df)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
If always is length of list longer:
df = df.reindex(range(len(L)))
df['Data2'] = L
You can try using pd.concat here but convert your list to a Series then use pd.concat
l = ['1', '2', '3', '4']
pd.concat([df, pd.Series(l, name='Data2')], axis=1)
Data Data2
0 1 1
1 2 2
2 3 3
3 NaN 4
Related
I am trying to make a function to spot the columns with "100" in the header and replace all values in these columns that are above 100 with nan values :
import pandas as pd
data = {'first_100': ['25', '1568200', '5'],
'second_column': ['first_value', 'second_value', 'third_value'],
'third_100':['89', '9', '589'],
'fourth_column':['first_value', 'second_value', 'third_value'],
}
df = pd.DataFrame(data)
print (df)
so this is the output I am looking for
Use filter to identify the columns with '100', to_numeric to ensure having numeric values, then mask with a boolean array:
cols = df.filter(like='100').columns
df[cols] = df[cols].mask(df[cols].apply(pd.to_numeric, errors='coerce').gt(100))
output:
first_100 second_column third_100 fourth_column
0 25 first_value 89 first_value
1 NaN second_value 9 second_value
2 5 third_value NaN third_value
I'm working with a very long dataframe, so I'm looking for the fastest way to fill several columns at once given certain conditions.
So let's say you have this dataframe:
data = {
'col_A1':[1,'','',''],
'col_A2':['','','',''],
'col_A3':['','','',''],
'col_B1':['','',1,''],
'col_B2':['','','',''],
'col_B3':['','','',''],
'col_C1':[1,1,'',''],
'col_C2':['','','',''],
'col_C3':['','','',''],
}
df = pd.DataFrame(data)
df
Input:
col_A1
col_A2
col_A3
col_B1
col_B2
col_B3
col_C1
col_C2
col_C3
1
1
1
1
And we want to find all '1' values in columns A1,B1 and C1 and then replace other values in the matching rows and columns A2,A3, B2,B3 and C2,C3 as well:
Output:
col_A1
col_A2
col_A3
col_B1
col_B2
col_B3
col_C1
col_C2
col_C3
1
2
3
1
2
3
1
2
3
1
2
3
I am currently iterating over columns A and looking for where A1 == 1 matches and then replacing the values for A2 and A3 in the matching rows, and the same for B, C...
But speed is important, so I'm wondering if I can do this for all columns at once, or in a more vectorized way.
You can use:
# extract letters/numbers from column names
nums = df.columns.str.extract('(\d+)$', expand=False)
# ['1', '2', '3', '1', '2', '3', '1', '2', '3']
letters = df.columns.str.extract('_(\D)', expand=False)
# ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C']
# or in a single line
# letters, nums = df.columns.str.extract(r'(\D)(\d+)$').T.to_numpy()
# compute a mask of values to fill
mask = df.ne('').groupby(letters, axis=1).cummax(axis=1)
# NB. alternatively use df.eq('1')...
# set the values
df2 = mask.mul(nums)
output:
col_A1 col_A2 col_A3 col_B1 col_B2 col_B3 col_C1 col_C2 col_C3
0 1 2 3 1 2 3
1 1 2 3
2 1 2 3
3
I have a tricky problem to select column in a dataframe. I have a dataframe and multiple columns in it have the same name "PTime".
This is my dataframe:
PTime first_column PTime third_column PTime fourth_column
0 4 first_value 1 first_value 6 first_value
1 4 second_value 2 second_value 7 second_value
This is what I want:
PTime first_column PTime fourth_column
0 4 first_value 6 first_value
1 4 second_value 7 second_value
I will select my columns from a list:
My code:
data = {'PTime': ['1', '1'],
'first_column': ['first_value', 'second_value'],
'PTime': ['2', '2'],
'third_column': ['first_value', 'second_value'],
'PTime': ['4', '4'],
'fourth_column': ['first_value', 'second_value'],
}
list_c = ['PTime','first_column','fourth_column']
df = pd.DataFrame(data)
#df = df[df.columns.intersection(list_c)]
df = df[list_c]
df
So my goal is to select the column that is in the list and select the column to the left of the one in the list. I if you have any idea to do that, thank you really much. Regards
I don't exactly know how to get left of one in list
But i have a trick to get desired table which you want as shown
PTime first_column PTime fourth_column
0 4 first_value 6 first_value
1 4 second_value 7 second_value
what we can do is simply remove the column by index
But here as there are same name pandas will to try to delete the first row
But you can simply rename the columns if there are duplicates name and then you can use indexing to delete columns..
So here find some logic to rename it like PTime1 .. PTime2 .. PTime3 ..
and then use indexes to remove it
df.drop(df.columns[i], axis=1,inplace=True)
// or //
df = df.drop(df.columns[i], axis=1)
Here you have to pass the list of indices . In your case it will be like
df.drop(df.columns[[2,3]],axis=1)
After renaming columns
In my dataframe I will not have multiple columns with the same name. All names will be distinct.
So in the case I have ten columns to select it will be difficult to list them all in a list.
data = {'PTime1': ['1', '1'],
'first_column': ['first_value', 'second_value'],
'PTime2': ['2', '2'],
'third_column': ['first_value', 'second_value'],
'PTime3': ['4', '4'],
'fourth_column': ['first_value', 'second_value'],
}
list_c = ['first_column','fourth_column'] #define column to select
df = pd.DataFrame(data) #create dataframe
list_index = [] #create list to store index column
for col in list_c:
index_no = df.columns.get_loc(col) #get index column
list_index.append(index_no-1) #insert index-1 in a list. Get column from the left
list_index.append(index_no) #insert index from the column in the list.
df = df.iloc[:, list_index] #Subset the dataframe with the list of column selected.
df
Like this I can select the column from my list and the column on the left of each element in my list.
I have csv data like the following.
1,2,3,4
a,b,c,d
1,2,3,4 is not a csv header. It is data.
That values is all strings data.
I want join columns of index (of list) of 1 and 2 by Pandas.
I want get result like the following.
Result data is strings.
1,23,4
a,bc,d
Python's code is like the following.
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
vals = lines[0]
s = vals[0] + ',' + (vals[1] + vals[2]) + ',' + vals[3] + '\n'
vals = lines[1]
s += vals[0] + ',' + (vals[1] + vals[2]) + ',' + vals[3] + '\n'
print(s)
How to you do it?
If you wand to use pandas, you could create new column and remove old ones:
import pandas as pd
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
df = pd.DataFrame(lines)
# Create new column
df['new_col'] = df[1] + df[2]
print(df)
# 0 1 2 3 new_col
# 0 1 2 3 4 23
# 1 a b c d bc
# Remove old columns if needed
df.drop([1, 2], axis=1, inplace=True)
print(df)
# 0 3 new_col
# 0 1 4 23
# 1 a d bc
If you want columns to be in specific order, use something like this:
print(df[[0, 'new_col', 3]])
# 0 new_col 3
# 0 1 23 4
# 1 a bc d
But it's better to save headers in csv
You can loop over it using for or a list-comprehension.
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
vals = [','.join([w, f'{x}{y}', *z]) for w, x, y, *z in lines]
s = '\n'.join(vals)
print(x)
# prints:
1,23,4
a,bc,d
You can do something like this.
import pandas as pd
lines = [
['1', '2', '3', '4'],
['a', 'b', 'c', 'd'],
]
df = pd.DataFrame(lines)
df['new_col'] = df.iloc[:, 1] + df.iloc[:, 2]
print(df)
Output
You can then drop the columns you don't want.
Since OP specified pandas, here's a solution that may work.
Once in pandas, eg with pd.read_csv()
You can simply concatenate text (object) columns with +
import pandas as pd
lines = [ ['1', '2', '3', '4'],
['a', 'b', 'c', 'd']]
df = pd.DataFrame(lines)
df[1] = df[1]+df[2]
df.drop(columns=2, inplace=True)
df
# 0 1 3
# 0 1 23 4
# 1 a bc d
Should give you what you want in a pandas dataframe.
I have a multi indexed pandas table as below.
I want to update Crop and Avl column, say with 'Tomato', and '0', but only for finite no of times (say, I need only 10 rows for Tomato, satisfying a condition). Currently via pandas I end up updating all rows that satisfy that condition.
col1 = ildf1.index.get_level_values(1) # https://stackoverflow.com/a/50608928/9148067
cond = col1.str.contains('DN_Mega') & (ildf1['Avl'] == 1)
ildf1.iloc[ cond , [0,2]] = ['Tomato', 0]
How do I restrict it to only say 10 rows of all rows that satisfy the condition?
PS: I used get_level_values as I have 4 columns (GR, PP+MT, Bay, Row) multi indexed in my df.
For df defined as below, you need to add additional index to numerate all rows with different number, then you can set new values based on slice. Here you go =^..^=
import pandas as pd
df = pd.DataFrame({'Crop': ['', '', '', '', ''], 'IPR': ['', '', '', '', ''], 'Avi': [1, 2, 3, 4, 5]}, index=[['0','0', '8', '8', '8'], ['6', '7', '7', '7', '7']])
# add additional index
df['id'] = [x for x in range(0, df.shape[0])]
df.set_index('id', append=True, inplace=True)
# select only two values based on condition
condition = df.index[df.index.get_level_values(0).str.contains('8')][0:2]
df.loc[condition, ['Crop', 'IPR']] = ['Tomato', 0]
Output:
Crop IPR Avi
id
0 6 0 1
7 1 2
8 7 2 Tomato 0 3
3 Tomato 0 4
4 5