I have to evaluate a lot of csv files. The columns of the files are always in a different order because some columns were removed and some new were added. Some columns are in every file and have the same name, therefore I want to switch from numpy to pandas because it's possible to access the data by the column name.
I want to calculate the average of a column dependent on the values in another column.
First I want to filter the values:
import pandas as pd
d = {"Y Position [0] [mm]": [1, 2, 3, 4, 5], "Y Position [1] [mm]": [6, 7, 8, 9, 0]}
df = pd.DataFrame(data=d)
dq = df.query("`Y Position [0] [mm]` > 2")
print(dq)
But I get this error:
File "<unknown>", line 1
Y_Position_[_0_]_[_mm_]_BACKTICK_QUOTED_STRING >2
^
SyntaxError: invalid syntax
When I remove the square brackets it works fine:
Y Position 0 Y Position [1] [mm]
2 3 8
3 4 9
4 5 0
I checked the documentation but I could not find a reason why it should not work.
Related
I have the following dataframe:
d_test = {
'random_staff' : ['gfda', 'fsd','gec', 'erw', 'gd', 'kjhk', 'fd', 'kui'],
'cluster_number' : [1, 2, 3, 3, 2, 1, 4, 2]
}
df_test = pd.DataFrame(d_test)
cluster_number column contains values from 1 to n. Some values could have repetition but no missing values are presented. For example above such values are: 1, 2, 3, 4.
I want to be able to select some value from cluster_number column and change every occurrence of this value to set of unique values. No missing value should be presented. For example if we select value 2 then desirable outcome for cluster_number is [1, 2, 3, 3, 5, 1, 4, 6]. Note we had three 2 in the column. We kept first one as 2 we change next occurrence of 2 to 5 and we changed last occurrence of 2 to 6.
I wrote code for the logic above and it works fine:
cluster_number_to_change = 2
max_cluster = max(df_test['cluster_number'])
first_iter = True
i = cluster_number_to_change
for index, row in df_test.iterrows():
if row['cluster_number'] == cluster_number_to_change:
df_test.loc[index, 'cluster_number'] = i
if first_iter:
i = max_cluster + 1
first_iter = False
else:
i += 1
But it is written as for-loop and I am trying understand if can be transformed in form of pandas .apply method (or any other effective vectorized solution).
Using boolean indexing:
# get cluster #2
m1 = df_test['cluster_number'].eq(2)
# identify duplicates
m2 = df_test['cluster_number'].duplicated()
# increment duplicates using the max as reference
df_test.loc[m1&m2, 'cluster_number'] = (
m2.where(m1).cumsum()
.add(df_test['cluster_number'].max())
.convert_dtypes()
)
print(df_test)
Output:
random_staff cluster_number
0 gfda 1
1 fsd 2
2 gec 3
3 erw 3
4 gd 5
5 kjhk 1
6 fd 4
7 kui 6
I would like to groupby column and sum of list for another column in dataframe, but it seems like the following code is not working. The length of each user is different after I use sum function.
dt2 = dt.groupby(['user']).sum()
the data like this:
user vector
1 [1,2,3,4,5]
2 [1,3,2,4,5]
1 [3,3,3,4,4]
1 [1,2,2,1,1]
2 [1,1,2,0,0]
The expect table should be
user vector
1 [5,7,8,9,9]
2 [2,4,4,4,5]
here is one way which creates a df based on the vector column and groups on user and sum , finally aggregate as list on axis=1:
(pd.DataFrame(df['vector'].tolist())
.groupby(df['user']).sum().agg(list,axis=1).reset_index(name='vector'))
user vector
0 1 [5, 7, 8, 9, 10]
1 2 [2, 4, 4, 4, 5]
Why am i getting this error message?
Here are the variables that are included in my code. The columns they include are all dummy variables:
country_cols = wine_dummies.loc[:, 'country_Chile':'country_US']
variety_cols = wine_dummies.loc[:, 'variety_Cabernet
Sauvignon':'variety_Zinfandel']
pricecat_cols = wine_dummies.loc[:, 'price_category_low':]
Here is the code that is throwing the error (it is throwing the error at "X = wine[feature_cols_1]":
feature_cols_1 = ['price', country_cols, variety_cols, 'year']
feature_cols_2 = [pricecat_cols, country_cols, variety_cols, 'year']
X = wine[feature_cols_1] <---ERROR
y = wine['points']
Here is the head of my dataframe:
country designation points price province variety year ... variety_Riesling variety_Rosé variety_Sangiovese variety_Sauvignon Blanc variety_Syrah variety_Tempranillo variety_White Blend variety_Zinfandel price_category_low price_category_med
Portugal Avidagos 87 15.0 Douro Portuguese Red 2011.0 ... 0 0 0 0 0 0 0 0 1 0
^ each dummy variable (0s and 1s) after "..." corresponds to each column after "..."
This is actually quite cumbersome, so it's only going to be useful if you have lots of columns between 'country_Chile':'country_US'. In the below example, I'm deliberately dropping the a column in middle_columns by taking the column indices.
This is using pandas.Index.get_loc to find the indices of the start and end columns, which can then be used as a slice on the full list of dataframe columns. Then it unpacks that list using * into the final list of columns.
import pandas as pd
df = pd.DataFrame({'a': [1, 2, 3], 'b': [2, 3, 4], 'c': [3, 4, 5],
'd': [4, 5, 6], 'wine': ['happy', 'drunk', 'sad'],
'year': [2002, 2003, 2019]})
middle_columns = df.columns[df.columns.get_loc('b'):df.columns.get_loc('d')+1]
all_cols = ['wine', *middle_columns, 'year']
X = df[all_cols]
The reason your current approach doesn't work is that feature_cols_1 = ['price', country_cols, variety_cols, 'year'] returns a list of strings and dataframes, that you then try to use as columns to a second dataframe.
I have a panda dataframe, it is used for a heatmap. I would like the minimal value of each column to be along the diagonal.
I've sorted the columsn using
data = data.loc[:, data.min().sort_values().index]
This works. Now I just need to sort the values such that the index of the min value in the first column is row 0, then the min value of second column is row 1, and so on.
Example
import seaborn as sns
import pandas as pd
data = [[5,1,9],
[7,8,6],
[5,3,2]]
data = pd.DataFrame(data)
#sns.heatmap(data)
data = data.loc[:, data.min().sort_values().index]
#sns.heatmap(data) # Gives result in step 1
# Step1, Columsn sorted by min value, 1, 2, 5
data = [[1,9,5],
[8,6,7],
[3,2,5]]
data = pd.DataFrame(data)
#sns.heatmap(data)
# How do i perform step two, maintinaing column order.
# Step 2, Rows sorted by min value 1,2,7
data = [[1,9,5],
[3,2,5],
[8,6,7]]
data = pd.DataFrame(data)
sns.heatmap(data)
Is this possible in panda in a clever way?
Setup
data = pd.DataFrame([[5, 1, 9], [7, 8, 6], [5, 3, 2]])
You can accomplish this by using argsort of the diagonal elements of your sorted DataFrame, then indexing the DataFrame using these values.
Step 1
Use your initial sort:
data = data.loc[:, data.min().sort_values().index]
1 2 0
0 1 9 5
1 8 6 7
2 3 2 5
Step 2
Use np.argsort with np.diag:
data.iloc[np.argsort(np.diag(data))]
1 2 0
0 1 9 5
2 3 2 5
1 8 6 7
I'm not quite sure, but you've already done the following to sort column
data = data.loc[:, data.min().sort_values().index]
the same trick could also be applied to sort row
data = data.loc[data.min(axis=1).sort_values().index, :]
To move some values around so that the min value within each column is placed along the diagonal you could try something like this:
for i in range(len(data)):
min_index = data.iloc[:, i].idxmin()
if data.iloc[i,i] != data.iloc[min_index, i]:
data.iloc[i,i], data.iloc[min_index,i] = data.iloc[min_index, i], data.iloc[i,i]
Basically just swap the min with the diagonal.
I have a tab delimited file and I wish I to read all col headers but the last 2 columns will have just one column header.
Example 1st row of file:
xx yy zz ii jj
5 5 10 2 a d
In my example, that will be colheader = jj and values will be a and d which spans 2 tabs. I tried with genfromtxt but it gives:
ValueError: Some errors were detected !
Line #2 (got 6 columns instead of 5).
I wish I can use numpy's genfromtxt due to my prior code but
any method will do right now. It seems difficult to use genfromtxt.
I expect a tuple of rows. At one point I got
[(5, 5, 10, 2, b'a') for 1st row but I wish I can get [(5, 5, 10, 2, ['a','d']) if possible
Thank you