I have this task:
Create a dictionary and name it cc. The dictionary has two keys: data and target, and the corresponding key values are NumPy arrays. For target, the key value is an array of values from the encoded column satisfaction of df. For data, the key value is an array of sub arrays and each sub array is an observation of one sample across the features in df2.
I am not sure if my code reflects the task, could anyone please take a look?
cc = {
"data": df2.iloc[:, :-1].to_numpy(),
"target": df["satisfaction_satisfied"].to_numpy(),
}
I am not sure I call the correctly df & df2 and am clueless if my iloc indexing corresponds to what's asked.
Any help will be very much appreciated:)
Thank you!
M.
I tested pandas in Python3.
import pandas as pd
data = [[1,2,3],
[2,3,4],
[3,4,5]]
# col0 col1 col2
# row 0: 1 2 3
# row 1: 2 3 4
# row 2: 3 4 5
df = pd.DataFrame(data, index=['row0', 'row1', 'row2'],
columns=['col0', 'col1', 'col2'])
#iloc[rows, cols]
data = df.iloc[:, :-1].to_numpy()
# meaning: All rows and All cols except last col.
print(data)
As you can see the code,
the first parameter of iloc is a condition of which rows are indexed, and
the second is a condition of which columns are indexed.
First parameter : means "index all rows.
Second parameter :-1 means "index all columns but last column.
So, iloc finds elements that satisfy conditions.
You want to know about python slicing. I found good reference:
https://railsware.com/blog/python-for-machine-learning-indexing-and-slicing-for-lists-tuples-strings-and-other-sequential-types/
Related
I want to replace df2 elements with df1 elements but according to that: If df2 first row first column has value '1' than df1 first row first column element is getting there, If it is zero than '0' stands. If df2 any row last column element is '1' than df1 that row last column element is coming there. It is going to be like that.
So i want to replace all df2 '1' element with df1 elements according to that rule. df3 is going to be like:
abcde0000;
abcd0e000;
abcd00e00;...
We can use apply function for this. But first you have concat both frames along axis 1. I am using a dummy table with just three entries. It can be applied for any number of rows.
import pandas as pd
import numpy as np
# Dummy data
df1 = pd.DataFrame([['a','b','c','d','e'],['a','b','c','d','e'],['a','b','c','d','e']])
df2 = pd.DataFrame([[1,1,1,1,1,0,0,0,0],[1,1,1,1,0,1,0,0,0],[1,1,1,1,0,0,1,0,0]])
# Display dataframe . May not work in python scripts. I used them in jupyter notebooks
display(df1)
display(df2)
# Concat DFs
df3 = pd.concat([df1,df2],axis=1)
display(df3)
# Define function for replacing
def replace(letters,indexes):
seek =0
for i in range(len(indexes)):
if indexes[i]==1:
indexes[i]=letters[seek]
seek+=1
return ''.join(list(map(str,indexes)))
# Applying replace function to dataframe
df4 = df3.apply(lambda x: replace(x[:5],x[5:]),axis=1)
# Display df4
display(df4)
The result is
0 abcde0000
1 abcd0e000
2 abcd00e00
dtype: object
I think this will solve your problem
So basically i have a bunch of dataframes with about 100 columns and 500-3000 rows filled with different String values. Now I want to search the entire Dataframe for lets say the String "Airbag" and delete every row which doesnt contain this String? I was able to do this with the following code:
df = df[df.apply(lambda row: row.astype(str).str.contains('Airbag', regex=False).any(), axis=1)]
This works exactly like i want to, but it is way too slow. So i tried finding a way to do it with Vectorization or List Comprehension but i wasn't able to do it or find some example code on the internet. So my question is, if it is possible to fasten this process up or not?
Example Dataframe:
df = pd.DataFrame({'col1': ['Airbag_101', 'Distance_xy', 'Sensor_2'], 'col2': ['String1', 'String2', 'String3'], 'col3': ['Tires', 'Wheel_Airbag', 'Antenna']})
Let's start from this dataframe with random strings and numbers in COLUMN:
import numpy as np
np.random.seed(0)
strings = np.apply_along_axis(''.join, 1, np.random.choice(list('ABCD'), size=(100, 5)))
junk = list(range(10))
col = list(strings)+junk
np.random.shuffle(col)
df = pd.DataFrame({'COLUMN': col})
>>> df.head()
COLUMN
0 BBCAA
1 6
2 ADDDA
3 DCABB
4 ADABC
You can simply apply pandas.Series.str.contains. You need to use fillna to account for the non-string elements:
>>> df[df['COLUMN'].str.contains('ABC').fillna(False)]
COLUMN
4 ADABC
31 BDABC
40 BABCB
88 AABCA
101 ABCBB
testing all columns:
Here is an alternative using a good old custom function. One could think that it should be slower than apply/transform, but it is actually faster when you have a lot of columns and a decent frequency of the seached term (tested on the example dataframe, a 3x3 with no match, and 3x3000 dataframes with matches and no matches):
def has_match(series):
for s in series:
if 'Airbag' in s:
return True
return False
df[df.apply(has_match, axis=1)]
Update (exact match)
Since it looks like you actually want an exact match, test with eq() instead of str.contains(). Then use boolean indexing with loc:
df.loc[df.eq('Airbag').any(axis=1)]
Original (substring)
Test for the string with applymap() and turn it into a row mask using any(axis=1):
df[df.applymap(lambda x: 'Airbag' in x).any(axis=1)]
# col1 col2 col3
# 0 Airbag_101 String1 Tires
# 1 Distance_xy String2 Wheel_Airbag
As mozway said, "optimal" depends on the data. These are some timing plots for reference.
Timings vs number of rows (fixed at 3 columns):
Timings vs number of columns (fixed at 3,000 rows):
Ok I was able to speed it up with the help of numpy arrays, but thanks for the help :D
master_index = []
for column in df.columns:
np_array = df[column].values
index = np.where(np_array == 'Airbag')
master_index.append(index)
print(df.iloc[master_index[1][0]])
I'm trying to filter a dataframe by the first row, but can't seem to figure out how to do it.
Here's a sample version of the data I'm working with:
In [11]: df = pd.DataFrame(
...: [['Open-Ended Response', 'Open-Ended Response', 'Response', 'Response'], [1, 2, 3, 4]],
...: columns=list('ABCD'),
...: )
In [12]: df
Out[12]:
A B C D
0 Open-Ended Response Open-Ended Response Response Response
1 1 2 3 4
What I want to do is filter for all columns that start with "Response" in the first non-header row. So in this case, just have the last two columns in there own dataframe.
I can easily filter the header with something like this:
respo = [col for col in df if col.startswith('Response')]
But it doesn't seem to work if it's the 1t non-header row. Importantly, I need to keep the current header after I filter.
Thank you.
First step is to select the values of the first row:
df.iloc[0] # selects the values in the first row
Then, use python's .str StringAccessor methods for working with data values rather than column names:
df.iloc[0].str.startswith('Response') # Test the result of the above line
This will give you a Series with True/False values indexed by column name. Finally, use this to select the columns from your dataframe based on the matched labels:
df.loc[:, df.iloc[0].str.startswith('Response')] # Select columns based on the test
This should do the trick!
See pandas's docs on Indexing and Selecting Data and the StringAccessor methods for more help.
I have a problem where I produce a pandas dataframe by concatenating along the row axis (stacking vertically).
Each of the constituent dataframes has an autogenerated index (ascending numbers).
After concatenation, my index is screwed up: it counts up to n (where n is the shape[0] of the corresponding dataframe), and restarts at zero at the next dataframe.
I am trying to "re-calculate the index, given the current order", or "re-index" (or so I thought). Turns out that isn't exactly what DataFrame.reindex seems to be doing.
Here is what I tried to do:
train_df = pd.concat(train_class_df_list)
train_df = train_df.reindex(index=[i for i in range(train_df.shape[0])])
It failed with "cannot reindex from a duplicate axis." I don't want to change the order of my data... just need to delete the old index and set up a new one, with the order of rows preserved.
If your index is autogenerated and you don't want to keep it, you can use the ignore_index option.
`
train_df = pd.concat(train_class_df_list, ignore_index=True)
This will autogenerate a new index for you, and my guess is that this is exactly what you are after.
After vertical concatenation, if you get an index of [0, n) followed by [0, m), all you need to do is call reset_index:
train_df.reset_index(drop=True)
(you can do this in place using inplace=True).
import pandas as pd
>>> pd.concat([
pd.DataFrame({'a': [1, 2]}),
pd.DataFrame({'a': [1, 2]})]).reset_index(drop=True)
a
0 1
1 2
2 1
3 2
This should work:
train_df.reset_index(inplace=True, drop=True)
Set drop to True to avoid an additional column in your dataframe.
I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance
Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1