Data selection using pandas - python

I have a file where the separator(delimiter) is ';' . I read that file into a pandas dataframe df. Now, I want to select some rows from df using a criteria from column c in df. The format of data in column c is as follows:
[0]science|time|boot
[1]history|abc|red
and so on...
I have another list of words L, which has values such as
[history, geography,....]
Now, if I split the text in column c on '|', then I want to select those rows from df, where the first word does not belong to L.
Therefore, in this example, I will select df[0] but will not chose df[1], since history is present in L and science is not.
I know, I can write a for loop and iter over each object in the dataframe but I was wondering if I could do something in a more compact and efficient way.
For example, we can do:
df.loc[df['column_name'].isin(some_values)]
I have this:
df = pd.read_csv(path, sep=';', header=None, error_bad_lines=False, warn_bad_lines=False)
dat=df.ix[:,c].str.split('|')
But, I do not know how to index 'dat'. 'dat' is a Pandas Series, as follows:
0 [science, time, boot]
1 [history, abc, red]
....
I tried indexing dat as follows:
dat.iloc[:][0]
But, it gives the entire series instead of just the first element.
Any help would be appreciated.
Thank You in advance

Here is an approach:
Data
df = pd.DataFrame({'c':['history|science','science|chemistry','geography|science','biology|IT'],'col2':range(4)})
Out[433]:
c col2
0 history|science 0
1 science|chemistry 1
2 geography|science 2
3 biology|IT 3
lst = ['geography', 'biology','IT']
Resolution
You can use list comprehension:
df.loc[pd.Series([not x.split('|')[0] in lst for x in df.c.tolist()])]
Out[444]:
c col2
0 history|science 0
1 science|chemistry 1

Related

Searching values in dataframe using re.search

I have a lot of datasets that I need to iterate through, search for specific value and return some values based on search outcome.
Datasets are stored as dictionary:
key type size Value
df1 DataFrame (89,10) Column names:
df2 DataFrame (89,10) Column names:
df3 DataFrame (89,10) Column names:
Each dataset looks something like this, and I am trying to look if value in column A row 1 has 035 in it and return B column.
| A | B | C
02 la 035 nan nan
Target 7 5
Warning 3 6
If I try to search for specific value in it I get an error
TypeError: first argument must be string or compiled pattern
I have tried:
something = []
for key in df:
text = df[key]
if re.search('035', text):
something.append(text['B'])
Something = pd.concat([something], axis=1)
You can use .str.contains() : https://pandas.pydata.org/docs/reference/api/pandas.Series.str.contains.html
df = pd.DataFrame({
"A":["02 la 035", "Target", "Warning"],
"B":[0,7,3],
"C":[0, 5, 6]
})
df[df["A"].str.contains("035")] # Returns first row only.
Also works for regex.
df[df["A"].str.contains("[TW]ar")] # Returns last two rows.
EDIT to answer additional details.
The dataframe I set up looks like this:
To extract column B for those rows which match the last regex pattern I used, amend the last line of code to:
df[df["A"].str.contains("[TW]ar")]["B"]
This returns a series. Output:
Edit 2: I see you want a dataframe at the end. Just use:
df[df["A"].str.contains("[TW]ar")]["B"].to_frame()

How do I properly use iloc indexing?

I have this task:
Create a dictionary and name it cc. The dictionary has two keys: data and target, and the corresponding key values are NumPy arrays. For target, the key value is an array of values from the encoded column satisfaction of df. For data, the key value is an array of sub arrays and each sub array is an observation of one sample across the features in df2.
I am not sure if my code reflects the task, could anyone please take a look?
cc = {
"data": df2.iloc[:, :-1].to_numpy(),
"target": df["satisfaction_satisfied"].to_numpy(),
}
I am not sure I call the correctly df & df2 and am clueless if my iloc indexing corresponds to what's asked.
Any help will be very much appreciated:)
Thank you!
M.
I tested pandas in Python3.
import pandas as pd
data = [[1,2,3],
[2,3,4],
[3,4,5]]
# col0 col1 col2
# row 0: 1 2 3
# row 1: 2 3 4
# row 2: 3 4 5
df = pd.DataFrame(data, index=['row0', 'row1', 'row2'],
columns=['col0', 'col1', 'col2'])
#iloc[rows, cols]
data = df.iloc[:, :-1].to_numpy()
# meaning: All rows and All cols except last col.
print(data)
As you can see the code,
the first parameter of iloc is a condition of which rows are indexed, and
the second is a condition of which columns are indexed.
First parameter : means "index all rows.
Second parameter :-1 means "index all columns but last column.
So, iloc finds elements that satisfy conditions.
You want to know about python slicing. I found good reference:
https://railsware.com/blog/python-for-machine-learning-indexing-and-slicing-for-lists-tuples-strings-and-other-sequential-types/

how to efficiently decode arrays to columns in pandas dataframe

I have a function that produces results for every month of a year. In my dataframe I collect these results for different data columns. After that, I have a dataframe containing multiple columns with arrays as values. Now I want to "pivot" those columns to have each value in its own column.
For example, if a row contains values [1,2,3,4,5,6,7,8,9,10,11,12] in column 'A', I want to have twelve columns 'A_01', 'A_02', ..., 'A_12' that each contain one value from the array.
My current code is this:
# create new columns
columns_to_add = []
column_count = len(columns_to_process)
for _, row in df[columns_to_process].iterrows():
columns_to_add += [[row[name][offset] if type(row[name]) == list else row[name]
for offset in range(array_len) for name in range(column_count)]]
new_df = pd.DataFrame(columns_to_add,
columns=[name+'_'+str(offset+1) for offset in range(array_len)
for name in columns_to_process],
index=df.index) # make dataframe addendum
(note: some rows don't have any values, so I had to put the condition if type() == list into the iteration)
But this code is awfully slow. I believe there must be a much more elegant solution. Can you show me such a solution?
IIUC, use Series.tolist with the pandas.DataFrame constructor.
We'll use DataFrame.rename as well to fix your column name format.
# Setup
df = pd.DataFrame({'A': [ [1,2,3,4,5,6,7,8,9,10,11,12] ]})
pd.DataFrame(df['A'].tolist()).rename(columns=lambda x: f'A_{x+1:0>2d}')
[out]
A_01 A_02 A_03 A_04 A_05 A_06 A_07 A_08 A_09 A_10 A_11 A_12
0 1 2 3 4 5 6 7 8 9 10 11 12

Python, Pandas: Using isin() like functionality but do not ignore duplicates in input list

I am trying to filter an input dataframe (df_in) against a list of indices. The indices list contains duplicates and I want my output df_out to contain all occurrences of a particular index. As expected, isin() gives me only a single entry for every index.
How do I try and not ignore duplicates and get output similar to df_out_desired?
import pandas as pd
import numpy as np
df_in = pd.DataFrame(index=np.arange(4), data={'A':[1,2,3,4],'B':[10,20,30,40]})
indices_needed_list = pd.Series([1,2,3,3,3])
# In the output df, I do not particularly care about the 'index' from the df_in
df_out = df_in[df_in.index.isin(indices_needed_list)].reset_index()
# With isin, as expected, I only get a single entry for each occurence of index in indices_needed_list
# What I am trying to get is an output df that has many rows and occurences of df_in index as in the indices_needed_list
temp = df_out[df_out['index'] == 3]
# This is what I would like to try and get
df_out_desired = pd.concat([df_out, df_out[df_out['index']==3], df_out[df_out['index']==3]])
Thanks!
Check reindex
df_out_desired = df_in.reindex(indices_needed_list)
df_out_desired
Out[177]:
A B
1 2 20
2 3 30
3 4 40
3 4 40
3 4 40

Add new column to Pandas DataFrame and fill with first word from another column from same df

I have a dataset of crimes reported by Gloucestershire Constabulary from 2011-16. It's a .csv file that I have imported to a Pandas dataframe. The data include a column stating the Lower Super Output Area (LSOA) in which the crime occurred, so for crimes in Tewkesbury, for instance, each record has the corresponding LSOA name, e.g. 'Tewkesbury 009D'; 'Tewkesbury 009E'.
I want to group these data by the town/city they relate to, e.g. 'Gloucester', 'Tewkesbury', ignoring the specific LSOAs within each conurbation. Ideally, I would append a new column to the dataframe, with just the place name copied across, and group on that. I am comfortable with how to do the grouping, just not the new column in the first place. Any advice on how to do this is gratefully received.
I am no Pandas expert but I think you can do string slicing to strip out the last five digits (it supports regex too if I recall correctly, so you can do a proper 'search' if required).
#x is the original dataframe
new_col = x.lsoa.str[:-5] #lsoa is the column containing city names
pd.concat([x, new_col], axis=1)
The str method can be used to extract a string out of the lsoa column of the dataframe.
Something along these lines should work:
df['town'] = [x.split()[0] for x in df['LSOA']]
You can use regex to extract the city name from the DataFrame and then join the result to the original DataFrame. If your inital DataFrame is df
df = pd.DataFrame([ 'Tewkesbury 009D', 'Tewkesbury 009E'], columns=['LSOA'])
In [2]: df
Out[2]:
LSOA
0 Tewkesbury 009D
1 Tewkesbury 009E
Then you can extract the city name and optionally the LSOA code in to a new DataFrame df_new
df_new = df['LSOA'].str.extract('(\w*)\s(\d+\w*)', expand=True)
In [10]: df_new
Out[10]:
0 1
0 Tewkesbury 009D
1 Tewkesbury 009E
If you want to discard the code and just keep the city name remove the second bracket from the regex as '(\w*)\s\d+\w*' . Now you can append this result to the original DataFrame
In [11]: df.join(df_new)
Out[11]:
LSOA 0 1
0 Tewkesbury 009D Tewkesbury 009D
1 Tewkesbury 009E Tewkesbury 009E

Categories

Resources