What is the simplest way to get the sub dataframe? - python

x is a dataframe whose index is code and contains a column pe.
>>>x
pe
code
01 15
02 30
03 70
04 6
05 40
06 34
07 25
08 65
10 45
12 55
13 32
Get the index of x.
x.index
Index(['01', '02', '03', '04', '05', '06', '07',
'08', '10', '12','13'],
dtype='object', name='code', length=11)
I want to get a sub dataframe whose index is ['01','04','08','10','12'].
x_sub
pe
code
01 15
04 6
08 65
10 45
12 55
What is the simplest way to get the sub dataframe?

Use loc
x_sub = x.loc[['01','04','08','10','12']]
Or if your indexes are integers:
x_sub = x.loc[[01,04,08,10,12]]

Related

Regex find greedy and lazy matches and all in-between

I have a sequence like such '01 02 09 02 09 02 03 05 09 08 09 ', and I want to find a sequence that starts with 01 and ends with 09, and in-between there can be one to nine double-digit, such as 02, 03, 04 etc. This is what I have tried so far.
I'm using w{2}\s (w{2} for matching the two digits, and \s for the whitespace). This can occur one to nine times, which leads to (\w{2}\s){1,9}. The whole regex becomes
(01\s(\w{2}\s){1,9}09\s). This returns the following result:
<regex.Match object; span=(0, 33), match='01 02 09 02 09 02 03 05 09 08 09 '>
If I use the lazy quantifier ?, it returns the following result:
<regex.Match object; span=(0, 9), match='01 02 09 '>
How can I obtain the results in-between too. The desired result would include all the following:
<regex.Match object; span=(0, 9), match='01 02 09 '>
<regex.Match object; span=(0, 15), match='01 02 09 02 09 '>
<regex.Match object; span=(0, 27), match='01 02 09 02 09 02 03 05 09 '>
<regex.Match object; span=(0, 33), match='01 02 09 02 09 02 03 05 09 08 09 '>
You can extract these strings using
import re
s = "01 02 09 02 09 02 03 05 09 08 09 "
m = re.search(r'01(?:\s\w{2})+\s09', s)
if m:
print( [x[::-1] for x in re.findall(r'(?=\b(90.*?10$))', m.group()[::-1])] )
# => ['01 02 09 02 09 02 03 05 09 08 09', '01 02 09 02 09 02 03 05 09', '01 02 09 02 09', '01 02 09']
See the Python demo.
With the 01(?:\s\w{2})+\s09 pattern and re.search, you can extract the substrings from 01 to the last 09 (with any space separated two word char chunks in between).
The second step - [x[::-1] for x in re.findall(r'(?=\b(90.*?10$))', m.group()[::-1])] - is to reverse the string and the pattern to get all overlapping matches from 09 to 01 and then reverse them to get final strings.
You may also reverse the final list if you add [::-1] at the end of the list comprehension: print( [x[::-1] for x in re.findall(r'(?=\b(90.*?10$))', m.group()[::-1])][::-1] ).
Here would be a non-regex answer that post-processes the matching elements:
s = '01 02 09 02 09 02 03 05 09 08 09 '.trim().split()
assert s[0] == '01' \
and s[-1] == '09' \
and (3 <= len(s) <= 11) \
and len(s) == len([elem for elem in s if len(elem) == 2 and elem.isdigit() and elem[0] == '0'])
[s[:i+1] for i in sorted({s.index('09', i) for i in range(2,len(s))})]
# [
# ['01', '02', '09'],
# ['01', '02', '09', '02', '09'],
# ['01', '02', '09', '02', '09', '02', '03', '05', '09'],
# ['01', '02', '09', '02', '09', '02', '03', '05', '09', '08', '09']
# ]

unique() in python doesn't print all values?

so I am trying to get all unique values in a dataframe.
This is the code
for i in df.columns.tolist():
print(f"{i}")
print(df[i].unique())
This is the result I am getting
customerID
['7590-VHVEG' '5575-GNVDE' '3668-QPYBK' ... '4801-JZAZL' '8361-LTMKD' '3186-AJIEK']
gender
['Female' 'Male']
SeniorCitizen
[0 1]
Partner
['Yes' 'No']
Dependents
['No' 'Yes']
tenure
[ 1 34 2 45 8 22 10 28 62 13 16 58 49 25 69 52 71 21 12 30 47 72 17 27
5 46 11 70 63 43 15 60 18 66 9 3 31 50 64 56 7 42 35 48 29 65 38 68
32 55 37 36 41 6 4 33 67 23 57 61 14 20 53 40 59 24 44 19 54 51 26 0
39]
PhoneService
['No' 'Yes']
MultipleLines
['No phone service' 'No' 'Yes']
InternetService
['DSL' 'Fiber optic' 'No']
OnlineSecurity
['No' 'Yes' 'No internet service']
OnlineBackup
['Yes' 'No' 'No internet service']
DeviceProtection
['No' 'Yes' 'No internet service']
TechSupport
['No' 'Yes' 'No internet service']
StreamingTV
['No' 'Yes' 'No internet service']
StreamingMovies
['No' 'Yes' 'No internet service']
Contract
['Month-to-month' 'One year' 'Two year']
PaperlessBilling
['Yes' 'No']
PaymentMethod
['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
'Credit card (automatic)']
MonthlyCharges
[29.85 56.95 53.85 ... 63.1 44.2 78.7 ]
TotalCharges
['29.85' '1889.5' '108.15' ... '346.45' '306.6' '6844.5']
Churn
['No' 'Yes']
Why is It skipping most values in MonthlyCharges and TotalCharges? and how to deal with it?
Thank you
Pandas has a setting to not show you all the values if they exceed some length.
So if your dataframe has more than 50 lines I think, it will show you the first 25 then the elipsis (...) then the last few and this is also what is happening here.
You could probably change that by executing the following:
pd.set_option("display.max_rows", None)
I would anyhow not do that, as if you have a very long dataframe, printing out all those lines might be very time consuming.
Alternatively you could iterate over the unique values and just print them out one by one:
for value in df[i].unique():
print(value)

How to find missing values in groups

I have a large dataset of restaurant inspections. One inspection will trigger several code violations. I want to find out if any inspections don't contain a specific code violation (for evidence of pests). I have the data in a Pandas data frame.
I tried separating the data frame based on whether the violation for pests was included. And I tried to group by the violation code. Can't seem to figure it out.
With the pest violation as "3A", data could look like:
import pandas as pd
df = pd.DataFrame(data = {
'visit' : ['1', '1', '1', '2', '2', '3', '3'],
'violation' : ['3A', '4B', '5C', '3A', '6C', '7D', '8E']
})
visit violation
0 1 3A
1 1 4B
2 1 5C
3 2 3A
4 2 6C
5 3 7D
6 3 8E
I'd like to end up with this:
result = pd.DataFrame(data = {
'visit' : ['3', '3'], 'violation' : ['7D', '8E']
})
Out[15]:
visit violation
0 3 7D
1 3 8E
Try using:
value = '3A'
print(df.groupby('visit').filter(lambda x: all(value != i for i in x['violation'])))
Output:
violation visit
5 7D 3
6 8E 3
Another approach would be:
violation_visits = df[df['violation']=='3A']['visit'].unique()
df[~df['visit'].isin(violation_visits.tolist())]
Out[16]:
visit violation
5 3 7D
6 3 8E
One way using filter
df.groupby('visit').filter(lambda x : ~x['violation'].eq('3A').any())
visit violation
5 3 7D
6 3 8E
Another way using transform
df[df.violation.ne('3A').groupby(df.visit).transform('all')]
visit violation
5 3 7D
6 3 8E

How to create a customized multi-index with different sub column headings using pandas in a dataframe

I have a dataset that contains multi-index columns with the first level consisting of a year divided into four quarters. How do I structure the index so as to have 4 sets of months under each quarter?
I found the following piece of code on stack overflow:
index = pd.MultiIndex.from_product([['S1', 'S2'], ['Start', 'Stop']])
print pd.DataFrame([pd.DataFrame(dic).unstack().values], columns=index)
that gave the following output:
S1 S2
Start Stop Start Stop
0 2013-11-12 2013-11-13 2013-11-15 2013-11-17
However, it couldn't solve my requirement of having different sets of months under each quarter of the year.
My data looks like this:
2015
Q1 Q2 Q3 Q4
Country jan Feb March Apr May Jun July Aug Sep Oct Nov Dec
India 45 54 34 34 45 45 43 45 67 45 56 56
Canada 44 34 12 32 35 45 43 41 60 43 55 21
I wish to input the same structure of the dataset into pandas with the specific set of months under each quarter. How should I go about this?
You can also create a MultiIndex in a few other ways. One of these, which is useful if you have a complicated structure, is to construct it from an explicit set of tuples where each tuple is one hierarchical column. Below I first create all of the tuples that you need of the form (year, quarter, month), make a MultiIndex from these, then assign that as the columns of the dataframe.
import pandas as pd
year = 2015
months = [
("Jan", "Feb", "Mar"),
("Apr", "May", "Jun"),
("Jul", "Aug", "Sep"),
("Oct", "Nov", "Dec"),
]
tuples = [(year, f"Q{i + 1}", month) for i in range(4) for month in months[i]]
multi_index = pd.MultiIndex.from_tuples(tuples)
data = [
[45, 54, 34, 34, 45, 45, 43, 45, 67, 45, 56, 56],
[44, 34, 12, 32, 35, 45, 43, 41, 60, 43, 55, 21],
]
df = pd.DataFrame(data, index=["India", "Canada"], columns=multi_index)
df
# 2015
# Q1 Q2 Q3 Q4
# Jan FebMar Apr May Jun Jul Aug Sep Oct Nov Dec
# India 45 54 34 34 45 45 43 45 67 45 56 56
# Canada 44 34 12 32 35 45 43 41 60 43 55 21

Python sorted to arrange according to decimal format

I would like to arrange the list according to the numbering as followed,
01
02
03
001
002
However default sorted command will give me,
001
002
01
02
03
To preserve length ordering over numerical ordering, I believe you need to sort on 2 criteria:
nums = '03 01 002 02 001'
num_array = nums.split()
sorted_nums = sorted(num_array, key=lambda x: [len(x), x])
print(sorted_nums)
Output:
['01', '02', '03', '001', '002']
Or, double-sort the list:
>>> nums = '03 01 002 02 001'
>>> sorted(sorted(nums.split()),key=len)
['01', '02', '03', '001', '002']
>>>
s = '001 01 02 03 002'
l = s.split()
print(sorted(l, key=lambda e: (len(e), int(e) )))
Output:
C:\Users\Desktop>py x.py
['01', '02', '03', '001', '002']
sorted_list = sorted(my_list, key=lambda x: (len(x), x))
First it checks the length of the string and then string itself char by char.
xs = "01 02 03 001 002".split()
print(sorted(xs, key="{:<018s}".format))
# ['001', '002', '01', '02', '03']
Unless you are golfing or decimals have more than 18 decimal places, using two criteria in key is probably the way to go.

Categories

Resources