How to add entries in Pandas DataFrame? - python

Basically I have census data of US that I have read in Pandas from a csv file.
Now I have to write a function that finds counties in a specific manner (not gonna explain that because that's not what the question is about) from the table I have gotten from csv file and return those counties.
MY TRY:
What I did is that I created lists with the names of the columns (that the function has to return), then applied the specific condition in the for loop using if-statement to read the entries of all required columns in their respective list. Now I created a new DataFrame and I want to read the entries from lists into this new DataFrame. I tried the same for loop to accomplish it, but all in vain, tried to make Series out of those lists and tried passing them as a parameter in the DataFrame, still all in vain, made DataFrames out of those lists and tried using append() function to concatenate them, but still all in vain. Any help would be appreciated.
CODE:
#idxl = list()
#st = list()
#cty = list()
idx2 = 0
cty_reg = pd.DataFrame(columns = ('STNAME', 'CTYNAME'))
for idx in range(census_df['CTYNAME'].count()):
if((census_df.iloc[idx]['REGION'] == 1 or census_df.iloc[idx]['REGION'] == 2) and (census_df.iloc[idx]['POPESTIMATE2015'] > census_df.iloc[idx]['POPESTIMATE2014']) and census_df.loc[idx]['CTYNAME'].startswith('Washington')):
#idxl.append(census_df.index[idx])
#st.append(census_df.iloc[idx]['STNAME'])
#cty.append(census_df.iloc[idx]['CTYNAME'])
cty_reg.index[idx2] = census_df.index[idx]
cty_reg.iloc[idxl2]['STNAME'] = census_df.iloc[idx]['STNAME']
cty_reg.iloc[idxl2]['CTYNAME'] = census_df.iloc[idx]['CTYNAME']
idx2 = idx2 + 1
cty_reg
CENSUS TABLE PIC:
SAMPLE TABLE:
REGION STNAME CTYNAME
0 2 "Wisconsin" "Washington County"
1 2 "Alabama" "Washington County"
2 1 "Texas" "Atauga County"
3 0 "California" "Washington County"
SAMPLE OUTPUT:
STNAME CTYNAME
0 Wisconsin Washington County
1 Alabama Washington County
I am sorry for the less-knowledge about the US-states and counties, I just randomly put the state names and counties in the sample table, just to show you what do I want to get out of that. Thanks for the help in advanced.

There are some missing columns in the source DF posted in the OP. However, reading the loop I don't think the loop is required at all. There are 3 filters required - for REGION, POPESTIMATE2015 and CTYNAME. If I have understood the logic in the OP, then this should be feasible without the loop
Option 1 - original answer
print df.loc[
(df.REGION.isin([1,2])) & \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington')), \
['REGION', 'STNAME', 'CTYNAME']]
Option 2 - using and with pd.eval
q = pd.eval("(df.REGION.isin([1,2])) and \
(df.POPESTIMATE2015 > df.POPESTIMATE2014) and \
(df.CTYNAME.str.startswith('Washington'))", \
engine='python')
print df.loc[q, ['REGION', 'STNAME', 'CTYNAME']]
Option 3 - using and with df.query
regions_list = [1,2]
dfq = df.query("(REGION==#regions_list) and \
(POPESTIMATE2015 > POPESTIMATE2014) and \
(CTYNAME.str.startswith('Washington'))", \
engine='python')
print dfq[['REGION', 'STNAME', 'CTYNAME']]

If I'm reading the logic in your code right, you want to select rows according to the following conditions:
REGION should be 1 or 2
POPESTIMATE2015 > POPESTIMATE2014
CTYNAME needs to start with "Washington"
In general, Pandas makes it easy to select rows based on conditions without having to iterate over the dataframe:
df = census_df[
((df.REGION == 1) | (df.REGION == 2)) & \
(df.POPESTIMATE2015 > POPESTIMATE2014) & \
(df.CTYNAME.str.startswith('Washington'))
]

Assuming you're selecting some kind of rows that satisfy a criteria, let's just say that select(row) and this function returns True if selected or False if not. I'll not infer what it is because you specifically said it was not important
And then you wanted the STNAME and CTYNAME of that row.
So here's what you would do:
your_new_df = census_df[census_df.apply(select, axis=1)]\
.apply(lambda x: x[['STNAME', 'CTYNAME']], axis=1)
This is the one liner that will get you what you wanted provided you wrote the select function that will pick the rows.

Related

Extract values within the quotes signs into two separate columns with python

How can i extract the values within the quotes signs into two separate columns with python. The dataframe is given below:
df = pd.DataFrame(["'FRH02';'29290'", "'FRH01';'29300'", "'FRT02';'29310'", "'FRH03';'29340'",
"'FRH05';'29350'", "'FRG02';'29360'"], columns = ['postcode'])
df
postcode
0 'FRH02';'29290'
1 'FRH01';'29300'
2 'FRT02';'29310'
3 'FRH03';'29340'
4 'FRH05';'29350'
5 'FRG02';'29360'
i would like to get an output like the one below:
postcode1 postcode2
FRH02 29290
FRH01 29300
FRT02 29310
FRH03 29340
FRH05 29350
FRG02 29360
i have tried several str.extract codes but havent been able to figure this out. Thanks in advance.
Finishing Quang Hoang's solution that he left in the comments:
import pandas as pd
df = pd.DataFrame(["'FRH02';'29290'",
"'FRH01';'29300'",
"'FRT02';'29310'",
"'FRH03';'29340'",
"'FRH05';'29350'",
"'FRG02';'29360'"],
columns = ['postcode'])
# Remove the quotes and split the strings, which results in a Series made up of 2-element lists
postcodes = df['postcode'].str.replace("'", "").str.split(';')
# Unpack the transposed postcodes into 2 new columns
df['postcode1'], df['postcode2'] = zip(*postcodes)
# Delete the original column
del df['postcode']
print(df)
Output:
postcode1 postcode2
0 FRH02 29290
1 FRH01 29300
2 FRT02 29310
3 FRH03 29340
4 FRH05 29350
5 FRG02 29360
You can use Series.str.split:
p1 = []
p2 = []
for row in df['postcode'].str.split(';'):
p1.append(row[0])
p2.append(row[1])
df2 = pd.DataFrame()
df2["postcode1"] = p1
df2["postcode2"] = p2

Python - How to filter column mix with int and string using DataFrame.query()?

I would like to filter records based on some criteria as below:
import pandas as pd
def doFilter(df, type, criteria):
if type=="contain":
return df[df.country.apply(str).str.contains(criteria)]
elif type=="start":
return df[df.remarks.apply(str).str.startswith(criteria)]
df= pd.read_csv("testdata.csv")
tempdf = doFilter(df, "contain", "U")
finaldf = doFilter(tempdf, "start", "123")
print(finaldf)
[testdata.csv]
id country remarks
1 UK 123
2 UK 123abc
3 US 456
4 JP 456
[Output]
id country remarks
0 1 UK 123
1 2 UK 123abc
As I need to filter dynamically by reading input config for different criteria (e.g. startswith(), contains(), endswith(), substring() etc.), I would like to use DataFrame.query() so that I can filter everything in 1 go.
e.g.
I've tried many ways similar to below but no luck:
output=df.query('country.apply(str).str.contains("U") & remarks.apply(str).str.startswith("123")')
Any help would be greatly appreciated. Thank you so much.
Could not test since you provide no sample data.
This will allow you to read filters at runtime and apply them with pandas built-in string methods.
# better to cast all relevant columns to string while setting up
df.country = df.country.astype(str)
df.remarks = df.remarks.astype(str)
# get passed filters
filter1 = [ # (field, filtertype, value, jointype)
('country', 'contains', 'U'),
('remarks', 'startswith', '123'),
]
# create a collection of boolean masks
mask = []
for field, filtertype, value in filter1:
if filtertype == 'contains':
mask.append(df[field].str.contains(value))
elif filtertype == 'startswith':
mask.append(df[field].str.startswith(value))
elif filtertype == 'endswith':
mask.append(df[field].str.startswith(value))
# all these filters need to be combined with `and`, as in `condition1 & condition2`
# if you need to allow for `or` then the whole thing gets a lot more complicated
# as you also need to expect parenthesis as in `(cond1 | cond2) & cond3`
# but it can be done with a parser
# allowing only `and` conditions
mask_combined = mask[0]
form m in mask[1:]:
mask_combined *= m
# apply filter
df_final = df[mask_combined]

new column based on specific string info from two different columns Python Pandas

I have 2 columns, I need to take specific string information from each column and create a new column with new strings based on this.
In column "Name" I have wellnames, I need to look at the last 4 characters of each wellname and if it Contains "H" then call that "HZ" in a new column.
I need to do the same thing if the column "WELLTYPE" contains specific words.
Using a Data Analysis program Spotfire I can do this all in one simple equation. (see below).
case
When right([UWI],4)~="H" Then "HZ"
When [WELLTYPE]~="Horizontal" Then "HZ"
When [WELLTYPE]~="Deviated" Then "D"
When [WELLTYPE]~="Multilateral" Then "ML"
else "V"
End
What would be the best way to do this in Python Pandas?
Is there a simple clean way you can do this all at once like in the spotfire equaiton above?
Here is the datatable with the two columns and my hopeful outcome column. (it did not copy very well into this), I also provide the code for the table below.
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V
6 HB-005 Source V
7 BB-007 Water V
Here is the code to create the dataframe
# Dataframe with hopeful outcome
raw_data = {'Name': ['HH-001HST2', 'HH-001HST', 'HB-002H', 'HB-002', 'HB-002','HB-004','HB-005','BB-007'],
'WELLTYPE':['Oil Horizontal', 'Oil_Horizontal', 'Oil', 'Water_Deviated', 'Oil_Multilateral','Oil','Source','Water'],
'What I Want': ['HZ', 'HZ', 'HZ', 'D', 'ML','V','V','V']}
df = pd.DataFrame(raw_data, columns = ['Name','WELLTYPE','What I Want'])
df
Nested 'where' variant:
df['What I Want'] = np.where(df.Name.str[-4:].str.contains('H'), 'HZ',
np.where(df.WELLTYPE.str.contains('Horizontal'),'HZ',
np.where(df.WELLTYPE.str.contains('Deviated'),'D',
np.where(df.WELLTYPE.str.contains('Multilateral'),'ML',
'V'))))
Using apply by row:
def criteria(row):
if row.Name[-4:].find('H') > 0:
return 'HZ'
elif row.WELLTYPE.find('Horizontal') > 0:
return 'HZ'
elif row.WELLTYPE.find('Deviated') > 0:
return 'D'
elif row.WELLTYPE.find('Multilateral') > 0:
return 'ML'
else:
return 'V'
df['want'] = df.apply(criteria, axis=1)
This feels more natural to me. Obviously subjective
from_name = df.Name.str[-4:].str.contains('H').map({True: 'HZ'})
regex = '(Horizontal|Deviated|Multilateral)'
m = dict(Horizontal='HZ', Deviated='D', Multilateral='ML')
from_well = df.WELLTYPE.str.extract(regex, expand=False).map(m)
df['What I Want'] = from_name.fillna(from_well).fillna('V')
print(df)
Name WELLTYPE What I Want
0 HH-001HST2 Oil Horizontal HZ
1 HH-001HST Oil_Horizontal HZ
2 HB-002H Oil HZ HZ
3 HB-002 Water_Deviated D
4 HB-002 Oil_Multilateral ML
5 HB-004 Oil V V
6 HB-005 Source V
7 BB-007 Water V

Merging DataFrames on multiple conditions - not specifically on equal values

Firstly, sorry if this is a bit lengthy, but I wanted to fully describe what I have having problems with and what I have tried already.
I am trying to join (merge) together two dataframe objects on multiple conditions. I know how to do this if the conditions to be met are all 'equals' operators, however, I need to make use of LESS THAN and MORE THAN.
The dataframes represent genetic information: one is a list of mutations in the genome (referred to as SNPs) and the other provides information on the locations of the genes on the human genome. Performing df.head() on these returns the following:
SNP DataFrame (snp_df):
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 752721
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
This shows the SNP reference ID and their locations. 'BP' stands for the 'Base-Pair' position.
Gene DataFrame (gene_df):
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
This dataframe shows the locations of all the genes of interest.
What I want to find out is all of the SNPs which fall within the gene regions in the genome, and discard those that are outside of these regions.
If I wanted to merge together two dataframes based on multiple (equals) conditions, I would do something like the following:
merged_df = pd.merge(snp_df, gene_df, on=['chromosome', 'other_columns'])
However, in this instance - I need to find the SNPs where the chromosome values match those in the Gene dataframe, and the BP value falls between 'chr_start' and 'chr_stop'. What makes this challenging is that these dataframes are quite large. In this current dataset the snp_df has 6795021 rows, and the gene_df has 34362.
I have tried to tackle this by either looking at chromosomes or genes seperately. There are 22 different chromosome values (ints 1-22) as the sex chromosomes are not used. Both methods are taking an extremely long time. One uses the pandasql module, while the other approach is to loop through the separate genes.
SQL method
import pandas as pd
import pandasql as psql
pysqldf = lambda q: psql.sqldf(q, globals())
q = """
SELECT s.SNP, g.feature_id
FROM this_snp s INNER JOIN this_genes g
WHERE s.BP >= g.chr_start
AND s.BP <= g.chr_stop;
"""
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
genic_snps = pysqldf(q)
all_dfs.append(genic_snps)
all_genic_snps = pd.concat(all_dfs)
Gene iteration method
all_dfs = []
for line in gene_df.iterrows():
info = line[1] # Getting the Series object
this_snp = snp_df.loc[(snp_df['chromosome'] == info['chromosome']) &
(snp_df['BP'] >= info['chr_start']) & (snp_df['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(len(this_snp.columns), 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
Can anyone give any suggestions of a more effective way of doing this?
I've just thought of a way to solve this - by combining my two methods:
First, focus on the individual chromosomes, and then loop through the genes in these smaller dataframes. This also doesn't have to make use of any SQL queries either. I've also included a section to immediately identify any redundant genes that don't have any SNPs that fall within their range. This makes use of a double for-loop which I normally try to avoid - but in this case it works quite well.
all_dfs = []
for chromosome in snp_df['chromosome'].unique():
this_chr_snp = snp_df.loc[snp_df['chromosome'] == chromosome]
this_genes = gene_df.loc[gene_df['chromosome'] == chromosome]
# Getting rid of redundant genes
min_bp = this_chr_snp['BP'].min()
max_bp = this_chr_snp['BP'].max()
this_genes = this_genes.loc[~(this_genes['chr_start'] >= max_bp) &
~(this_genes['chr_stop'] <= min_bp)]
for line in this_genes.iterrows():
info = line[1]
this_snp = this_chr_snp.loc[(this_chr_snp['BP'] >= info['chr_start']) &
(this_chr_snp['BP'] <= info['chr_stop'])]
if this_snp.shape[0] != 0:
this_snp = this_snp[['SNP']]
this_snp.insert(1, 'feature_id', info['feature_id'])
all_dfs.append(this_snp)
all_genic_snps = pd.concat(all_dfs)
While this doesn't run spectacularly quickly - it does run so that I can actually get some answers. I'd still like to know if anyone has any tips to make it run more efficiently though.
You can use the following to accomplish what you're looking for:
merged_df=snp_df.merge(gene_df,on=['chromosome'],how='inner')
merged_df=merged_df[(merged_df.BP>=merged_df.chr_start) & (merged_df.BP<=merged_df.chr_stop)][['SNP','feature_id']]
Note: your example dataframes do not meet your join criteria. Here is an example using modified dataframes:
snp_df
Out[193]:
chromosome SNP BP
0 1 rs3094315 752566
1 1 rs3131972 30400
2 1 rs2073814 753474
3 1 rs3115859 754503
4 1 rs3131956 758144
gene_df
Out[194]:
chromosome chr_start chr_stop feature_id
0 1 10954 11507 GeneID:100506145
1 1 12190 13639 GeneID:100652771
2 1 14362 29370 GeneID:653635
3 1 30366 30503 GeneID:100302278
4 1 34611 36081 GeneID:645520
merged_df
Out[195]:
SNP feature_id
8 rs3131972 GeneID:100302278

Python: how to filter pandas.Series with function without losing index association?

I have a pandas.DataFrame on which I'm iterating over the rows. On each row I need to filter out some non valuable values and keep the indexes association. This is where I'm at right now:
for i,row in df.iterrows():
my_values = row["first_interesting_column":]
# here I need to filter 'my_values' Series based on a function
# what I'm doin right now is use the built-in python filter function, but what I get back is a list with no indexes anymore
my_valuable_values = filter(lambda x: x != "-", my_values)
How can I do that?
I was suggested the answer by a guy on IRC. Here it is:
w = my_values != "-" # creates a Series with a map of the stuff to be included/exluded
my_valuable_values = my_values[w]
... which could also be shortened in ...
my_valuable_values = my_values[my_values != "-"]
... and, of course, to avoid one more step ...
row["first_interesting_column":][row["first_interesting_column":] != "-"]
It is generally bad practice (and very slow) to iterate over rows. As #JohnE suggested you want to use applymap.
If I understand your question, I think what you want to do is:
import pandas as pd
from io import StringIO
datastring = StringIO("""\
2009 2010 2011 2012
1 4 - 4
3 - 2 3
4 - 8 7
""")
df = pd.read_table(datastring, sep='\s\s+')
a = df[df.applymap(lambda x: x != '-')].astype(np.float).values
a[~np.isnan(a)]

Categories

Resources