Python Compare Dataframe columns and replace with contents based on prefix - python

Still relatively new to working in python and am having some issues.
I currently have a small program that takes csv files, merges them, puts them into a data frame, and then converts to excel.
What I want to do is match the values of 'Team' and 'Abrev' from the data frame columns based on the prefix of its values, and then replace Team column with 'Abrev' column contents.
Team Games Points Abrev
Arsenal 38 87 ARS
Liverpool 38 80 LIV
Manchester 38 82 MAN
Newcastle 38 73 NEW
I would like it to eventually look like the following:
Team Games Points
ARS 38 87
LIV 38 80
MAN 38 82
NEW 38 73
So what I'm thinking is that I need a for loop to iterate through the amount of rows in the dataframe, and then I need a way to compare the contents by the prefix in column Abrev. If first three letters match then replace, but I don't know how to go about it because I am trying not to hard code it.
Can someone help or point me in the right direction?

you can use apply operation to get the desired output.
df = pd.read_csv('input.csv')
df['Team'] = df.apply(lambda row: row['Team'] if row['Team'][:3].upper()!= row['Abrev']
else row['Abrev'],axis=1)
df.drop('Abrev', axis=1, inplace=True)
This gives you:
Team Games Points
ARS 38 87
LIV 38 80
MAN 38 82
NEW 38 73

pandas is what you are looking for
import pandas as pd
df = pd.read_csv('input.csv')
df['team'] = df['Abrev']
df.drop('Abrev', axis=1, inplace=True)
df.to_excel('output.xls')

Related

How to create a pivot table with value ranges in the index and headers and values in the frame?

This is my input data
And the output I want:
PAX range DELHI PUNE MUMBAI
0-50 56 22 56
51-100 55 33 77
101-150 52 27 89
A couple of things are not clear from your question:
There is no PAX column in the dataframe, perhaps there are more columns not shown, in which case it is all right.
I'm assuming by your comment that the aggregation function you want to use is count of rows.
If this is all correct then you can binarize and pass to a groupby call with.
output = df.groupby([
pd.cut(df.PAX, bins=[0, 50, 100, 150]), 'City'
]).size().unstack()

A simple IF statement in Python 3.X Pandas not working

This is supposed to be a simple IF statement that is updating based on a condition but it is not working.
Here is my code
df["Category"].fillna("999", inplace = True)
for index, row in df.iterrows():
if (str(row['Category']).strip()=="11"):
print(str(row['Category']).strip())
df["Category_Description"] = "Agriculture, Forestry, Fishing and Hunting"
elif (str(row['Category']).strip()=="21"):
df["Category_Description"] = "Mining, Quarrying, and Oil and Gas Extraction"
The print statement
print(str(row['Category']).strip())
is working fine but updates to the Category_Description column are not working.
The input data has the following codes
Category Count of Records
48 17845
42 2024
99 1582
23 1058
54 1032
56 990
32 916
33 874
44 695
11 630
53 421
81 395
31 353
49 336
21 171
45 171
52 116
71 108
61 77
51 64
62 54
72 51
92 36
55 35
22 14
The update resulted in
Agriculture, Forestry, Fishing and Hunting 41183
Here is a small sample of the dataset and code on repl.it
https://repl.it/#RamprasadRengan/SimpleIF#main.py
When I run the code above with this data I still see the same issue.
What am I missing here?
You are performing a row operation but applying a dataframe change in the "IF" statement. This will apply the values to all the records.
Try sometime like:
def get_category_for_record(rec):
if (str(row['Category']).strip()=="11"):
return "Agriculture, Forestry, Fishing and Hunting"
elif (str(row['Category']).strip()=="21"):
return "Mining, Quarrying, and Oil and Gas Extraction"
df["category"] = df.apply(get_category_for_record, axis = 1)
I think you want to add a column to the dataframe that maps category to a longer description. As mentioned in the comments, assignment to a column affects the entire column. But if you use a list, each row in the column gets the corresponding value.
So use a dictionary to map name to description, build a list, and assign it.
import pandas as pd
category_map = {
"11":"Agriculture, Forestry, Fishing and Hunting",
"21":"Mining, Quarrying, and Oil and Gas Extraction"}
df = pd.DataFrame([["48", 17845],
[" 11 ", 88888],
["12", 33333],
["21", 999]],
columns=["category", "count of records"])
# cleanup category and add description
df["category"] = df["category"].str.strip()
df["Category_Description"] = [category_map.get(cat, "")
for cat in df["category"]]
# alternately....
#df.insert(2, "Category_Description",
# [category_map.get(cat, "") for cat in df["category"]])
print(df)

How to manipulate the index in one dataframe and filter for indices in another

I started learning pandas and stuck at below issue:
I have two large DataFrames:
df1=
ID KRAS ATM
TCGA-3C-AAAU-01A-11R-A41B-07 101 32
TCGA-3C-AALI-01A-11R-A41B-07 101 75
TCGA-3C-AALJ-01A-31R-A41B-07 102 65
TCGA-3C-ARLJ-01A-61R-A41B-07 87 54
df2=
ID BRCA1 ATM
TCGA-A1-A0SP 54 65
TCGA-3C-AALI 191 8
TCGA-3C-AALJ 37 68
The ID is the index in both df. First, I want to cut the name of the ID to only the first 10 digits ( convert TCGA-3C-AAAU-01A-11R-A41B-07 to TCGA-3C-AAAU) in df1. Then I want to produce a new df from df1 which has the ID that exist in df2.
df3 should look:
ID KRAS ATM
TCGA-3C-AALI 101 75
TCGA-3C-AALJ 102 65
I tried different ways but failed. Any suggestions on this, please?
Here is one way using vectorised functions:
# truncate to first 10 characters, or 12 including '-'
df1['ID'] = df1['ID'].str[:12]
# filter for IDs in df2
df3 = df1[df1['ID'].isin(df2['ID'])]
Result
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65
Explanation
Use .str accessor to limit df1['ID'] to first 12 characters.
Mask df1 to include only IDs found in df2.
IIUC TCGA-3C-AAAU this contain 12 character :-)
df3=df1.assign(ID=df1.ID.str[:12]).loc[lambda x:x.ID.isin(df2.ID),:]
df3
Out[218]:
ID KRAS ATM
1 TCGA-3C-AALI 101 75
2 TCGA-3C-AALJ 102 65

Fetching records and then putting in a new column

I am working with Pandas data frame for one of my projects.
I have a column name Count having integer values in that column.
I have 720 values for each hour i.e 24 * 30 days.
I want to run a loop which can get initially first 24 values from the data frame and put in a new column and then take the next 24 and put in the new column and then so on.
for example:
input:
34
45
76
87
98
34
output:
34 87
45 98
76 34
here it is a row of 6 and I am taking first 3 values and putting them in the first column and the next 3 in the 2nd one.
Can someone please help with writing a code/program for the same. It would be of great help.
Thanks!
You can also try numpy's reshape method performed on pd.Series.values.
s = pd.Series(np.arange(720))
df = pd.DataFrame(s.values.reshape((30,24)).T)
Or split (specify how many arrays you want to split into),
df = pd.DataFrame({"day" + str(i): v for i, v in enumerate(np.split(s.values, 30))})

Should Indexing supply Direct Access to pandas DataFrame?

I have a dataframe which have several columns. I want to extract rows by combination of values from two specific columns, so I use the set_index() property to index the dataframe by those columns. I figured that after doing so, I will have a direct access (O(1)) to rows for a given combination of keys. Currently, It does not seem like the case, It takes quite some time for a df.ix[ix1,ix2] operation to take place.
Example:
Say I have the following dataframe:
In [228]: df
Out[228]:
ID1 ID2 score
752476 5626887150_0 5626887150_6 96
752477 5626887150_0 5626887150_7 95
752478 5626887150_0 5626887150_2 95
752479 5626887150_0 5626887150_8 93
752480 5626887150_0 5626887150_1 89
752481 5626887150_0 2142280814_5 88
752482 5626887150_0 5626887150_3 84
752483 5626887150_0 6625625104_5 82
752484 5626887150_0 2142280814_4 81
And say I want to look at the score column in different ID1,ID2 combinations. To easily do this, i'm setting ID1 and ID2 as indexes and obtain the following result:
In [230]: df = df.set_index(['ID1','ID2'])
Out[230]:
score
ID1 ID2
5626887150_0 5626887150_6 96
5626887150_7 95
5626887150_2 95
5626887150_8 93
5626887150_1 89
2142280814_5 88
5626887150_3 84
6625625104_5 82
2142280814_4 81
Now I can easy access my data with ID1,ID2 combinations (e.g. df.ix['5626887150_0','5626887150_6']), that's true. BUT, it does not seem to be an O(1) access. It seems to take quite some time to return a value on a large dataframe.
So what exactly is the set_index() method doing? and is there a way to force an O(1) acess to the data?

Categories

Resources