how to read a column with another column name in pandas? - python

I have to read and process set of files.(eg: 100 files) In which one file come with a column name as 'Idass'. Other files come with the column name 'IdassId'.
After processing I select few columns and writing the output in excel.
df.to_excel(writer, columns=['Date','IdassId','TankNo','GradeNo','Sales'],sheet_name='sales')
Here I miss that single file's entry since it doesn't have column name as 'IdassId'. It contains that specific column with 'Idass'.
(I could not rename that column before processing since it is an automated process coming form another process).
I tried rename that column with IdassId and tried to write in excel.
d = {'Idass': 'IdassId'}
df.rename(columns=d).to_excel(writer, columns=['Date','IdassId','TankNo','GradeNo','Sales'],sheet_name='sales')
but above gives an error since another files come with same column name as 'idassId'
ValueError: cannot reindex from a duplicate axis
How to do this in pandas?

I'm assuming your concatenating the excel files together so it would look similar to below.
Idass IdassId
0 0.0 NaN
1 1.0 NaN
2 2.0 NaN
3 3.0 NaN
4 4.0 NaN
5 NaN 0.0
6 NaN 1.0
7 NaN 2.0
8 NaN 3.0
9 NaN 4.0
If you were to rename Idass to IdassId then you would have two columns named IdassId and that is what is causing your error.
You should be able to fill in the null values of IdassID and get your desired result.
df['IdassId'] = df['IdassId'].where(df['IdassId'].notnull(), df['Idass'])
df.drop('Idass', axis=1, inplace=True)
IdassId
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
5 0.0
6 1.0
7 2.0
8 3.0
9 4.0

It looks like your defined d{} is trying to rename Idass to IdassId. I think your key:value pairing is switched.

Related

Fill missing values with the most common value in the grouped form

Could anybody help me with fill missing values with the most common value but grouped form? .Here I want to fill missing value of cylinders columns with the same model of cars.
I tried this :
sh_cars['cylinders']=sh_cars['cylinders'].fillna(sh_cars.groupby('model')['cylinders'].agg(pd.Series.mode))
and other ones but I got everytime error messages.
Thanks in advance.
I think problem is there are only NaNs per some (or all) groups, so error is raised. Possible solution is use custom function with GroupBy.transform for return Series with same size like original DataFrame:
data = {'model':['a','a','a','a','b','b','a'],
'cylinders':[2,9,9,np.nan,np.nan,np.nan,np.nan]}
sh_cars = pd.DataFrame(data)
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['new']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders new
0 a 2.0 2.0
1 a 9.0 9.0
2 a 9.0 9.0
3 a NaN 9.0
4 b NaN NaN
5 b NaN NaN
6 a NaN 9.0
Replace original column:
f = lambda x: x.mode().iat[0] if x.notna().any() else np.nan
s = sh_cars.groupby('model')['cylinders'].transform(f)
sh_cars['cylinders']=sh_cars['cylinders'].fillna(s)
print (sh_cars)
model cylinders
0 a 2.0
1 a 9.0
2 a 9.0
3 a 9.0
4 b NaN
5 b NaN
6 a 9.0

Trying to fill NaNs with fillna() and groupby()

So I basically have an Airbnb data set with a few columns. Several of them correspond to ratings of different parameters (cleanliness, location,etc). For those columns I have a bunch of NaNs that I want to fill.
As some of those NaNs correspond to listings from the same owner, I wanted to fill some of the NaNs with the corresponding hosts' rating average for each of those columns.
For example, let's say that for host X, the average value for review_scores_location is 7. What I want to do is, in the review_scores_location column, fill all the NaN values, that correspond to the host X, with 7.
I've tried the following code:
cols=['reviews_per_month','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin','review_scores_communication','review_scores_location','review_scores_value']
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].mean())
Although it does run and it doesn't return any error, it does not fill the NaN values, since when I check if there are still any NaNs, the amount hasn't changed.
What am I doing?
Thanks for taking the time to read this!
The problem here is that when using the series airbnb.groupby('host_id')[i].mean() in the fillna, the function tries to align index and as the index of airbnb.groupby('host_id')[i].mean() are actually the values of the column host_id and not the original index values of airbnb, the fillna does not work as you expect. Several options are possible to do the job, one way is to use transform after the groupby that will align the mean value per group to the original index values and then the fillna would work as expected, such as:
for i in cols:
airbnb[i]=airbnb[i].fillna(airbnb.groupby('host_id')[i].transform('mean'))
And even, you can use this method without a loop such as:
airbnb = airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean'))
with an example:
airbnb = pd.DataFrame({'host_id':[1,1,1,2,2,2],
'reviews_per_month':[4,5,np.nan,9,3,5],
'review_scores_rating':[3,np.nan,np.nan,np.nan,7,8]})
print (airbnb)
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 NaN 5.0
2 1 NaN NaN
3 2 NaN 9.0
4 2 7.0 3.0
5 2 8.0 5.0
and you get:
cols=['reviews_per_month','review_scores_rating'] # would work with all your columns
print (airbnb.fillna(airbnb.groupby('host_id')[cols].transform('mean')))
host_id review_scores_rating reviews_per_month
0 1 3.0 4.0
1 1 3.0 5.0
2 1 3.0 4.5
3 2 7.5 9.0
4 2 7.0 3.0
5 2 8.0 5.0

Python - iterate over rows and columns

First of all, please pardon my skills. I am trying to get into Python, I learn just for fun, let's say, I don't use it professionally and I am quite bad, to be honest. Probably there will be basic errors on my question.
Anyway, I am trying to go over a dataframe's rows and columns. I want to check if the values of the columns (except the first one) are NaNs. If they are, then they should change to the value of the first one.
import math
for index, row in rawdata3.iterrows():
test = row[0]
for column in row:
if math.isnan(row.loc[column]) == True:
row.loc[column] = test
The error I get is something like this:
the label [4.0] is not in the [columns]
I also had other errors with slightly different code like:
cannot do label indexing on class pandas.core.indexes.base.Index with these indexers class float
Could you give me a hand, please?
Thanks in advance!
Cheers.
I don't know if there is a better way but this works fine:
for i in df.columns:
df.loc[df[i].isnull(), i] = df.loc[df[i].isnull(), 'A']
output:
A B C
0 5 5.0 2.0
1 6 5.0 6.0
2 9 9.0 9.0
3 2 4.0 6.0
Where df is:
A B C
0 5 NaN 2.0
1 6 5.0 NaN
2 9 NaN NaN
3 2 4.0 6.0
Use transpose and fillna:
Due to fillna "NotImplementedEerror"
NotImplementedError: Currently only can fill with dict/Series column
by column
df.fillna(value=df.A, axis=1) will not work. Therefore we use:
df.T.fillna(df.A).T
Output:
A B C
0 5.0 5.0 2.0
1 6.0 5.0 6.0
2 9.0 9.0 9.0
3 2.0 4.0 6.0

Filter out CSV values after a space in python

So my goal is to read CSV file created by a Geocoder that has annoyingly put string values with a space and latitude or longitude values… I could go through all of these excel cells and split them manually, but I would really like to read CSV instead and just use the space as the delimiter and filter out all of the string values. I know how to import CSV, and even how to specify space as the delimiter I think I… But what I don't understand is how to filter out all of the string values and save only the numeric values in a brand new Excel sheet. Does anyone know how to do this?
Here is the code I have so far to delimit the white space:
pd.read_csv('file.csv',delim_whitespace=True)
Use pd.read_csv to read your CSV, select_dtypes to select only numeric columns, and save only numeric columns to a CSV using to_csv.
df = pd.read_csv('file.csv', delim_whitespace=True)
df.select_dtypes(['float']).to_csv('file.csv')
If your file has no headers, you'll need to add header=None when reading the CSV.
df
a b c
0 1.0 0 foo
1 2.0 0 NaN
2 1.0 1 bar
3 1.0 1 foo
4 NaN 1 baz
5 3.0 1 foo
6 3.0 1 bar
df.select_dtypes(['float'])
a
0 1.0
1 2.0
2 1.0
3 1.0
4 NaN
5 3.0
6 3.0
If, for some reason, you have integeral columns you want to save, change float to number:
df.select_dtypes(['number'])
a b
0 1.0 0
1 2.0 0
2 1.0 1
3 1.0 1
4 NaN 1
5 3.0 1
6 3.0 1
And just chain a .to_csv call.
If you get the data separated as you should you can use this:
df.convert_objects(convert_numeric=True).dropna(axis=1)
and you can add .to_csv('your_file_name.csv') at the end.

Python Pandas: How to merge based on an "OR" condition?

Let's say I have two dataframes, and the column names for both are:
table 1 columns:
[ShipNumber, TrackNumber, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]
I want to merge the two tables based on both ShipNumber and TrackNumber.
However, if i simply use merge in the following way (pseudo code, not real code):
tab1.merge(tab2, "left", on=['ShipNumber','TrackNumber'])
then, that means the values in both ShipNumber and TrackNumber columns from both tables MUST MATCH.
However, in my case, sometimes the ShipNumber column values will match, sometimes the TrackNumber column values will match; as long as one of the two values match for a row, I want the merge to happen.
In other words, if row 1 ShipNumber in tab 1 matches row 3 ShipNumber in tab 2, but the TrackNumber in two tables for the two records do not match, I still want to match the two rows from the two tables.
So basically this is a either/or match condition (pesudo code):
if tab1.ShipNumber == tab2.ShipNumber OR tab1.TrackNumber == tab2.TrackNumber:
then merge
I hope my question makes sense...
Any help is really really appreciated!
As suggested, I looked into this post:
Python pandas merge with OR logic
But it is not completely the same issue I think, as the OP from that post has a mapping file, and so they can simply do 2 merges to solve this. But I dont have a mapping file, rather, I have two df's with same key columns (ShipNumber, TrackNumber)
Use merge() and concat(). Then drop any duplicate cases where both A and B match (thanks #Scott Boston for that final step).
df1 = pd.DataFrame({'A':[3,2,1,4], 'B':[7,8,9,5]})
df2 = pd.DataFrame({'A':[1,5,6,4], 'B':[4,1,8,5]})
df1 df2
A B A B
0 3 7 0 1 4
1 2 8 1 5 1
2 1 9 2 6 8
3 4 5 3 4 5
With these data frames we should see:
df1.loc[0] matches A on df2.loc[0]
df1.loc[1] matches B on df2.loc[2]
df1.loc[3] matches both A and B on df2.loc[3]
We'll use suffixes to keep track of what matched where:
suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
df = pd.concat([df1.merge(df2, on='A', suffixes=suff_A),
df1.merge(df2, on='B', suffixes=suff_B)])
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
1 4.0 NaN NaN NaN 5.0 5.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
Note that the second and fourth rows are duplicate matches (for both data frames, A = 4 and B = 5). We need to remove one of those sets.
dups = (df.B_on_A_match_1 == df.B_on_A_match_2) # also could remove A_on_B_match
df.loc[~dups]
A A_on_B_match_1 A_on_B_match_2 B B_on_A_match_1 B_on_A_match_2
0 1.0 NaN NaN NaN 9.0 4.0
0 NaN 2.0 6.0 8.0 NaN NaN
1 NaN 4.0 4.0 5.0 NaN NaN
I would suggest this alternate way for doing merge like this. This seems easier for me.
table1["id_to_be_merged"] = table1.apply(
lambda row: row["ShipNumber"] if pd.notnull(row["ShipNumber"]) else row["TrackNumber"], axis=1)
You can add the same column in table2 as well if needed and then use in left_in or right_on based on your requirement.

Categories

Resources