Say I have two CSV files. The first one, input_1.csv, has an index column, so when I run:
import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1
I get a DataFrame with an index column, as well as a column called Unnamed: 0, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0 and everything is fine.
The second file, input_2.csv, has no index column, i.e., it looks like this:
| stuff | things |
|--------:|---------:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
Running pd.read_csv("input_2.csv") gives me a DataFrame with an index column. In this case, adding the index_col=0 argument will set in the index column to stuff, as in the CSV file itself.
My problem is that I have a function that contains the read_csv part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?
CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.
It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.
Therefore you will probably need to just check if an Unnamed: column appears, and set those columns to be the index.
By index, I hope you mean a column with serial number either starting at 0 or 1.
You can have some kind of post-import logic to decide, if first column qualifies as an index column:
The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.
For example:
idx value
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True
np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])
# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True
In Pandas I have a table with the next columns:
Number of words | 1 | 2 | 4 |
...And I want to make it like the following:
----------------|worker/node|
Number of words | 1 | 2 | 4 |
So how to "create" this header for sub-features?
And how to merge empty cell (from row 1 where FeatureHeader is located) with "Index" cell in row 2?
In another words, I want to make table headers like this:
Use MultiIndex.from_product for add first level of MultiIndex by your string:
#if necessary convert some columns to index first
df = df.set_index(['Number of words'])
df.columns = pd.MultiIndex.from_product([['Worker/node'], df.columns])
How can we use Coalesce with multiple data frames.
columns_List = Emp_Id, Emp_Name, Dept_Id...
I have two data frames getting used in python script. df1[Columns_List] , df2[columns_List]. In both the dataframes i have same columns used but i will be having different values in both dataframes.
How can i use Coalesce so that lets say :In Dataframe df1[Columns_List] -- I have Emp_Name null then i want to pick Emp_Name from df2[Columns_list].
I am trying to create an output CSV file.
Please sorry if my framing of question is wrong..
Please find below sample data.
For Dataframe1 -- df1[Columns_List] .. Please find below output
EmpID,Emp_Name,Dept_id,DeptName
1,,1,
2,,2,
For Dataframe2 -- df2[Columns_List] .. Please find below output
EmpID,Emp_Name,Dept_id,DeptName
1,XXXXX,1,Sciece
2,YYYYY,2,Maths
I have source as Json file. Once i parse the data by python , i am using 2 dataframes in the same script. In Data frame 1 ( df1) i have Emp_Name & Dept_Name as null. In that case i want to pick data from Dataframe2 (df2).
In the above example i have provided few columns. But i may have n number of columns. but column ordering and column names will be always same. I am trying to achieve in such a way if any of the column from df1 is null then i want to pick value from df2.
Is that possible.. Please help me with any suggestionn...
You can use pandas.DataFrame.combine. This method does what you need: it builds a dataframe taking elements from two dataframes according to a custom function.
You can then write a custom function which picks the element from dataframe one unless that is null, in which case the element is taken from dataframe two.
Consider the two following dataframe. I built them according to your examples but with a small difference to emphatize that only emtpy string will be replaced:
columnlist = ["EmpID", "Emp_Name", "Dept_id", "DeptName"]
df1 = pd.DataFrame([[1, None, 1, np.NaN], [2, np.NaN, 2, None]], columns=columnlist)
df2 = pd.DataFrame([[1, "XXX", 2, "Science"], [2, "YYY", 3, "Math"]], columns=columnlist)
They are:
df1
EmpID Emp_Name Dept_id DeptName
0 1 NaN 1 NaN
1 2 NaN 2 NaN
df2
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 3 Math
What you need to do is:
ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))
to get ddf:
ddf
EmpID Emp_Name Dept_id DeptName
0 1 XXX 1 Science
1 2 YYY 2 Math
As you can see, only Null values in df1 have been replaced with the corresponding values in df2.
EDIT: A bit deeper explanation
Since I've been asked in the comments, let me give a bit of explanation more on the solution:
ddf = df1.combine(df2, lambda ss, rep_ss : pd.Series([r if pd.isna(x) else x for x, r in zip(ss, rep_ss)]))
Is a bit compact, but there is nothing much than some basic python techiques like list comprehensions, plus the use of pandas.DataFrame.combine. The pandas method is detailed in the docs I linked above. It compares the two dataframes column by column: the columns are passed to a custom function which must return a pandas.Series. This Series become a column in the returned dataframe.
In this case, the custom function is a lambda, which uses a list comprehension to loop over the pairs of elements (one from each column) and pick only one element of the pair (the first if not null, otherwise the second).
You can use a mask to get null values and replace those. The best part is that you don't have to eyeball anything; the function will find what to replace for you.
You can also adjust the pd.DataFrame.select_dtypes() function to suit your needs, or just go through multiple dtypes with appropriate conversion and detection measures being used.
import pandas as pd
ddict1 = {
'EmpID':[1,2],
'Emp_Name':['',''],
'Dept_id':[1,2],
'DeptName':['',''],
}
ddict2 = {
'EmpID':[1,2],
'Emp_Name':['XXXXX','YYYYY'],
'Dept_id':[1,2],
'DeptName':['Sciece','Maths'],
}
df1 = pd.DataFrame(ddict1)
df2 = pd.DataFrame(ddict2)
def replace_df_values(df_A, df_B):
## Select object dtypes
for i in df_A.select_dtypes(include=['object']):
### Check to see if column contains missing value
if len(df_A[df_A[i].str.contains('')]) > 0:
### Create mask for zero-length values (or null, your choice)
mask = df_A[i] == ''
### Replace on 1-for-1 basis using .loc[]
df_A.loc[mask, i] = df_B.loc[mask, i]
### Pass dataframes in reverse order to cover both scenarios
replace_df_values(df1, df2)
replace_df_values(df2, df1)
Initial values for df1:
EmpID Emp_Name Dept_id DeptName
0 1 1
1 2 2
Output for df1 after running function:
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
I replicated your dataframes:
# df1
EmpID Emp_Name Dept_id DeptName
0 1 1
1 2 2
# df2
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
If you want to replace missing values (NaN) from df1.column with existing values from df2.column, you could use .fillna(). For example:
df1['Emp_Name'].fillna(df2['Emp_Name'], inplace=True)
# df1
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1
1 2 YYYYY 2
If you want to replace all values from a given column with the values from the same column of another dataframe, you could use list comprehension.
df1['DeptName'] = [ each for each in list(df2['DeptName'])]
EmpID Emp_Name Dept_id DeptName
0 1 XXXXX 1 Sciece
1 2 YYYYY 2 Maths
I'm sure there's a better way to do this, but I hope this helps!
I have a df with about 50 columns:
Product ID | Cat1 | Cat2 |Cat3 | ... other columns ...
8937456 0 5 10
8497534 25 3 0
8754392 4 15 7
Cat signifies how many quantities of that product fell into a category. Now I want to add a column "Category" denoting the majority Category for a product (ignoring the other columns and just considering the Cat columns).
df_goal:
Product ID | Cat1 | Cat2 |Cat3 | Category | ... other columns ...
8937456 0 5 10 3
8497534 25 3 0 1
8754392 4 15 7 2
I think I need to use max and apply or map?
I found those on stackoverflow, but they don't not address the category assignment. In Excel I renamed the columns from Cat 1 to 1 and used index(match(max)).
Python Pandas max value of selected columns
How should I take the max of 2 columns in a dataframe and make it another column?
Assign new value in DataFrame column based on group max
Here's a NumPy way with numpy.argmax -
df['Category'] = df.values[:,1:].argmax(1)+1
To restrict the selection to those columns, use those column headers/names specifically and then use idxmax and finally replace the string Cat with `empty strings, like so -
df['Category'] = df[['Cat1','Cat2','Cat3']].idxmax(1).str.replace('Cat','')
numpy.argmax or panda's idxmax basically gets us the ID of max element along an axis.
If we know that the column names for the Cat columns start at 1st column and end at 4th one, we can slice the dataframe : df.iloc[:,1:4] instead of df[['Cat1','Cat2','Cat3']].
I am working on python pandas.
I have one table table_one which has columns name,address,one,two,phone.
Now one is foreign_key on two
Now i want pandas to do the join on this foreign key and resulted data freame should give result like below:
Input dta frame
Id name Address one two nuber
1 test | addrs | 1 | 2 | number
2 fert | addrs | 2 | 1 | testnumber
3 dumy | addrs | 3 | 9 | testnumber
Ouptput should be:
join this df(data frame) to itself and get name for its foreign key which is two
o/p:
Get all column of left table and only name from right table in pandas
Means ext row 1: one is foreign key on two so resulted op will be
1 test addrs 1 2 number fert
same for all means for row 1 one value 1 is mapped to column two which is row 2 having value 1 for column two' so take namefert` in resulted new column.
I tried below
pd.merge(df, df, left_on=['one'], right_on=['two'])
but not getting required result it is giving all column for right table also but i want only name value with all coulmn of left table..
Any help will be appreciated.
Select required columns before merge (rename it to avoid conflict)
pd.merge(df, df[['two', 'name']].rename(columns={'name', 'for_name'}), left_on=['one'], right_on=['two'])