Header for sub-headers in Pandas - python

In Pandas I have a table with the next columns:
Number of words | 1 | 2 | 4 |
...And I want to make it like the following:
----------------|worker/node|
Number of words | 1 | 2 | 4 |
So how to "create" this header for sub-features?
And how to merge empty cell (from row 1 where FeatureHeader is located) with "Index" cell in row 2?
In another words, I want to make table headers like this:

Use MultiIndex.from_product for add first level of MultiIndex by your string:
#if necessary convert some columns to index first
df = df.set_index(['Number of words'])
df.columns = pd.MultiIndex.from_product([['Worker/node'], df.columns])

Related

How to automatically set an index to a Pandas DataFrame when reading a CSV with or without an index column

Say I have two CSV files. The first one, input_1.csv, has an index column, so when I run:
import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1
I get a DataFrame with an index column, as well as a column called Unnamed: 0, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0 and everything is fine.
The second file, input_2.csv, has no index column, i.e., it looks like this:
| stuff | things |
|--------:|---------:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
Running pd.read_csv("input_2.csv") gives me a DataFrame with an index column. In this case, adding the index_col=0 argument will set in the index column to stuff, as in the CSV file itself.
My problem is that I have a function that contains the read_csv part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?
CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.
It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.
Therefore you will probably need to just check if an Unnamed: column appears, and set those columns to be the index.
By index, I hope you mean a column with serial number either starting at 0 or 1.
You can have some kind of post-import logic to decide, if first column qualifies as an index column:
The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.
For example:
idx value
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True
np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])
# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True

Handling rows with 2 lines of data in Python

My DataFrame looks like this :
there are some rows ( example: 297) where the "Price" column has two values ( Plugs and Quarts), I have filled the Nans with the previous row since it belongs to the same Latin Name. However I was thinking of splitting the Price Column further into two columns with Names "Quarts" and "Plugs" and fill the amount, 0 if there are no Plugs found and the same with Quarts.
Example :
Plugs | Quarts
0 | 2
2 | 3
4 | 0
Can someone help me with this?

How to select a dataframe column based on a list?

I want to make a condition based on each value unique of a data frame column using python .
I tried to put it in a list and iterate to take all the value :
**f=(df['Technical family'].unique())
for i in f:
df_2 = df[(df['Technical family'] = f[i])]
S=pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
**
but apparently the df_2 = df[(df['Technical family'] = f[i])] doesn't work !
Anyone have an idea how to do it ?
You need to use == instead of =. == is for comparing values while = is for assigning values.
For example:
df = pd.DataFrame({'Technical family':np.random.choice(['1','2','3'],100),
'PFG | ID':np.random.choice(['A','B','C'],100),
'Comp. | Family':np.random.choice(['a','b','c'],100)})
f=(df['Technical family'].unique())
df_2 = df[df['Technical family']==f[0]]
pd.crosstab(df_2['PFG | ID'],df_2['Comp. | Family'])
Comp. | Family a b c
PFG | ID
A 5 5 3
B 3 5 3
C 4 3 4
Also as a further suggestion, you can directly crosstab:
res = pd.crosstab([df['Technical family'],df['PFG | ID']],df['Comp. | Family'])
res.loc['1']

Make equivalent pandas dataframe for sql query containing: join table on itself with selecting all column of 1st table and some column of right table

I am working on python pandas.
I have one table table_one which has columns name,address,one,two,phone.
Now one is foreign_key on two
Now i want pandas to do the join on this foreign key and resulted data freame should give result like below:
Input dta frame
Id name Address one two nuber
1 test | addrs | 1 | 2 | number
2 fert | addrs | 2 | 1 | testnumber
3 dumy | addrs | 3 | 9 | testnumber
Ouptput should be:
join this df(data frame) to itself and get name for its foreign key which is two
o/p:
Get all column of left table and only name from right table in pandas
Means ext row 1: one is foreign key on two so resulted op will be
1 test addrs 1 2 number fert
same for all means for row 1 one value 1 is mapped to column two which is row 2 having value 1 for column two' so take namefert` in resulted new column.
I tried below
pd.merge(df, df, left_on=['one'], right_on=['two'])
but not getting required result it is giving all column for right table also but i want only name value with all coulmn of left table..
Any help will be appreciated.
Select required columns before merge (rename it to avoid conflict)
pd.merge(df, df[['two', 'name']].rename(columns={'name', 'for_name'}), left_on=['one'], right_on=['two'])

Python: create new row based on column names in DataFrame

I would like to know how to make a new row based on the column names row in a python dataframe, and append it to the same dataframe.
example
df = pd.DataFrame(np.random.randn(10, 5),columns=['abx', 'bbx', 'cbx', 'acx', 'bcx'])
I want to create a new row based on the column names that gives:
b | b | b | c | c |by taking the middle char of the column name.
the idea is to use that new row, later, for multi-indexing the columns.
I'm assuming this is what you want as you've not responded, we can append a new row by creating a dict from zipping the df columns and a list comprehension of the middle character (assuming that column name lengths are 3):
In [126]:
df.append(dict(zip(df.columns, [col[1] for col in df])), ignore_index=True)
Out[126]:
abx bbx cbx acx bcx
0 -0.373421 -0.1005462 -0.8280985 -0.1593167 1.335307
1 1.324328 -0.6189612 -0.743703 0.9419248 1.282682
2 0.3730312 -0.06697892 1.113707 -0.9691056 1.779643
3 -0.6644958 1.379606 -0.3751724 -1.135034 0.3287292
4 0.4406139 -0.5767996 -0.2267589 -1.384412 -0.03038372
5 -1.242734 -0.838923 -0.6724592 1.405247 -0.3716862
6 -1.682637 -1.69309 -1.291833 1.781704 0.6321988
7 -0.5793783 -0.6809975 1.03502 -0.6498381 -1.124236
8 1.589016 1.272961 -1.968225 0.5515182 0.3058628
9 -2.275342 2.892237 2.076253 -0.1422845 -0.09776171
10 b b b c c
ix --- lets you read the entire row-- you just say which ever row you want.
then you get your columns and assign them to the raw you want.
See the example below.
virData = DataFrame(df)
virData.columns = virData.ix[1].values
virData.columns

Categories

Resources