Handling rows with 2 lines of data in Python - python

My DataFrame looks like this :
there are some rows ( example: 297) where the "Price" column has two values ( Plugs and Quarts), I have filled the Nans with the previous row since it belongs to the same Latin Name. However I was thinking of splitting the Price Column further into two columns with Names "Quarts" and "Plugs" and fill the amount, 0 if there are no Plugs found and the same with Quarts.
Example :
Plugs | Quarts
0 | 2
2 | 3
4 | 0
Can someone help me with this?

Related

How to automatically set an index to a Pandas DataFrame when reading a CSV with or without an index column

Say I have two CSV files. The first one, input_1.csv, has an index column, so when I run:
import pandas as pd
df_1 = pd.read_csv("input_1.csv")
df_1
I get a DataFrame with an index column, as well as a column called Unnamed: 0, which is the same as the index column. I can prevent this duplication by adding the argument index_col=0 and everything is fine.
The second file, input_2.csv, has no index column, i.e., it looks like this:
| stuff | things |
|--------:|---------:|
| 1 | 10 |
| 2 | 20 |
| 3 | 30 |
| 4 | 40 |
| 5 | 50 |
Running pd.read_csv("input_2.csv") gives me a DataFrame with an index column. In this case, adding the index_col=0 argument will set in the index column to stuff, as in the CSV file itself.
My problem is that I have a function that contains the read_csv part, and I want it to return a DataFrame with an index column in either case. Is there a way to detect whether the input file has an index column or not, set one if it doesn't, and do nothing if it does?
CSV has no built-in notion of an "index" column, so the answer I think is that this isn't possible in general.
It would be nice if you could say "use 0 as index only if unnamed", but Pandas does not give us that option.
Therefore you will probably need to just check if an Unnamed: column appears, and set those columns to be the index.
By index, I hope you mean a column with serial number either starting at 0 or 1.
You can have some kind of post-import logic to decide, if first column qualifies as an index column:
The logic is, if difference between default index and first column is same for all rows, then the first column contains increasing sequence (starting at any number). Pre-condition is that the column should be numeric.
For example:
idx value
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
5 6 f
pd.api.types.is_numeric_dtype(df[df.columns[0]])
>> True
np.array(df.index) - df.iloc[:,0].values
>> array([-1, -1, -1, -1, -1, -1])
# If all values are equal
len(pd.Series(np.array(df.index) - df.iloc[:,0].values).unique()) == 1
>> True

Header for sub-headers in Pandas

In Pandas I have a table with the next columns:
Number of words | 1 | 2 | 4 |
...And I want to make it like the following:
----------------|worker/node|
Number of words | 1 | 2 | 4 |
So how to "create" this header for sub-features?
And how to merge empty cell (from row 1 where FeatureHeader is located) with "Index" cell in row 2?
In another words, I want to make table headers like this:
Use MultiIndex.from_product for add first level of MultiIndex by your string:
#if necessary convert some columns to index first
df = df.set_index(['Number of words'])
df.columns = pd.MultiIndex.from_product([['Worker/node'], df.columns])

How to merge by a column of collection using Python Pandas?

I have 2 lists of Stack Overflow questions, group A and group B. Both have two columns, Id and Tag. e.g:
|Id |Tag
| -------- | --------------------------------------------
|2 |c#,winforms,type-conversion,decimal,opacity
For each question in group A, I need to find in group B all matched questions that have at least one overlapping tag the question in group A, independent of the position of tags. For example, these questions should all be matched questions:
|Id |Tag
|----------|---------------------------
|3 |c#
|4 |winforms,type-conversion
|5 |winforms,c#
My first thought was to convert the variable Tag into a set variable and merge using Pandas because set ignores position. However, it seems that Pandas doesn't allow a set variable to be the key variable. So I am now using for loop to search over group B. But it is extremely slow since I have 13 million observation in group B.
My question is:
1. Is there any other way in Python to merge by a column of collection and can tell the number of overlapping tags?
2. How to improve the efficiency of for loop search?
This can be achieved using df.join and df.groupby.
This is the setup I'm working with:
df1 = pd.DataFrame({ 'Id' : [2], 'Tag' : [['c#', 'winforms', 'type-conversion', 'decimal', 'opacity']]})
Id Tag
0 2 [c#, winforms, type-conversion, decimal, opacity]
df2 = pd.DataFrame({ 'Id' : [3, 4, 5], 'Tag' : [['c#'], ['winforms', 'type-conversion'], ['winforms', 'c#']]})
Id Tag
0 3 [c#]
1 4 [winforms, type-conversion]
2 5 [winforms, c#]
Let's flatten out the right column in both data frames. This helped:
In [2331]: from itertools import chain
In [2332]: def flatten(df):
...: return pd.DataFrame({"Id": np.repeat(df.Id.values, df.Tag.str.len()),
...: "Tag": list(chain.from_iterable(df.Tag))})
...:
In [2333]: df1 = flatten(df1)
In [2334]: df2 = flatten(df2)
In [2335]: df1.head()
Out[2335]:
Id Tag
0 2 c#
1 2 winforms
2 2 type-conversion
3 2 decimal
4 2 opacity
And similarly for df2, which is also flattened.
Now is the magic. We'll do a join on Tag column, and then groupby on joined IDs to find count of overlapping tags.
In [2337]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index()
Out[2337]:
Id_x Id_y Tag
0 2 3 1
1 2 4 2
2 2 5 2
The output shows each pair of tags along with the number of overlapping tags. Pairs with no overlaps are filtered out by the groupby.
The df.count counts overlapping tags, and df.reset_index just prettifies the output, since groupby assigns the grouped column as the index, so we reset it.
To see matching tags, you'll modify the above slightly:
In [2359]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y'])['Tag'].apply(list).reset_index()
Out[2359]:
Id_x Id_y Tag
0 2 3 [c#]
1 2 4 [winforms, type-conversion]
2 2 5 [c#, winforms]
To filter out 1-overlaps, chain a df.query call to the first expression:
In [2367]: df1.merge(df2, on='Tag').groupby(['Id_x', 'Id_y']).count().reset_index().query('Tag > 1')
Out[2367]:
Id_x Id_y Tag
1 2 4 2
2 2 5 2
Step 1 List down all tags
Step 2 create binary representation of each tag, i.e. use bit 1 or 0 to represent whether have or not have the tag
Step 3 to find any ID share the same tag, you could call a simple apply function to decode your binary representation.
In terms of processing speed, it should be all right. However, if number of tags are too huge, there might be memory issue. If you only need to find questions with same tag for one Id only, I will suggest you write a simple function and calling df.apply. If you need to check for a lot of IDs and find questions with same tag, I will say above approach will be better.
(Intended to leave it as comment, but not enough reputation... sigh)

Pandas (python): max in columns define new value in new column

I have a df with about 50 columns:
Product ID | Cat1 | Cat2 |Cat3 | ... other columns ...
8937456 0 5 10
8497534 25 3 0
8754392 4 15 7
Cat signifies how many quantities of that product fell into a category. Now I want to add a column "Category" denoting the majority Category for a product (ignoring the other columns and just considering the Cat columns).
df_goal:
Product ID | Cat1 | Cat2 |Cat3 | Category | ... other columns ...
8937456 0 5 10 3
8497534 25 3 0 1
8754392 4 15 7 2
I think I need to use max and apply or map?
I found those on stackoverflow, but they don't not address the category assignment. In Excel I renamed the columns from Cat 1 to 1 and used index(match(max)).
Python Pandas max value of selected columns
How should I take the max of 2 columns in a dataframe and make it another column?
Assign new value in DataFrame column based on group max
Here's a NumPy way with numpy.argmax -
df['Category'] = df.values[:,1:].argmax(1)+1
To restrict the selection to those columns, use those column headers/names specifically and then use idxmax and finally replace the string Cat with `empty strings, like so -
df['Category'] = df[['Cat1','Cat2','Cat3']].idxmax(1).str.replace('Cat','')
numpy.argmax or panda's idxmax basically gets us the ID of max element along an axis.
If we know that the column names for the Cat columns start at 1st column and end at 4th one, we can slice the dataframe : df.iloc[:,1:4] instead of df[['Cat1','Cat2','Cat3']].

Make equivalent pandas dataframe for sql query containing: join table on itself with selecting all column of 1st table and some column of right table

I am working on python pandas.
I have one table table_one which has columns name,address,one,two,phone.
Now one is foreign_key on two
Now i want pandas to do the join on this foreign key and resulted data freame should give result like below:
Input dta frame
Id name Address one two nuber
1 test | addrs | 1 | 2 | number
2 fert | addrs | 2 | 1 | testnumber
3 dumy | addrs | 3 | 9 | testnumber
Ouptput should be:
join this df(data frame) to itself and get name for its foreign key which is two
o/p:
Get all column of left table and only name from right table in pandas
Means ext row 1: one is foreign key on two so resulted op will be
1 test addrs 1 2 number fert
same for all means for row 1 one value 1 is mapped to column two which is row 2 having value 1 for column two' so take namefert` in resulted new column.
I tried below
pd.merge(df, df, left_on=['one'], right_on=['two'])
but not getting required result it is giving all column for right table also but i want only name value with all coulmn of left table..
Any help will be appreciated.
Select required columns before merge (rename it to avoid conflict)
pd.merge(df, df[['two', 'name']].rename(columns={'name', 'for_name'}), left_on=['one'], right_on=['two'])

Categories

Resources