Naming columns in pandas deletes data

Naming columns in pandas deletes data - python

The code below gives me this table:
raw = pd.read_clipboard()
raw.head()
+---+---------------------+-------------+---------+----------+-------------+
| | Afghanistan | South Asia | 652225 | 26000000 | Unnamed: 4 |
+---+---------------------+-------------+---------+----------+-------------+
| 0 | Albania | Europe | 28728 | 3200000 | 6656000000 |
| 1 | Algeria | Middle East | 2400000 | 32900000 | 75012000000 |
| 2 | Andorra | Europe | 468 | 64000 | NaN |
| 3 | Angola | Africa | 1250000 | 14500000 | 14935000000 |
| 4 | Antigua and Barbuda | Americas | 442 | 77000 | 770000000 |
+---+---------------------+-------------+---------+----------+-------------+
But when I attempt to rename the columns and create a DataFrame, all of the data disappears:
df = pd.DataFrame(raw, columns = ['name', 'region', 'area', 'population', 'gdp'])
df.head()
+---+------+--------+------+------------+-----+
| | name | region | area | population | gdp |
+---+------+--------+------+------------+-----+
| 0 | NaN | NaN | NaN | NaN | NaN |
| 1 | NaN | NaN | NaN | NaN | NaN |
| 2 | NaN | NaN | NaN | NaN | NaN |
| 3 | NaN | NaN | NaN | NaN | NaN |
| 4 | NaN | NaN | NaN | NaN | NaN |
+---+------+--------+------+------------+-----+
Any idea why?

You should just write:
df.columns = ['name', 'region', ...]
This is also much more efficient as you aren't trying to copy the entire DataFrame; as far as I know passing one DataFrame into the constructor for another will make a deep, not shallow copy.

Related

Match dtypes of two DataFrames that share columns

I have the following dataframes in pandas:
df:
| ID | country | money | code | money_add | other |
| -------- | -------------- | --------- | -------- | --------- | ----- |
| 832932 | Other | NaN | 00000 | NaN | NaN |
| 217#8# | NaN | NaN | NaN | NaN | NaN |
| 1329T2 | France | 12131 | 00020 | 3452 | 123 |
| 124932 | France | NaN | 00016 | NaN | NaN |
| 194022 | France | NaN | 00000 | NaN | NaN |
df1:
| cod_t | money | money_add | other |
| -------- | ------ | --------- | ----- |
| 00000 | 4532 | 72323 | 321 |
| 00016 | 1213 | 23822 | 843 |
| 00018 | 1313 | 8393 | 183 |
| 00020 | 1813 | 27328 | 128 |
| 00030 | 8932 | 3204 | 829 |
cols = df.columns.intersection(df1.columns)
print (df[cols].dtypes.eq(df1[cols].dtypes))
money False
money_add False
other False
dtype: bool
I want to match the dtypes of the columns of the second dataframe to be equal to those of the first one. Is there any way to do this?

try:
for i in df1.columns.tolist():
df1[f'{i}'] = df1[f'{i}'].astype(df[f'{i}'].dtype)

How to scrape table off a website in dataframe format?

I tried scraping a table of the NBA site but got the error no tables found.
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = 'https://www.nba.com/game/0022101100/play-by-play?latest=0'
html = requests.get(url).content
df_list = pd.read_html(html)
How do I go about getting the play-by-play table?

As stated, that data is dynamically rendered. You could a) use Selenium to simulate opeing the browser, allowing the page to render, THEN use pandas to parse the table tags. or b) use the nba api and get the data in json format.
import requests
import pandas as pd
gameId = '0022101100'
url = f'https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_{gameId}.json'
jsonData = requests.get(url).json()
df = pd.json_normalize(jsonData,
record_path=['game', 'actions'])
Here is Option 2:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import json
gameId = '0021900709'
url = f'https://www.nba.com/game/{gameId}/play-by-play'
headers = {
'referer': 'https://www.nba.com/',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
jsonStr = soup.find('script', {'id':'__NEXT_DATA__'}).text
jsonData = json.loads(jsonStr)['props']['pageProps']['playByPlay']
df = pd.json_normalize(jsonData,
record_path=['actions'])
Output: first 10 rows of 548
print(df.head(10).to_markdown())
| | actionNumber | clock | timeActual | period | periodType | actionType | subType | qualifiers | personId | x | y | possession | scoreHome | scoreAway | edited | orderNumber | xLegacy | yLegacy | isFieldGoal | side | description | personIdsFilter | teamId | teamTricode | descriptor | jumpBallRecoveredName | jumpBallRecoverdPersonId | playerName | playerNameI | jumpBallWonPlayerName | jumpBallWonPersonId | jumpBallLostPlayerName | jumpBallLostPersonId | shotDistance | shotResult | pointsTotal | assistPlayerNameInitial | assistPersonId | assistTotal | officialId | foulPersonalTotal | foulTechnicalTotal | foulDrawnPlayerName | foulDrawnPersonId | shotActionNumber | reboundTotal | reboundDefensiveTotal | reboundOffensiveTotal | turnoverTotal | stealPlayerName | stealPersonId | blockPlayerName | blockPersonId | value |
|---:|---------------:|:------------|:-----------------------|---------:|:-------------|:-------------|:----------|:---------------------|-----------:|---------:|---------:|-------------:|------------:|------------:|:---------------------|--------------:|----------:|----------:|--------------:|:-------|:-------------------------------------------------------|:--------------------------|--------------:|:--------------|:-------------|:------------------------|---------------------------:|:-------------|:---------------|:------------------------|----------------------:|:-------------------------|-----------------------:|---------------:|:-------------|--------------:|:--------------------------|-----------------:|--------------:|-------------:|--------------------:|---------------------:|:----------------------|--------------------:|-------------------:|---------------:|------------------------:|------------------------:|----------------:|------------------:|----------------:|------------------:|----------------:|--------:|
| 0 | 2 | PT12M00.00S | 2022-03-25T23:10:44.0Z | 1 | REGULAR | period | start | [] | 0 | nan | nan | 0 | 0 | 0 | 2022-03-25T23:10:44Z | 20000 | nan | nan | 0 | | Period Start | [] | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 1 | 4 | PT11M55.00S | 2022-03-25T23:10:47.2Z | 1 | REGULAR | jumpball | recovered | [] | 1626220 | nan | nan | 1610612762 | 0 | 0 | 2022-03-25T23:10:47Z | 40000 | nan | nan | 0 | | Jump Ball R. Gobert vs. M. Plumlee: Tip to R. O'Neale | [1626220, 203497, 203486] | 1.61061e+09 | UTA | startperiod | R. O'Neale | 1.62622e+06 | O'Neale | R. O'Neale | Gobert | 203497 | Plumlee | 203486 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 2 | 7 | PT11M36.00S | 2022-03-25T23:11:06.3Z | 1 | REGULAR | 2pt | DUNK | ['pointsinthepaint'] | 203497 | 92.8548 | 47.0588 | 1610612762 | 0 | 2 | 2022-03-25T23:11:12Z | 70000 | -15 | 15 | 1 | right | R. Gobert DUNK (2 PTS) (D. Mitchell 1 AST) | [203497, 1628378] | 1.61061e+09 | UTA | nan | nan | nan | Gobert | R. Gobert | nan | nan | nan | nan | 2.08 | Made | 2 | D. Mitchell | 1.62838e+06 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 3 | 9 | PT11M21.00S | 2022-03-25T23:11:25.8Z | 1 | REGULAR | foul | personal | ['2freethrow'] | 203497 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:38Z | 90000 | nan | nan | 0 | | R. Gobert shooting personal FOUL (1 PF) (Plumlee 2 FT) | [203497, 203486] | 1.61061e+09 | UTA | shooting | nan | nan | Gobert | R. Gobert | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 200832 | 1 | 0 | Plumlee | 203486 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 4 | 11 | PT11M21.00S | 2022-03-25T23:11:50.7Z | 1 | REGULAR | freethrow | 1 of 2 | [] | 203486 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:50Z | 110000 | nan | nan | 0 | | MISS M. Plumlee Free Throw 1 of 2 | [203486] | 1.61061e+09 | CHA | nan | nan | nan | Plumlee | M. Plumlee | nan | nan | nan | nan | nan | Missed | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 5 | 12 | PT11M21.00S | 2022-03-25T23:11:50.7Z | 1 | REGULAR | rebound | offensive | ['deadball', 'team'] | 0 | nan | nan | 1610612766 | 0 | 2 | 2022-03-25T23:11:50Z | 120000 | nan | nan | 0 | | TEAM offensive REBOUND | [] | 1.61061e+09 | CHA | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 11 | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 6 | 13 | PT11M21.00S | 2022-03-25T23:12:06.4Z | 1 | REGULAR | freethrow | 2 of 2 | [] | 203486 | nan | nan | 1610612766 | 1 | 2 | 2022-03-25T23:12:06Z | 130000 | nan | nan | 0 | | M. Plumlee Free Throw 2 of 2 (1 PTS) | [203486] | 1.61061e+09 | CHA | nan | nan | nan | Plumlee | M. Plumlee | nan | nan | nan | nan | nan | Made | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 7 | 14 | PT11M06.00S | 2022-03-25T23:12:22.2Z | 1 | REGULAR | 3pt | Jump Shot | [] | 1626220 | 69.7273 | 75.2451 | 1610612762 | 1 | 2 | 2022-03-25T23:12:29Z | 140000 | 126 | 232 | 1 | right | MISS R. O'Neale 26' 3PT | [1626220] | 1.61061e+09 | UTA | nan | nan | nan | O'Neale | R. O'Neale | nan | nan | nan | nan | 26.42 | Missed | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| 8 | 15 | PT11M02.00S | 2022-03-25T23:12:26.2Z | 1 | REGULAR | rebound | offensive | [] | 1627823 | nan | nan | 1610612762 | 1 | 2 | 2022-03-25T23:12:29Z | 150000 | nan | nan | 0 | | J. Hernangomez REBOUND (Off:1 Def:0) | [1627823] | 1.61061e+09 | UTA | nan | nan | nan | Hernangomez | J. Hernangomez | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 14 | 1 | 0 | 1 | nan | nan | nan | nan | nan | nan |
| 9 | 16 | PT10M56.00S | 2022-03-25T23:12:33.1Z | 1 | REGULAR | 3pt | Jump Shot | ['2ndchance'] | 1628378 | 68.6761 | 70.098 | 1610612762 | 1 | 5 | 2022-03-25T23:12:38Z | 160000 | 100 | 242 | 1 | right | D. Mitchell 26' 3PT (3 PTS) (J. Hernangomez 1 AST) | [1628378, 1627823] | 1.61061e+09 | UTA | nan | nan | nan | Mitchell | D. Mitchell | nan | nan | nan | nan | 26.19 | Made | 3 | J. Hernangomez | 1.62782e+06 | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |

Data is loaded dynamically via API as json format.So you can extract them using json() and pandas as follows:
import requests
import pandas as pd
api_url = 'https://cdn.cookielaw.org/vendorlist/iab2Data.json'
headers= {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
}
req=requests.get(api_url,headers=headers).json()
df = pd.DataFrame(req)
print(df)
Output:
gvlSpecificationVersion tcfPolicyVersion ... vendorListVersion lastUpdated
1 2 2 ... 159 2022-09-01T16:05:33Z
2 2 2 ... 159 2022-09-01T16:05:33Z
3 2 2 ... 159 2022-09-01T16:05:33Z
4 2 2 ... 159 2022-09-01T16:05:33Z
5 2 2 ... 159 2022-09-01T16:05:33Z
... ... ... ... ... ...
1146 2 2 ... 159 2022-09-01T16:05:33Z
1147 2 2 ... 159 2022-09-01T16:05:33Z
1148 2 2 ... 159 2022-09-01T16:05:33Z
1149 2 2 ... 159 2022-09-01T16:05:33Z
1150 2 2 ... 159 2022-09-01T16:05:33Z
[907 rows x 10 columns]

Fill duplicates with missing value after grouping with some logic

I have a dataframe, I need to take off the duplicates of ticket_id if the owner_type is the same, and if not, pick 'm' over 's', if no value is picket then a NaN is returned:
data = pd.DataFrame({'owner_type':['m','m','m','s','s','m','s','s'],'ticket_id':[1,1,2,2,3,3,4,4]})
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | 1 |
| 1 | m | 1 |
| 2 | m | 2 |
| 3 | s | 2 |
| 4 | s | 3 |
| 5 | m | 3 |
| 6 | s | 4 |
| 7 | s | 4 |'
Should give back:
'| | owner_type | ticket_id |
|---:|:-------------|------------:|
| 0 | m | NaN |
| 1 | m | NaN |
| 2 | m | 2 |
| 3 | s | NaN |
| 4 | s | NaN |
| 5 | m | 3 |
| 6 | s | NaN |
| 7 | s | NaN |'
Pseudo code would be like : If ticket_id is duplicated, look at owner_type, if owner_type has mover than one value, return value of 'm' and NaN for 's'.
My attempt
data.groupby('ticket_id').apply(lambda x: x['owner_type'] if len(x) < 2 else NaN)
Not working

Try this:
(df['ticket_id'].where(
~df.duplicated(['owner_type','ticket_id'],keep=False) &
df['owner_type'].eq(df.groupby('ticket_id')['owner_type'].transform('min'))))
Old answer:
m = ~df.duplicated(keep=False) & df['owner_type'].eq('m')
df['ticket_id'].where(m)
Output:
0 NaN
1 NaN
2 2.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN

Apply Coalesce after grouping on two columns in pandas

Please Please help.
I have a dataframe like
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-20 21:24:03.390000 | 2020-10-20 23:46:36.990000 |
| 1 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 04:36:03.390000 | 2020-10-21 06:58:36.990000 |
| 2 | 12345 | nan | 49584 | 2827 | nan | nan | nan | 2020-10-21 09:24:03.390000 | 2020-10-21 11:46:36.990000 |
| 3 | 12345 | nan | nan | nan | 3940 | nan | nan | 2020-10-21 14:12:03.390000 | 2020-10-21 16:34:36.990000 |
| 4 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-21 21:24:03.390000 | 2020-10-21 23:46:36.990000 |
| 5 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 02:40:51.390000 | 2020-10-22 05:03:24.990000 |
| 6 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-22 08:26:27.390000 | 2020-10-22 10:49:00.990000 |
| 7 | 12345 | Pass | nan | nan | nan | 392 | 304 | 2020-10-22 14:12:03.390000 | 2020-10-22 16:34:36.990000 |
| 8 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-22 19:57:39.390000 | 2020-10-22 22:20:12.990000 |
| 9 | 12346 | nan | 22839 | 4059 | nan | nan | nan | 2020-10-23 01:43:15.390000 | 2020-10-23 04:05:48.990000 |
| 10 | 12346 | nan | nan | nan | 4059 | nan | nan | 2020-10-23 07:28:51.390000 | 2020-10-23 09:51:24.990000 |
| 11 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 13:14:27.390000 | 2020-10-23 15:37:00.990000 |
| 12 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-23 19:00:03.390000 | 2020-10-23 21:22:36.990000 |
| 13 | 12346 | nan | nan | nan | nan | nan | nan | 2020-10-24 00:45:39.390000 | 2020-10-24 03:08:12.990000 |
| 14 | 12346 | Fail | nan | nan | nan | 2938 | 495 | 2020-10-24 06:31:15.390000 | 2020-10-24 08:53:48.990000 |
| 15 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-24 12:16:51.390000 | 2020-10-24 14:39:24.990000 |
| 16 | 12345 | nan | 62839 | 1827 | nan | nan | nan | 2020-10-24 18:02:27.390000 | 2020-10-24 20:25:00.990000 |
| 17 | 12345 | nan | nan | nan | 2726 | nan | nan | 2020-10-24 23:48:03.390000 | 2020-10-25 02:10:36.990000 |
| 18 | 12345 | nan | nan | nan | nan | nan | nan | 2020-10-25 05:33:39.390000 | 2020-10-25 07:56:12.990000 |
| 19 | 12345 | Fail | nan | nan | nan | nan | 1827 | 2020-10-25 11:19:15.390000 | 2020-10-25 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
and want my output to look like
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+
| | ID | Result | measurement_1 | measurement_2 | measurement_3 | measurement_4 | measurement_5 | start_time | end-time |
|----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------|
| 0 | 12345 | Pass | 49584 | 2827 | 3940 | 392 | 304 | 2020-10-20 21:24:03.390000 | 2020-10-22 16:34:36.990000 |
| 1 | 12346 | Fail | 22839 | 4059 | 4059 | 2938 | 495 | 2020-10-22 19:57:39.390000 | 2020-10-24 08:53:48.990000 |
| 2 | 12345 | Fail | 62839 | 1827 | 2726 | nan | 1827 | 2020-10-24 12:16:51.390000 | 2020-10-23 13:41:48.990000 |
+----+-------+----------+-----------------+-----------------+-----------------+-----------------+-----------------+----------------------------+----------------------------+

so far I am able to group the cols on `ID` and `Result`. Now want to apply the Coalesce to it (newDf)
df = pd.read_excel("Test_Coalesce.xlsx")
newDf = df.groupby(['ID','Result'])
newDf.all().reset_index()

It looks like you want to groupby consecutive blocks of ID. If so:
blocks = df['ID'].ne(df['ID'].shift()).cumsum()
agg_dict = {k:'first' if k != 'end-time' else 'last'
for k in df.columns}
df.groupby(blocks).agg(agg_dict)

How do I create a new column if the text from one column if the text from a second column contains a specific string pattern?

My current data looks something like this
+-------+----------------------------+-------------------+-----------------------+
| Index | 0 | 1 | 2 |
+-------+----------------------------+-------------------+-----------------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date |
| 1 | V50011 Tech Comp | nan | Phone:0177222222 |
| 2 | Regis Place | nan | Fax:017757575789 |
| 3 | Catenberry | nan | nan |
| 4 | Manhattan, NY | nan | nan |
| 5 | V7484 Pipe | nan | Phone: |
| 6 | Japan | nan | nan |
| 7 | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan |
+-------+----------------------------+-------------------+-----------------------+
I am trying to create a new column, df['Company'], that should contain the what is in df[0] if it Starts with a "V" and if df[2] has "Phone" in it. If the condition is not satisfied, then it can be nan. Below is what I am looking for.
+-------+----------------------------+-------------------+-----------------------+------------+
| Index | 0 | 1 | 2 | Company |
+-------+----------------------------+-------------------+-----------------------+------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date | nan |
| 1 | V50011 Tech | nan | Phone:0177222222 |V50011 Tech |
| 2 | Regis Place | nan | Fax:017757575789 | nan |
| 3 | Catenberry | nan | nan | nan |
| 4 | Manhattan, NY | nan | nan | nan |
| 5 | V7484 Pipe | nan | Phone: | V7484 Pipe |
| 6 | Japan | nan | nan | nan |
| 7 | nan | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan | nan |
+-------+----------------------------+-------------------+-----------------------+------------+
I am trying the below script but I get an error ValueError: Wrong number of items passed 1420, placement implies 1
df['Company'] = pd.np.where(df[2].str.contains("Ph"), df[0].str.extract(r"(^V[A-Za-z0-9]+)"),"stop")
I put in "stop" as the else part because I don't know how to let python use nan when the condition is not met.
I would also like to be able to parse out a section of the df[0], for example just the v5001 section, but not rest of the cell contents. I tried something like this using AMCs answer but get an error:
df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0].str.extract(r"(^V[A-Za-z0-9]+)")
Thank you

You haven't provided an easy way for us to test potential solutions, but this should do the job:
df.loc[df[0].str.startswith('V') & df[2].str.contains('Phone'), 'Company'] = df[0]

A potential solution to this would be to use list comprehension. You could probably get a speed boost using some of pandas' built in functions but this will get you there.
#!/usr/bin/env python
import numpy as np
import pandas as pd
df = pd.DataFrame({
0:["reference", "v5001 tech comp", "catenberry", "very different"],
1:["not", "phone", "other", "text"]
})
df["new_column"] = [x if (x[0].lower() == "v") & ("phone" in y.lower())
else np.nan for x,y in df.loc[:, [0,1]].values]
print(df)
Which will produce
0 1 new_column
0 reference not NaN
1 v5001 tech comp phone v5001 tech comp
2 catenberry other NaN
3 very different text NaN
All I'm doing is taking your two conditions and building a new list which will then be assigned to your new column.

Here's another way to get your result
condition1=df['0'].str.startswith('V')
condition2=df['2'].str.contains('Phone')
df['Company']=np.where((condition1 & condition2), df['0'],np.nan)
df['Company']=df['Company'].str.split(' ',expand=True)

You can do it with the pandas apply function:
import re
import numpy as np
import pandas as pd
df['Company'] = df.apply(lambda x: x[0].split()[0] if re.match(r'^v[A-Za-z0-9]+', x[0].lower()) and 'phone' in x[1].lower() else np.nan, axis=1)
Edit:
To adjust to comment under #AMC's answer

IIUC,
we can use either a boolean condition to extract the V Number with some basic regex,
or we can apply the same formula within a where statement.
to set a value to NaN we can use np.nan
if you want to grab the entire string after V we can use [V]\w+.* which will grab everything after the first match.
from IO import StringIO
d = """+-------+----------------------------+-------------------+-----------------------+
| Index | 0 | 1 | 2 |
+-------+----------------------------+-------------------+-----------------------+
| 0 | Reference Curr | Daybook / Voucher | Invoice Date Due Date |
| 1 | V50011 Tech Comp | nan | Phone:0177222222 |
| 2 | Regis Place | nan | Fax:017757575789 |
| 3 | Catenberry | nan | nan |
| 4 | Manhattan, NY | nan | nan |
| 5 | Ultilagro, CT | nan | nan |
| 6 | Japan | nan | nan |
| 7 | nan | nan | nan |
| 8 | 4543.34GBP (British Pound) | nan | nan |
+-------+----------------------------+-------------------+-----------------------+"""
df = pd.read_csv(StringIO(d),sep='|',skiprows=1)
df = df.iloc[1:-1,2:-1]
df.columns = df.columns.str.strip()
df["3"] = df[df["2"].str.contains("phone", case=False) == True]["0"].str.extract(
r"([V]\w+)"
)
print(df[['0','2','3']])
0 2 3
1 Reference Curr Invoice Date Due Date nan
2 V50011 Tech Comp Phone:0177222222 V50011
3 Regis Place Fax:017757575789 nan
4 Catenberry nan nan
5 Manhattan, NY nan nan
6 Ultilagro, CT nan nan
7 Japan nan nan
8 nan nan nan
9 4543.34GBP (British Pound) nan nan
if you want as a where statement:
import numpy as np
df["3"] = np.where(
df[df["2"].str.contains("phone", case=False)], df["0"].str.extract(r"([V]\w+)"), np.nan
)
print(df[['0','2','3']])
0 2 3
1 Reference Curr Invoice Date Due Date NaN
2 V50011 Tech Comp Phone:0177222222 V50011
3 Regis Place Fax:017757575789 NaN
4 Catenberry nan NaN
5 Manhattan, NY nan NaN
6 Ultilagro, CT nan NaN
7 Japan nan NaN
8 nan nan NaN
9 4543.34GBP (British Pound) nan NaN

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Naming columns in pandas deletes data - python

You should just write: df.columns = ['name', 'region', ...] This is also much more efficient as you aren't trying to copy the entire DataFrame; as far as I know passing one DataFrame into the constructor for another will make a deep, not shallow copy.

Related

Match dtypes of two DataFrames that share columns

How to scrape table off a website in dataframe format?

Fill duplicates with missing value after grouping with some logic

Apply Coalesce after grouping on two columns in pandas

How do I create a new column if the text from one column if the text from a second column contains a specific string pattern?

Categories

Resources