Extract strings from a Dataframe looping over a single row - python

I'm reading multiple PDFs (using tabula) into data frames like this:
nan = float('nan')
DataFrame_as_dict = {'Unnamed: 0': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 1': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'NOTA DE NEGOCIAÇÃO': {0: nan,
1: nan,
2: 'Rico Investimentos - Grupo XP',
3: 'Av. Presidente Juscelino Kubitschek - Torre Sul, 1909 - 25o ANDAR VILA OLIMPIA 4543-907',
4: 'Tel. 3003-5465Fax: (55 11) 4007-2465',
5: 'Internet: www.rico.com.vc SAC: 0800-774-0402 e-mail: atendimento#rico.com.vc'},
'Unnamed: 3': {0: 'Nr. nota Folha',
1: '330736 1',
2: nan,
3: 'SÃO PAULO - SP',
4: nan,
5: nan},
'Unnamed: 4': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan, 5: nan},
'Unnamed: 5': {0: 'Data pregão',
1: '30/09/2015',
2: nan,
3: nan,
4: nan,
5: nan}}
df = pd.DataFrame(DataFrame_as_dict)
dataframe figure
My intention is to use that value '330736 1' into the variable "number" and '30/09/2015' into a variable "date".
The issue is that, although these values will always be located in row 1, the columns vary in an unpredictable way across the multiple PDFs.
Therefore, I tried to loop over the different columns of row 1, in order to extract these data regardless the columns they are:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1:2,i]).strip()
else:
n_nota = str(df.iloc[1:2,i]).strip()
However, without success... Any thoughts?

In your original code, if isinstance(df.iloc[1:2,i], str) will never evaluate to True for two reasons:
Strings inside DataFrames are of type object
df.iloc[1:2,i] will always be a pandas Series.
Since object is such a flexible type, it's not as useful as str for identifying the data you want. In the code below, I simply used a space character to differentiate the data you want for n_nota. If this doesn't work with your data, a regex pattern may be a good approach.
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1:2,i].values, object):
(df.iloc[1:2,i].values)
if "/" in str(df.iloc[1:2,i].values):
date = str(df.iloc[1:2,i].values[0]).strip()
elif " " in str(df.iloc[1:2,i].values):
n_nota = str(df.iloc[1:2,i].values[0]).strip()
Edit: As noted below, the original code in the question text would have worked if each df.iloc[1:2,i] were replaced with df.iloc[1,i] as in:
list_columns = df.columns
for i in range(len(list_columns)):
if isinstance(df.iloc[1,i], str):
if df.iloc[1:2,i].str.contains("/",na=False,regex=False).any():
date = str(df.iloc[1,i]).strip()
else:
n_nota = str(df.iloc[1,i]).strip()

Related

Filtering by rows Pandas DataFrame [duplicate]

This question already has answers here:
Use a list of values to select rows from a Pandas dataframe
(8 answers)
Closed 12 months ago.
I want to filer the pandas DataFrame where it filters out every other column out of the DataFrame except the rows stated within the rows values. How would I be able to do that and get the Expected Output.
import pandas as pd
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buy s': {0: nan, 1: 2.0, 2: nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: nan},
'Number of Sell s': {0: 1.0, 1: nan, 2: 1.0, 3: nan, 4: nan, 5: nan, 6: nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
Expected Output:
Use .isin()
data[data['Symbol'].isin(rows)]

I am trying to map the countries on the world map. The problem occurring is only USA is being shown on the output

fig = px.scatter_geo(df, locations="country", color = "country",
projection="natural earth")
fig.show()
On the output side, I am able to get the world map and in the legends, all the countries do appear. The problem is the countries are not shown on the map.
Here is the snap of the sample data:
{'id': {0: '72b83200-4881-4806-b910-af86905256c4',
1: '5db5df19-c06b-489a-b2f4-c2ffc26643ba',
2: '6c9e4f0d-ef87-497f-97af-df207a25331d',
3: '004bf779-368d-47ae-b3cc-07b0ecad2464',
4: '8a2265d9-1f81-4c47-953f-0d4bfab326c0'},
'name': {0: 'BALCO BRANDS PTY LTD',
1: 'Bambury',
2: 'Bata Shoe Company of Australia',
3: 'Bean Body Care',
4: 'Caprice Australia '},
'canonical_name': {0: 'balcobrands',
1: 'bambury',
2: 'batashoecompanyofaustralia',
3: 'beanbodycare',
4: 'capriceaustralia'},
'url': {0: 'http://www.balcobrands.com',
1: 'http://www.bambury.com.au',
2: 'http://www.bataindustrials.com.au',
3: 'https://global.beanbodycare.com',
4: 'http://www.caprice.com.au'},
'type': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3},
'address': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'city': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'state': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'country': {0: 'Australia',
1: 'Australia',
2: 'Australia',
3: 'Australia',
4: 'Australia'},
'country_code': {0: 'AU', 1: 'AU', 2: 'AU', 3: 'AU', 4: 'AU'},
'created_at': {0: '2020-04-01 20:52:38.098099',
1: '2020-04-01 20:52:38.364935',
2: '2020-04-01 20:52:38.636768',
3: '2020-04-01 20:52:38.951573',
4: '2020-04-01 20:52:39.271376'},
'created_by': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan},
'updated_at': {0: '2020-04-01 20:52:38.098099',
1: '2020-04-01 20:52:38.364935',
2: '2020-04-01 20:52:38.636768',
3: '2020-04-01 20:52:38.951573',
4: '2020-04-01 20:52:39.271376'},
'updated_by': {0: nan, 1: nan, 2: nan, 3: nan, 4: nan}}
The data did not contain the three-digit country codes. When the data was merged with another dataset having three-digit country codes, the required output was obtained.

Getting first non null value after group by function

I would like to return the first non null value of the utm_source column from each group after running a group by function.
This is the code I have written:
file[file['steps'] == 'Sign-ups'].sort_values(by=['ts']).groupby('anonymous_id')['utm_source'].apply(lambda x: x.first_valid_index())
This seems to return this:
anonymous_id
00003df1-be12-47b8-b3b8-d01c84a22fdf NaN
00009cc0-279f-4ccf-aea4-f6af1f2bb75a NaN
0000a6a0-00bc-475f-a9e5-9dcbb4309e78 NaN
0000c906-7060-4521-8090-9cd600b08974 638.0
0000c924-5959-4e2d-8757-0d10f96ca462 NaN
0000dc27-292c-4676-8a1b-4977f2ad1577 275.0
0000df7e-2579-4071-8aa5-814ab294bf9a 419.0
I am not quite sure what the values associated with the anon_id's are.
Here is a sample of my data:
{'anonymous_id': {0: '0000f8ea-3aa6-4423-9247-1d9580d378e1',
1: '00015d49-2cd8-41b1-bbe7-6aedbefdb098',
2: '0002226e-26a4-4f55-9578-2eff2999de7e',
3: '00022b83-240e-4ef9-aaad-ac84064bb902',
4: '00022b83-240e-4ef9-aaad-ac84064bb902'},
'ts': {0: '2018-04-11 06:59:20.206000',
1: '2019-05-18 05:59:11.874000',
2: '2018-09-10 18:19:25.260000',
3: '2017-10-11 08:20:18.092000',
4: '2017-10-11 08:20:31.466000'},
'utm_source': {0: nan, 1: 'facebook', 2: 'facebook', 3: nan, 4: nan},
'rank': {0: 1, 1: 1, 2: 1, 3: 1, 4: 2},
'steps': {0: 'Sign-ups', 1: nan, 2: nan, 3: nan, 4: nan}}
So for each anonymous_id I would return the first (chronological, sorted by the ts column) utm_source associated with the anon_id
So for each anonymous_id I would return the first (chronological,
sorted by the ts column) utm_source associated with the anon_id
IIUC you can first drop the null values and then groupby first:
df.sort_values('ts').dropna(subset=['utm_source']).groupby('anonymous_id')['utm_source'].first()
Output for your example data:
anonymous_id
00015d49-2cd8-41b1-bbe7-6aedbefdb098 facebook
0002226e-26a4-4f55-9578-2eff2999de7e facebook

Parallelizing a for loop in python

I have a dictionary where each key (date) contains a table (multiple lists of the format[day1, val11, val21], [day2, va12, val22], [day3, val13, val23], .... I want to transform it into a DataFrame; this is done with the following code:
df4 = pd.DataFrame(columns=sorted(set_days))
for date in dic.keys():
days = [day for day, val1, val2 in dic[date]]
val1 = [val1 for day, val1, val2 in dic[date]]
df4.loc[date, days] = val1
This code works fine, but it takes more than two hours to run.
After some research, I've realized I could parallelize it via the multiprocessing library; the following code is the intended parallel version
import multiprocessing
def func(date):
global df4, dic
days = [day for day, val1, val2 in dic[date]]
val1 = [val1 for day, val1, val2 in dic[date]]
df4.loc[date, days] = val1
multiprocessing.Pool(processes=8).map(func, dic.keys())
The problem with this code is that, after executing multiprocessing.Pool(processes..., the df4 DataFrame is empty.
Any help would be much appreciated.
Example
Suppose the dictionary contains two days:
dic['20030812'][:4]
Out: [[1, 24.25, 0.0], [20, 23.54, 23.54], [30, 23.13, 24.36], [50, 22.85, 23.57]]
dic['20030813'][:4]
Out: [[1, 24.23, 0.0], [19, 23.4, 22.82], [30, 22.97, 24.19], [49, 22.74, 23.25]]
then the DataFrame should be of the form:
df4.loc[:, 1:50]
1 2 3 4 5 ... 46 47 48 49 50
20030812 24.25 NaN NaN NaN NaN ... NaN NaN NaN NaN 22.85
20030813 24.23 NaN NaN NaN NaN ... NaN NaN NaN 22.74 NaN
Also,
dic.keys()
Out[36]: dict_keys(['20030812', '20030813'])
df1.head().to_dict()
Out:
{1: {'20030812': 24.25, '20030813': 24.23},
2: {'20030812': nan, '20030813': nan},
3: {'20030812': nan, '20030813': nan},
4: {'20030812': nan, '20030813': nan},
5: {'20030812': nan, '20030813': nan},
6: {'20030812': nan, '20030813': nan},
7: {'20030812': nan, '20030813': nan},
8: {'20030812': nan, '20030813': nan},
9: {'20030812': nan, '20030813': nan},
10: {'20030812': nan, '20030813': nan},
11: {'20030812': nan, '20030813': nan},
12: {'20030812': nan, '20030813': nan},
13: {'20030812': nan, '20030813': nan},
14: {'20030812': nan, '20030813': nan},
15: {'20030812': nan, '20030813': nan},
16: {'20030812': nan, '20030813': nan},
17: {'20030812': nan, '20030813': nan},
18: {'20030812': nan, '20030813': nan},
19: {'20030812': nan, '20030813': 23.4},
20: {'20030812': 23.54, '20030813': nan},
21: {'20030812': nan, '20030813': nan},
22: {'20030812': nan, '20030813': nan},
23: {'20030812': nan, '20030813': nan},
24: {'20030812': nan, '20030813': nan},
25: {'20030812': nan, '20030813': nan},
26: {'20030812': nan, '20030813': nan},
27: {'20030812': nan, '20030813': nan},
28: {'20030812': nan, '20030813': nan},
29: {'20030812': nan, '20030813': nan},
30: {'20030812': 23.13, '20030813': 22.97},
31: {'20030812': nan, '20030813': nan},
32: {'20030812': nan, '20030813': nan},
...
To answer your original question (roughly: "Why is the df4 DataFrame empty?"), the reason this doesn't work is that when the Pool workers are launched, each worker inherits a personal copy-on-write view of the parent's data (either directly if multiprocessing is running on a UNIX-like system with fork, or via a kludgy approach to simulate it when running on Windows).
Thus, when each worker does:
df4.loc[date, days] = val1
it's mutating the worker's personal copy of df4; the parent process's copy remains untouched.
In general, there are three ways to handle this:
Change your worker function to return something that can be used in the parent process. For example, instead of trying to perform in-place mutation with df4.loc[date, days] = val1, return what's necessary to do it in the parent, e.g. return date, days, val1, then change the parent to:
for date, days, val in multiprocessing.Pool(processes=8).map(func, dic.keys()):
df4.loc[date, days] = val
Downside to this approach is that it requires each of the return values to be pickled (Python's version of serialization), piped from child to parent, and unpickled; if the worker task doesn't do very much work, especially if the return values are large (and in this case, that seems to be the case), it can easily spend more time on serialization and IPC than it gains in parallelism.
Using shared object/memory (demonstrated in this answer to "Multiprocessing writing to pandas dataframe"). In practice, this usually doesn't gain you much, since stuff that isn't based on the more "raw" ctypes sharing using multiprocessing.sharedctypes is still ultimately going to end up needing to pipe data from one process to another; sharedctypes based stuff can get a meaningful speed boost though, since once mapped, shared raw C arrays are nearly as fast to access as local memory.
If the work being parallelized is I/O bound, or uses third party C extensions for CPU bound work (e.g. numpy), you may be able to get the required speed boosts from threads, despite GIL interference, and threads do share the same memory. Your case doesn't appear to be either I/O bound or meaningfully dependent on third party C extensions which might release the GIL, so it probably won't help here, but in general, the simple way to switch from process-based parallelism to thread-based parallelism (when you're already using multiprocessing) is to change the import from:
import multiprocessing
to
import multiprocessing.dummy as multiprocessing
which imports the thread-backed version of multiprocessing under the expected name, so code seamlessly switches from using processes to threads.
As RafaelC hinted, It was an XY problem.
I've been able to reduce the execution time to 20 seconds without multiprocessing.
I created a lista list that replaces the dictionary, and, rather than adding to the df4 DataFrame a row for each date, once the lista is full, I transform the lista into a DataFrame.
# Returns the largest day from all the dates (each date has a different number of days)
def longest_series(dic):
largest_series = 0
for date in dic.keys():
# get the last day's table of a specific date
current_series = dic[date][-1][0]
if largest_series < current_series:
largest_series = current_series
return largest_series
ls = longest_series(dic)
l_total_days = list(range(1, ls+1))
s_total_days = set(l_total_days)
# creating lista list, lista is similar to dic
#The difference is that, in lista, every date has the same number of days
#i.e. from 1 to ls, and it does not contain the dates.
# It takes 15 seconds
lista = list()
for date in dic.keys():
present_days = list()
presen_values = list()
for day, val_252, _ in dic[date]:
present_days.append(day)
presen_values.append(val_252)
missing_days = list(s_total_days.difference(set(present_days))) # extra days added to date
missing_values = [None] * len(missing_days) # extra values added to date
all_days_index = list(np.argsort(present_days + missing_days)) # we need to preserve the order between days and values
all_day_values = presen_values + missing_values
lista.append(list(np.array(all_day_values)[all_days_index]))
# It takes 4 seconds
df = pd.DataFrame(lista, index= dic.keys(), columns=l_total_days)

'forward replace' with pandas

I have a df where every row looks something like this:
series = pd.Series({0: 1.0,
1: 1.0,
10: nan,
11: nan,
2: 1.0,
3: 1.0,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
'B': 3.0,
'D': 1.0})
For every value in D I want series[series.D] to be 2 and, all numerical values that are higher than series.D, also to be 2. It's kinda like a 'forward replace'.
So what I want is:
target = pd.Series({0: 1.0,
1: 2.0,
10: nan,
11: nan,
2: 2.0,
3: 2.0,
4: nan,
5: nan,
6: nan,
7: nan,
8: nan,
9: nan,
'B': 3.0,
'D': 1.0})
So far I've got:
def forward_replace(series):
if pd.notnull(series['D']):
cols = [0,1,2,3,4,5,6,7,8,9,10,11]
target_cols = [x for x in cols if x > series.D]
series.loc[target_cols].replace({1:2}, inplace=True)
return series
It seems like it's not possible to use label based indexing with numerical column labels?

Categories

Resources