I am trying to append rows by column for every set of 4 rows/values.
I have 11 values the first 4 values should be in one concatenated row and row-5 to row-8 as one value and last 3 rows as one value even if the splitted values are not four.
df_in = pd.DataFrame({'Column_IN': ['text 1','text 2','text 3','text 4','text 5','text 6','text 7','text 8','text 9','text 10','text 11']})
and my expected output is as follows
df_out = pd.DataFrame({'Column_OUT': ['text 1&text 2&text 3&text 4','text 5&text 6&text 7&text 8','text 9&text 10&text 11']})
I have tried to get my desired output df_out as below.
df_2 = df_1.iloc[:-7].agg('&'.join).to_frame()
Any modification required to get required output?
Try using groupby and agg:
>>> df_in.groupby(df_in.index // 4).agg('&'.join)
Column_IN
0 text 1&text 2&text 3&text 4
1 text 5&text 6&text 7&text 8
2 text 9&text 10&text 11
>>>
Related
I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666').
I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no
text
font
1
they
0
1
ate
0
1
apples
0
2
and
0
2
then
1
2
their
0
2
stoma22
0
2
fe156
1
2
sligh334
0
2
pain666
1
2
given
0
2
the
1
3
fruit
0
You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc + 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666
IIUC, use pandas.Series.idxmax with "".join().
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value.
If multiple values equal the maximum, the first row label with that
value is returned.
So, assuming (Words) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc+1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()
I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.
here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91
I have a 2 datasets (dataframes), one called source and the other crossmap. I am trying to find rows with a specific column value starting with "999", if one is found I need to look up the complete value of that column (e.x. "99912345") on the crossmap dataset (dataframe) and return the value from a column on that row in the cross-map.
# Source Dataframe
0 1 2 3 4
------ -------- -- --------- -----
0 303290 544981 2 408300622 85882
1 321833 99910722 1 408300902 85897
2 323241 99902978 3 408056001 95564
# Cross Map Dataframe
ID NDC ID DIN(NDC) GTIN NAME PRDID
------- ------ -------- -------------- ---------------------- -----
44563 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
69281 321833 99910722 99910722000000 SALBUTAMOL SULFATE (A) 90367
6002800 323241 99902978 75402850039706 EPINEPHRINE (A) 95564
8001116 323241 99902978 99902978000000 EPINEPHRINE (A) 95564
The 'straw dog' logic I am working with is this:
search source file and find '999' entries in column 1
df_source[df_source['Column1'].str.contains('999')]
interate through the rows returned and search for the value in column 1 in the crossmap dataframe column (DIN(NDC)) and return the corresponding PRDID
update the source dataframe with the PRDID, and write the updated file
It is these last two logic pieces where I am struggling with how to do this. Appreciate any direction/guidance anyone can provide.
Is there maybe a better/easier means of doing this using python but not pandas/dataframes?
So, as far as I understood you correctly: we are looking for the first digits of 999 in the 'Source Dataframe' in the first column of the value. Next, we find these values in the 'Cross Map' column 'DIN(NDC)' and we get the values of the column 'PRDID' on these lines.
If everything is correct, then I can't understand your further actions?
import pandas as pd
import more_itertools as mit
Cross_Map = pd.DataFrame({'DIN(NDC)': [99910722, 99910722, 99902978, 99902978],
'PRDID': [90367, 90367, 95564, 95564]})
df = pd.DataFrame({0: [303290, 321833, 323241], 1: [544981, 99910722, 99902978], 2: [2, 1, 3],
3: [408300622, 408300902, 408056001], 4: [85882, 85897, 95564]})
m = [i for i in df[1] if str(i)[:3] == '999'] #find the values in column 1
index = list(mit.locate(list(Cross_Map['DIN(NDC)']), lambda x: x in m)) #get the indexes of the matched column values DIN(NDC)
print(Cross_Map['PRDID'][index])
For ex. consider the data frame given below:
Timestamp in_speed
1625638530 268.78
1625638590 262.75
1625638650 265.43
1625638710 270.67
1625638770 261.13
1625638830 265.49
1625638890 266.51
1625638950 270.54
1625639010 275.12
1625639070 267.62
1625639130 267.20
1625639190 265.29
1625639250 261.95
1625639310 264.39
1625639370 270.76
1625639430 291.18
I want to extract the whole row containing the maximum value for every 7 rows. Hence, desired output will be:
1625638710 270.67
1625639010 275.12
1625639430 291.18
Use DataFrameGroupBy.idxmax for indices by maximal values and select by DataFrame.loc:
df = df.loc[df.groupby(df.index // 7)['in_speed'].idxmax()]
#alternative for not default index
#df = df.loc[df.groupby(np.arange(len(df)) // 7)['in_speed'].idxmax()]
print (df)
Timestamp in_speed
3 1625638710 270.67
8 1625639010 275.12
15 1625639430 291.18
I have two dataframes. Here is dwpjp.head():
jp_number
0
25146315052147720191
1
57225427599900052634
2
86076681691411639833
3
50491824499499656478
4
95588382889227620465
and ct_data.head():
imjp_number
imct_id
0
23605308039805192764
x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1
57225427599900052634
aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
2
53733358271401869469
6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
3
50491824499499656478
__gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
4
82143248133286027306
__g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
I want two new dataframes cct_data, and dct_data from ct_data. The ct_data dataframe should be split on the condition if the jp_number is present in the dwbjp dataframe then put into cct_data, otherwise put into dct_data.
I tried this for common jp_number present in dwpjp:
cct_data = ct_data[ct_data.isin(dwpjp).any(1).values]
and for the other I negated the condition as follows:
dct_data = ct_data[~[ct_data.isin(dwpjp).any(1).values]]
but results are not getting as below.
cct_data
imjp_number
imct_id
0
57225427599900052634
aa0d2dac654d4154bf7c09f73faeaf62|-vf6738ee3bed
1
50491824499499656478
__gbe204670ca784a01b7207b42a7e5a5d3|54e2c39cd3
and dct_data:
imjp_number
imct_id
0
23605308039805192764
x1E5e3ukRyEFRT6SUAF6lg|d543d3d064da465b8576d87
1
53733358271401869469
6FfHZRoiWs2VO02Pruk07A|__g3d877adf9d154637be26
2
82143248133286027306
__g1114a30c6ea548a2a83d5a51718ff0fd|773840905c
Note: jpnumber=imjp_number.
Modified your formula as below
cct_data = ct_data[ct_data.imjp_number.isin(dwpjp.jp_number)]
and
dct_data = ct_data[~ct_data.imjp_number.isin(dwpjp.jp_number)]