how to extract all repeating patterns from a string into a dataframe - python

i have a dataframe with the equiptment codes of certain trucks, this is a similar list o list of the cells
x = [[A0B,A1C,A1Z,A2E,A5C,B1B,B1F,B1H,B2A],
[A0A,A0B,A1C,A1Z,A2I,A5L,B1B,B1F,B1H,B2A,B2X,B3H,B4L,B5E,B5J,C0G,C1W,C5B,C5D],
[A0B,A1C,A1Z,A2E,A5C,B1B,B1F,B1H,B2A,B2X,B4L,B5C,B5I,C0A,C1J,C5B,C5D,C6C,C6J,C6Q]]
i want to extract all the values with match with "B" for example ("B1B,B1F,B1H");("B1B,B1F,B1H,B2A,B2X,B3H")("B1B,B1F,B1H,B2A,B2X,B4L,B5C,B5I") i try this code but every row each line has a different length
sublista = ['B1B','B1F','B1H','B2A','B2X','B4L','B5C','B5I']
df3 = pd.DataFrame(columns=['FIN', 'Equipmentcodes', 'AQUATARDER', 'CAJA'])
for elemento in sublista:
df_aux=(df2[df2['Equipmentcodes'].str.contains(elemento, case=False)])
df_aux['CAJA'] = elemento
df3 = df3.append(df_aux, ignore_index=True)

Assuming your column contains strings, you could use a regex:
df['selected'] = (df['code']
.str.extractall(r'\b(B[^,]*)\b')[0]
.groupby(level=0).apply(','.join)
)
example input:
x = ['A0B,A1C,A1Z,A2E,A5C,B1B,B1F,B1H,B2A',
'A0A,A0B,A1C,A1Z,A2I,A5L,B1B,B1F,B1H,B2A,B2X,B3H,B4L,B5E,B5J,C0G,C1W,C5B,C5D',
'A0B,A1C,A1Z,A2E,A5C,B1B,B1F,B1H,B2A,B2X,B4L,B5C,B5I,C0A,C1J,C5B,C5D,C6C,C6J,C6Q']
df = pd.DataFrame({'code': x})
output:
selected code
0 B1B,B1F,B1H,B2A A0B,A1C,A1Z,A2E,A5C,B1B,B1F,B1H,B2A
1 B1B,B1F,B1H,B2A,B2X,B3H,B4L,B5E,B5J A0A,A0B,A1C,A1Z,A2I,A5L,B1B,B1F,B1H,B2A,B2X,B3H,B4L,B5E,B5J,C0G,C1W,C5B,C5D
2 B1B,B1F,B1H,B2A,B2X,B4L,B5C,B5I A0B,A1C,A1Z,A2E,A5C,B1B,B1F,B1H,B2A,B2X,B4L,B5C,B5I,C0A,C1J,C5B,C5D,C6C,C6J,C6Q

Related

check if string in df and if true apply df row to a new list

Im new to python and have an issue.
I have the below df:
language, count
en-US,12
en,5
lang_list = ['en', 'es]
I would like to check if the items in lang_list are in my df. If so, i would like to append the df row with the matched language to a new list.
This is my code till now:
new list = []
for x in lang_list:
if x in df['language']:
Now, if true i would like to append the language and the count (taken from df) to my new list.
Can you please assist?
Check whether the first two characters from the language column exists in the lang_list, then filter df over those that do.
Finally, extract their count column and turn it into list.
res = df[df['language'].apply(lambda x: x[:2] in lang_list)]
res['count'].tolist()
Example:
language count
0 en-US 12
1 fr 21
2 en 5
lang_list = ['en', 'es']
res = df[df['language'].apply(lambda x: x[:2] in lang_list)]
res['count'].tolist()
# [12, 5] # output

Split a dataframe into two dataframe using first column that have a string values in python

I have two .txt file where I want to separate the data frame into two parts using the first column value. If the value is less than "H1000", we want in a first dataframe and if it is greater or equal to "H1000" we want in a second dataframe.First column starts the value with H followed by a four numbers. I want to ignore H when comparing numbers less than 1000 or greater than 1000 in python.
What I have tried this thing,but it is not working.
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
Here is my code:
if (".txt" in str(path_txt).lower()) and path_txt.is_file():
txt_files = [Path(path_txt)]
else:
txt_files = list(Path(path_txt).glob("*.txt"))
for fn in txt_files:
all_dfs = pd.read_csv(fn,sep="\t", header=None) #Reading file
all_dfs = all_dfs.dropna(axis=1, how='all') #Drop the columns where all columns are NaN
all_dfs = all_dfs.dropna(axis=0, how='all') #Drop the rows where all columns are NaN
print(all_dfs)
ht_data = all_dfs.index[all_dfs.iloc[:, 0] == "H1000"][0]
print(ht_data)
df_h = all_dfs[0:ht_data] # Head Data
df_t = all_dfs[ht_data:] # Tene Data
Can anyone help me how to achieve this task in python?
Assuming this data
import pandas as pd
data = pd.DataFrame(
[
["H0002", "Version", "5"],
["H0003", "Date_generated", "8-Aug-11"],
["H0004", "Reporting_period_end_date", "19-Jun-11"],
["H0005", "State", "AW"],
["H1000", "Tene_no/Combined_rept_no", "E75/3794"],
["H1001", "Tenem_holder Magnetic Resources", "NL"],
],
columns = ["id", "col1", "col2"]
)
We can create a mask of over and under a pre set threshold, like 1000.
mask = data["id"].str.strip("H").astype(int) < 1000
df_h = data[mask]
df_t = data[~mask]
If you want to compare values of the format val = HXXXX where X is a digit represented as a character, try this:
val = 'H1003'
val_cmp = int(val[1:])
if val_cmp < 1000:
# First Dataframe
else:
# Second Dataframe

How to find and replace substrings at the end of column headers

I have the following columns, among others, in my dataframe: dom_pop', 'an_dom_n', 'an_dom_ncmplt. Equivalent columns exist in multiple dataframes, with the suffix changing. For example, in another dataframe they may be called out as pa_pop', 'an_pa_n', 'an_pa_ncmplt. I want to append '_kwh' to these cols across all my dataframes.
I wrote the following code:
cols = ['_n$', '_ncmplt', '_pop'] << the $ is added to indicate string ending in _n.
filterfuel = 'kwh'
for c in cols:
dfdom.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfdom.columns]
dfpa.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfpa.columns]
dfsw.columns = [col.replace(f'{c}', f'{c}_{filterfuel}') for col in dfsw.columns]
kwh gets appended to _ncmplt and _pop cols, but not the _n column. If I remove the $ _n gets appended but then _ncmplt looks like 'an_dom_n_kwh_cmplt'.
for df dom the corrected names should look like dom_pop_kwh', 'an_dom_n_kwh', 'an_dom_ncmplt_kwh'
Why is $ not being recongnised as an end of string parameter?
You can use np.where with a regex
cols = ['_n$', '_ncmplt', '_pop']
filterfuel = 'kwh'
pattern = fr"(?:{'|'.join(cols)})"
for df in [dfdom, dfpa, dfsw]:
df.columns = np.where(df.columns.str.contains(pattern, regex=True),
df.columns + f"_{filterfuel}", df.columns)
Output:
>>> pattern
'(?:_n$|_ncmplt|_pop)'
# dfdom = pd.DataFrame([[0]*4], columns=['dom_pop', 'an_dom_n', 'an_dom_ncmplt', 'hello'])
# After:
>>> dfdom
dom_pop_kwh an_dom_n_kwh an_dom_ncmplt_kwh hello
0 0 0 0 0

How can I split a column into 2 in the correct way?

I am web-scraping tables from a website, and I am putting it to the Excel file.
My goal is to split a columns into 2 columns in the correct way.
The columns what i want to split: "FLIGHT"
I want this form:
First example: KL744 --> KL and 0744
Second example: BE1013 --> BE and 1013
So, I need to separete the FIRST 2 character (in the first column), and after that the next characters which are 1-2-3-4 characters. If 4 it's oke, i keep it, if 3, I want to put a 0 before it, if 2 : I want to put 00 before it (so my goal is to get 4 character/number in the second column.)
How Can I do this?
Here my relevant code, which is already contains a formatting code.
df2 = pd.DataFrame(datatable,columns = cols)
df2["UPLOAD_TIME"] = datetime.now()
mask = np.column_stack([df2[col].astype(str).str.contains(r"Scheduled", na=True) for col in df2])
df3 = df2.loc[~mask.any(axis=1)]
if os.path.isfile("output.csv"):
df1 = pd.read_csv("output.csv", sep=";")
df4 = pd.concat([df1,df3])
df4.to_csv("output.csv", index=False, sep=";")
else:
df3.to_csv
df3.to_csv("output.csv", index=False, sep=";")
Here the excel prt sc from my table:
You can use indexing with str with zfill:
df = pd.DataFrame({'FLIGHT':['KL744','BE1013']})
df['a'] = df['FLIGHT'].str[:2]
df['b'] = df['FLIGHT'].str[2:].str.zfill(4)
print (df)
FLIGHT a b
0 KL744 KL 0744
1 BE1013 BE 1013
I believe in your code need:
df2 = pd.DataFrame(datatable,columns = cols)
df2['a'] = df2['FLIGHT'].str[:2]
df2['b'] = df2['FLIGHT'].str[2:].str.zfill(4)
df2["UPLOAD_TIME"] = datetime.now()
...
...

create names of dataframes in a loop

I need to give names to previously defined dataframes.
I have a list of dataframes :
liste_verif = ( dffreesurfer,total,qcschizo)
And I would like to give them a name by doing something like:
for h in liste_verif:
h.name = str(h)
Would that be possible ?
When I'm testing this code, it's doesn't work : instead of considering h as a dataframe, python consider each column of my dataframe.
I would like the name of my dataframe to be 'dffreesurfer', 'total' etc...
You can use dict comprehension and map DataFrames by values in list L:
dffreesurfer = pd.DataFrame({'col1': [7,8]})
total = pd.DataFrame({'col2': [1,5]})
qcschizo = pd.DataFrame({'col2': [8,9]})
liste_verif = (dffreesurfer,total,qcschizo)
L = ['dffreesurfer','total','qcschizo']
dfs = {L[i]:x for i,x in enumerate(liste_verif)}
print (dfs['dffreesurfer'])
col1
0 7
1 8
print (dfs['total'])
col2
0 1
1 5

Categories

Resources