I have a code that check in different columns for all the dates that are >= "2022-12-01" and i <= "2024-12-31.
What I would like is to be able to extract some other informations located on the same row.
these are the the headers of my columns :
EMPL. NO
NOM A L'EMPLACEMENT
ADRESSE
VILLE
PROV
OBJET NO
EMPLACEMENT DE L'APPAREIL
DESCRIPTION DE L'APPAREIL
MANUFACTURIER
DIMENSIONS
MAWP
SVP
DERNIERE INSP. EXT.
FREQ. EXT.
DERNIERE INSP. INT.
FREQ. INT.
D_EXT_1
D_INT_1
D_EXT_2
D_INT_2
D_EXT_3
D_INT_3
D_EXT_4
D_INT_4
D_EXT_5
D_INT_5
D_EXT_6
D_INT_6
I would like to search for are all the dates that are between >= "2022-12-01" and i <= "2024-12-31 in any of the columns with the prefix D_EXT_x and extract it with all the information on the row that comes before D_EXT_1.
This is the code I got from a question I asked earlier:
import pandas as pd
cols = [prefix + str(i) for prefix in ['D_INT_'] for i in range(1,7)]
data = pd.read_csv("dates.csv")
for col in cols:
data.loc[:,col] = pd.to_datetime(data.loc[:,col])
ext = data[
(
data.loc[:,cols].ge(pd.to_datetime("2022-12-01"))\
& data.loc[:,cols].le(pd.to_datetime("2024-12-31"))\
).any(axis=1)
]
print(ext)
The problem is that it's not doing what it's supposed to do. My file has 1692 lines and 29 columns but the output is giving me : [1692 rows x 1715 columns].
here is the original question:
how to extract entire row when a value is found
Any help would be appreciated
# Get the rows
rows_with_valid_date = df[after_this <= df[date_column_name] <= before_this]
# Get the wanted columns
needed_values = rows_with_valid_date[[wanted_column1, wanted_column2, etc]]
You can fill in the correct names where needed.
Related
Below is the dataframe generated using python and transfered to csv file. The number of delimiter i.e (|) are 9 as shown below
Date|ID|CD|BIN|INTRNL|PCC|IND|CENTRE|TRANS|ENTITY
20221231|APPLE|10004050|BCH_dummy|3505|N|Y|Y|6310|
20221231|APPLE|10004050|BCH_MOTOR|3502|N|Y|Y|6310|
Dataframe:
Date ID CD BIN INTRNL PCC IND CENTRE TRANS ENTITY
20221231 APPLE 10004050 BCH_dummy 3505 N Y Y 6310
20221231 APPLE 10004050 BCH_MOTOR 3502 N Y Y 6310
But I want to add an extra column name on the left side of Date column and maintain the same number of delimeter(|) which is 9 as shown below
Expected Output in CSV file:
BDR2|Date|ID|CD|BIN|INTRNL|PCC|IND|CENTRE|TRANS|ENTITY
20221231|APPLE|10004050|BCH_dummy|3505|N|Y|Y|6310|
20221231|APPLE|10004050|BCH_MOTOR|3502|N|Y|Y|6310|
df.insert(0, column="BDR2", value='')
df = df.shift(-1, axis = 1)
df.replace("nan",'',inplace=True)
df.to_csv(r"C:\INPUT\df_sample_test.csv",sep='|',index=False)
Technically, I don't think it's possible.
However, you can cheat/fake it by making a one-column csv like so :
out = (
pd.read_csv("inputfile.csv", sep="|")
.rename({"Date": "BDR2|Date"}, axis=1)
.fillna("").astype(str)
.pipe(lambda x: x.agg("|".join, axis=1)
.to_frame(f'{"|".join(x.columns)}'))
)
out.to_csv(outputfile.csv, index=False)
Output :
print(out.to_csv(sys.stdout, index=False))
BDR2|Date|ID|CD|BIN|INTRNL|PCC|IND|CENTRE|TRANS|ENTITY
20221231|APPLE|10004050|BCH_dummy|3505|N|Y|Y|6310|
20221231|APPLE|10004050|BCH_MOTOR|3502|N|Y|Y|6310|
I have a csv list of keywords in this format:
75410,Sportart
75419,Ballsport
75428,Basketball
76207,Atomenergie
76212,Atomkraftwerk
76223,Wiederaufarbeitung
76225,Atomlager
67869,Werbewirtschaft
I read the values using pandas and create a table in this format:
DF: name
id
75410 Sportart
75419 Ballsport
75428 Basketball
76207 Atomenergie
76212 Atomkraftwerk
... ...
251450 Tag und Nacht
241473 Kollektivverhalten
270930 Indigene Völker
261949 Wirtschaft und Politik
282512 Impfen
Using the name, I want to delete the whole row, e.g. 'Sportart' deletes first row.
I want to check this with values from my wordList array, I store them as Strings in a list.
What did I miss? Using the code below I receive an '(value) not in axis' error.
df = pd.read_csv("labels.csv", header=None, index_col=0)
df.index.name = "id"
df.columns = ["name"]
print('DF: ',df)
df.drop(labels=wordList, axis=0,inplace=True)
pd_frame = pd.DataFrame(df)
cleaned_pd_frame = pd_frame.query('name != {}'.format(wordList))
I succeeded to hide them with query(), but I want to remove the entirely.
You can use a helper function, index_to_drop below, to take in a name and filter its index out:
index_to_drop = lambda name: df.index[df['name']==name]
Then you can drop "Sportart" like:
df.drop(index_to_drop('Sportart'), inplace=True)
print(df)
Output:
id name
1 75419 Ballsport
2 75428 Basketball
3 76207 Atomenergie
4 76212 Atomkraftwerk
5 251450 Tag und Nacht
6 241473 Kollektivverhalten
7 270930 Indigene Völker
8 261949 Wirtschaft und Politik
9 282512 Impfen
That being said, this is just a convoluted way to drop a row. The same outcome can be obtained much simpler by using isin:
df = df[df['name']!='Sportart']
I feel really stupid now, this should be easy.
I got good help here how-to-keep-the-index-of-my-pandas-dataframe-after-normalazation-json
I need to get the min/max value in the column 'price' only where the value in the column 'type' is buy/sell. Ultimately I want to get back the 'id' also for that specific order.
So first of I need the price value and second I need to get back the value of 'id' corresponding.
You can find the dataframe that I'm working with in the link.
What I can do is find the min/max value of the whole column 'price' like so :
x = df['price'].max() # = max price
and I can sort out all the "buy" type like so:
d = df[['type', 'price']].value_counts(ascending=True).loc['buy']
but I still can't do both at the same time.
you have to use the .loc method in the dataframe in order to filter the type.
import pandas as pd
data = {"type":["buy","other","sell","buy"], "price":[15,222,11,25]}
df = pd.DataFrame(data)
buy_and_sell = df.loc[df['type'].isin(["sell","buy"])]
min_value = buy_and_sell['price'].min()
max_value = buy_and_sell['price'].max()
min_rows = buy_and_sell.loc[buy_and_sell['price']==min_value]
max_rows = buy_and_sell.loc[buy_and_sell['price']==max_value]
min_rows and max_rows can contain multiple rows because is posible that the same min price is repeated.
To extract the index just use .index.
hbid = df.loc[df.type == 'buy'].min()[['price', 'txid']]
gives me the lowest value of price and the lowest value of txid and not the id that belongs to the order with lowest price . . any help or tips would be greatly appreciated !
0 OMG4EA-Z2WUP-AQJ2XU None ... buy 0.00200000 XBTEUR # limit 14600.0
1 OBTJMX-WTQSU-DNEOES None ... buy 0.00100000 XBTEUR # limit 14700.0
2 OAULXQ-3B5WJ-LMLSUC None ... buy 0.00100000 XBTEUR # limit 14800.0
[3 rows x 23 columns]
highest buy order =
14800.0
here the id and price . . txid =
price 14600.0
txid OAULXQ-3B5WJ-LMLSUC
I' m still not sure how your line isin works. buy_and_sell not specified ;)
How I did it -->
I now first found the highest buy, then found the 'txid' for that price, then I had to remove the index from the returned series. And finally I had to remove a whitespace before my string. no idea how it came there
def get_highest_sell_txid():
hs = df.loc[df.type == 'sell', :].max()['price']
hsid = df.loc[df.price == hs, :]
xd = hsid['txid']
return xd.to_string(index=False)
xd = get_highest_sell_txid()
sd = xd.strip()
cancel_order = 'python -m krakenapi CancelOrder txid=' + sd #
subprocess.run(cancel_order)
I have 2 dataframes that I want to sort the values of the first dataframe by the string length which I used str.len() for then sort the second data frame based on the index of the second dataframe I'm trying to use pandas.masking but gives me error any advices ?
index of both dataframes are matching.
my code
wdata = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(wdata.count(' ') == 0)
wdata = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
length= wdata['sentences'].str.len().sort_values()
print(length)
sort= wdata['sentences'].sort_values('length', ascending=True, inplace=True).any(axis=1)
df=sort
print(df)
df2 = pd.read_csv(fileinput, nrows=0).columns[0]
skip = int(df2.count(' ') == 0)
df2 = pd.read_csv(fileinput, names=['sentences'], skiprows=skip)
newdata2 = df2[df2.sort(df.index)]
print(newdata2)
----------------------
#first dataframe example
----------------------
#how are you
#I want to die
#I was home
#I went to sleep at work
#he have a bad reputation
#it was me who went to him
#have good sleep home
#yes
#I'm good
----------------------
#second dataframe example
----------------------
#halaw kuy bashii
#damawe bmrm
#la malawa bum
#la esh nustm
#aw kabraya bash nya
#awa mn bum chum bo lay
#xaweki xosh basar bba la malawa
#bale
#mn bashm
the output I expect is
the errors I'm getting
raise ValueError("No axis named {0} for object type {1}".format(axis, cls))
ValueError: No axis named length for object type <class 'pandas.core.series.Series'>
what am I doing wrong any ideas to solve it please ?
For first use Series.argsort for positions of sorted values, so then pass to DataFrame.iloc:
idx = wdata['sentences'].str.len().argsort()
df = wdata.iloc[idx]
print (df)
sentences
7 yes
8 Im good
2 I was home
0 how are you
1 I want to die
6 have good sleep home
3 I went to sleep at work
4 he have a bad reputation
5 it was me who went to him
If want select one column to Series:
sentences = df['sentences']
For second use same, if same index values like wdata:
newdata2 = df2.iloc[idx]
So kinda a newb here, but I have this dataset that is transposed wkardly, I want to have this back to our guy in the next week, and I've gotten pretty close to completing - I think.
The problem I am facing is getting the data into one data frame. When I run the code, and print from the for loop, I can see the chunks of values that will need to be concatenated. however, i cant find a way to store all the values. when I do, I just get one chunk.
import pandas as pd
import numpy as np
df = pd.read_excel("DATA,h",
header = None,
dtype = object)
ranges = []
last_index = 0
def clean(df12,df13):
df12 = df12.T
df13 = df13.T
value1 = pd.DataFrame(df12)
value2 = pd.DataFrame(df13)
final_value = value1.append(value2)
return(final_value)
for i, row in df.iterrows():
rows = df.iloc[i]
if rows[9] == 'Member' or rows[9] == 'Non-Pledging Member':
if last_index == 0:
last_index = i
else:
ranges.append([last_index, i])
last_index = i
df44 = beans(row,row)
print(df44)
when I print rows from the for loop I get all the values I need in the terminal, but if i store it in a value or dataframe.. I just see one of those blocks of data. Does anyone know whats going on?
data: there are 15k of these
Proctor, Terry 206-915-3555 Member
620 33rd Ave E 16283
Seattle, WA 98112
what I am shooting for:
Proctor, Terry, 620 33rd Ave E, Seattle, WA, 98112, 206-915-3555, Member