pandas remove period after json normalize - python

I'm trying to build a tensorflow application in python, but after importing my data in I needed to normalize it. No problem there, except all my columns are now titled palm.velocity.x for example. I found a way to rename all of these columns as there are 230 of them in total so the old df.rename and similar methods aren't much help, unless they can be used like df.apply but from what I've looked at there doesn't seem to be a way.

def FixColumnHeading(column):
columns = re.split(r'\.', column)
name = []
for word in range(len(columns)):
if(word > 0):
columns[word] = columns[word].capitalize()
name.append(columns[word])
newColumn = ''
for part in name:
newColumn += part
return newColumn
normalisedData.columns = normalisedData.columns.to_series().apply(lambda x: FixColumnHeading(x))
If anyone can think of a way to improve, please put what you would change below :)

Related

How to use python to fill specific data to column in excel based on information of the first column?

I have a problem with an excel file! and I want to automate it by using python script to complete a column based on the information of the first column: for example:
if data == 'G711Alaw 64k' or 'G711Ulaw 64k'
print('1-Jan) till find it == '2-Jan' then print('2-Jan') and so on.
befor automate
I need its looks like this after automate:
after automate
Is there anyone can help me to do solve this issue?
The file:
the excel file
Thanks a lot for your help.
Try this, pandas reads your jan-1 is datetime type, if you need to change it to a string you can set it directly in the code, the following code will directly assign the value read to the second column:
import pandas as pd
df = pd.read_excel("add_date_column.xlsx", engine="openpyxl")
sig = []
def t(x):
global sig
if not isinstance(x.values[0], str):
tmp_sig = x.values[0]
if tmp_sig not in sig:
sig = [tmp_sig]
x.values[1] = sig[-1]
return x
new_df = df.apply(t, axis=1)
new_df.to_excel("new.xlsx", index=False)
The concept is very simple :
If the value is date/time, copy to the [same row, next column].
If not, [same row, next column] is copied from [previous row, next
column].
You do not specifically need Python for this task. The excel formula for this would be;
=IF(ISNUMBER(A:A),A:A,B1)
Instead of checking if it is date/time, I took adavantage of the fact that the rest of the entries are alphanumeric (including both alphabets and numbers). This formula is applied on the new column.
Of course, you might already be in Python and just work within the same environment. So, here's the loop :
for i in range(len(df)):
if type(df["Orig. Codec"][i]) is datetime:
df["Column1"][i] = df["Orig. Codec"][i]
else:
df["Column1"][i] = df["Column1"][i-1]
There might be ways to lambda function for the same concept, not that I am aware of how to apply lambda and shift at the same time.

Removing a lot of rows from a dataframe in Python

I am trying to remove non-English tweets from a large dataset in the most efficient way possible. I have tried to create a list of rows that are not English and them removing them, but removing each tweet takes a long time (the langid.classify() function is not the problem).
def removeLanguage(df):
rowsToDelete = []
text = df['tweet'][i]
try:
if (langid.classify(text)[0] != 'en' ):
rowsToDelete.append(i)
continue
except ValueError:
rowsToDelete.append(i)
continue
for i in rowsToDelete:
df.drop(i, inplace=True)
newDf = beforeClassification(inputDf).reset_index(drop=True)
Is there a more efficient way to remove a set of rows from a DataFrame than df.drop()?
df.drop is pretty efficient
but I'd also use anything like this
df = df[langid.classify(df.tweet)[0] != 'en' ]

How to search through pandas data frame row by row and extract variables

I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.

How do I drop multiple rows based on values in a pandas data frame with 1 line of code?

I'll post a a little bit of my code here. Basically I've been manually removing 1 row at a time that I don't want, but I want it to look nicer than that, so I'm wondering if there's a cleaner way that at allows me to delete everything in 1 line.
data = data[data.city_or_county != 'Alma']
data = data[data.city_or_county != 'Alpine']
data = data[data.city_or_county != 'Altadena']
data = data[data.city_or_county != 'Alsip']
You could use .isin
data[~data.city_or_county.isin(["Alma", "Alpine",
"Alsip","Altadena"])]

Pandas For Loop, If String Is Present In ColumnA Then ColumnB Value = X

I'm pulling Json data from the Binance REST API, after formatting I'm left with the following...
I have a dataframe called Assets with 3 columns [Asset,Amount,Location],
['Asset'] holds ticker names for crypto assets e.g.(ETH,LTC,BNB).
However when all or part of that asset has been moved to 'Binance Earn' the strings are returned like this e.g.(LDETH,LDLTC,LDBNB).
['Amount'] can be ignored for now.
['Location'] is initially empty.
I'm trying to set the value of ['Location'] to 'Earn' if the string in ['Asset'] includes 'LD'.
This is how far I got, but I can't remember how to apply the change to only the current item, it's been ages since I've used Pandas or for loops.
And I'm only able to apply it to the entire column rather than the row iteration.
for Row in Assets['Asset']:
if Row.find('LD') == 0:
print('Earn')
Assets['Location'] = 'Earn' # <----How to apply this to current row only
else:
print('???')
Assets['Location'] = '???' # <----How to apply this to current row only
The print statements work correctly, but currently the whole column gets populated with the same value (whichever was last) as you might expect.
So (LDETH,HOT,LDBTC) returns ('Earn','Earn','Earn') rather than the desired ('Earn','???','Earn')
Any help would be appreciated...
np.where() fits here. If the Asset starts with LD, then return Earn, else return ???:
Assets['Location'] = np.where(Assets['Asset'].str.startswith('LD'), 'Earn', '???')
You could run a lambda in df.apply to check whether 'LD' is in df['Asset']:
df['Location'] = df['Asset'].apply(lambda x: 'Earn' if 'LD' in x else None)
One possible solution:
def get_loc(row):
asset = row['Asset']
if asset.find('LD') == 0:
print('Earn')
return 'Earn'
print('???')
return '???'
Assets['Location'] = Assets.apply(get_loc, axis=1)
Note, you should almost never iterate over a pandas dataframe or series.

Categories

Resources