I am getting data from an American system. The numbers what I get in the CSV are strings "(100)" and I have to convert it to -100 integer. I have N number of columns in the data frame and i have to do it for all the columns.
What I am doing now is that I replace all the parenthesis to empty and to negative value sign. This is not the best solution as it is converting all the given values in the data frame.
import pandas as pd
df=pd.read_csv('American.csv', thousands=r',')
df=df.apply(lambda z: z.astype(str).str.replace(')',''))
df=df.apply(lambda z: z.astype(str).str.replace('(','-'))
What I expect:
"(100)" -> -100
"Nick (Jones)" ->"Nick **(Jones)**"
What I get:
"(100)" -> -100
"Nick (Jones)" ->"Nick **-Jones**"
I would need a code that does the necessary conversion with the numbers for all columns but does not bother with other values.
Use pandas.DataFrame.replace with regex=True:
df = pd.DataFrame(["(100)", "Nick (Jones)"])
new_df = df.replace('\((\d+)\)', '-\\1',regex=True)
print(new_df)
Output:
0
0 -100
1 Nick (Jones)
Regex:
Captures any number of digits inside the pair of brackets (Group #1), and put - in front of it (-Group #1).
Related
Example:
I the df['column'] has a bunch of values similar to: F/4500/O or G/2/P
The length of the digits range from 1 to 4 similar to the examples given above.
How can I transform that column to only keep 1449 as an integer?
I tried the split method but I can't get it right.
Thank you!
You could extract the value and convert to_numeric:
df['number'] = pd.to_numeric(df['column'].str.extract('/(\d+)/', expand=False))
Example:
column number
0 F/4500/O 4500
1 G/2/P 2
How's about:
df['column'].map(lambda x: int(x.split('/')[1]))
I have 1st dataFrame with column 'X' as :
X
A468593-3
A697269-2
A561044-2
A239882 04
2nd dataFrame with column 'Y' as :
Y
000A561044
000A872220
I would like to match the part of substrings from both columns with minimum no. of characters(example 7 chars only alphanumeric to be considered for matching. all special chars to be excluded).
so, my output DataFrame should be like this
X
A561044-2
Any possible solution would highly appreciate.
Thanks in advance.
IIUC and assuming that the first three values of Y start with 0, you can slice Y by [3:] to remove the first three zero values. Then, you can join these values by |. Finally, you can create your mask using contains that checks whether a series contains a specified value (in your case you would have something like 'A|B' and check whether a value contains 'A' or 'B'). Then, this mask can be used to query your other data frame.
Code:
import pandas as pd
df1 = pd.DataFrame({"X": ["A468593-3", "A697269-2", "A561044-2", "A239882 04"]})
df2 = pd.DataFrame({"Y": ["000A561044", "000A872220"]})
mask = df1["X"].str.contains(f'({"|".join(df2["Y"].str[3:])})')
df1.loc[mask]
Output:
X
2 A561044-2
If you have values in Y that do not start with three zeros, you can use this function to reduce your columns and remove all first n zeros.
def remove_first_numerics(s):
counter = 0
while s[counter].isnumeric():
counter +=1
return s[counter:]
df_test = pd.DataFrame({"A": ["01Abd3Dc", "3Adv3Dc", "d31oVgZ", "10dZb1B", "CCcDx10"]})
df_test["A"].apply(lambda s: remove_first_numerics(s))
Output:
0 Abd3Dc
1 Adv3Dc
2 d31oVgZ
3 dZb1B
4 CCcDx10
Name: A, dtype: object
I have a table example input :
Energy1 Energy2
-966.463549649 -966.463549649
-966.463608088 -966.463585840
So I need a script for summing the two energies E1 and E2 and then convert with a factor *627.51 (hartree in kcal/mol) and at the end truncate the number with 4 digits.
I never attempted this with Python. I've always written this in Julia, but I think it should be simple.
Do you know how I can find an example of reading the table and then doing operations with the numbers in it?
something like:
import numpy
data = numpy.loadtxt('table.tab')
print(data[?:,?].sum())
You can use pandas for this if you convert the table to a csv file. You add the columns directly then use the apply function with lambda to multiply each of the elements by the conversion factor. To truncate to 4 digits, you can change pandas global settings to display the format as 1 digit + 3 decimal in scientific notation.
import pandas as pd
df = pd.read_csv('something.csv')
pd.set_option('display.float_format', '{:.3E}'.format)
df['Sum Energies'] = (df['Energy1'] + df['Energy2']).apply(lambda x: x*627.51)
print(df)
This outputs:
Energy1 Energy2 Sum Energies
0 -9.665E+02 -9.665E+02 -1.213E+06
1 -9.665E+02 -9.665E+02 -1.213E+06
I have a Column with data like 3.4500,00 EUR.
Now I want to compare this with another column having float numbers like 4000.00.
How do I take this string, remove the EUR and replace comma with decimal and then convert into float to compare?
You can use regular expressions to make your conditions general that would work in all cases:
# Make example dataframe for showing answer
df = pd.DataFrame({'Value':['3.4500,00 EUR', '88.782,21 DOLLAR']})
Value
0 3.4500,00 EUR
1 88.782,21 DOLLAR
Use str.replace with regular expression:
df['Value'].str.replace('[A-Za-z]', '').str.replace(',', '.').astype(float)
0 34500.00
1 88782.21
Name: Value, dtype: float64
Explanation:
str.replace('[A-Za-z\.]', '') removes all alphabetic characters and dots.
str.replace(',', '.') replaces the comma for a dot
astype(float) converts it from object (string) type to float
Here is my solution:
mock data:
amount amount2
0 3.4500,00EUR 4000
1 3.600,00EUR 500
use apply() then convert the data type to float
data['amount'] = data['amount'].apply(lambda x: x.replace('EUR', '')).apply(lambda x: x.replace('.', '')).apply(lambda x: x.replace(',', '.')).astype('float')
result:
amount amount2
0 34500.0 4000
1 3600.0 500
I have a column called accountnumber with values similar to 4.11889000e+11 in a pandas dataframe. I want to suppress the scientific notation and convert the values to 4118890000. I have tried the following method and did not work.
df = pd.read_csv(data.csv)
pd.options.display.float_format = '{:,.3f}'.format
Please recommend.
You don't need the thousand separators "," and the 3 decimals for the account numbers.
Use the following instead.
pd.options.display.float_format = '{:.0f}'.format
I assume the exponential notation for the account numbers must come from the data file. If I create a small csv with the full account numbers, pandas will interpret them as integers.
acct_num
0 4118890000
1 9876543210
df['acct_num'].dtype
Out[51]: dtype('int64')
However, if the account numbers in the csv are represented in exponential notation then pandas will read them as floats.
acct_num
0 4.118890e+11
1 9.876543e+11
df['acct_num'].dtype
Out[54]: dtype('float64')
You have 2 options. First, correct the process that creates the csv so the account numbers are written out correctly. The second is to change the data type of the acct_num column to integer.
df['acct_num'] = df['acct_num'].astype('int64')
df
Out[66]:
acct_num
0 411889000000
1 987654321000