I'm cleaning a csv file with pandas, mainly removing special characters such as ('/', '#', etc). The file has 7 columns (none of which are dates).
In some of the columns, there's just numerical data such as '11/6/1980'.
I've noticed that directly after reading the csv file,
df = pd.read_csv ('report16.csv', encoding ='ANSI')
this data becomes '11/6/80', after cleaning it becomes '11 6 80' (it's the same result in the output file). So wherever the data has ' / ', it's being interpreted as a date and python is eliminating the first 2 digits from the data.
Data
Expected result
Actual Result
11/6/1980
11 6 1980
11 6 80
12/8/1983
12 8 1983
12 8 83
Both of the above results are wrong because in the Actual Result column, I'm losing 2 digits towards the end.
The data looks like this
Org Name
Code
Code copy
ABC
11/6/1980
11/6/1980
DEF
12/8/1983
12/8/1983
GH
11/5/1987
11/5/1987
OrgName, Code, Code copy
ABC, 11/6/1980, 11/6/1980
DEF, 12/8/1983, 12/8/1983
GH, 11/5/1987, 11/5/1987
KC, 9000494, 9000494
It's worth mentioning that the column contains other data such as '900490', strings, etc but in these instances, there aren't any problems.
What could be done to not allow this conversion?
Not an answer, but comments do not allow to include well presented code and data.
Here is what I call a minimal reproducible example:
Content of the sample.csv file:
Data,Expected result,Actual Result
11/6/1980,11 6 1980,11 6 80
12/8/1983,12 8 1983,12 8 83
Code:
df = pd.read_csv('sample.csv')
print(df)
s = df['Data'].str.replace('/', ' ')
print((df['Expected result'] == s).all())
It gives :
Data Expected result Actual Result
0 11/6/1980 11 6 1980 11 6 80
1 12/8/1983 12 8 1983 12 8 83
True
This proves that read_csv has correctly read the file and has not changed anything.
PLEASE SHOW THE CONTENT OF YOUR CSV FILE AS TEXT, along with enough code to reproduce your problem.
How about trying string operation?! First select the column that you would like to modify and replace "/" or "#" with whitespace : column.str.replace("/", " "). I hope this is gonna work !
The behavior of converting dates is not strictly a python issue. You are using pandas read_csv.
Try to explicitly declare a separator. If sep not declared, it makes guesses.
df = pd.read_csv ('report16.csv', encoding ='ANSI', sep =',')
Related
I have one dataframe where format is given as below image.
Every row where three columns are representing as one type of data. In given example there are one column for ticker and next three column is kind one type of data and column 5-7are second type of data.
Now I want to transform this in column where every type of data appended by another group.
Expected output is:
is there anyway to do this transformation in pandas using any API? I am doing it very basic way where creating a new dataframe for one group and then appending it.
here is one way to do it
use pd.melt to unstack the table, then split what used to be columns (and now as rows) on "/" to separate them into two columns (txt, year)
create the new row value by combining ticker and year, then using pivot to get the desired result set
df2=df.melt(id_vars='ticker', var_name='col') # line missed in earlier solution,updated
df2[['txt','year']] = df.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df2.assign(ticker2=df2['ticker'] + '/' + df2['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
Result set
txt ticker2 data1 data2
0 AAPL/2020 0.824676 0.616524
1 AAPL/2021 0.018540 0.046365
2 AAPL/2022 0.222349 0.729845
3 AMZ/2020 0.122288 0.087217
4 AMZ/2021 0.012168 0.734674
5 AMZ/2022 0.923501 0.437676
6 APPL/2020 0.886927 0.520650
7 APPL/2021 0.725515 0.543404
8 APPL/2022 0.211378 0.464898
9 GGL/2020 0.777676 0.052658
10 GGL/2021 0.297292 0.213876
11 GGL/2022 0.894150 0.185207
12 MICO/2020 0.898251 0.882252
13 MICO/2021 0.141342 0.105316
14 MICO/2022 0.440459 0.811005
based on the code that you posted in comment. I missed a line, unfortunately, in posting the solution. its added now
df2 = pd.DataFrame(np.random.randint(0,100,size=(2, 6)),
columns=["data1/2020","data1/2021", "data1/2022", "data2/2020", "data2/2021", "data2/2022"])
ticker = ['APPL', 'MICO']
df2.insert(loc=0, column='ticker', value=ticker)
df2.head()
df3=df2.melt(id_vars='ticker', var_name='col') # missed line in earlier posting
df3[['txt','year']] = df2.melt(id_vars='ticker', var_name='col')['col'].str.split('/', expand=True)
df3.head()
df3.assign(ticker2=df3['ticker'] + '/' + df3['year']).pivot(index='ticker2', columns='txt', values='value').reset_index()
txt ticker2 data1 data2
0 APPL/2020 26 9
1 APPL/2021 75 59
2 APPL/2022 20 44
3 MICO/2020 79 90
4 MICO/2021 63 30
5 MICO/2022 73 91
Loading in the data
in: import pandas as pd
in: df = pd.read_csv('name', sep = ';', encoding='unicode_escape')
in : df.dtypes
out: amount object
I have an object column with amounts like 150,01 and 43,69. Thee are about 5,000 rows.
df['amount']
0 31
1 150,01
2 50
3 54,4
4 32,79
...
4950 25,5
4951 39,5
4952 75,56
4953 5,9
4954 43,69
Name: amount, Length: 4955, dtype: object
Naturally, I tried to convert the series into the locale format, which suppose to turn it into a float format. I came back with the following error:
In: import locale
setlocale(LC_NUMERIC, 'en_US.UTF-8')
Out: 'en_US.UTF-8'
In: df['amount'].apply(locale.atof)
Out: ValueError: could not convert string to float: ' - '
Now that I'm aware that there are non-numeric values in the list, I tried to use isnumeric methods to turn the non-numeric values to become NaN.
Unfortunately, due to the comma separated structure, all the values would turn into -1.
0 -1
1 -1
2 -1
3 -1
4 -1
..
4950 -1
4951 -1
4952 -1
4953 -1
4954 -1
Name: amount, Length: 4955, dtype: int64
How do I turn the "," values to "." by first removing the "-" values? I tried .drop() or .truncate it does not help. If I replace the str",", " ", it would also cause trouble since there is a non-integer value.
Please help!
Documentation that I came across
-https://stackoverflow.com/questions/21771133/finding-non-numeric-rows-in-dataframe-in-pandas
-https://stackoverflow.com/questions/56315468/replace-comma-and-dot-in-pandas
p.s. This is my first post, please be kind
Sounds like you have a European-style CSV similar to the following. Provide actual sample data as many comments asked for if your format is different:
data.csv
thing;amount
thing1;31
thing2;150,01
thing3;50
thing4;54,4
thing5;1.500,22
To read it, specify the column, decimal and thousands separator as needed:
import pandas as pd
df = pd.read_csv('data.csv',sep=';',decimal=',',thousands='.')
print(df)
Output:
thing amount
0 thing1 31.00
1 thing2 150.01
2 thing3 50.00
3 thing4 54.40
4 thing5 1500.22
Posting as an answer since it contains multi-line code, despite not truly answering your question (yet):
Try using chardet. pip install chardet to get the package, then in your import block, add import chardet.
When importing the file, do something like:
with open("C:/path/to/file.csv", 'r') as f:
data = f.read()
result = chardet.detect(data.encode())
charencode = result['encoding']
# now re-set the handler to the beginning and re-read the file:
f.seek(0, 0)
data = pd.read_csv(f, delimiter=';', encoding=charencode)
Alternatively, for reasons I cannot fathom, passing engine='python' as a parameter works often. You'd just do
data = pd.read_csv('C:/path/to/file.csv', engine='python')
#Mark Tolonen has a more elegant approach to standardizing the actual data, but my (hacky) way of doing it was to just write a function:
def stripThousands(self, df_column):
df_column.replace(',', '', regex=True, inplace=True)
df_column = df_column.apply(pd.to_numeric, errors='coerce')
return df_column
If you don't care about the entries that are just hyphens, you could use a function like
def screw_hyphens(self, column):
column.replace(['-'], np.nan, inplace=True)
or if np.nan values will be a problem, you can just replace it with column.replace('-', '', inplace=True)
**EDIT: there was a typo in the block outlining the usage of chardet. it should be correct now (previously the end of the last line was encoding=charenc)
I don't exactly know how to describe the issue I'm having, so I'll just show it.
I have 2 data tables, and I'm using regex to search through and extract values in those tables based on if it matches with the correct word. I'll put the whole script for reference.
import re
import os
import pandas as pd
import numpy as np
os.chdir('C:/Users/Sams PC/Desktop')
f=open('test5.txt', 'w')
NHSQC=pd.read_csv('NHSQC.txt', sep='\s+', header=None)
NHSQC.columns=['Column_1','Column_2','Column_3']
HNCA=pd.read_csv('HNCA.txt', sep='\s+', header=None)
HNCA.columns=['Column_1','Column_2','Column_3','Column_4']
x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC))
y=re.findall('[A-Z][0-9][0-9][A-Z]-[C][A]-[H][N]',str(HNCA))
print (NHSQC)
print (HNCA)
print(x)
print (y)
data=[]
label=[]
for i in range (0,6):
if x[i] in str(NHSQC):
data2=NHSQC.set_index('Column_1',drop=False)
data3=(data2.loc[str(x[i]), 'Column_2':'Column_3'])
data.extend(list(data3))
a=[x[i]]
label.extend(a)
label.extend(a)
if y[i] in str(HNCA):
data2=HNCA.set_index('Column_1',drop=False)
data3=(data2.loc[str(y[i]),'Column_3'])
data.append(data3)
a=[y[i]]
label.extend(a)
else:
print('Not Found')
else:
print('Not Found')
data6=[label,data]
matrix=data6
data5=np.transpose(matrix)
print(data5)
f.write(str(data5))
f.close()
This script, does exactly what I want it to do, and it works as intended when I run my test data files, but fails when I run my actual data files. I don't know how to explain the issue, so I'll just show it. This is the output:
Column_1 Column_2 Column_3
0 S31N-HN 114.424 7.390
1 Y32N-HN 121.981 7.468
2 Q33N-HN 120.740 8.578
3 A34N-HN 118.317 7.561
4 G35N-HN 106.764 7.870
.. ... ... ...
89 R170N-HN 118.078 7.992
90 S171N-HN 110.960 7.930
91 R172N-HN 119.112 7.268
92 999_XN-HN 116.703 8.096
93 1000_XN-HN 117.530 8.040
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
.. ... ... ... ...
173 R170N-CA-HN 118.016 60.302 7.999
174 S171N-R170CA-S171HN 110.960 60.239 7.932
175 S171N-CA-HN 110.960 60.946 7.931
176 R172N-S171CA-R172HN 119.112 60.895 7.264
177 R172N-CA-HN 119.112 55.093 7.265
[178 rows x 4 columns]
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN']
Traceback (most recent call last):
File "test.py", line 29, in <module>
if y[i] in str(HNCA):
IndexError: list index out of range
As you can see, there is an issue because my regex for y isn't finding all the values. Furthermore, there is an issue with how many my x regex is finding (only 5 instead of the hundreds it should be). Initially I thought this was just a display thing (it wasn't displaying the hundreds of matches since it would take too long), and I also thought the ... in the middle of it printing my table was also for display purposes. However, if I copy part of my HNCA.txt data and save it as a separate file, it fixes the issue.
[94 rows x 3 columns]
Column_1 Column_2 Column_3 Column_4
0 Assignment w1 w2 w3
1 S31N-A30CA-S31HN 114.424 54.808 7.393
2 S31N-A30CA-S31HN 126.854 53.005 9.277
3 S31N-CA-HN 114.424 61.717 7.391
4 S31N-HA-HN 126.864 59.633 9.287
5 Y32N-S31CA-Y32HN 121.981 61.674 7.467
6 Y32N-CA-HN 121.981 60.789 7.469
7 Q33N-Y32CA-Q33HN 120.770 60.775 8.582
8 Q33N-CA-HN 120.701 58.706 8.585
9 A34N-Q33CA-A34HN 118.317 58.740 7.559
10 A34N-CA-HN 118.317 52.260 7.565
11 G35N-A34CA-G35HN 106.764 52.195 7.868
12 G35N-CA-HN 106.764 46.507 7.868
13 R36N-G35CA-R36HN 117.833 46.414 8.111
14 R36N-CA-HN 117.833 54.858 8.112
15 G37N-R36CA-G37HN 110.365 54.808 8.482
16 G37N-CA-HN 110.365 44.901 8.484
17 I55N-CA-HN 118.132 65.360 7.935
18 Y56N-I55CA-Y56HN 123.025 65.464 8.088
19 Y56N-CA-HN 123.025 62.195 8.082
20 A57N-Y56CA-A57HN 120.470 62.159 7.978
21 A57N-CA-HN 120.447 55.522 7.980
22 S72N-K71CA-S72HN 117.239 55.390 8.368
23 S72N-CA-HN 117.259 58.583 8.362
24 C73N-S72CA-C73HN 128.142 58.569 9.690
25 C73N-CA-HN 128.142 61.410 9.677
26 G74N-C73CA-G74HN 116.187 61.439 9.439
27 G74N-CA-HN 116.194 46.528 9.437
28 H75N-G74CA-H75HN 122.640 46.307 9.642
29 H75N-CA-HN 122.621 56.784 9.644
30 C76N-H75CA-C76HN 122.775 56.741 7.152
31 C76N-CA-HN 122.738 57.527 7.146
32 R77N-C76CA-R77HN 120.104 57.532 8.724
33 R77N-CA-HN 120.135 59.674 8.731
['S31N-HN', 'Y32N-HN', 'Q33N-HN', 'A34N-HN', 'G35N-HN']
['S31N-CA-HN', 'Y32N-CA-HN', 'Q33N-CA-HN', 'A34N-CA-HN', 'G35N-CA-HN', 'R36N-CA-HN', 'G37N-CA-HN', 'I55N-CA-HN', 'Y56N-CA-HN', 'A57N-CA-HN', 'S72N-CA-HN', 'C73N-CA-HN', 'G74N-CA-HN', 'H75N-CA-HN', 'C76N-CA-HN', 'R77N-CA-HN']
[['S31N-HN' '114.42399999999999']
I won't post the whole output, but as you can see, now it finds all the proper matches. Its also now displaying the entire table, instead of doing ... and only showing the top and bottom halves. I don't exactly understand where this issue is arising from though. Why is it displaying only the top and bottom half of my table, but if I copy and paste it to another file, it displays the entire thing. Why does regex not search through the entire table even if it isn't displayed (based on the fact it shows the top and bottom half, makes me think the entire table is there, but again its not showing it because its trying to simplify the display, but why would whats being displayed effect what regex is searching)?
Why is python only displaying the top and bottom portions of your table?
Python classes can define two "magic" methods:
__repr__(), which is supposed to produce a "representation" of the object as a string, and which has a pretty useless default implementation for most objects; and
__str__(), which is supposed to produce a readable "string" of the object, and which falls back to __repr__().
When the line x=re.findall('[A-Z][0-9][0-9][A-Z]-[H][N]',str(NHSQC)) is run, that last str(NHSQC) bit tells python to call NHSCQ.__str__(), which falls back to NHSCQ.__repr__(), which you can read about here.
The developers of the pandas library implemented DataFrame.__repr__() in such a way that, depending on the values of certain global variables, will produce a string that does not fully represent the underlying data. The defaults truncate the DataFrame to show only the first 5 and last 5 rows with ellipses (...) telling you that there are bits missing. Thus, as you suspected, you are only calling re.findall on the first 5 and last 5 rows of the DataFrame.
What should you do instead?
Using str(NHSQC) is probably not what you intend to do. This converts the entire DataFrame into a (incomplete) string representation, then runs the regular expression search over that entire string. That's extremely inefficient, so why not use the Series.str methods instead?
For instance, you appear to be lining up Column_2 and Column_3 of rows from DataFrame NHSQC where the value of Column_1 matches the first regex in order with Column_3 of rows from DataFrame HNCA where the value of Column_1 matches the second regex, right?
df1 = NHSQC.loc[NHSQC["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-HN"))]
df2 = HNCA.loc[HNCA["Column_1"].str.match(re.compile("[A-Z][0-9][0-9][A-Z]-CA-HN")), ["Column_1", "Column_3"]]
Those lines will select the requisite rows and columns from the two DataFrames using Series.str.match on Column_1.
long1 = df1.melt(id_vars=["Column_1"]).drop("variable", axis="columns")
long2 = df2.rename(columns={"Column_3": "value"})
The first line uses DataFrame.melt to turn the three columns of df1 into a "longer" version with columns Column_1 as an identifier, variable as either the strings "Column_2" or "Column_3", and value, containing the thing you actually care about and are printing at the end of your program. You don't use the column name anymore, so it is dropped. The DataFrame df2 doesn't need to be converted to a longer format because it only has two columns, so we just rename Column_3 to value.
extra_long = pd.concat([long1, long2])
print(extra_long.to_numpy())
This just concatenates the two long DataFrames together, turns them into a numpy array, then prints them out.
Data:
from io import StringIO
import pandas as pd
s = '''ID,Level,QID,Text,ResponseID,responseText,date_key
375280046,S,D3M,Which is your favorite?,D5M0,option 1,2012-08-08 00:00:00
375280046,S,D3M,How often? (at home, at work, other),D3M0,Work,2010-03-31 00:00:00
375280046,M,A78,Do you prefer a, b, or c?,A78C,a,2010-03-31 00:00:00'''
df = pd.read_csv(StringIO(s))
Error received:
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 7 fields in line 3, saw 9
It's very obvious why I'm receiving this error. The data contains text such as How often? (at home, at work, other) and Do you prefer a, b, or c?.
How does one read this type of data into a pandas DataFrame?
Of course, as I write the question, I figured it out. Rather than delete it, I'll share it with my future self when I forget how to do this.
Apparently, pandas default sep=',' can also be a regular expression.
The solution was to add sep=r',(?!\s)' to read_csv like so:
df = pd.read_csv(StringIO(s), sep=r',(?!\s)')
The (?!\s) part is a negative lookahead to match only commas that don't have a following space after them.
Result:
ID Level QID Text ResponseID \
0 375280046 S D3M Which is your favorite? D5M0
1 375280046 S D3M How often? (at home, at work, other) D3M0
2 375280046 M A78 Do you prefer a, b, or c? A78C
responseText date_key
0 option 1 2012-08-08 00:00:00
1 Work 2010-03-31 00:00:00
2 a 2010-03-31 00:00:00
I am trying to import a weirdly formatted text file into a pandas DataFrame. Two example lines are below:
LOADED LANE 1 MAT. TYPE= 2 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.474 LOADEFFECT 5075. LMAX= 3643. COV= .13
LOADED LANE 1 MAT. TYPE= 3 LEFFECT= 1 SPAN= 200. SPACE= 10. BETA= 3.515 LOADEFFECT10009. LMAX= 9732. COV= .08
First I tried the following:
df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])
This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). I got a result like:
632 1 2 1 200 10 3.474 5075. 3643. 0.13
633 1 3 1 200 10 3.515 LMAX= COV= NaN
Then I decided to use a regular expression to define my delimiters. After many trial and error runs (I am no expert in regex), I managed to get close with the following line:
df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')
This almost works, but creates a NaN column for some reason at the very beginning:
632 NaN 1 2 1 200 10 3.474 5075 3643 0.13
633 NaN 1 3 1 200 10 3.515 10009 9732 0.08
At this point I think I can just delete that first column, and get away with it. However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. Any ideas? Other than that, I am sure there is a smarter way to parse this text file. I would be glad to hear your recommendations.
Thanks!
import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
for i in line:
new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))
table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values